Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Google BigQuery

BigQuery observing high import/export job latencies in several regions

Incident began at 2022-07-14 19:30 and ended at 2022-07-15 15:02 (all times are US/Pacific).

Date Time Description
27 Jul 2022 17:46 PDT

INCIDENT REPORT

Summary:

Google Cloud Networking experienced reduced capacity for lower priority traffic such as batch, streaming and transfer operations from 19:30 US/Pacific on Thursday, 14 July 2022, through 15:02 US/Pacific on Friday, 15 July 2022. High-priority user-facing traffic was not affected. This service disruption resulted from an issue encountered during a combination of repair work and a routine network software upgrade rollout. Due to the nature of the disruption and resilience capabilities of Google Cloud products, the impacted regions and individual impact windows varied substantially. To our customers whose businesses were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s availability.

Root Cause:

The root cause was identified as an issue with a new control plane configuration rollout, causing low-priority classified traffic capacity reduction in Google’s internal backbone network connecting data centers. Mitigation efforts were slowed by the capacity reduction, and engineering teams required more than their usual time to safely undo the configuration change. During the period of the rollout and subsequent rollback, constrained traffic in certain cloud zones affected the performance of some Cloud services.

Remediation and Prevention:

At approximately 02:00 US/Pacific on Friday, 15 July, as an in-progress rollout expanded to more regions, Google engineers observed performance degradation in Cloud Networking due to reduced capacity. The engineering team then started an investigation into the cause. At 03:50 US/Pacific, Google engineers pushed the first mitigation attempt to halt the ongoing rollout. While this effort succeeded in pausing any new actions, those already in progress continued, which further reduced network capacity.

Subsequently, the engineering team shifted their mitigation efforts toward a global rollback of the problematic configuration. Their first attempt to mitigate using a configuration push was applied at 08:40 US/Pacific, but it was not successfully applied to all nodes, due in part to the reduced network performance. Google engineers worked through alternate mitigations, and by 12:40 US/Pacific, the configuration was updated correctly, and this mitigated the majority of impact.

By 15:02 US/Pacific on 15 July 2022, services for all customers had been restored. The Google Cloud Service Health Dashboard was updated to reflect this.

Google is committed to preventing future recurrence, and we are taking the following actions:

Detection:

  • Improve signals to detect significant changes in traffic for upgraded domains in the backbone network.
  • Improve dashboards that help debugging global- and domain-level control failures and configuration status.

Prevention:

  • Improve the automated handling of disconnected local Software Defined Networking (SDN) controllers that will reduce overall impact and mitigation time. .
  • Improve the global safety systems that globally halt elective rollouts on Google’s wide area (WAN) networks, when global capacity is reduced.
  • Further isolate production network neighborhoods.

Mitigation:

  • Improve the API to push changes to global and local network controllers during a service disruption.
  • Improve failure domain configuration validation to reject unintended configuration.
  • Improve the feature and testing of emergency tools across different scenarios at regular intervals, and invest in test environments that enable us to do this without impact.

Detailed Description of Impact:

Google Cloud Networking

  • Google Cloud Networking experienced reduced capacity within the Google Cloud regions of us-east1, southamerica-west1, us-central1, and us-central2, starting from Friday 15 July 2022 02:24 US/Pacific. The most severe impact occurred from 03:58 to 12:40.
  • southamerica-west1 experienced reduced capacity between 14:40 and 15:02.
  • southamerica-west1 and southamerica-east1 both experienced egress packet loss between 20% and 30%.

Google Cloud Storage (GCS)

  • Between Thursday, 14 July 2022 21:57 US/Pacific and Friday, 15 July 12:40 US/Pacific, GCS customers may have experienced elevated latency, delays, issues importing, exporting or querying data from GCS buckets, and HTTP 500 errors in the Google Cloud regions of us-east1, southamerica-west1, and us-central2, and for buckets located in us-central1, us-east1, and nam4.
  • This disruption affected 0.007% of GCS requests and impacted customers reading/writing data and metadata.

Google BigQuery

  • Customer workloads running in impacted Google Cloud regions and using the BigQuery Storage Read and Write APIs to read from or write to BigQuery may have experienced elevated latency.

Note

  • Additional services may have been impacted by this event but did not meet the thresholds to be included in this report.
18 Jul 2022 15:31 PDT

This is a preliminary Incident Report (Mini-IR). A Full Incident Report with additional details is being prepared and will be posted at a later date.

We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support.

(All Times US/Pacific)

Incident Start: 14 July 2022 19:30

Incident End: 15 July 2022 15:02

Duration: 19 hours, 32 minutes

Affected Services

Google Cloud Networking Google BigQuery Google Cloud Storage (GCS)

Regions/Zones: Global Description:

Google Cloud Networking experienced reduced availability globally for a period of 19 hours and 32 minutes. Because of the nature of the outage and resilience capabilities of GCP products, the impacted regions and individual impacted windows may vary inside of the network impact window. From preliminary analysis, the root cause was due to an issue with a new control plane configuration rollout.

Customer Impact:

Google Cloud Storage

  • Affected customers would have experienced elevated latencies or HTTP 500 errors in multiple regions, including us-east1, us-central1, and southamerica-west1.
  • Affected customers reading/writing data from GCS impacted regions or from buckets located in impacted regions would have observed high latency and/or errors.
  • Affected customers reading/writing metadata for buckets located in impacted regions would have observed high latency and/or errors.

Google BigQuery

  • Affected customers performing cross-regional table copies would have observed increased latency and/or errors.
15 Jul 2022 14:17 PDT

The issue with Google BigQuery has been resolved for all affected users as of Friday, 2022-07-15 14:12 US/Pacific.

We thank you for your patience while we worked on resolving the issue.

15 Jul 2022 14:02 PDT

Summary: BigQuery observing high import/export job latencies in several regions

Description: We are experiencing an issue with Google BigQuery beginning at Friday, 2022-07-15 05:00 US/Pacific.

Our engineering team continues to investigate the issue.

We will provide an update by Friday, 2022-07-15 15:05 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: BigQuery is observing high import/export job latencies (slowness) in several regions, which may ultimately result in errors.

Workaround: None at this time.

15 Jul 2022 13:40 PDT

Summary: BigQuery observing high import/export job latencies in several regions

Description: We are experiencing an issue with Google BigQuery beginning at Friday, 2022-07-15 05:00 US/Pacific.

Our engineering team continues to investigate the issue.

We will provide an update by Friday, 2022-07-15 15:00 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: BigQuery is observing high import/export job latencies (slowness) in several regions, which may ultimately result in errors.

Workaround: None at this time.