Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Google BigQuery, Google Cloud Dataflow

We are investigating elevated error rates and latency for streaming ingestion into BigQuery

Incident began at 2024-12-09 09:33 and ended at 2024-12-09 11:50 (all times are US/Pacific).

Previously affected location(s)

Multi-region: usIowa (us-central1)South Carolina (us-east1)Northern Virginia (us-east4)Oregon (us-west1)Los Angeles (us-west2)

Date Time Description
16 Dec 2024 09:38 PST

Incident Report

Summary

Starting on Monday, 9 December 2024 09:24 US/Pacific, some Google BigQuery customers in the US multi-region encountered failures on 80-90% of requests to the insertAll API, with a ‘5xx’ error code along with increased latency. BigQuery Write API customers also saw an increase in latency for some requests during this time. Also, Dataflow customers may have experienced an increase in latency by 5-15%. The impact lasted for a duration of 2 hours and 16 minutes.

To our BigQuery and Dataflow customers who were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability.

Root Cause

Dataflow depends on BigQuery to write streaming data to a table. BigQuery uses a streaming backend service to persist data. This streaming backend service in one of the clusters in the US multi-region experienced a high number of concurrent connections caused by a slightly higher than usual customer traffic fluctuation. This backend service enforces a limit on the number of such concurrent connections for flow control and a bug in its mechanism prevented backends from accepting new requests once this limit was reached.

Due to reduced bandwidth in the backend servers in the cluster, the regional streaming frontends experienced higher latency and started accumulating more inflight requests. This led to their overload and regional impact to BigQuery and Dataflow streaming customers outside the affected cluster.

Remediation and Prevention

Google engineers were alerted to the outage by our internal monitoring system on Monday, 9 December 2024 at 09:33 US/Pacific and immediately started an investigation.

After a thorough investigation, the impacted backend cluster was identified. Initial mitigation attempt focused on reducing server load through traffic throttling. To achieve complete mitigation, our engineers then drained the affected cluster, resulting in immediate and complete recovery.

Google is committed to preventing this issue from repeating in the future and is completing the following actions:

  • Fix the root cause in the backend service to handle surges in concurrent connections and avoid zonal impact.
  • Improve testing coverage of the backend service to prevent similar issues.
  • Enhance the ability to detect and automatically mitigate similar cases of zonal impact.
  • Improve isolation to prevent issues in a particular cluster or availability zone from impacting all users in the region.

Detailed Description of Impact

On Monday 9 December 2024 from 09:24 to 11:40 US/Pacific BigQuery and Dataflow customers experienced increased latency and elevated error rates in US multi-region.

Google BigQuery

  • 80-90% of all requests for the insertAll API failed with a ‘5xx’ status code in US multi-region. Tail latency also increased substantially from <100ms to ~30 seconds during this time.
  • Additionally, AppendRows requests for the Write API saw increased tail latency (99.99%) from <3 seconds to ~30 seconds during this time.

Cloud Dataflow

  • 5-15% of Dataflow streaming jobs may have experienced increased latency in us-east1, us-east4, us-west1, us-west2 and us-central1 regions for the duration of the incident.
10 Dec 2024 10:23 PST

Mini Incident Report

We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support or to Google Workspace Support using help article https://support.google.com/a/answer/1047213.

(All Times US/Pacific)

Incident Start: 9 December 2024 09:24

Incident End: 9 December 2024 11:40

Duration: 2 hours, 24 minutes

Affected Services and Features:

Google BigQuery Cloud Dataflow

Regions/Zones:

  • Google BigQuery - US multi-region
  • Cloud Dataflow - us-west1, us-east1, us-east4, us-west2 & us-central1 were the most impacted but all Dataflow pipelines writing to the BigQuery US multi-region have likely been impacted too.

Description:

Google BigQuery experienced increased latency and elevated error rates in US multi-region for a duration of 2 hours, 24 minutes. Cloud Dataflow customers also observed elevated latency in their streaming jobs to the BigQuery US multi-region. Preliminary analysis indicates that the root cause of the issue was a sudden burst of traffic, which overloaded and slowed the backend in the availability zone. This led to aggressive retries and overloaded the frontend service. The incident was mitigated by rate-limiting requests and by evacuating the slow availability zone.

Google will complete a full IR in the following days that will provide a full root cause.

Customer Impact:

Google BigQuery

  • During the incident, customers calling google.cloud.bigquery.v2.TableDataService.InsertAll API method may have experienced transient failures with 5XX status code, which should have succeeded after retries.
  • Customers using google.cloud.bigquery.storage.v1.AppendRows may have experienced increased latency during this incident.

Cloud Dataflow

  • Customers would have experienced increased latency for streaming jobs in the us-east1, us-east4, us-west1, us-west2, and us-central1 regions.
9 Dec 2024 11:50 PST

The issue with Google BigQuery, Google Cloud Dataflow has been resolved for all affected projects as of Monday, 2024-12-09 11:40 US/Pacific.

We will publish an analysis of this incident once we have completed our internal investigation.

We thank you for your patience while we worked on resolving the issue.

9 Dec 2024 11:38 PST

Summary: We are investigating elevated error rates and latency for streaming ingestion into BigQuery

Description: We are experiencing an issue with Google BigQuery, Cloud Dataflow beginning on Monday, 2024-12-09 09:30 US/Pacific.

Our engineering team continues to investigate the issue.

We will provide an update by Monday, 2024-12-09 12:15 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Google BigQuery customers may see elevated 503 errors and increased latency when using either the Storage Write API or insertAll.

Cloud Dataflow customers may observe elevated latency in their streaming jobs.

Workaround: None at this time.

9 Dec 2024 10:58 PST

Summary: We are investigating elevated error rates and latency for streaming ingestion into BigQuery

Description: We are experiencing an issue with Google BigQuery, Cloud Dataflow beginning on Monday, 2024-12-09 09:30 US/Pacific.

Our engineering team continues to investigate the issue.

We will provide an update by Monday, 2024-12-09 11:45 US/Pacific with current details.

Diagnosis: Google BigQuery customers may see elevated 503 errors and increased latency when using either the Storage Write API or insertAll.

Cloud Dataflow customers may observe elevated latency in their streaming jobs.

Workaround: None at this time.