Service Health
Incident affecting Google BigQuery, Google Cloud Dataflow
We are investigating elevated error rates and latency for streaming ingestion into BigQuery
Incident began at 2024-12-09 09:33 and ended at 2024-12-09 11:50 (all times are US/Pacific).
Previously affected location(s)
Multi-region: usIowa (us-central1)South Carolina (us-east1)Northern Virginia (us-east4)Oregon (us-west1)Los Angeles (us-west2)
Date | Time | Description | |
---|---|---|---|
| 16 Dec 2024 | 09:38 PST | Incident ReportSummaryStarting on Monday, 9 December 2024 09:24 US/Pacific, some Google BigQuery customers in the US multi-region encountered failures on 80-90% of requests to the insertAll API, with a ‘5xx’ error code along with increased latency. BigQuery Write API customers also saw an increase in latency for some requests during this time. Also, Dataflow customers may have experienced an increase in latency by 5-15%. The impact lasted for a duration of 2 hours and 16 minutes. To our BigQuery and Dataflow customers who were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. Root CauseDataflow depends on BigQuery to write streaming data to a table. BigQuery uses a streaming backend service to persist data. This streaming backend service in one of the clusters in the US multi-region experienced a high number of concurrent connections caused by a slightly higher than usual customer traffic fluctuation. This backend service enforces a limit on the number of such concurrent connections for flow control and a bug in its mechanism prevented backends from accepting new requests once this limit was reached. Due to reduced bandwidth in the backend servers in the cluster, the regional streaming frontends experienced higher latency and started accumulating more inflight requests. This led to their overload and regional impact to BigQuery and Dataflow streaming customers outside the affected cluster. Remediation and PreventionGoogle engineers were alerted to the outage by our internal monitoring system on Monday, 9 December 2024 at 09:33 US/Pacific and immediately started an investigation. After a thorough investigation, the impacted backend cluster was identified. Initial mitigation attempt focused on reducing server load through traffic throttling. To achieve complete mitigation, our engineers then drained the affected cluster, resulting in immediate and complete recovery. Google is committed to preventing this issue from repeating in the future and is completing the following actions:
Detailed Description of ImpactOn Monday 9 December 2024 from 09:24 to 11:40 US/Pacific BigQuery and Dataflow customers experienced increased latency and elevated error rates in US multi-region. Google BigQuery
Cloud Dataflow
|
| 10 Dec 2024 | 10:23 PST | Mini Incident ReportWe apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support or to Google Workspace Support using help article https://support.google.com/a/answer/1047213. (All Times US/Pacific) Incident Start: 9 December 2024 09:24 Incident End: 9 December 2024 11:40 Duration: 2 hours, 24 minutes Affected Services and Features: Google BigQuery Cloud Dataflow Regions/Zones:
Description: Google BigQuery experienced increased latency and elevated error rates in US multi-region for a duration of 2 hours, 24 minutes. Cloud Dataflow customers also observed elevated latency in their streaming jobs to the BigQuery US multi-region. Preliminary analysis indicates that the root cause of the issue was a sudden burst of traffic, which overloaded and slowed the backend in the availability zone. This led to aggressive retries and overloaded the frontend service. The incident was mitigated by rate-limiting requests and by evacuating the slow availability zone. Google will complete a full IR in the following days that will provide a full root cause. Customer Impact: Google BigQuery
Cloud Dataflow
|
| 9 Dec 2024 | 11:50 PST | The issue with Google BigQuery, Google Cloud Dataflow has been resolved for all affected projects as of Monday, 2024-12-09 11:40 US/Pacific. We will publish an analysis of this incident once we have completed our internal investigation. We thank you for your patience while we worked on resolving the issue. |
| 9 Dec 2024 | 11:38 PST | Summary: We are investigating elevated error rates and latency for streaming ingestion into BigQuery Description: We are experiencing an issue with Google BigQuery, Cloud Dataflow beginning on Monday, 2024-12-09 09:30 US/Pacific. Our engineering team continues to investigate the issue. We will provide an update by Monday, 2024-12-09 12:15 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Google BigQuery customers may see elevated 503 errors and increased latency when using either the Storage Write API or insertAll. Cloud Dataflow customers may observe elevated latency in their streaming jobs. Workaround: None at this time. |
| 9 Dec 2024 | 10:58 PST | Summary: We are investigating elevated error rates and latency for streaming ingestion into BigQuery Description: We are experiencing an issue with Google BigQuery, Cloud Dataflow beginning on Monday, 2024-12-09 09:30 US/Pacific. Our engineering team continues to investigate the issue. We will provide an update by Monday, 2024-12-09 11:45 US/Pacific with current details. Diagnosis: Google BigQuery customers may see elevated 503 errors and increased latency when using either the Storage Write API or insertAll. Cloud Dataflow customers may observe elevated latency in their streaming jobs. Workaround: None at this time. |
- All times are US/Pacific