Service Health
Incident affecting Google BigQuery
We are investigating an issue with Google BigQuery impacting Customer is the US Region
Incident began at 2023-04-01 03:30 and ended at 2023-04-03 05:30 (all times are US/Pacific).
Previously affected location(s)
Multi-region: us
Date | Time | Description | |
---|---|---|---|
| 10 Apr 2023 | 08:40 PDT | Incident ReportSummaryBetween Saturday, 01 April 2023 at 03:30 US/Pacific and Monday, 03 April 2023 at 05:30 US/Pacific, BigQuery experienced three separate windows of elevated latency and unavailability errors (503, 500 errors) in the US multi-region for a total duration of 8 hours, 2 minutes. To our BigQuery customers whose business analytics were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. We have conducted an internal investigation and are taking steps to improve our service. Root CauseBigQuery’s API provides users the ability to send job requests as outlined in the public documentation. Elevated latencies or elevated error rates on these API requests may lead to customers experiencing service unavailability. These job requests are received by job frontend servers which handles the initial processing and routing to BigQuery’s query processing engine. Each server processes these requests in a series of thread pools. The issue was triggered by an unexpected surge of job insert API requests from a single workload in rapid succession. This surge led to excessive load on a small fraction of our metadata serving systems creating a hotspot. This caused increased latencies and backed up requests to read the metadata for the single workload. This, in turn, led to contention of the threadpool, which resulted in elevated latency and timeout failures for other API job requests hitting the affected job frontend servers. Remediation and PreventionGoogle engineers were first alerted to the outage by internal monitoring on 01 April 2023 at 04:05 US/Pacific and immediately started an investigation. A single workload was identified as the trigger. While engineers continued to work on identifying the cause and mitigation, the latency subsided, mitigating the issue automatically. After mitigation, our engineers continued investigations to understand the connection between the single workload and the broader impact. During ongoing investigations, engineers were alerted to a recurrence of the issue on 02 April 2023 at 03:05 US/Pacific. The issue was further escalated, and engineering added CPU resources to the problematic server. This was completed around the same time as when the single workload ended, thereby mitigating the issue. After further monitoring, engineers concluded investigations as the added resources helped mitigate the issue. On 03 April 2023 at 01:34 US/Pacific time, our engineers were alerted to a reappearance of the issue. The engineering team identified and successfully stopped the type of workload that was causing the issue, effectively mitigating the problem. The team has taken additional steps to block that specific workload in the future. Google is committed to preventing a repeat of this issue in the future and is completing the following actions:
Detailed Description of ImpactOn 01 April 2023 between 03:30 and 05:14 US/Pacific, on 02 April 2023 between 03:00 and 05:08, and on 03 April 2023 between 01:20 and 05:30 : BigQuery: Affected customers in multi-region US experienced elevated latency and unavailability errors (503, 500 errors) when using the jobs API to insert a query job, get information about a job, cancel a job, or get back query results. |
| 4 Apr 2023 | 13:17 PDT | Mini Incident ReportWe apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support . (All Times US/Pacific) Impact Window #1: Incident Start: 1 April 2023 03:30 Impact Window #2: Incident Start: 2 April 2023 03:00 Impact Window #3: Incident Start: 3 April 2023 01:20 Cumulative Duration: 8 hours, 2 minutes Affected Services and Features: BigQuery Regions/Zones: Multi-Region US Description: BigQuery experienced elevated latency and error rates in the US multi-region. The incident spanned between 1 April 2023 03:30 and 3 April 2023 05:30 for a cumulative duration of 7 hours, 47 minutes. From preliminary analysis, the root cause of the issue was triggered by an unexpected surge of job API requests from a single workload. The issue was mitigated once Google engineers identified and isolated the problematic workload. Google will complete a detailed Incident Report in the following days that will provide a full root cause. Customer Impact:
|
| 3 Apr 2023 | 14:42 PDT | The issue with Google BigQuery has been resolved for all affected users as of Monday, 2023-04-03 05:30 US/Pacific. We thank you for your patience while we worked on resolving the issue. |
| 3 Apr 2023 | 13:24 PDT | Summary: We are investigating an issue with Google BigQuery impacting Customer is the US Region Description: Engineers believe that the issue that was causing high latency and unavailable errors has been mitigated, however we are continuing to monitor the service to confirm full recovery. We will provide more information by Monday, 2023-04-03 15:30 US/Pacific. Diagnosis: Some Customers will see system high latency / unavailable errors (503, 500 errors) while using the jobs API to insert a query job or get back query results. Workaround: None at this time. |
| 3 Apr 2023 | 12:25 PDT | Summary: We are investigating an issue with Google BigQuery impacting Customer is the US Region Description: Engineers believe that the issue that was causing high latency and unavailable errors has been mitigated, however we are continuing to monitor the service to confirm full recovery. We will provide more information by Monday, 2023-04-03 13:30 US/Pacific. Diagnosis: Some Customers will see system high latency / unavailable errors (503, 500 errors) while using the jobs API to insert a query job or get back query results. Workaround: None at this time. |
| 3 Apr 2023 | 11:53 PDT | Summary: We are investigating an issue with Google BigQuery impacting Customer is the US Region Description: Engineers believe that the issue that was causing high latency and unavailable errors has been mitigated, and are continuing to monitor the service. We will provide more information by Monday, 2023-04-03 12:30 US/Pacific. Diagnosis: Some Customers will see system high latency / unavailable errors (503, 500 errors) while using the jobs API to insert a query job or get back query results. Workaround: None at this time. |
| 3 Apr 2023 | 09:38 PDT | Summary: We are investigating an issue with Google BigQuery impacting Customer is the US Region Description: Engineers believe that the issue that was causing high latency and unavailable errors has been mitigated, and are continuing to monitor the service. We will provide more information by Monday, 2023-04-03 12:00 US/Pacific. Diagnosis: Some Customers will see system high latency / unavailable errors (503, 500 errors) while using the jobs API to insert a query job or get back query results. Workaround: None at this time. |
| 3 Apr 2023 | 04:53 PDT | Summary: We are investigating an issue with Google BigQuery impacting Customer is the US Region Description: Engineers have localized the issue and mitigation work is currently underway by our engineering team. We believe there is no ongoing impact but will continue to monitor the service. if customer are still impacted please raise a support case. We will provide more information by Monday, 2023-04-03 12:00 US/Pacific. Diagnosis: Some Customers will see system high latency / unavailable errors (503, 500 errors) while using the jobs API to insert a query job or get back query results. Workaround: None at this time. |
| 3 Apr 2023 | 02:46 PDT | Summary: We are investigating an issue with Google BigQuery impacting Customer is the US Region Description: We are experiencing an issue with Google BigQuery. At present there is no ongoing impact, google engineers are continuing to investigate and are monitoring closely. If customers are experiencing issues please raise a support case. We will provide an update by Monday, 2023-04-03 05:00 US/Pacific with current details. Diagnosis: Some Customers will see system high latency / unavailable errors (503, 500 errors) while using the jobs API to insert a query job or get back query results. Workaround: None at this time. |
| 3 Apr 2023 | 02:18 PDT | Summary: We are investigating an issue with Google BigQuery impacting Customer is the US Region Description: We are experiencing an issue with Google BigQuery. Our engineering team continues to investigate the issue. We will provide an update by Monday, 2023-04-03 03:00 US/Pacific with current details. Diagnosis: Some Customers will see system high latency / unavailable errors (503, 500 errors) while using the jobs API to insert a query job or get back query results. Workaround: None at this time. |
| 3 Apr 2023 | 02:11 PDT | Summary: We are investigating an issue with Google BigQuery impacting Customer is the US Region Description: We are experiencing an issue with Google BigQuery. Our engineering team continues to investigate the issue. We will provide an update by Monday, 2023-04-03 03:00 US/Pacific with current details. Diagnosis: Some Customers will see system unavailable errors while using the jobs API to insert a query job or get back query results. Workaround: None at this time. |
- All times are US/Pacific