Service Health
Incident affecting Google BigQuery
Elevated latency in BigQuery within EU multiregion
Incident began at 2023-11-22 09:40 and ended at 2023-11-22 10:50 (all times are US/Pacific).
Previously affected location(s)
Multi-region: eu
Date | Time | Description | |
---|---|---|---|
| 26 Nov 2023 | 19:37 PST | Incident ReportSummaryBeginning at 23:25 US/Pacific on Tuesday 21 November 2023 Google BigQuery’s job servers in the EU multi-region experienced elevated latency and connection errors on three separate occurrences, for a cumulative period of 5 hours and 22 minutes. To our BigQuery customers whose backend jobs were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we took immediate steps to improve the platform’s performance and availability. Root CauseThe root cause of the issue was contention over the thread pools used by our backend metadata system. In this incident, Google BigQuery received an unexpected spike of jobs that required a large number of complex requests to our backend metadata service per query. This traffic utilized the available thread pool to a high degree in the metadata servers, causing high latency for metadata requests as well as restarts in the job server. As a result, some customers in the impacted region observed high query latency and connection errors. Remediation and PreventionGoogle engineers were alerted to the issue at 23:27 US/Pacific on Tuesday 21 November 2023 by our monitoring tools and immediately started to take actions to mitigate the issue. As a first step Google engineers increased memory on the job servers, which mitigated the issue in the interim, followed by horizontal upscaling of the job servers in the impacted region. After the third instance, Google engineers were able to narrow down the underlying root cause to the metadata server thread pools, and were able to identify the traffic pattern causing the spike of requests. The issue was mitigated at 13:47 US/Pacific on Wednesday 22 November 2023, when Google engineers put in place measures to prevent these workloads from affecting the rest of the system. These measures included limits on the workload triggering the error as well as additional metadata server resources to allow them to handle increased load gracefully. For our Google BigQuery customers whose backend jobs were affected, we apologize for the length and severity of this incident. We are taking immediate steps to prevent a recurrence and improve reliability in the future:
Detailed Description of ImpactCustomers in the EU multi-region experienced elevated latencies or could not schedule jobs in Google BigQuery, on three separate instances for a cumulative period of 5 hours and 22 minutes. |
| 22 Nov 2023 | 16:31 PST | Mini Incident ReportWe apologize for the inconvenience this service disruption may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support. (All Times US/Pacific)
Affected Services and Features: Google BigQuery Regions/Zones: muti-regions: eu Description: Google BigQuery experienced three occurrences of elevated latencies in the multi-regions eu for a total duration of 5 hours, 22 minutes. From preliminary analysis, this issue was triggered by a pattern of traffic to our backend job servers that resulted in an unusually large number of requests to the downstream metadata servers, causing the job servers to slow down or drop connections. Google engineers were able to identify the source of the incoming traffic that was causing the issue, and put in place measures to prevent these workloads from affecting the rest of the system. The issue was fully mitigated as of 13:47 US/Pacific. Google will complete a full IR in the following days that will provide a full root cause analysis. Customer Impact:
|
| 22 Nov 2023 | 14:08 PST | The issue with Google BigQuery has been resolved for all affected users. Google Engineers were able to identify the workload causing the error and have put in place measures, as of Wednesday, 2023-11-22 13:47 US/Pacific, to ensure this issue doesn't reoccur. We will publish the full root cause of this incident in the next few days. We thank you for your patience while we worked on resolving the issue. |
| 22 Nov 2023 | 12:59 PST | Summary: Elevated latency in BigQuery within EU multiregion Description: Google BigQuery has experienced three occurrences of elevated latency in EU multiregion. The first occurrence was between 2023-11-21 23:20 US/Pacific and 2023-11-22 02:20 US/Pacific. The second occurrence was between 2023-11-22 05:30 US/Pacific and 2023-11-22 06:05 US/Pacific. The third occurrence was between 2023-11-22 09:40 US/Pacific and 2023-11-22 10:50 US/Pacific. There are no latency issues noticed in the system at this time. There is also no ongoing impact at the moment. Our engineering team has identified the cause of the issue to be an unexpected workload in the EU multi-region. All necessary teams are engaged and continuing to work on identifying a mitigation strategy. We will provide an update by Wednesday, 2023-11-22 15:00 US/Pacific with current details. We apologize to all who are affected by the disruption. We sincerely appreciate your patience and understanding as we work to resolve it as quickly as possible. Diagnosis: Impacted users may experience intermittent connection errors and higher query latencies. Workaround: There is no known workaround at this time. |
| 22 Nov 2023 | 12:08 PST | Summary: BigQuery is experiencing elevated latency in EU multiregion Description: Google BigQuery has experienced three occurrences of elevated latency in EU multiregion. The first occurrence was between 2023-11-21 23:20 US/Pacific and 2023-11-22 02:20 US/Pacific. The second occurrence was between 2023-11-22 05:30 US/Pacific and 2023-11-22 06:05 US/Pacific. The third occurrence was between 2023-11-22 09:40 US/Pacific and 2023-11-22 10:50 US/Pacific. The latencies are back to normal and there is no ongoing impact at the moment. Our engineering team has identified the cause of the issue to be an unexpected workload in the EU multi-region and are continuing to work on identifying a mitigation strategy. We will provide an update by Wednesday, 2023-11-22 13:08 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Impacted users may experience intermittent connection errors and higher query latencies. Workaround: None at this time. |
| 22 Nov 2023 | 11:04 PST | Summary: BigQuery is experiencing elevated latency in EU multiregion Description: Google BigQuery has experienced three occurrences of elevated latency in EU multiregion. The first occurrence was between 2023-11-21 23:20 US/Pacific and 2023-11-22 02:20 US/Pacific. The second occurrence was between 2023-11-22 05:30 US/Pacific and 2023-11-22 06:05 US/Pacific. The third occurrence was between 2023-11-22 09:40 US/Pacific and 2023-11-22 10:50 US/Pacific. Our engineering team is continuing to monitor BigQuery service closely while they investigate the cause of issue. We will provide an update by Wednesday, 2023-11-22 12:00 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Impacted users may experience intermittent connection errors and higher query latencies. Workaround: None at this time. |
| 22 Nov 2023 | 10:45 PST | Summary: BigQuery experiencing increased latency in Multiregion Europe Description: We are experiencing an issue with Google BigQuery. Our engineering team continues to investigate the issue. We will provide an update by Wednesday, 2023-11-22 11:15 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Impacted users may experience connection errors and higher query latencies. Workaround: None at this time. |
- All times are US/Pacific