Service Health
Incident affecting Google BigQuery
BigQuery Jobs are experiencing latency regression
Incident began at 2023-11-20 00:00 and ended at 2023-11-20 03:12 (all times are US/Pacific).
Previously affected location(s)
Multi-region: us
Date | Time | Description | |
---|---|---|---|
| 4 Dec 2023 | 07:08 PST | Incident ReportSummaryBeginning on Monday, 20 November 2023, Google BigQuery experienced elevated latency and errors in the US region during three separate impact periods, for a cumulative period of 4 hours, 30 minutes. To our BigQuery customers whose business was impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. Root CauseBigQuery relies on Colossus [1], Google’s latest-generation distributed file system, which is used by all Google services and is capable of storing exabytes of data. BigQuery's table data is stored in files on Colossus, and BigQuery manages data in tables by replicating between multiple Colossus clusters in each region for availability and disaster resilience. A combination of a recent change in the interaction of BigQuery's table management logic with the Colossus metadata servers and an increase in traffic from new BigQuery workloads led to additional load on the Colossus metadata servers in a subset of clusters in the US region, causing them to throttle. This failure affected other metadata operations that are performed during queries on those tables, resulting in errors and increased latency for BigQuery workloads in the region. [1] https://cloud.google.com/blog/products/bigquery/bigquery-under-the-hood Remediation and PreventionGoogle engineers were alerted to the issue by our internal monitoring system on Monday, 20 November 2023 at 00:33 US/Pacific, and immediately started an investigation. The issue was mitigated on 20 November at 03:12 when Google engineers moved traffic away from the impacted clusters. Google engineers were alerted to additional recurrences on Friday, 24 November at 00:00 to 00:54 and on Saturday, 25 November at 00:00 to 00:24. In both cases, engineers applied the same mitigation to the impacted clusters in order to minimize the effects on customer workloads. Based on these recurrences, Google engineers were able to further investigate the root cause of the outage and develop additional techniques for mitigating its impact. These new mitigation techniques prevented recurrences from 25 November onwards. Google is committed to preventing recurrence of this incident. We have taken the following actions:
As always, Google engineers are closely monitoring the system's operation during this critical time period and are prepared to respond to any future incidents promptly. We apologize for the length and severity of this incident. Detailed Description of ImpactGoogle BigQuery customers experienced elevated latency and errors in the multi-region US for a cumulative duration of 4 hours, 30 minutes. The impact periods were:
Affected customers experienced elevated latency degradation in the multi-region US on three separate instances, with query latency lasting up to a few minutes.This incident encompassed three major latency spikes, on November 20, 24, and 25. During the time periods of the incidents, <8% of projects with queries executing in the region saw a noticeable latency increase in their queries compared to their historical means. |
| 21 Nov 2023 | 10:00 PST | Mini Incident ReportWe apologize for the inconvenience this service disruption may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support. (All Times US/Pacific) Incident Start: 20 November 2023 00:00 Incident End: 20 November 2023 03:12 Duration: 3 hours, 12 minutes Affected Services and Features: Google BigQuery Regions/Zones: multi-region US Description: Google BigQuery experienced elevated latency degradation in the multi-region US for a duration of 3 hours, 12 minutes. From preliminary analysis, the root cause of the issue is due to a failure in an underlying storage layer dependency that controls metadata requests. The issue was mitigated by redirecting metadata request traffic to an alternate location. Customer Impact: Google BigQuery
|
| 20 Nov 2023 | 03:27 PST | The issue with Google BigQuery has been resolved for all affected users as of Monday, 2023-11-20 03:15 US/Pacific. We thank you for your patience while we worked on resolving the issue. |
| 20 Nov 2023 | 03:01 PST | Summary: BigQuery Jobs are experiencing latency regression Description: Mitigation work has been completed and currently we are monitoring the services to confirm if the issue was resolved. We will provide more information by Monday, 2023-11-20 04:20 US/Pacific. Diagnosis: The impacted customers may experience latency with BigQuery jobs. Workaround: None at this time. |
| 20 Nov 2023 | 02:24 PST | Summary: BigQuery Jobs are experiencing latency regression Description: We are experiencing an issue with Google BigQuery. The resolving teams are still investigating the issue. We will provide an update by Monday, 2023-11-20 03:45 US/Pacific with current details. Diagnosis: The impacted customers may experience latency with BigQuery jobs. Workaround: None at this time. |
- All times are US/Pacific