Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Google BigQuery

BigQuery Jobs are experiencing latency regression

Incident began at 2023-11-20 00:00 and ended at 2023-11-20 03:12 (all times are US/Pacific).

Previously affected location(s)

Multi-region: us

Date Time Description
4 Dec 2023 07:08 PST

Incident Report

Summary

Beginning on Monday, 20 November 2023, Google BigQuery experienced elevated latency and errors in the US region during three separate impact periods, for a cumulative period of 4 hours, 30 minutes.

To our BigQuery customers whose business was impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability.

Root Cause

BigQuery relies on Colossus [1], Google’s latest-generation distributed file system, which is used by all Google services and is capable of storing exabytes of data. BigQuery's table data is stored in files on Colossus, and BigQuery manages data in tables by replicating between multiple Colossus clusters in each region for availability and disaster resilience.

A combination of a recent change in the interaction of BigQuery's table management logic with the Colossus metadata servers and an increase in traffic from new BigQuery workloads led to additional load on the Colossus metadata servers in a subset of clusters in the US region, causing them to throttle. This failure affected other metadata operations that are performed during queries on those tables, resulting in errors and increased latency for BigQuery workloads in the region.

[1] https://cloud.google.com/blog/products/bigquery/bigquery-under-the-hood

Remediation and Prevention

Google engineers were alerted to the issue by our internal monitoring system on Monday, 20 November 2023 at 00:33 US/Pacific, and immediately started an investigation. The issue was mitigated on 20 November at 03:12 when Google engineers moved traffic away from the impacted clusters.

Google engineers were alerted to additional recurrences on Friday, 24 November at 00:00 to 00:54 and on Saturday, 25 November at 00:00 to 00:24. In both cases, engineers applied the same mitigation to the impacted clusters in order to minimize the effects on customer workloads. Based on these recurrences, Google engineers were able to further investigate the root cause of the outage and develop additional techniques for mitigating its impact. These new mitigation techniques prevented recurrences from 25 November onwards.

Google is committed to preventing recurrence of this incident. We have taken the following actions:

  • Conducted a detailed investigation of the underlying cause of the incident, spanning several Google engineering teams.
  • Deployed a targeted change that will reduce the load on individual Colossus metadata servers and thus reduce throttling that was causing these failures. This change has been effective in preventing failures from 25 November onwards.
  • Developing more comprehensive improvements to handling Colossus metadata load to prevent future incidents of this nature.

As always, Google engineers are closely monitoring the system's operation during this critical time period and are prepared to respond to any future incidents promptly. We apologize for the length and severity of this incident.

Detailed Description of Impact

Google BigQuery customers experienced elevated latency and errors in the multi-region US for a cumulative duration of 4 hours, 30 minutes. The impact periods were:

  • 20 November 2023 00:00 to 20 November 2023 03:12 US/Pacific (3 hours, 12 minutes)
  • 24 November 2023 00:00 to 24 November 2023 00:54 US/Pacific (54 minutes)
  • 25 November 2023 00:00 to 25 November 2023 00:24 US/Pacific (24 minutes)

Affected customers experienced elevated latency degradation in the multi-region US on three separate instances, with query latency lasting up to a few minutes.This incident encompassed three major latency spikes, on November 20, 24, and 25. During the time periods of the incidents, <8% of projects with queries executing in the region saw a noticeable latency increase in their queries compared to their historical means.


21 Nov 2023 10:00 PST

Mini Incident Report

We apologize for the inconvenience this service disruption may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support.

(All Times US/Pacific)

Incident Start: 20 November 2023 00:00

Incident End: 20 November 2023 03:12

Duration: 3 hours, 12 minutes

Affected Services and Features:

Google BigQuery

Regions/Zones: multi-region US

Description:

Google BigQuery experienced elevated latency degradation in the multi-region US for a duration of 3 hours, 12 minutes. From preliminary analysis, the root cause of the issue is due to a failure in an underlying storage layer dependency that controls metadata requests. The issue was mitigated by redirecting metadata request traffic to an alternate location.

Customer Impact:

Google BigQuery

  • Affected customers experienced elevated latency degradation in the multi-region US with query latency lasting up to a few minutes.
20 Nov 2023 03:27 PST

The issue with Google BigQuery has been resolved for all affected users as of Monday, 2023-11-20 03:15 US/Pacific.

We thank you for your patience while we worked on resolving the issue.

20 Nov 2023 03:01 PST

Summary: BigQuery Jobs are experiencing latency regression

Description: Mitigation work has been completed and currently we are monitoring the services to confirm if the issue was resolved.

We will provide more information by Monday, 2023-11-20 04:20 US/Pacific.

Diagnosis: The impacted customers may experience latency with BigQuery jobs.

Workaround: None at this time.

20 Nov 2023 02:24 PST

Summary: BigQuery Jobs are experiencing latency regression

Description: We are experiencing an issue with Google BigQuery.

The resolving teams are still investigating the issue.

We will provide an update by Monday, 2023-11-20 03:45 US/Pacific with current details.

Diagnosis: The impacted customers may experience latency with BigQuery jobs.

Workaround: None at this time.