Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Google BigQuery

Google BigQuery users are experiencing issues with slot autoscaling and with purchasing new commitments in multi-region US

Incident began at 2023-09-28 06:00 and ended at 2023-09-28 17:12 (all times are US/Pacific).

Previously affected location(s)

Multi-region: us

Date Time Description
6 Oct 2023 09:30 PDT

Incident Report

Summary

On 28 September 2023, Google BigQuery users experienced issues with slot autoscaling[1], purchasing new capacity commitments[2], and propagation delays in reservation assignment[3] updates in US multi-region for a period of 11 hours, 12 minutes.

Root Cause

When a BigQuery customer purchases slots, either directly through the Reservations API[4] or through the Autoscaler, the request is sent to a Reservation server in that region. The Reservation server performs permission and quota checks and sends a request to the Capacity Manager server to approve capacity. In order to calculate available capacity in the region, Capacity Manager maintains a list that maps users to different data centers within the region. This information is continuously synchronized to a backend database.

On 27 September 2023, an issue with the size of the data written to the database resulted in errors for some transactions. By 28 September 2023, at 06:00 US/Pacific, the transactions began to fail consistently, causing the Capacity Management system to enter a crash loop and preventing calculation of available capacity in the region and causing failures in slot purchase approval.

Remediation and Prevention

Google engineers were alerted to the issue by internal monitoring on 28 September 2023 at 06:48 US/Pacific and immediately started an investigation. Engineers initially determined that the start of the issue aligned with a recent release and initiated a rollback at 10:13 US/Pacific.

At 10:42 US/Pacific, testing showed that the rollback had not mitigated the issue. Due to the nature and scope of the issue, engineers began manually approving slot requests. At 11:50 US/Pacific, engineers implemented auto-approval for all slot requests in the US multi-region as a temporary mitigation while the root cause was investigated further.

Engineers were able to fully mitigate the issue at 17:12 US/Pacific by reducing the size of the database transaction. Once the mitigation was confirmed, engineers rolled back the temporary auto-approval change.

Google is committed to preventing a repeat of this issue in the future and is completing the following actions:

  • Short term, we are enhancing the retry logic for project assignment mappings when errors are detected in the BigQuery capacity mapping system.
  • Short term, we are working to optimize the reservation database to reduce contention and reduce runtime dependencies on it.
  • In the medium-term, we are working to decouple reservation processing from the capacity mapping system to make capacity allocation independent of reservation mapping.

Detailed Description of Impact

On 28 September 2023, from 06:00 to 17:12 US/Pacific, Google BigQuery users in the US multi-region were unable to purchase additional slots through autoscaling or new capacity commitments. Additionally, reservation assignment propagation would have been delayed.

29 Sep 2023 11:22 PDT

Mini Incident Report

We apologize for the inconvenience this service disruption may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support

(All Times US/Pacific)

Incident Start: 28 Sept 2023 06:00

Incident End: 28 Sept 2023 17:12

Duration: 11 hours, 12 mins

Affected Services and Features:

Google BigQuery - Workload Management

Regions/Zones: US multi-region

Description:

Google BigQuery users experienced issues with slot autoscaling, purchasing new capacity commitments, and propagation delays in reservation assignment updates in US multi-region for a period of 11 hours, 12 minutes.

From preliminary analysis, the root cause is BigQuery capacity manager component being unavailable due to a bug exposed by increased contention in the database this component depends on.

A mitigation was applied around 28 Sept at 15:00, at which point customers started seeing their autoscaling and capacity commitment requests succeed. Full mitigation was applied at 17:12. This resolved the reservation assignment updates propagation delays.

Customer Impact:

  • BigQuery users in US multi-region were unable to scale up slots, purchase new capacity commitments.
  • Users also had propagation delays in reservation assignment updates.

28 Sep 2023 17:33 PDT

The issue with Google BigQuery has been resolved for all affected users as of Thursday, 2023-09-28 17:12 US/Pacific. Capacity slot purchases are now succeeding and reservation assignments are propagating correctly.

We thank you for your patience while we worked on resolving the issue.

28 Sep 2023 16:18 PDT

Summary: Google BigQuery users are experiencing issues with slot autoscaling and with purchasing new commitments in multi-region US

Description: Mitigation work is currently underway by our engineering team. A mitigation has been implemented which has fixed capacity purchase and autoscaling requests. Reservation assignment updates are still experiencing propagation delays.

We will provide an update by Thursday, 2023-09-28 17:30 US/Pacific with current details.

Diagnosis: BigQuery users in multi-region US may see issues with scaling up slots, purchasing new commitments, and reservation assignment updates have propagation delays.

Workaround: None at this time.

28 Sep 2023 14:37 PDT

Summary: Google BigQuery users are experiencing issues with slot autoscaling and with purchasing new commitments in multi-region US

Description: Mitigation work is currently underway by our engineering team.

The mitigation is expected to complete by Thursday, 2023-09-28 16:30 US/Pacific.

We will provide more information by Thursday, 2023-09-28 16:40 US/Pacific.

Diagnosis: BigQuery users in multi-region US may see issues with scaling up slots and purchasing new commitments.

Workaround: None at this time.

28 Sep 2023 13:49 PDT

Summary: Google BigQuery users are experiencing issues with slot autoscaling and with purchasing new commitments in multi-region US

Description: Mitigation work is currently underway by our engineering team.

The mitigation is expected to complete by Thursday, 2023-09-28 14:50 US/Pacific.

We will provide more information by Thursday, 2023-09-28 15:00 US/Pacific.

Diagnosis: BigQuery users in multi-region US may see issues with scaling up slots and purchasing new commitments.

Workaround: None at this time.

28 Sep 2023 13:11 PDT

Summary: Google BigQuery users are experiencing issues with slot autoscaling and with purchasing new commitments in multi-region US

Description: Mitigation work is currently underway by our engineering team.

We do not have an ETA for mitigation at this point.

We will provide more information by Thursday, 2023-09-28 13:45 US/Pacific.

Diagnosis: BigQuery users in multi-region US may see issues with scaling up slots and purchasing new commitments.

Workaround: None at this time.

28 Sep 2023 12:31 PDT

Summary: Google BigQuery users are experiencing issues with slot autoscaling and with purchasing new commitments in multi-region US

Description: Upon further investigation, our engineering team concluded that the previously identified mitigation does not resolve the issue.

Our engineers are continuing to investigate the issue to identify a mitigation for the issue.

We will provide more information by Thursday, 2023-09-28 13:10 US/Pacific.

Diagnosis: BigQuery users in multi-region US may see issues with scaling up slots and purchasing new commitments.

Workaround: None at this time.

28 Sep 2023 11:46 PDT

Summary: Google BigQuery users are experiencing issues with slot autoscaling and with purchasing new commitments in multi-region US

Description: Our Engineering team has identified a mitigation for the issue and are working to initiate mitigation work.

We will provide more information by Thursday, 2023-09-28 12:20 US/Pacific.

Diagnosis: BigQuery users in multi-region US may see issues with scaling up slots and purchasing new commitments.

Workaround: None at this time.