Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Google BigQuery

We are investigating an issue with Google BigQuery impacting Customer is the US Region

Incident began at 2023-04-01 03:30 and ended at 2023-04-03 05:30 (all times are US/Pacific).

Previously affected location(s)

Multi-region: us

Date Time Description
10 Apr 2023 08:40 PDT

Incident Report

Summary

Between Saturday, 01 April 2023 at 03:30 US/Pacific and Monday, 03 April 2023 at 05:30 US/Pacific, BigQuery experienced three separate windows of elevated latency and unavailability errors (503, 500 errors) in the US multi-region for a total duration of 8 hours, 2 minutes. To our BigQuery customers whose business analytics were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. We have conducted an internal investigation and are taking steps to improve our service.

Root Cause

BigQuery’s API provides users the ability to send job requests as outlined in the public documentation. Elevated latencies or elevated error rates on these API requests may lead to customers experiencing service unavailability. These job requests are received by job frontend servers which handles the initial processing and routing to BigQuery’s query processing engine. Each server processes these requests in a series of thread pools.

The issue was triggered by an unexpected surge of job insert API requests from a single workload in rapid succession. This surge led to excessive load on a small fraction of our metadata serving systems creating a hotspot. This caused increased latencies and backed up requests to read the metadata for the single workload. This, in turn, led to contention of the threadpool, which resulted in elevated latency and timeout failures for other API job requests hitting the affected job frontend servers.

Remediation and Prevention

Google engineers were first alerted to the outage by internal monitoring on 01 April 2023 at 04:05 US/Pacific and immediately started an investigation. A single workload was identified as the trigger. While engineers continued to work on identifying the cause and mitigation, the latency subsided, mitigating the issue automatically. After mitigation, our engineers continued investigations to understand the connection between the single workload and the broader impact.

During ongoing investigations, engineers were alerted to a recurrence of the issue on 02 April 2023 at 03:05 US/Pacific. The issue was further escalated, and engineering added CPU resources to the problematic server. This was completed around the same time as when the single workload ended, thereby mitigating the issue. After further monitoring, engineers concluded investigations as the added resources helped mitigate the issue.

On 03 April 2023 at 01:34 US/Pacific time, our engineers were alerted to a reappearance of the issue. The engineering team identified and successfully stopped the type of workload that was causing the issue, effectively mitigating the problem. The team has taken additional steps to block that specific workload in the future.

Google is committed to preventing a repeat of this issue in the future and is completing the following actions:

  • Continue to block the problematic workload until we implement additional safeguards for it to run without causing broader impact.

  • Improve API execution by introducing isolation between the threadpools that process different job API calls and by introducing limits on the instantaneous threadpool capacity that can be used by a single workload.

Detailed Description of Impact

On 01 April 2023 between 03:30 and 05:14 US/Pacific, on 02 April 2023 between 03:00 and 05:08, and on 03 April 2023 between 01:20 and 05:30 :

BigQuery:

Affected customers in multi-region US experienced elevated latency and unavailability errors (503, 500 errors) when using the jobs API to insert a query job, get information about a job, cancel a job, or get back query results.


4 Apr 2023 13:17 PDT

Mini Incident Report

We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support .

(All Times US/Pacific)

Impact Window #1:

Incident Start: 1 April 2023 03:30
Incident End: 1 April 2023 05:14
Duration: 1 hour, 44 minutes

Impact Window #2:

Incident Start: 2 April 2023 03:00
Incident End: 2 April 2023 05:08
Duration: 2 hours, 8 minutes

Impact Window #3:

Incident Start: 3 April 2023 01:20
Incident End: 3 April 2023 05:30
Duration: 4 hours, 10 minutes

Cumulative Duration: 8 hours, 2 minutes

Affected Services and Features:

BigQuery

Regions/Zones: Multi-Region US

Description:

BigQuery experienced elevated latency and error rates in the US multi-region. The incident spanned between 1 April 2023 03:30 and 3 April 2023 05:30 for a cumulative duration of 7 hours, 47 minutes. From preliminary analysis, the root cause of the issue was triggered by an unexpected surge of job API requests from a single workload. The issue was mitigated once Google engineers identified and isolated the problematic workload.

Google will complete a detailed Incident Report in the following days that will provide a full root cause.

Customer Impact:

  • Affected customers experienced elevated latency and unavailability errors (503, 500 errors) while using the jobs API to insert a query job, get information about a job, cancel a job, or get back query results.

3 Apr 2023 14:42 PDT

The issue with Google BigQuery has been resolved for all affected users as of Monday, 2023-04-03 05:30 US/Pacific.

We thank you for your patience while we worked on resolving the issue.

3 Apr 2023 13:24 PDT

Summary: We are investigating an issue with Google BigQuery impacting Customer is the US Region

Description: Engineers believe that the issue that was causing high latency and unavailable errors has been mitigated, however we are continuing to monitor the service to confirm full recovery.

We will provide more information by Monday, 2023-04-03 15:30 US/Pacific.

Diagnosis: Some Customers will see system high latency / unavailable errors (503, 500 errors) while using the jobs API to insert a query job or get back query results.

Workaround: None at this time.

3 Apr 2023 12:25 PDT

Summary: We are investigating an issue with Google BigQuery impacting Customer is the US Region

Description: Engineers believe that the issue that was causing high latency and unavailable errors has been mitigated, however we are continuing to monitor the service to confirm full recovery.

We will provide more information by Monday, 2023-04-03 13:30 US/Pacific.

Diagnosis: Some Customers will see system high latency / unavailable errors (503, 500 errors) while using the jobs API to insert a query job or get back query results.

Workaround: None at this time.

3 Apr 2023 11:53 PDT

Summary: We are investigating an issue with Google BigQuery impacting Customer is the US Region

Description: Engineers believe that the issue that was causing high latency and unavailable errors has been mitigated, and are continuing to monitor the service.

We will provide more information by Monday, 2023-04-03 12:30 US/Pacific.

Diagnosis: Some Customers will see system high latency / unavailable errors (503, 500 errors) while using the jobs API to insert a query job or get back query results.

Workaround: None at this time.

3 Apr 2023 09:38 PDT

Summary: We are investigating an issue with Google BigQuery impacting Customer is the US Region

Description: Engineers believe that the issue that was causing high latency and unavailable errors has been mitigated, and are continuing to monitor the service.

We will provide more information by Monday, 2023-04-03 12:00 US/Pacific.

Diagnosis: Some Customers will see system high latency / unavailable errors (503, 500 errors) while using the jobs API to insert a query job or get back query results.

Workaround: None at this time.

3 Apr 2023 04:53 PDT

Summary: We are investigating an issue with Google BigQuery impacting Customer is the US Region

Description: Engineers have localized the issue and mitigation work is currently underway by our engineering team. We believe there is no ongoing impact but will continue to monitor the service. if customer are still impacted please raise a support case.

We will provide more information by Monday, 2023-04-03 12:00 US/Pacific.

Diagnosis: Some Customers will see system high latency / unavailable errors (503, 500 errors) while using the jobs API to insert a query job or get back query results.

Workaround: None at this time.

3 Apr 2023 02:46 PDT

Summary: We are investigating an issue with Google BigQuery impacting Customer is the US Region

Description: We are experiencing an issue with Google BigQuery.

At present there is no ongoing impact, google engineers are continuing to investigate and are monitoring closely. If customers are experiencing issues please raise a support case.

We will provide an update by Monday, 2023-04-03 05:00 US/Pacific with current details.

Diagnosis: Some Customers will see system high latency / unavailable errors (503, 500 errors) while using the jobs API to insert a query job or get back query results.

Workaround: None at this time.

3 Apr 2023 02:18 PDT

Summary: We are investigating an issue with Google BigQuery impacting Customer is the US Region

Description: We are experiencing an issue with Google BigQuery.

Our engineering team continues to investigate the issue.

We will provide an update by Monday, 2023-04-03 03:00 US/Pacific with current details.

Diagnosis: Some Customers will see system high latency / unavailable errors (503, 500 errors) while using the jobs API to insert a query job or get back query results.

Workaround: None at this time.

3 Apr 2023 02:11 PDT

Summary: We are investigating an issue with Google BigQuery impacting Customer is the US Region

Description: We are experiencing an issue with Google BigQuery.

Our engineering team continues to investigate the issue.

We will provide an update by Monday, 2023-04-03 03:00 US/Pacific with current details.

Diagnosis: Some Customers will see system unavailable errors while using the jobs API to insert a query job or get back query results.

Workaround: None at this time.