Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Google BigQuery

Increased connection errors for some Bigquery jobs impacting customers in US regions.

Incident began at 2023-04-28 18:00 and ended at 2023-04-29 13:10 (all times are US/Pacific).

Previously affected location(s)

Multi-region: us

Date Time Description
10 May 2023 15:27 PDT

Incident Report

Summary

On Friday, 28 April 2023 at 14:18 US/Pacific, Google BigQuery experienced elevated connection errors impacting BigQuery jobs in US regions for a duration of 19 hours and 10 minutes.

To our BigQuery customers who were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. We have conducted an internal investigation and are taking measures to improve our service.

Root Cause and Trigger

BigQuery is adding a new INFORMATION_SCHEMA view that gives users a more detailed view of their billed storage usage. The view is populated by a background data processing pipeline that outputs a storage usage row for all tables in a project or organization. To guarantee the accurate and consistent delivery of each record, BigQuery uses a Queue database to update batches of rows for tables in a GCP project. The queue database is a shared resource for our metadata subsystem. A sharding technique based on project unique IDs was implemented to distribute the load on the queue database. However, this technique proved problematic due to the high level of skew in terms of the distribution of work among projects in the reporting set. As a result, projects with a significant amount of work caused the queue database performance to degrade in the shards hosting projects with such high load. It is worth noting that this skew is more pronounced in the US multi-region.

The pipeline was rolling out gradually, per region, across the BigQuery fleet. On Thursday, 27 April 2023, it was enabled in the US multi-region, and it started running on Friday, 28 April 2023 at 06:00 US/Pacific. At 14:24 US/Pacific, the pipeline reached the stage of submitting tasks to the queue database which triggered the problem. Shards of the queue database corresponding to skewed project load started pushing back under high load. Other queue tasks corresponding to operations on the serving path like Table Copy, Load, and Query collocated with the servers pushing back were consequently impacted.

The user visible impact for the affected operations manifested as internal errors due to operations exceeding their deadlines, or increased tail latencies for successful ones.

Remediation and Prevention

On Friday, 28 April 2023 at 18:47 US/Pacific, Google engineers were alerted to elevated job connection errors on different job types and immediately started an investigation. Once the nature and scope of the issue became clear, Google engineers quickly halted and initiated a rollback of a binary release at 20:00 US/Pacific however, this did not mitigate the issue. The issue was further escalated at 22:48 US/Pacific, and engineers initiated several mitigation attempts which were unsuccessful. On Saturday, 29 April 2023 12:36:00 US/Pacific engineers identified and terminated the problematic job, and the issue was fully mitigated by 13:10 US/Pacific.

Google is committed to preventing a repeat of this issue in the future and is completing the following actions:

  • Enhance our detection procedures for internal or external triggers that cause increased resource load.
  • Improve internal rollout procedures for pipelines that utilize the queue databases to control the pace of scaling up their traffic.
  • Develop job monitoring dashboards for the queue databases to detect errors or latency issues and reduce time to mitigation.
  • Redesign the pipeline that triggered the incident to avoid overloading the system.
  • Develop an isolation mechanism to prioritize critical workloads over background database jobs.

Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization. We thank you for your business.

Detailed Description of Impact

On Friday, 28 April 2023 from 14:24 to Saturday, 29 April 2023 13:10 US/Pacific, Google BigQuery experienced elevated connection errors impacting BigQuery jobs.

  • Approximately 8.56% of projects in the US multi-region, observed elevated connection errors, others might have seen elevated latency at the tail end of metadata API operations while attempting to execute jobs in the US multiregion.
  • At the peak of the incident between Friday, 28 April 2023 17:00 and Saturday, 29 April 2023 01:00 US/Pacific:
    • ~3.7% of all projects in the US multi-region saw an error ratio > 1% across all job types.
    • ~37% of projects saw an error ratio > 0.01%.
    • Table Copy jobs were the most severely affected (almost 2% at peak).
    • Query and load operations observed errors between 0.1% and 1%.

2 May 2023 07:17 PDT

Mini Incident Report

We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support.

(All Times US/Pacific)

Incident Start: 28 April 2023 18:00

Incident End: 29 April 2023 13:10

Duration: 19 hours, 10 minutes

Affected Services and Features:

Google BigQuery

Regions/Zones: US Multi region

Description:

Google BigQuery experienced increased connection errors impacting BigQuery jobs in US regions for a duration of 19 hours and 10 minutes. From preliminary analysis, the system returned connection errors for some operations as it attempted to shed the extra load that was being created by a problem in the internal data processing pipeline.

Customer Impact:

During the incident, some customers may have experienced increased connection errors, others might have seen elevated latency at the tail end of metadata API operations while attempting to execute jobs in the US multiregion.

29 Apr 2023 13:20 PDT

The issue with Google BigQuery has been resolved for all affected users as of Saturday, 2023-04-29 12:00 US/Pacific.

We thank you for your patience while we worked on resolving the issue.

29 Apr 2023 11:30 PDT

Summary: Increased connection errors for some Bigquery jobs impacting customers in US regions.

Description: Our engineering team continues to work on mitigating table copy and load jobs, which remain at elevated error levels.

We will provide an update by Saturday, 2023-04-29 13:30 US/Pacific with current details.

Diagnosis: Customers may experience increased connection errors for some jobs in the US multiregion.

Workaround: None at this time.

29 Apr 2023 09:10 PDT

Summary: Increased connection errors for some Bigquery jobs impacting customers in US regions.

Description: Our engineering team has been able to mitigate the issue for most job types and are currently working on mitigating table copy and load jobs, which remain at elevated error levels.

We will provide an update by Saturday, 2023-04-29 11:30 US/Pacific with current details.

Diagnosis: Customers may experience increased connection errors for some jobs in the US multiregion.

Workaround: None at this time.

29 Apr 2023 07:38 PDT

Summary: Increased connection errors for some Bigquery jobs impacting customers in US regions.

Description: Our engineering team continues to actively investigate this issue.

We will provide an update by Saturday, 2023-04-29 09:30 US/Pacific with current details.

Diagnosis: Customers may experience increased connection errors for some jobs in the US multiregion.

Workaround: None at this time.

29 Apr 2023 05:26 PDT

Summary: Increased connection errors for some Bigquery jobs impacting customers in US region only.

Description: Our engineering team continues to actively investigating this issue

We will provide an update by Saturday, 2023-04-29 08:00 US/Pacific with current details.

Diagnosis: Customers may experience increased connection errors for some jobs in the US multiregion.

Workaround: None at this time.

29 Apr 2023 03:25 PDT

Summary: Increased connection errors for some Bigquery jobs impacting customers in US region only.

Description: Our engineering team has determined that further investigation is required to mitigate the issue.

We will provide an update by Saturday, 2023-04-29 05:30 US/Pacific with current details.

Diagnosis: Customers may experience increased connection errors for DREMEL_IMPORT, TABLE_COPY, METADATA_SNAPSHOT_GENERATE, and DERIVED_TABLE_REFRESH jobs in the us multiregion.

Workaround: None at this time.

29 Apr 2023 02:59 PDT

Summary: Increased connection errors for some Bigquery jobs impacting customers in US region only.

Description: Our engineering team has determined that further investigation is required to mitigate the issue.

We will provide an update by Saturday, 2023-04-29 03:30 US/Pacific with current details.

Diagnosis: Customers may experience increased connection errors for DREMEL_IMPORT, TABLE_COPY, METADATA_SNAPSHOT_GENERATE, and DERIVED_TABLE_REFRESH jobs in the us multiregion.

Workaround: None at this time.