Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Google BigQuery

Elevated latency in BigQuery within EU multiregion

Incident began at 2023-11-22 09:40 and ended at 2023-11-22 10:50 (all times are US/Pacific).

Previously affected location(s)

Multi-region: eu

Date Time Description
26 Nov 2023 19:37 PST

Incident Report

Summary

Beginning at 23:25 US/Pacific on Tuesday 21 November 2023 Google BigQuery’s job servers in the EU multi-region experienced elevated latency and connection errors on three separate occurrences, for a cumulative period of 5 hours and 22 minutes.

To our BigQuery customers whose backend jobs were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we took immediate steps to improve the platform’s performance and availability.

Root Cause

The root cause of the issue was contention over the thread pools used by our backend metadata system.

In this incident, Google BigQuery received an unexpected spike of jobs that required a large number of complex requests to our backend metadata service per query. This traffic utilized the available thread pool to a high degree in the metadata servers, causing high latency for metadata requests as well as restarts in the job server. As a result, some customers in the impacted region observed high query latency and connection errors.

Remediation and Prevention

Google engineers were alerted to the issue at 23:27 US/Pacific on Tuesday 21 November 2023 by our monitoring tools and immediately started to take actions to mitigate the issue.

As a first step Google engineers increased memory on the job servers, which mitigated the issue in the interim, followed by horizontal upscaling of the job servers in the impacted region. After the third instance, Google engineers were able to narrow down the underlying root cause to the metadata server thread pools, and were able to identify the traffic pattern causing the spike of requests.

The issue was mitigated at 13:47 US/Pacific on Wednesday 22 November 2023, when Google engineers put in place measures to prevent these workloads from affecting the rest of the system. These measures included limits on the workload triggering the error as well as additional metadata server resources to allow them to handle increased load gracefully.

For our Google BigQuery customers whose backend jobs were affected, we apologize for the length and severity of this incident. We are taking immediate steps to prevent a recurrence and improve reliability in the future:

  • Increasing the capacity in the metadata and job server systems.
  • Review and improve our admission control logic to ensure that appropriate limits are enforced earlier, preventing downstream servers from being overwhelmed.
  • Improve the resource governance in the thread pool system to prevent the query pattern from consuming a significant amount of resources.

Detailed Description of Impact

Customers in the EU multi-region experienced elevated latencies or could not schedule jobs in Google BigQuery, on three separate instances for a cumulative period of 5 hours and 22 minutes.

22 Nov 2023 16:31 PST

Mini Incident Report

We apologize for the inconvenience this service disruption may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support.

(All Times US/Pacific)

  • First occurrence Start: 21 November, 2023 23:25
  • First occurrence End: 22 November, 2023 03:08
  • First occurrence Duration: 3 hours, 43 minutes
  • Second occurrence Start: 22 November, 2023 05:33
  • Second occurrence End: 22 November, 2023 06:02
  • Second occurrence Duration: 29 minutes
  • Third occurrence Start: 22 November, 2023 09:40
  • Third occurrence End: 22 November, 2023 10:50
  • Third occurrence Duration: 1 hour, 10 minutes

Affected Services and Features:

Google BigQuery

Regions/Zones: muti-regions: eu

Description:

Google BigQuery experienced three occurrences of elevated latencies in the multi-regions eu for a total duration of 5 hours, 22 minutes.

From preliminary analysis, this issue was triggered by a pattern of traffic to our backend job servers that resulted in an unusually large number of requests to the downstream metadata servers, causing the job servers to slow down or drop connections. Google engineers were able to identify the source of the incoming traffic that was causing the issue, and put in place measures to prevent these workloads from affecting the rest of the system. The issue was fully mitigated as of 13:47 US/Pacific.

Google will complete a full IR in the following days that will provide a full root cause analysis.

Customer Impact:

  • Affected customers were experiencing elevated latencies and 500 errors in multi-regions: eu.
22 Nov 2023 14:08 PST

The issue with Google BigQuery has been resolved for all affected users.

Google Engineers were able to identify the workload causing the error and have put in place measures, as of Wednesday, 2023-11-22 13:47 US/Pacific, to ensure this issue doesn't reoccur.

We will publish the full root cause of this incident in the next few days.

We thank you for your patience while we worked on resolving the issue.

22 Nov 2023 12:59 PST

Summary: Elevated latency in BigQuery within EU multiregion

Description: Google BigQuery has experienced three occurrences of elevated latency in EU multiregion.

The first occurrence was between 2023-11-21 23:20 US/Pacific and 2023-11-22 02:20 US/Pacific. The second occurrence was between 2023-11-22 05:30 US/Pacific and 2023-11-22 06:05 US/Pacific. The third occurrence was between 2023-11-22 09:40 US/Pacific and 2023-11-22 10:50 US/Pacific.

There are no latency issues noticed in the system at this time. There is also no ongoing impact at the moment.

Our engineering team has identified the cause of the issue to be an unexpected workload in the EU multi-region.

All necessary teams are engaged and continuing to work on identifying a mitigation strategy.

We will provide an update by Wednesday, 2023-11-22 15:00 US/Pacific with current details.

We apologize to all who are affected by the disruption. We sincerely appreciate your patience and understanding as we work to resolve it as quickly as possible.

Diagnosis: Impacted users may experience intermittent connection errors and higher query latencies.

Workaround: There is no known workaround at this time.

22 Nov 2023 12:08 PST

Summary: BigQuery is experiencing elevated latency in EU multiregion

Description: Google BigQuery has experienced three occurrences of elevated latency in EU multiregion.

The first occurrence was between 2023-11-21 23:20 US/Pacific and 2023-11-22 02:20 US/Pacific.

The second occurrence was between 2023-11-22 05:30 US/Pacific and 2023-11-22 06:05 US/Pacific.

The third occurrence was between 2023-11-22 09:40 US/Pacific and 2023-11-22 10:50 US/Pacific.

The latencies are back to normal and there is no ongoing impact at the moment.

Our engineering team has identified the cause of the issue to be an unexpected workload in the EU multi-region and are continuing to work on identifying a mitigation strategy.

We will provide an update by Wednesday, 2023-11-22 13:08 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Impacted users may experience intermittent connection errors and higher query latencies.

Workaround: None at this time.

22 Nov 2023 11:04 PST

Summary: BigQuery is experiencing elevated latency in EU multiregion

Description: Google BigQuery has experienced three occurrences of elevated latency in EU multiregion.

The first occurrence was between 2023-11-21 23:20 US/Pacific and 2023-11-22 02:20 US/Pacific.

The second occurrence was between 2023-11-22 05:30 US/Pacific and 2023-11-22 06:05 US/Pacific.

The third occurrence was between 2023-11-22 09:40 US/Pacific and 2023-11-22 10:50 US/Pacific.

Our engineering team is continuing to monitor BigQuery service closely while they investigate the cause of issue.

We will provide an update by Wednesday, 2023-11-22 12:00 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Impacted users may experience intermittent connection errors and higher query latencies.

Workaround: None at this time.

22 Nov 2023 10:45 PST

Summary: BigQuery experiencing increased latency in Multiregion Europe

Description: We are experiencing an issue with Google BigQuery.

Our engineering team continues to investigate the issue.

We will provide an update by Wednesday, 2023-11-22 11:15 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Impacted users may experience connection errors and higher query latencies.

Workaround: None at this time.