Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Cloud Logging, Google BigQuery, Operations

Issue with BigQuery Streaming API and Cloud Logging in US Multiregion

Incident began at 2024-06-10 18:07 and ended at 2024-06-10 20:09 (all times are US/Pacific).

Previously affected location(s)

Multi-region: us

Date Time Description
26 Jun 2024 15:08 PDT

Incident Report

Summary

Between Thursday, 30 May 2024 and Thursday, 13 June 2024 at 23:45 US/Pacific on, BigQuery customers experienced elevated query latencies across the multiregions:us and multiregions:eu. As a result, operations requiring compute resources (queries, loads, extracts) would have also experienced resource contention during the same period, and some customers with queries or operations relying on the BigQuery Autoscaler would have been unable to scale up to the desired limits.

Additionally, between 16:46 to 20:00 US/Pacific on Monday, 10 June 2024, BigQuery experienced elevated errors across multiple APIs (Jobs, Query, Load, Extract, Storage Read and Write APIs) in the multiregion:us region.

Root Cause

BigQuery uses a distributed shuffle infrastructure for execution of large and complex joins, aggregations and analytic operations needed for query execution. Shuffle storage is a tiered architecture, optimizing for storing data in-memory, but uses SSD then HDD as backing stores to flush to as the aggregate needs increase. The incident was caused by a combination of factors.

  1. Colossus, Google's distributed file system [1] , was migrating to a newer version. This migration caused a gradual increase in traffic to the new system per zone across the fleet.
  2. The SSD cache configuration for flushing was not appropriately set up for the newer file system version.
  3. As a result, BigQuery gradually lost portions of its SSD cache in the relevant zones, proportional to the traffic migrated to the new system.
  4. Queries needing to flush to disk experienced increased latency as flushing directly to HDD increasingly dominated.

Diverting the network traffic from an affected zone is a mitigation usually taken while determining the root cause of a problem. However, an operator error resulted in reduction of capacity in multiple zones simultaneously on Monday 10 June, 2024 at 16:46 US/Pacific. This led to elevated errors across BigQuery APIs until 20:00 PST the same day when the impact of excessive traffic redirection was mitigated.

[1] : https://cloud.google.com/blog/products/storage-data-transfer/a-peek-behind-colossus-googles-file-system

Remediation and Prevention

  • At 10:23 US/Pacific on Thursday, 30 May 2024, Google engineers were alerted to occasional increased latency of operations materializing data into storage in one zone in the US region. BigQuery automation redirected traffic out of the zone and investigation started to find the root cause.
  • By Wednesday, 5 June 2024, Google engineers expedited the investigation by coordinating between BigQuery engineering and several GCP infrastructure teams.
  • By Monday, 10 June 2024, other BigQuery zones alerted for similar symptoms, with automation being prevented from redirecting traffic as mitigation. Customer reports of slowness began accumulating.
  • [Wider Impact Starts] At 16:46 on Monday, 10 June, 2024, an excessive redirection of traffic erroneously occurred. At 18:07 US/Pacific, the incident escalated further with multiple infrastructure, BigQuery engineering, and incident and customer management teams involved.
  • At 18:28 the excessive traffic redirection was rectified. Google engineers then took several actions to quickly absorb the incoming traffic by adding front end capacity and shifting traffic between zones while internal caches recovered.
  • [Wider Impact Fully Mitigated] At 20:00 on Monday, 10 June, 2024, traffic returned to baseline before the major sub-incident. Elevated incident management to root cause the original symptoms of storage slowness continued.
  • Due to the nature of the root cause, zone traffic redirection which is the fastest and most reliable mitigation for users impacted by a zone slowness caused the problem to shift to other BigQuery customers elsewhere. This unfortunately complicated investigation and remediation for users during the investigation and extended the duration of impact until the true root cause and trigger were identified.
  • [Incident Fully Mitigated] At 21:00, on Thursday June 13, 2024, the fundamental root cause of the gradual loss of SSD caching, and its secondary and tertiary impact on shuffle flush performance and query latency was confirmed. Google engineers rectified the SSD cache configuration (the size of which was already increased in many zones as mitigation) and the incident was fully mitigated.

We apologize for the length and severity of this incident. We are taking immediate steps to prevent a recurrence and improve reliability in the future.

  • Enhanced Detection :
    • Enhance BigQuery’s telemetry to detect anomalies in SSD cache utilization faster and more efficiently.
  • Increase the SSD cache capacity for BigQuery Shuffle flushing across the fleet and rectify the SSD cache configuration to restore shuffle flushing performance to shorten mitigation time for any future occurrences.
  • Preventive Action Items :
    • Increase the safeguards against excessive traffic redirection by any means (manual and automatic).
    • Improve BigQuery’s resilience to sudden increases in traffic for the Streaming APIs to recover faster.

Detailed Description of Impact

BigQuery:

  • A subset of customers would have experienced 500 errors while executing calls to insertAll and storage write APIs in the US Multi-region. Additionally, some customers may have experienced system errors using the Jobs and Query API.
  • BigQuery customers would have also experienced periods of reduced query performance and longer latencies. As a side effect, resource contention within user reservations would have also increased.
  • Some BigQuery customers experienced periods of inability to scale up resources using the BigQuery Autoscaler consistently.

Cloud Logging: Ingestion delays to analytics buckets with local global. Logs Explorer queries were not affected.

11 Jun 2024 09:11 PDT

Mini Incident Report

We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support .

(All Times US/Pacific)

Incident Start: 10 June 2024 16:45

Incident End: 10 June 2024 20:00

Duration: 3 hours and 15 minutes

Affected Services and Features:

Google BigQuery and Google Cloud Logging

Regions/Zones: Multi-regions: US

Description:

BigQuery experienced elevated errors across multiple APIs in the US Multi-region due to the concurrent mitigation of simultaneous degradations, which impacted the BigQuery projects hosted in two degraded clusters. Google will complete a full IR in the following days that will provide a full root cause.

Customer Impact:

BigQuery: A subset of customers would have experienced HTTP 500 errors while executing calls to insertAll and storage write APIs in the US Multi-region. Additionally, some customers may have experienced system errors using the Jobs and Query API.

Cloud Logging: Cloud Logging customers using Log Analytics faced log ingestion delays up to 2 hours for their analytics buckets if they ingested logs via Cloud regions us-central1 or us-central2. Logs Explorer queries in the Google Cloud console were not impacted.

10 Jun 2024 20:09 PDT

The issue with Cloud Logging, Google BigQuery has been resolved for all affected users as of Monday, 2024-06-10 20:00 US/Pacific.

We will publish an analysis of this incident once we have completed our internal investigation.

We thank you for your patience while we worked on resolving the issue.

10 Jun 2024 19:06 PDT

Summary: Issue with BigQuery Streaming API and Cloud Logging in US Multiregion

Description: Google engineers are working on mitigating the problem and we are observing the error rate for the impacted APIs dropping. We are closely monitoring the progress of the problem mitigation.

We will provide more information by Monday, 2024-06-10 20:30 US/Pacific.

Diagnosis: Google BigQuery: Customers impacted by this issue may see 500 errors while executing BigQuery statements.

Cloud Logging: Ingestion delays to analytics buckets with local global. Logs Explorer queries are not affected.

Workaround: None at this time.

10 Jun 2024 18:36 PDT

Summary: Issue with BigQuery Streaming API and Cloud Logging

Description: We are experiencing an issue with Google BigQuery Streaming API, Cloud Logging beginning on Monday, 2024-06-10 16:46 US/Pacific.

Our engineering team continues to investigate the issue.

We will provide an update by Monday, 2024-06-10 18:59 US/Pacific with current details.

Diagnosis: Google BigQuery: Customers impacted by this issue may see 500 errors while executing BigQuery statements.

Cloud Logging: Ingestion delays to analytics buckets with local global. Logs Explorer queries are not affected.

Workaround: None at this time.