Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Google BigQuery

Elevated error rates for API and Web UI

Incident began at 2015-05-07 20:45 and ended at 2015-05-07 21:20 (all times are US/Pacific).

Date Time Description
8 May 2015 21:08 PDT

SUMMARY:

On Thursday 7 May 2015, requests to the Google BigQuery Web UI and APIs experienced errors for a total duration of 2 hours and 9 minutes over two separate periods. We understand the high level of reliability that is demanded and expected of a service like BigQuery and apologize for the disruption. We are taking immediate actions to ensure we minimize the risk of this issue repeating itself.

DETAILED DESCRIPTION OF IMPACT:

On Thursday 7 May 2015 from 20:45 to 21:20 PDT and on Friday 8 May from 03:13 to 04:47, requests to the Web UI resulted in the page hanging with the message “Loading BigQuery…”. Additionally, when accessing BigQuery via the API, users would have seen responses with error code 400 or 500.

ROOT CAUSE:

A routine software upgrade to the authorization process in BigQuery had a side effect of reducing the cache hit rate of dataset permission validation. A particular query load triggered a cascade of live authorization checks that fanned out and amplified throughout the BigQuery service, eventually causing user visible errors as the authorization backends became overwhelmed. As a byproduct, error rates for the service increased as individual requests failed to authorize.

REMEDIATION AND PREVENTION:

Google engineers were able to identify and cancel problematic in-flight BigQuery queries that were causing a high number of retries to the permissions validation backend. To prevent a recurrence of this issue, engineers temporarily disabled the retry of these queries to prevent retries from amplifying the effect of unhealthy permission validation backends. Google engineers were also able to adjust the retry parameters of the authorization system to return cache hit rates to normal. As the system stabilized, BigQuery engineers were able to gradually allow query traffic to flow in and re-enabled permission validation, restoring service.

To prevent future recurrences of this issue, Google engineers will change the structure of permissions validation so that continual retries will not destabilize the entire service. This restructuring includes reducing the number of backends that require permissions validation by changing the steps involved in the BigQuery request validation process. Engineers will also introduce safety limits governing communication between BigQuery and the permissions validation system. Google engineers are also adding additional monitoring to better detect and potentially preemptively mitigate issues of this nature.

8 May 2015 05:12 PDT

The issue with Bigquery elevated error rates for API and Web UI should be resolved for all affected users as of 2015-05-08 04:47 (all times are in US/Pacific). We will provide a more detailed analysis of this incident once we have completed our internal investigation

8 May 2015 04:27 PDT

We are still investigating the issue with Bigquery elevated error rates for API and Web UI. We will provide another status update by 2015-05-08 05:30 (all times are in US/Pacific) with current details.

8 May 2015 04:05 PDT

We are experiencing a recurrence of this problem starting at 03:00 US/Pacific time on Friday 2015-05-08. We will provide an update by 04:30 US/Pacific time.

7 May 2015 22:44 PDT

We experienced an issue with the Google BigQuery Web UI and BigQuery APIs, beginning at 2015-05-07 20:45 (all times are in US/Pacific). The issue should be resolved as of 2015-05-07 21:20. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.