Google Cloud Status Dashboard
This page provides status information on the services that are part of Google Cloud Platform. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit cloud.google.com.
Google BigQuery Incident #19002
High error rate on multiple Google BigQuery APIs in the US region
Incident began at 2019-03-08 00:45 and ended at 2019-03-08 01:30 (all times are US/Pacific).
Date | Time | Description | |
---|---|---|---|
Mar 18, 2019 | 11:25 | ISSUE SUMMARY On Friday 8 March 2019, Google BigQuery’s jobs.insert API in the US regions experienced an average elevated error rate of 51.21% for a duration of 45 minutes. BigQuery’s Streaming API was unaffected during this period. We understand how important BigQuery’s availability is to our customers’ business analytics and we sincerely apologize for the impact caused by this incident. We are taking immediate steps detailed below to prevent this situation from happening again. DETAILED DESCRIPTION OF IMPACT On Friday 8 March 2019 from 00:45 - 01:30 US/Pacific, BigQuery’s jobs.insert [1] API (responsible for import/export, query, and copy jobs) in the US region experienced an average error rate of 51.21%. Affected customers received error responses such as “Error encountered during Execution, retrying may solve the problem” and “Read timed out” when sending requests to BigQuery. BigQuery’s Streaming API was not impacted by this incident. The following is a breakdown of the errors experienced during the incident:
ROOT CAUSE A recent change to BigQuery’s shuffle scheduling service [2] introduced the potential for the service to enter a state where it was unable to process shuffle jobs. A new canary release was deployed to fix the potential issue. However, this release contained an unrelated issue which placed an overly restrictive rate limit on the shuffle service preventing it from operating nominally. This strict rate limit created a large job backlog for the BigQuery Job Server, which resulted in BigQuery returning errors such as “Error encountered during Execution, retrying may solve the problem” and “Read timed out” to users. REMEDIATION AND PREVENTION Google Engineers were automatically alerted at 00:47 and immediately began their investigation. The root cause was discovered at 01:23, and our engineers worked quickly to mitigate the issue by redirecting traffic away from the impacted datacenter at 01:27. The incident was fully resolved by 01:30. We are taking immediate action to prevent recurrence. First, we have implemented a fix to prevent the shuffle service from potentially entering a state where it is unable to process jobs. Second, we are allocating additional capacity to BigQuery’s US region to reduce the impact of traffic redirections on adjacent datacenters running the service. Additionally, we are increasing the precision of our monitoring to enable more swift and accurate diagnosing of BigQuery issues going forward. [1] https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/insert [2] https://cloud.google.com/blog/products/gcp/in-memory-query-execution-in-google-bigquery |
|
ISSUE SUMMARY On Friday 8 March 2019, Google BigQuery’s jobs.insert API in the US regions experienced an average elevated error rate of 51.21% for a duration of 45 minutes. BigQuery’s Streaming API was unaffected during this period. We understand how important BigQuery’s availability is to our customers’ business analytics and we sincerely apologize for the impact caused by this incident. We are taking immediate steps detailed below to prevent this situation from happening again. DETAILED DESCRIPTION OF IMPACT On Friday 8 March 2019 from 00:45 - 01:30 US/Pacific, BigQuery’s jobs.insert [1] API (responsible for import/export, query, and copy jobs) in the US region experienced an average error rate of 51.21%. Affected customers received error responses such as “Error encountered during Execution, retrying may solve the problem” and “Read timed out” when sending requests to BigQuery. BigQuery’s Streaming API was not impacted by this incident. The following is a breakdown of the errors experienced during the incident:
ROOT CAUSE A recent change to BigQuery’s shuffle scheduling service [2] introduced the potential for the service to enter a state where it was unable to process shuffle jobs. A new canary release was deployed to fix the potential issue. However, this release contained an unrelated issue which placed an overly restrictive rate limit on the shuffle service preventing it from operating nominally. This strict rate limit created a large job backlog for the BigQuery Job Server, which resulted in BigQuery returning errors such as “Error encountered during Execution, retrying may solve the problem” and “Read timed out” to users. REMEDIATION AND PREVENTION Google Engineers were automatically alerted at 00:47 and immediately began their investigation. The root cause was discovered at 01:23, and our engineers worked quickly to mitigate the issue by redirecting traffic away from the impacted datacenter at 01:27. The incident was fully resolved by 01:30. We are taking immediate action to prevent recurrence. First, we have implemented a fix to prevent the shuffle service from potentially entering a state where it is unable to process jobs. Second, we are allocating additional capacity to BigQuery’s US region to reduce the impact of traffic redirections on adjacent datacenters running the service. Additionally, we are increasing the precision of our monitoring to enable more swift and accurate diagnosing of BigQuery issues going forward. [1] https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/insert [2] https://cloud.google.com/blog/products/gcp/in-memory-query-execution-in-google-bigquery |
|||
Mar 08, 2019 | 02:51 | The issue with Google BigQuery API returning 503 errors has been resolved for all affected projects as of 1:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. |
|
The issue with Google BigQuery API returning 503 errors has been resolved for all affected projects as of 1:30 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. |
|||
Mar 08, 2019 | 02:30 | Mitigation work is currently underway by our Engineering Team. We will provide another status update by Friday, 2019-03-08 03:30 US/Pacific with current details. |
|
Mitigation work is currently underway by our Engineering Team. We will provide another status update by Friday, 2019-03-08 03:30 US/Pacific with current details. |
|||
Mar 08, 2019 | 01:24 | We are investigating an issue with Google BigQuery. We will provide more information by Friday, 2019-03-08 02:30 US/Pacific. |
|
We are investigating an issue with Google BigQuery. We will provide more information by Friday, 2019-03-08 02:30 US/Pacific. |
|||
Mar 08, 2019 | 01:24 | We are investigation an issue with Google BigQuery. |
|
We are investigation an issue with Google BigQuery. |