Google Cloud Service Health

Google Cloud Service Health
Incidents
Elevated error rate with Google Cloud Storage.

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

For incidents related to Google Security Products, visit https://status.cloud.google.com/security.

Available
Service information
Service disruption
Service outage

Incident affecting Google Cloud Storage, Google App Engine, Cloud Monitoring

Elevated error rate with Google Cloud Storage.

Incident began at 2019-03-12 18:40 and ended at 2019-03-12 22:50 (all times are US/Pacific).

Date	Time	Description
14 Mar 2019	11:09 PDT	ISSUE SUMMARY On Tuesday 12 March 2019, Google's internal blob storage service experienced a service disruption for a duration of 4 hours and 10 minutes. We apologize to customers whose service or application was impacted by this incident. We know that our customers depend on Google Cloud Platform services and we are taking immediate steps to improve our availability and prevent outages of this type from recurring. DETAILED DESCRIPTION OF IMPACT On Tuesday 12 March 2019 from 18:40 to 22:50 PDT, Google's internal blob (large data object) storage service experienced elevated error rates, averaging 20% error rates with a short peak of 31% errors during the incident. User-visible Google services including Gmail, Photos, and Google Drive, which make use of the blob storage service also saw elevated error rates, although (as was the case with GCS) the user impact was greatly reduced by caching and redundancy built into those services. There will be a separate incident report for non-GCP services affected by this incident. The Google Cloud Platform services that experienced the most significant customer impact were the following: Google Cloud Storage experienced elevated long tail latency and an average error rate of 4.8%. All bucket locations and storage classes were impacted. GCP services that depend on Cloud Storage were also impacted. Stackdriver Monitoring experienced up to 5% errors retrieving historical time series data. Recent time series data was available. Alerting was not impacted. App Engine's Blobstore API experienced elevated latency and an error rate that peaked at 21% for fetching blob data. App Engine deployments experienced elevated errors that peaked at 90%. Serving of static files from App Engine also experienced elevated errors. ROOT CAUSE On Monday 11 March 2019, Google SREs were alerted to a significant increase in storage resources for metadata used by the internal blob service. On Tuesday 12 March, to reduce resource usage, SREs made a configuration change which had a side effect of overloading a key part of the system for looking up the location of blob data. The increased load eventually lead to a cascading failure. REMEDIATION AND PREVENTION SREs were alerted to the service disruption at 18:56 PDT and immediately stopped the job that was making configuration changes. In order to recover from the cascading failure, SREs manually reduced traffic levels to the blob service to allow tasks to start up without crashing due to high load. In order to prevent service disruptions of this type, we will be improving the isolation between regions of the storage service so that failures are less likely to have global impact. We will be improving our ability to more quickly provision resources in order to recover from a cascading failure triggered by high load. We will make software measures to prevent any configuration changes that cause overloading of key parts of the system. We will improve load shedding behavior of the metadata storage system so that it degrades gracefully under overload.
13 Mar 2019	00:05 PDT	We did a preliminary analysis about the impact of this issue. We confirmed that the error rate to Cloud Storage has been less than 6% during the incident.
12 Mar 2019	23:43 PDT	The issue with Google Cloud Storage has been resolved for all affected projects as of Tuesday, 2019-03-12 23:18 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.
12 Mar 2019	23:15 PDT	The issue with Cloud Storage should be resolved for the majority of projects and we expect a full resolution in the near future. We will provide another status update by Tuesday, 2019-03-12 23:45 US/Pacific with current details.
12 Mar 2019	22:51 PDT	The underlying storage infrastructure of Cloud Storage is gradually recovering . We will provide another status update by Tuesday, 2019-03-12 23:15 US/Pacific with current details.
12 Mar 2019	22:15 PDT	We still have an issue with Google Cloud Storage. Our Engineering team understands the root cause and is working to implement the solution. We will provide another status update by Tuesday, 2019-03-12 22:45 US/Pacific with current details.
12 Mar 2019	21:47 PDT	We are still working to address the root cause of the issue. We will provide another status update by Tuesday, 2019-03-12 22:15 US/Pacific with current details.
12 Mar 2019	21:15 PDT	Mitigation work is still underway by our Engineering Team. We will provide another status update by Tuesday, 2019-03-12 21:45 US/Pacific with current details.
12 Mar 2019	20:46 PDT	We confirmed that the issue affects customers in all regions. Also our Engineering Team believes they have identified the potential root causes of the errors and is still working to mitigate. We will provide another status update by Tuesday, 2019-03-12 21:15 US/Pacific with current details.
12 Mar 2019	20:31 PDT	We are still seeing the increased error rate with Google Cloud Storage. Mitigation work is currently underway by our Engineering Team. We will provide another status update by Tuesday, 2019-03-12 20:45 US/Pacific with current details.
12 Mar 2019	20:31 PDT	Elevated error rate with Google Cloud Storage

All times are US/Pacific