Google Cloud Status Dashboard

This page provides status information on the services that are part of Google Cloud Platform. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit cloud.google.com.

Google Cloud Storage Incident #16023

Elevated Error Rate from Google Cloud Storage

Incident began at 2015-05-06 19:26 and ended at 2015-05-06 20:00 (all times are US/Pacific).

Date Time Description
May 08, 2015 01:16

SUMMARY:

On Tuesday 5 May 2015 and Wednesday 6 May 2015, Google Cloud Storage (GCS) experienced elevated request latency and error rates for a total duration of 43 minutes during two separate periods. We understand how important uptime and latency are to you and we apologize for this disruption. We are using the lessons from this incident to achieve a higher level of service in the future.

DETAILED DESCRIPTION OF IMPACT:

On Tuesday 5 May 2015 from 12:26 to 12:48 PDT requests to GCS returned with elevated latency and error rates. Averaged over the incident, 33% of requests returned error code 500 or 503. At 12:34 the error rate peaked at 42% of requests. Median latency for successful requests increased by 29%.

On Wednesday 6 May from 19:25 to 19:41 PDT, and for two minutes at 19:46 and three minutes at 19:52, the same symptoms were seen with a 64% average error rate and 55% increase in median latency.

ROOT CAUSE:

At 12:25 on 5 May 2015 GCS received an extremely high rate of requests to a small set of GCS objects, causing high load and queuing on a single metadata database shard. This load caused a fraction of unrelated GCS requests to be queued as well, resulting in latency and timeouts visible to other GCS users.

At 19:25 on 6 May 2015 GCS received a second round of unusual load to a different set of objects, causing a recurrence of the same issue.

REMEDIATION AND PREVENTION:

In both incidents, Google engineers identified the set of GCS objects affected and increased localized caching to increase capacity. The Google support team also contacted the project generating the load and worked with them to reduce their demand.

In addition to these tactical fixes, Google engineers will enable service-wide caching for the affected GCS components. Google engineers are also working on other steps to improve service isolation between unrelated GCS objects and projects.

SUMMARY:

On Tuesday 5 May 2015 and Wednesday 6 May 2015, Google Cloud Storage (GCS) experienced elevated request latency and error rates for a total duration of 43 minutes during two separate periods. We understand how important uptime and latency are to you and we apologize for this disruption. We are using the lessons from this incident to achieve a higher level of service in the future.

DETAILED DESCRIPTION OF IMPACT:

On Tuesday 5 May 2015 from 12:26 to 12:48 PDT requests to GCS returned with elevated latency and error rates. Averaged over the incident, 33% of requests returned error code 500 or 503. At 12:34 the error rate peaked at 42% of requests. Median latency for successful requests increased by 29%.

On Wednesday 6 May from 19:25 to 19:41 PDT, and for two minutes at 19:46 and three minutes at 19:52, the same symptoms were seen with a 64% average error rate and 55% increase in median latency.

ROOT CAUSE:

At 12:25 on 5 May 2015 GCS received an extremely high rate of requests to a small set of GCS objects, causing high load and queuing on a single metadata database shard. This load caused a fraction of unrelated GCS requests to be queued as well, resulting in latency and timeouts visible to other GCS users.

At 19:25 on 6 May 2015 GCS received a second round of unusual load to a different set of objects, causing a recurrence of the same issue.

REMEDIATION AND PREVENTION:

In both incidents, Google engineers identified the set of GCS objects affected and increased localized caching to increase capacity. The Google support team also contacted the project generating the load and worked with them to reduce their demand.

In addition to these tactical fixes, Google engineers will enable service-wide caching for the affected GCS components. Google engineers are also working on other steps to improve service isolation between unrelated GCS objects and projects.

May 06, 2015 20:28

We were experiencing an issue with the elevated error rates on Google Cloud Storage, beginning at 2015-05-06 19:26 (all times are in US/Pacific). The issue should be resolved as of 2015-05-06 20:00. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

We were experiencing an issue with the elevated error rates on Google Cloud Storage, beginning at 2015-05-06 19:26 (all times are in US/Pacific). The issue should be resolved as of 2015-05-06 20:00. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.