Google Cloud Status

This page provides status information on the services that are part of the Google Cloud Platform. Check back here to view the current status of the services listed below. For additional information on these services, please visit cloud.google.com.

Google Cloud Storage Incident #16024

Increased error rate and latency in GCS uploads

Incident began at 2015-05-15 03:30 and ended at 2015-05-15 05:45 (all times are US/Pacific).

Date Time Description
May 18, 2015 04:11

SUMMARY:

On Friday 15 May 2015, uploads to Google Cloud Storage experienced increased latency and error rates for a duration of 1 hour 38 minutes. If your service or application was affected, we apologize. We understand that many services and applications rely on consistent performance when uploading objects to our service and we failed to uphold that level of performance. We are taking immediate actions to ensure this issue does not happen again.

DETAILED DESCRIPTION OF IMPACT:

On Friday 15 May 2015 from 03:35 to 05:13 PDT, uploads to Google Cloud Storage either failed or took longer than expected. During the incident, 6% of all POST requests globally returned error code 503 between 03:35 and 03:43, and the error rate remained at >0.5% until 05:13. Google Cloud Storage is a highly distributed system; one of Google's US datacenters was the focus of much of the impact, with an error rate peaking at over 40%. Median latency for successful requests increased 16%, compared to typical levels, while latency at the 90th and 99th percentiles increased 29% and 63% respectively.

ROOT CAUSE:

A periodic replication job, run automatically against Google Cloud Storage's underlying storage system, increased load and reduced available resources for processing new uploads. As a result, latency increased and uploads to GCS either continued to wait for completion or failed.

REMEDIATION AND PREVENTION:

Google engineers were alerted to increased latency in one of the datacenters responsible for processing uploads at 03:12 PDT and redirected upload traffic to several other datacenters to distribute the load. When it was evident this redirection had not alleviated the increase in latency, engineers began to provision additional capacity while continuing to investigate the underlying root cause. When the replication job was identified as the source of the increased load, Google engineers reduced the rate of replication and service was restored.

In the short term, Google engineers will be adding additional monitoring to the underlying storage layer to better identify problematic system load conditions as well as the tasks responsible. In the longer term, Google engineers will be isolating the impact the replication job can have on the latency and performance of other services.

SUMMARY:

On Friday 15 May 2015, uploads to Google Cloud Storage experienced increased latency and error rates for a duration of 1 hour 38 minutes. If your service or application was affected, we apologize. We understand that many services and applications rely on consistent performance when uploading objects to our service and we failed to uphold that level of performance. We are taking immediate actions to ensure this issue does not happen again.

DETAILED DESCRIPTION OF IMPACT:

On Friday 15 May 2015 from 03:35 to 05:13 PDT, uploads to Google Cloud Storage either failed or took longer than expected. During the incident, 6% of all POST requests globally returned error code 503 between 03:35 and 03:43, and the error rate remained at >0.5% until 05:13. Google Cloud Storage is a highly distributed system; one of Google's US datacenters was the focus of much of the impact, with an error rate peaking at over 40%. Median latency for successful requests increased 16%, compared to typical levels, while latency at the 90th and 99th percentiles increased 29% and 63% respectively.

ROOT CAUSE:

A periodic replication job, run automatically against Google Cloud Storage's underlying storage system, increased load and reduced available resources for processing new uploads. As a result, latency increased and uploads to GCS either continued to wait for completion or failed.

REMEDIATION AND PREVENTION:

Google engineers were alerted to increased latency in one of the datacenters responsible for processing uploads at 03:12 PDT and redirected upload traffic to several other datacenters to distribute the load. When it was evident this redirection had not alleviated the increase in latency, engineers began to provision additional capacity while continuing to investigate the underlying root cause. When the replication job was identified as the source of the increased load, Google engineers reduced the rate of replication and service was restored.

In the short term, Google engineers will be adding additional monitoring to the underlying storage layer to better identify problematic system load conditions as well as the tasks responsible. In the longer term, Google engineers will be isolating the impact the replication job can have on the latency and performance of other services.

May 15, 2015 06:02

Correction: the start of the incident occurred on Friday, 2015-05-15 03:30 US/Pacific.

Correction: the start of the incident occurred on Friday, 2015-05-15 03:30 US/Pacific.

May 15, 2015 05:58

GCS API experienced increased error rate and latency for uploads starting Friday, 2015-05-15 05:30 ending Friday, 2015-05-15 05:15 US/Pacific. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

GCS API experienced increased error rate and latency for uploads starting Friday, 2015-05-15 05:30 ending Friday, 2015-05-15 05:15 US/Pacific. We will provide a more detailed analysis of this incident once we have completed our internal investigation.