Google Cloud Status Dashboard

This page provides status information on the services that are part of Google Cloud Platform. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit cloud.google.com.

Google Cloud Storage Incident #16025

Increased error rate in Google Cloud Storage

Incident began at 2015-06-09 15:31 and ended at 2015-06-09 17:15 (all times are US/Pacific).

Date Time Description
Jun 11, 2015 02:11

SUMMARY:

On Tuesday 9 June 2015 Google Cloud Storage served an elevated error rate and latency for a duration of 1 hour 40 minutes. If your service or application was affected, we apologize. We understand that many services and applications rely on consistent performance of our service and we failed to uphold that level of performance. We are taking immediate actions to ensure this issue does not happen again.

DETAILED DESCRIPTION OF IMPACT:

On Tuesday 9 June 2015 from 15:29 to 17:09 PDT, an average of 52.7% of the global requests to Cloud Storage resulted in HTTP 500 or 503 responses. The impact was most pronounced for GCS requests in the US with 55% average error rate, while requests to GCS in Europe and Asia resulted in 25% and 1% failures, respectively.

ROOT CAUSE:

The incident resulted from a change in the default behavior of a dependency of Cloud Storage that was included in a new GCS server release. At scale, the change led to pathological retry behavior which eventually caused increased latency, led to request timeouts, threadpool saturation and an increased error rate for the service. Google follows a canary process for all new releases, by upgrading a small number of servers and looking for problems before releasing the change everywhere in a gradual fashion. In this case the traffic served from those canary servers was not sufficient to expose the issue.

REMEDIATION AND PREVENTION:

Automated monitoring detected increased latency for Cloud Storage requests in one datacenter at 15:44. In an attempt to mitigate the problem and allow the troubleshooting of the underlying issue to continue, engineers increased resources to several backend systems that were exhibiting hotspotting problems which resulted in a small improvement. When it was evident that this was related to a behavior change in one of the libraries for a dependent system, Google Engineers made the necessary production adjustments to stabilize the system by quickly disabling the pathological retry behavior.

To prevent similar issues from happening in the future we're making a number of changes: Cloud Storage release tools will be upgraded to allow for quicker rollbacks, risky changes will be released through an experimental traffic framework allowing for more precise canarying, and our outage response procedures will default to rolling back new releases more quickly as a default mitigation for all incidents.

SUMMARY:

On Tuesday 9 June 2015 Google Cloud Storage served an elevated error rate and latency for a duration of 1 hour 40 minutes. If your service or application was affected, we apologize. We understand that many services and applications rely on consistent performance of our service and we failed to uphold that level of performance. We are taking immediate actions to ensure this issue does not happen again.

DETAILED DESCRIPTION OF IMPACT:

On Tuesday 9 June 2015 from 15:29 to 17:09 PDT, an average of 52.7% of the global requests to Cloud Storage resulted in HTTP 500 or 503 responses. The impact was most pronounced for GCS requests in the US with 55% average error rate, while requests to GCS in Europe and Asia resulted in 25% and 1% failures, respectively.

ROOT CAUSE:

The incident resulted from a change in the default behavior of a dependency of Cloud Storage that was included in a new GCS server release. At scale, the change led to pathological retry behavior which eventually caused increased latency, led to request timeouts, threadpool saturation and an increased error rate for the service. Google follows a canary process for all new releases, by upgrading a small number of servers and looking for problems before releasing the change everywhere in a gradual fashion. In this case the traffic served from those canary servers was not sufficient to expose the issue.

REMEDIATION AND PREVENTION:

Automated monitoring detected increased latency for Cloud Storage requests in one datacenter at 15:44. In an attempt to mitigate the problem and allow the troubleshooting of the underlying issue to continue, engineers increased resources to several backend systems that were exhibiting hotspotting problems which resulted in a small improvement. When it was evident that this was related to a behavior change in one of the libraries for a dependent system, Google Engineers made the necessary production adjustments to stabilize the system by quickly disabling the pathological retry behavior.

To prevent similar issues from happening in the future we're making a number of changes: Cloud Storage release tools will be upgraded to allow for quicker rollbacks, risky changes will be released through an experimental traffic framework allowing for more precise canarying, and our outage response procedures will default to rolling back new releases more quickly as a default mitigation for all incidents.

Jun 09, 2015 18:01

The problem with Google Cloud Storage should be resolved as of 2015-06-09 17:15 (PT). We apologize for any issues this may have caused to you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems.

We will provide a more detailed analysis of this incident once we have completed our internal investigation.

The problem with Google Cloud Storage should be resolved as of 2015-06-09 17:15 (PT). We apologize for any issues this may have caused to you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems.

We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Jun 09, 2015 17:32

We're still investigating an issue with code 500 Backend Error on Google Cloud Storage RPC calls, beginning at 2015-06-09 15:31 US/Pacific. We will provide another update by 18:00 US/Pacific.

We're still investigating an issue with code 500 Backend Error on Google Cloud Storage RPC calls, beginning at 2015-06-09 15:31 US/Pacific. We will provide another update by 18:00 US/Pacific.

Jun 09, 2015 17:01

We are experiencing an issue with Google Cloud Storage beginning at Tuesday, 2015-06-09 15:31 US/Pacific. Impacted customers will receive Code 500 Backend Error when using the service.

For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 17:30 US/Pacific with current details.

We are experiencing an issue with Google Cloud Storage beginning at Tuesday, 2015-06-09 15:31 US/Pacific. Impacted customers will receive Code 500 Backend Error when using the service.

For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 17:30 US/Pacific with current details.

Jun 09, 2015 16:35

We're investigating an issue with Google Cloud Storage beginning at 15:31 Pacific Time. We will provide more information shortly.

We're investigating an issue with Google Cloud Storage beginning at 15:31 Pacific Time. We will provide more information shortly.