Google Cloud Status

This page provides status information on the services that are part of the Google Cloud Platform. Check back here to view the current status of the services listed below. For additional information on these services, please visit cloud.google.com.

Google App Engine Incident #15006

Authentication issues with Google App Engine

Incident began at 2015-03-05 07:33 and ended at 2015-03-05 08:27 (all times are US/Pacific).

Date Time Description
Mar 06, 2015 14:00

SUMMARY:

On Thursday 5 March 2015, for a duration of 84 minutes, Google App Engine applications that accessed some Google APIs over HTTP experienced elevated error rates. We apologize for any impact this incident had on your service or application, and have made immediate changes to prevent this issue from recurring.

DETAILED DESCRIPTION OF IMPACT:

On Thursday 5 January, from 07:04 AM to 08:28 AM, some Google App Engine applications making calls to other Google APIs via HTTP experienced elevated error rates. During the incident, the global error rate for all API calls remained under 1%, and in total, the outage affected 2% of applications that were active during the incident. The effect on those applications was significant: requests to issue OAuth tokens experienced an error rate of over 85%. In addition, the HTTP APIs to googleapis.com/storage and googleapis.com/gmail received error rates between 50% and 60%. Other googleapis.com endpoints were affected with error rates of 10% to 20%.

ROOT CAUSE:

A component in Google’s shared HTTP load balancing fabric experienced a non-malicious increase in traffic, exceeding its provisioned capacity. This triggered an automatic DoS protection which shunted a portion of the incoming traffic to a CAPTCHA. The unexpected response caused some clients to issue automated retries, exacerbating the problem.

REMEDIATION AND PREVENTION:

Google Engineers were alerted to the issue by automated monitoring at 07:02, as the load balancing system detected excess traffic and attempted to automatically mitigate it. At 07:46, Google Engineers enabled standby load balancing capacity to rectify the issue. From 08:15 to 08:40, Google Engineers continued to provision additional resources in the load balancing fabric in order to serve the increased traffic. During this period, at 08:28, Google engineers determined that sufficient capacity was in place to serve both regular and retry traffic, and instructed the load balancing system to cease mitigation and resume normal traffic serving. This action marked the end of the event.

To prevent this issue from recurring, Google engineers are comprehensively re-examining the affected load balancing fabric to ensure it is and remains correctly provisioned. Additionally, Google engineers are improving monitoring rules to provide an early warning of capacity shortfall. Finally, Google engineers are examining the services that depend on this load balancing system, and will move some services to a separate pool of more easily scalable load balancers where appropriate.

SUMMARY:

On Thursday 5 March 2015, for a duration of 84 minutes, Google App Engine applications that accessed some Google APIs over HTTP experienced elevated error rates. We apologize for any impact this incident had on your service or application, and have made immediate changes to prevent this issue from recurring.

DETAILED DESCRIPTION OF IMPACT:

On Thursday 5 January, from 07:04 AM to 08:28 AM, some Google App Engine applications making calls to other Google APIs via HTTP experienced elevated error rates. During the incident, the global error rate for all API calls remained under 1%, and in total, the outage affected 2% of applications that were active during the incident. The effect on those applications was significant: requests to issue OAuth tokens experienced an error rate of over 85%. In addition, the HTTP APIs to googleapis.com/storage and googleapis.com/gmail received error rates between 50% and 60%. Other googleapis.com endpoints were affected with error rates of 10% to 20%.

ROOT CAUSE:

A component in Google’s shared HTTP load balancing fabric experienced a non-malicious increase in traffic, exceeding its provisioned capacity. This triggered an automatic DoS protection which shunted a portion of the incoming traffic to a CAPTCHA. The unexpected response caused some clients to issue automated retries, exacerbating the problem.

REMEDIATION AND PREVENTION:

Google Engineers were alerted to the issue by automated monitoring at 07:02, as the load balancing system detected excess traffic and attempted to automatically mitigate it. At 07:46, Google Engineers enabled standby load balancing capacity to rectify the issue. From 08:15 to 08:40, Google Engineers continued to provision additional resources in the load balancing fabric in order to serve the increased traffic. During this period, at 08:28, Google engineers determined that sufficient capacity was in place to serve both regular and retry traffic, and instructed the load balancing system to cease mitigation and resume normal traffic serving. This action marked the end of the event.

To prevent this issue from recurring, Google engineers are comprehensively re-examining the affected load balancing fabric to ensure it is and remains correctly provisioned. Additionally, Google engineers are improving monitoring rules to provide an early warning of capacity shortfall. Finally, Google engineers are examining the services that depend on this load balancing system, and will move some services to a separate pool of more easily scalable load balancers where appropriate.

Mar 05, 2015 15:01

At 7:04 AM PST Google systems began returning errors for approximately 20% of requests from App Engine to many Google Cloud Platform APIs. The error rate peaked around 50% at 7:50 and remained at that level until the incident was resolved at 8:26. Many users observed this issue as a failure of the authentication service. We will post a complete incident report following our internal investigation.

At 7:04 AM PST Google systems began returning errors for approximately 20% of requests from App Engine to many Google Cloud Platform APIs. The error rate peaked around 50% at 7:50 and remained at that level until the incident was resolved at 8:26. Many users observed this issue as a failure of the authentication service. We will post a complete incident report following our internal investigation.

Mar 05, 2015 08:56

The problem with authentication on Google App Engine and the Google APIs was resolved as of Thursday, 2015-03-05 08:27 (all times are in US/Pacific). We apologize to our customers for the inconvenience, and we thank you for your patience and continued support.

We will provide a more detailed analysis of this incident once we have completed our internal investigation.

The problem with authentication on Google App Engine and the Google APIs was resolved as of Thursday, 2015-03-05 08:27 (all times are in US/Pacific). We apologize to our customers for the inconvenience, and we thank you for your patience and continued support.

We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Mar 05, 2015 08:46

The issue with Google App Engine and Google APIs authentication is resolved for most applications as of 08:28 US/Pacific. Our engineers continue to monitor the situation to ensure that service is fully restored and stable.

We will provide more information by 09:15 US/Pacific.

The issue with Google App Engine and Google APIs authentication is resolved for most applications as of 08:28 US/Pacific. Our engineers continue to monitor the situation to ensure that service is fully restored and stable.

We will provide more information by 09:15 US/Pacific.

Mar 05, 2015 08:27

We are investigating an issue with authentication on Google App Engine beginning at Thursday, 2015-03-05 07:32 (all times are in US/Pacific).

Affected applications are responding with a HTTP 302 on user login, or with a 403 error connecting to Google APIs.

We will provide more information by 09:00 US/Pacific time.

We are investigating an issue with authentication on Google App Engine beginning at Thursday, 2015-03-05 07:32 (all times are in US/Pacific).

Affected applications are responding with a HTTP 302 on user login, or with a 403 error connecting to Google APIs.

We will provide more information by 09:00 US/Pacific time.

Mar 05, 2015 07:52

We are investigating an issue with App Engine authentication services. We will post an update shortly.

We are investigating an issue with App Engine authentication services. We will post an update shortly.