Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Google App Engine

Issues with App Engine Java and Go runtimes

Incident began at 2016-02-03 18:38 and ended at 2016-02-03 18:57 (all times are US/Pacific).

Date Time Description
4 Feb 2016 19:05 PST

SUMMARY:

On Wednesday 3 February 2016, some App Engine applications running on Java7, Go and Python runtimes served errors with HTTP response 500 for a duration of 18 minutes. We sincerely apologize to customers who were affected. We have taken and are taking immediate steps to improve the platform's performance and availability.

DETAILED DESCRIPTION OF IMPACT:

On Wednesday 3 February 2016, from 18:37 PST to 18:55 PST, 1.1% of Java7, 3.1% of Go and 0.2% of all Python applications served errors with HTTP response code 500. The impact varied across applications, with less than 0.8% of all applications serving more than 100 errors during this time period. The distribution of errors was heavily tail-weighted, with a few applications receiving a large fraction of errors for their traffic during the event.

ROOT CAUSE:

An experiment meant to test a new feature on a small number of applications was inadvertently applied to Java7 and Go applications globally. Requests to these applications tripped over the incompatible experimental feature, causing the instances to shut down without serving any requests successfully, while the depletion of healthy instances caused these applications to serve HTTP requests with a 500 response. Additionally, the high rate of failure in Java and Go instances caused resource contention as the system tried to start new instances, which resulted in collateral damage to a small number of Python applications.

REMEDIATION AND PREVENTION:

At 18:35, a configuration change was erroneously enabled globally instead of to the intended subset of applications. Within a few minutes, Google Engineers noticed a drop in global traffic to GAE applications and determined that the configuration change was the root cause. At 18:53 the configuration change was rolled back and normal operations were restored by 18:55.

To prevent a recurrence of this problem, Google Engineers are modifying the fractional push framework to inhibit changes which would simultaneously apply to the majority of applications, and creating telemetry to accurately predict the fraction of instances affected by a given change. Google Engineers are also enhancing the alerts on traffic drop and error spikes to quickly identify and mitigate similar incidents.

3 Feb 2016 19:26 PST

The issue with App Engine Java and Go runtimes serving errors should have been resolved for all affected applications as of 18:57 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

3 Feb 2016 19:04 PST

We are investigating reports of an issue with App Engine Java and Go applications. We will provide more information by 19:30 US/Pacific.