Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Google Compute Engine

Issue with Google Compute Engine APIs on November 1st, 2014

Incident began at 2014-10-31 23:50 and ended at 2014-11-01 02:40 (all times are US/Pacific).

Date Time Description
5 Nov 2014 10:56 PST

SUMMARY: On Friday 31 October 2014, consumers of the Google Compute Engine API experienced increased latency and error rates for a duration of 115 minutes. Additionally, some requests to create new projects were delayed up to 4 hours. We apologize if this incident had an impact on your service or application. We have taken immediate steps to improve API’s performance and availability.

DETAILED DESCRIPTION OF IMPACT: From Friday 31 October at 23:50 PST until Saturday 1 November 2014 at 01:45 PDT Google Compute Engine API consumers experienced increased latency of up to 120 seconds per call along with rates of 500 and 503 responses as high as 1%. Users trying to create projects between 23:53 and 02:30 experienced delays lasting until 03:55.

ROOT CAUSE: A scheduled maintenance task began running at 23:45 to clean up resources linked to deleted projects. The maintenance task consumed more Compute Engine resources than expected, which caused the system to queue requests from other sources (for example, the Google Developer Console and Compute Engine APIs). Resource contention combined with a latent bug in the project creation subsystem prevented retries when API calls failed, which stopped project creation from working correctly during the outage.

REMEDIATION AND PREVENTION: At 23:53 PST, API call latency spiked. Within one minute, Google engineers were alerted by Compute Engine monitoring systems and began investigating the spike. Within 17 minutes of the alert, Google engineers had identified the pipeline as the cause of the spike, and engaged the team responsible for it, shutting down the offending job by 00:23. Once Google engineers had stopped the job, the system took some time to process the backlog of work. API calls returned to normal latency for most users by 00:50, recovering completely by 01:45.

At 02:23, Google engineers decreased the rate at which project creations were processed, reducing contention and allowing new projects to be created successfully. Due to a latent bug, projects created during the outage remained in an incomplete state; Google engineers ran a tool to reset these incomplete projects to a pre-provisioned state, finishing at 03:55.

To prevent this class of incident from recurring, Google engineers are adding additional API quota safeguards for internal pipelines. Additionally, Google engineers will ensure that large pipeline tasks of this nature are run more frequently to avoid a large queue of tasks. To facilitate quicker identification and resolution of such incidents, Google engineers are further tuning the monitoring and alerting systems to provide extra insight into the tasks queue length for the Compute Engine subsystems involved in this incident.

1 Nov 2014 04:03 PDT

The problem with Google Compute Engine APIs was resolved as of 02:40 US/Pacific. We apologize for any issue this may have caused you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better.

1 Nov 2014 03:04 PDT

Normal API operations have been restored for most of the projects. We continue to monitor the health of our system and would provide another update before 05:00 PST.

1 Nov 2014 01:45 PDT

We're investigating an issue with Google Compute Engine API and few operations might be taking longer than usual. We will provide more information before 03:00 PST.