Google Cloud Service Health

Google Cloud Service Health
Incidents
We are currently investigating an issue with persistent disk creation/deletion as well as virtual machine creation/deletion. For everyone who is affected, we apologize - we know you count on Google to work for you and we're working hard to restore normal operation. We will provide an update by 15:30 PST.

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Available
Service information
Service disruption
Service outage

Incident affecting Google Compute Engine

We are currently investigating an issue with persistent disk creation/deletion as well as virtual machine creation/deletion. For everyone who is affected, we apologize - we know you count on Google to work for you and we're working hard to restore normal operation. We will provide an update by 15:30 PST.

Incident began at 2014-11-06 14:00 and ended at 2014-11-06 16:26 (all times are US/Pacific).

Date	Time	Description
12 Nov 2014	17:17 PST	SUMMARY: For a period of 121 minutes on Thursday 6 November 2014, all Google Compute Engine users were unable to create or delete virtual machines or persistent disks, and 17% of Compute Engine requests experienced elevated latency. If this issue had an impact on you or your service, we apologize; we understand the standard of reliability and availability you have come to expect from Google and appreciate that we did not meet that standard in this case. We have taken immediate steps to prevent future recurrences of this issue. DETAILED DESCRIPTION OF IMPACT: On Thursday 6 November 2014 from 13:40 PST until 15:40 Compute Engine users were unable to create persistent disks, snapshots, or new instances, and were unable to delete persistent disks. Additionally, 17% of operations that do not rely on disks experienced increased latency, but failure rates did not deviate from normal levels. ROOT CAUSE: Google engineers triggered a migration job to upgrade existing Compute Engine images to a new format, which involved creating a new persistent disk for each image and a new image from that disk. This job exposed an unoptimized code path in the garbage collection portion of the persistent disk subsystem that triggered database lock contention, causing all requests to time out. A latent bug in the resource management layer of Compute Engine caused the system to retry persistent disk requests too aggressively, which amplified this contention. REMEDIATION AND PREVENTION: To resolve the issue, Google engineers stopped the migration job and decreased the number of tasks processing persistent disk requests, which eliminated the lock contention that had prevented those tasks from completing. Once the underlying lock contention was resolved, the backlog of tasks was cleared and requests were again served in a timely fashion. To prevent this issue from happening in the future, Google engineers have resized the pool of workers that process persistent disk requests to prevent the lock contention from happening. Additionally, Google engineers are optimizing the persistent disk garbage collection code path to acquire locks for a shorter duration and to perform database intensive tasks in a staggered fashion, which will allow for a greater number of requests to succeed. Google engineers are also creating additional alerts that notify oncall engineers of database contention and elevated persistent disk related error rates.
7 Nov 2014	03:50 PST	The problem with persistent disk and virtual machine creation and deletion was resolved as of 16:26 PST on 06 November 2014. We apologize for any issues this may have caused you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems
6 Nov 2014	15:40 PST	We are continuing to investigate an issue related to persistent disks and virtual machines. We will provide another status update by Thursday 06 November 2014 at 16:30 PST.

Date

Time

Description

12 Nov 2014

17:17 PST

SUMMARY: For a period of 121 minutes on Thursday 6 November 2014, all Google Compute Engine users were unable to create or delete virtual machines or persistent disks, and 17% of Compute Engine requests experienced elevated latency. If this issue had an impact on you or your service, we apologize; we understand the standard of reliability and availability you have come to expect from Google and appreciate that we did not meet that standard in this case. We have taken immediate steps to prevent future recurrences of this issue.

DETAILED DESCRIPTION OF IMPACT: On Thursday 6 November 2014 from 13:40 PST until 15:40 Compute Engine users were unable to create persistent disks, snapshots, or new instances, and were unable to delete persistent disks. Additionally, 17% of operations that do not rely on disks experienced increased latency, but failure rates did not deviate from normal levels.

ROOT CAUSE: Google engineers triggered a migration job to upgrade existing Compute Engine images to a new format, which involved creating a new persistent disk for each image and a new image from that disk. This job exposed an unoptimized code path in the garbage collection portion of the persistent disk subsystem that triggered database lock contention, causing all requests to time out. A latent bug in the resource management layer of Compute Engine caused the system to retry persistent disk requests too aggressively, which amplified this contention.

REMEDIATION AND PREVENTION: To resolve the issue, Google engineers stopped the migration job and decreased the number of tasks processing persistent disk requests, which eliminated the lock contention that had prevented those tasks from completing. Once the underlying lock contention was resolved, the backlog of tasks was cleared and requests were again served in a timely fashion.

To prevent this issue from happening in the future, Google engineers have resized the pool of workers that process persistent disk requests to prevent the lock contention from happening. Additionally, Google engineers are optimizing the persistent disk garbage collection code path to acquire locks for a shorter duration and to perform database intensive tasks in a staggered fashion, which will allow for a greater number of requests to succeed. Google engineers are also creating additional alerts that notify oncall engineers of database contention and elevated persistent disk related error rates.

7 Nov 2014

03:50 PST

The problem with persistent disk and virtual machine creation and deletion was resolved as of 16:26 PST on 06 November 2014. We apologize for any issues this may have caused you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems

6 Nov 2014

15:40 PST

We are continuing to investigate an issue related to persistent disks and virtual machines. We will provide another status update by Thursday 06 November 2014 at 16:30 PST.

All times are US/Pacific