Google Cloud Status Dashboard
This page provides status information on the services that are part of Google Cloud Platform. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit cloud.google.com.
Google Compute Engine Incident #15053
Elevated latency and error rate for Google Compute Engine API
Incident began at 2015-05-12 02:58 and ended at 2015-05-12 05:12 (all times are US/Pacific).
Date | Time | Description | |
---|---|---|---|
May 12, 2015 | 23:28 | SUMMARY: On Tuesday 12 May 2015 an infrastructure event caused the reboot of 6% of virtual machines in the Google Compute Engine zone us-central1-a. In addition, API operations targeting the us-central1-a zone resulted in errors for a duration of 1 hour and 38 minutes, while other Compute Engine API operations experienced elevated latency for the same duration. If you or your customers were affected by either the reboots or the API issues, we apologize. We failed to contain the issue to the affected power hardware and are working to improve the failure isolation of the systems involved. DETAILED DESCRIPTION OF IMPACT: On Tuesday 12 May 2015 at 02:58 PDT, 6% of virtual machines in the us-central1-a zone rebooted due to a power domain failure. The affected instances finished rebooting by 03:35 PDT. At 03:21 PDT, Compute Engine API operations began to fail for the us-central1-a zone, and other Compute Engine API operations experienced higher than usual latency. The API issue was resolved at 04:59 PDT and API latency recovered by 05:12 PDT. ROOT CAUSE: At 02:58 PDT power systems in the us-central1-a zone initiated a shutdown for safety reasons, and alerted Google engineers to the issue. In response to the power issue Google engineers initiated a change at 03:15 PDT intended to direct lower priority traffic away from us-central1-a during the event. However, a software bug in the GCE control plane interacted poorly with this change and caused API requests directed to us-central1-a to be rejected starting at 03:21 PDT. Retries and timeouts from the failed calls caused increased load on other API backends, resulting in higher latency for all GCE API calls. The API issues were resolved when Google engineers identified the control plane issue and corrected it at 04:59 PDT, with the backlog fully cleared by 05:12 PDT. REMEDIATION AND PREVENTION: Google engineers are fixing the bug in the control plane software so it will not unintentionally reject requests in similar situations in future. Google engineers have manually validated the configuration of the components of the API system to ensure that no similar errors will happen in the future. Google engineers will also improve the robustness of the API backends so that a single zone failure does not manifest increased latency outside of the affected zone. |
|
SUMMARY: On Tuesday 12 May 2015 an infrastructure event caused the reboot of 6% of virtual machines in the Google Compute Engine zone us-central1-a. In addition, API operations targeting the us-central1-a zone resulted in errors for a duration of 1 hour and 38 minutes, while other Compute Engine API operations experienced elevated latency for the same duration. If you or your customers were affected by either the reboots or the API issues, we apologize. We failed to contain the issue to the affected power hardware and are working to improve the failure isolation of the systems involved. DETAILED DESCRIPTION OF IMPACT: On Tuesday 12 May 2015 at 02:58 PDT, 6% of virtual machines in the us-central1-a zone rebooted due to a power domain failure. The affected instances finished rebooting by 03:35 PDT. At 03:21 PDT, Compute Engine API operations began to fail for the us-central1-a zone, and other Compute Engine API operations experienced higher than usual latency. The API issue was resolved at 04:59 PDT and API latency recovered by 05:12 PDT. ROOT CAUSE: At 02:58 PDT power systems in the us-central1-a zone initiated a shutdown for safety reasons, and alerted Google engineers to the issue. In response to the power issue Google engineers initiated a change at 03:15 PDT intended to direct lower priority traffic away from us-central1-a during the event. However, a software bug in the GCE control plane interacted poorly with this change and caused API requests directed to us-central1-a to be rejected starting at 03:21 PDT. Retries and timeouts from the failed calls caused increased load on other API backends, resulting in higher latency for all GCE API calls. The API issues were resolved when Google engineers identified the control plane issue and corrected it at 04:59 PDT, with the backlog fully cleared by 05:12 PDT. REMEDIATION AND PREVENTION: Google engineers are fixing the bug in the control plane software so it will not unintentionally reject requests in similar situations in future. Google engineers have manually validated the configuration of the components of the API system to ensure that no similar errors will happen in the future. Google engineers will also improve the robustness of the API backends so that a single zone failure does not manifest increased latency outside of the affected zone. |
|||
May 12, 2015 | 05:34 | The issue with Google Compute Engine API latency should be resolved for all affected users as of 05:28 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation. |
|
The issue with Google Compute Engine API latency should be resolved for all affected users as of 05:28 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation. |
|||
May 12, 2015 | 05:18 | We are experiencing an issue with elevated GCE API latency globally which also impacts ability to successfully create new instances and perform other operations in us-central1-a zone beginning at Tuesday, 2015-05-12 03:04 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 05:50 US/Pacific with current details. |
|
We are experiencing an issue with elevated GCE API latency globally which also impacts ability to successfully create new instances and perform other operations in us-central1-a zone beginning at Tuesday, 2015-05-12 03:04 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 05:50 US/Pacific with current details. |
|||
May 12, 2015 | 04:55 | We are investigating reports of an issue with Google Compute Engine. We will provide more information by 05:20 US/Pacific. |
|
We are investigating reports of an issue with Google Compute Engine. We will provide more information by 05:20 US/Pacific. |