Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Cloud Build, Cloud Developer Tools, Google Cloud Dataflow, Google Cloud Deploy, Google Cloud SQL, Google Compute Engine, Google Kubernetes Engine

Google Compute Engine (GCE) VM instance creation/ deletion and all related operations for other products were failing in the asia-northeast1 region.

Incident began at 2024-09-07 04:20 and ended at 2024-09-07 06:10 (all times are US/Pacific).

Previously affected location(s)

Tokyo (asia-northeast1)

Date Time Description
13 Sep 2024 06:44 PDT

Incident Report

Summary

On 7 September 2024 starting at 04:20 US/Pacific, several Google Cloud products experienced a service degradation of varying impact or were unavailable in asia-northeast1 region for a period of 1 hour 50 minutes. The list of impacted products and services is detailed below.

To our Google Cloud customers whose businesses were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer you.

Root Cause

Most Google Cloud products and services use a regional metadata store to support their internal operations. The metadata store supports critical functions such as servicing customer requests, load balancing, admin operations, and retrieving/storing metadata including server location information.

Google Compute Engine (GCE) internal DNS depends on the regional metadata store for storing instance metadata. A routine update of the metadata store to a new software version had a change which resulted in poor handling of a rare resource contention corner case, which caused the writes from GCE internal DNS to a zonal replica of the metadata store to fail.

During such zonal issues, we have automated failover mechanisms to use the healthy replicas from other zones. But during this disruption, a secondary issue caused automated failover to not work, rendering the entire metadata storage unavailable despite two other healthy zones being available.

This resulted in disruptions to all GCE instance operations in the asia-northeast1 region. Actions such as creating, deleting, starting, and stopping instances or consuming reservations were affected. This, in turn, affected operations of other services dependent on GCE instances, including GKE, Cloud Build, Cloud Dataflow, Cloud Deploy and Cloud SQL.

Remediation and Prevention

Google engineers were alerted to the issue by internal monitoring on 7 September 2024 at 04:26 US/Pacific and immediately started an investigation. The issue was fully mitigated at 06:10 US/Pacific after failover of the metadata storage operations to the healthy zones was manually initiated by our engineering team.

Google is committed to preventing a repeat of this issue in the future and is completing the following actions:

  • Improve the automated failover logic used in the metadata storage infrastructure.
  • Improve the testing of how the metadata storage service handles resource contention corner cases.
  • Improve internal processes and documentations to enable faster response and mitigation time for this type of issue.

Detailed Description of Impact

  • Google Compute Engine Google Compute Engine requests and operations like creating, deleting, starting or modifying VMs were not executed.

  • Cloud Build Cloud Build deployments via the Cloud Build API or via Cloud Build Integrations were not being executed.

  • Cloud Dataflow Google Cloud Dataflow jobs failed at start up, existing job autoscale functionality was disrupted.

  • Cloud Deploy Google Cloud Deploy actions (render, deploy, verify, pre- and post-deploy hooks) were unresponsive and unable to complete their tasks.

  • Google Cloud SQL Google Cloud SQL, dependent on Google Compute Engine, experienced operational issues during this incident. No database creation or deletion operations were possible during the incident.

  • Google Kubernetes Engine Google Kubernetes Engine experienced issues with operations that depend on GCE. Specifically, cluster and node pool creation, node autoscaling, cluster and node upgrades, and automatic repairs were unsuccessful in the affected region and zones.

9 Sep 2024 06:35 PDT

Mini Incident Report

We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below.

Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues.

If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support.

(All Times US/Pacific)

Incident Start: 07 September, 2024 04:20

Incident End: 07 September, 2024 06:10

Duration: 01 hours, 50 minutes

Affected Services and Features:

  • Cloud Build
  • Google Cloud Dataflow
  • Google Cloud Deploy
  • Google Cloud SQL
  • Google Compute Engine
  • Google Kubernetes Engine

Regions/Zones: asia-northeast1

Description:

Several Google Cloud products experienced a service degradation of varying impact, or were unavailable for a duration of 01 hours, 50 minutes in asia-northeast1 region.

Google Engineers have identified the cause to be a change rollout to an internal component. This change was subsequently rolled back which mitigated all known impacts.

Google will complete a full Incident Report (IR) in the following days that will provide a full root cause.

Customer Impact:

Through the incident duration, the impacted Google Cloud services experienced different kinds of service degradations as detailed below.

  • Google Compute Engine requests and operations like creating, deleting, starting or modifying VMs were not executed.
  • Cloud Build deployed via the Cloud Build API or via CloudBuild Integrations were not being executed.
  • Google Cloud Dataflow jobs failed at start up, existing job autoscale functionality was disrupted.
  • Google Cloud Deploy actions (render, deploy, verify, pre- and post-deploy hooks) were unresponsive and unable to complete their tasks.
  • Google Cloud SQL, dependent on Google Compute Engine, which experienced operational issues during this incident. No databases were able to be created or deleted during the incident.
7 Sep 2024 07:15 PDT

The issue with Cloud Build, Google Cloud Dataflow, Google Cloud Deploy, Google Cloud SQL, Google Compute Engine, Google Kubernetes Engine has been resolved for all affected customers as of Saturday, 2024-09-07 06:10 US/Pacific.

We will publish an analysis of this incident once we have completed our internal investigation.

We thank you for your patience while we worked on resolving the issue.

7 Sep 2024 07:13 PDT

Summary: Google Compute Engine (GCE) VM instance creation/ deletion and all related operations for other products were failing in the asia-northeast1 region.

Description: Engineering team has rolled out the fix which mitigated the impact on Saturday, 2024-09-07 06:10 US/Pacific.

We will provide more information by Saturday, 2024-09-07 07:30 US/Pacific.

Diagnosis: GCE Customers impacted by this issue were experiencing "Internal error. Please try again or contact Google Support. (Code: '-1343002181035865699')" while attempting instance creation.

Cloud Build customers were observing their builds not being executed.

Google Cloud Dataflow customers were unable to create jobs or scale existing jobs.

Workaround: Customers may retry their operation in case they experience failures.

7 Sep 2024 06:11 PDT

Summary: Google Compute Engine (GCE) VM instance creation/ deletion operations, Google Cloud Dataflow and Cloud Build are failing in the asia-northeast1 region.

Description: Mitigation work is currently underway by our engineering teams.

We do not have an ETA for mitigation at this point.

We will provide more information by Saturday, 2024-09-07 07:30 US/Pacific.

Diagnosis: GCE Customers impacted by this issue may experience "Internal error. Please try again or contact Google Support. (Code: '-1343002181035865699')" while attempting instance creation.

Cloud Build customers may observe their builds not being executed.

Google Cloud Dataflow customers are unable to create jobs or scale existing jobs.

Workaround: Customers may attempt their GCVE, Cloud Build and Cloud Dataflow operations in other region.