Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Cloud Build, Cloud Developer Tools, Cloud Machine Learning, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Pub/Sub, Google Compute Engine, Google Kubernetes Engine, Persistent Disk, Vertex AI Batch Prediction

Multiple GCP services impacted in the europe-west3-c zone

Incident began at 2024-10-23 18:22 and ended at 2024-10-24 02:01 (all times are US/Pacific).

Previously affected location(s)

Frankfurt (europe-west3)

Date Time Description
31 Oct 2024 08:50 PDT

Incident Report

Summary

On Wednesday, 23 October 2024, a power failure occurred in a single data center within the europe-west3 region. This failure degraded the building’s cooling infrastructure, leading to a partial shutdown of the europe-west-c zone to avoid thermal damage and causing Virtual Machines (VMs) to go offline. The event duration was 7 hours and 39 minutes and impacted various Google Cloud services in the affected zone.

To our Google Cloud customers whose businesses were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer, and we are taking immediate steps to improve the platform’s performance and resilience.

Root Cause

On 23 October 2024, at 18:22 US/Pacific time, an electrical arc flash occurred in one of the europe-west3-c zone's power distribution units, resulting in a partial power outage. This incident also affected the cooling infrastructure, leading to a rise in ambient temperature. To prevent damage, some IT equipment at the facility was shut down, causing Virtual Machines (VMs) in the datacenter to go offline and impacting multiple cloud services in the zone.

Remediation and Prevention

Google engineers were alerted to VM failures in europe-west3-c zone at 18:39 US/Pacific on 23 October 2024, and immediately launched an investigation. Upon understanding the issue's nature and scope, engineers took precautionary measures to ensure equipment safety by shutting it down and diverting traffic away from the affected infrastructure at 20:43 US/Pacific. Power was manually restored at 21:44 US/Pacific by transferring the load away from the failed components. Cloud traffic was gradually reintroduced to the datacenter at 00:30 US/Pacific on 24 October, 2024. Full restoration of all cloud services in the affected zone was completed by 2:09 US/Pacific.

We apologize for the length and severity of this incident. We are taking immediate steps to prevent a recurrence and improve reliability in the future. To ensure continued high availability in the future, Google are pursuing the following actions:

  • Complete root cause investigation on the arc flash and complete repairs of the affected power distribution unit.
  • Ensure that the underlying root cause(s) of the arc flash are not present in any other data centers, and remediating any risks which are discovered in the analysis of the event.
  • Further hardening GCP’s Persistent Disk services to prevent any regional impact during single-zone issues. This work is anticipated to be fully rolled out in the coming weeks.

Detailed Description of Impact

Google Compute Engine (GCE) and Persistent Disk (PD):
Customers experienced increased latency and errors when creating new GCE instances and attaching Persistent Disk volumes to existing instances in all europe-west3 zones from 23 October 2024 19:12 to 23:34 US/Pacific. The elevated error rates resulted from problems in access control services for multiple zones that were caused by a misconfiguration and triggered by the outage in europe-west3-c. To prevent a recurrence of this issue, engineers have hardened this component against zonal outages in europe-west3.

Additionally, a small percentage of Persistent Disk volumes in europe-west3-c were unavailable from 23 October 2024 18:24 to 24 October 2024 00:58 US/Pacific.

Google Cloud Pub/Sub:
A small percentage of customers may have experienced errors in API calls from 23 October 2024 18:25 to 18:41 US/Pacific and elevated latency for push subscriptions from 23 October 2024 22:57 to 24 October 2024 00:11 US/Pacific in europe-west3-c.

Google Cloud Dataflow:
A small percentage of customers may have experienced higher latencies for both batch and streaming jobs from 23 October 2024 19:05 to 23:55 US/Pacific in europe-west3.

Dataproc:
A small percentage of customers may have experienced errors in all API calls from 23 October 2024 19:12 to 22:00 US/Pacific in europe-west3 and a few customers might have experienced errors in all API calls except the create cluster API call through 24 October 2024 23:47 US/Pacific.

Cloud Build:
A small percentage of customers may have experienced errors in API calls from 23 October 2024 20:52 to 24 October 2024 01:11 US/Pacific in europe-west3.

Google Kubernetes Engine (GKE):
During the period of the outage, a small percentage of customers may have experienced errors in API calls to europe-west3. A small percentage of customers may have additionally experienced unavailability of their Kubernetes Control Planes in europe-west3 or europe-west3-c and/or Kubernetes nodes in europe-west3-c.

Vertex AI Batch Prediction:
A small percentage of customers may have experienced errors in batch prediction jobs from 23 October 2024 19:45 to 23:30 US/Pacific in europe-west3.

Cloud SQL:
Customers creating or updating instances in europe-west3 may have experienced errors from 23 October 2024 18:35 US/Pacific to 24 October 2024 23:28 US/Pacific. Users with self-service or scheduled maintenance on instances in europe-west3 during this period may have experienced a failed update causing downtime.


24 Oct 2024 10:49 PDT

Mini Incident Report

We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support.

(All Times US/Pacific)

Incident Start: 23 October 2024 18:30

Incident End: 24 October 2024 02:09

Duration: 7 hours, 39 minutes

Affected Services and Features:

  • Persistent Disk (PD)
  • Google Compute Engine
  • Google Cloud Pub/Sub
  • Google Cloud Dataflow
  • Dataproc
  • Cloud Build
  • Google Kubernetes Engine (GKE)
  • Vertex AI Batch Prediction

Regions/Zones: Region europe-west3 / Zone europe-west3-c

Description:

Multiple Google Cloud products were impacted in the europe-west3-c zone for a duration of 7 hours, 39 minutes. From preliminary analysis, the root cause of the issue was due to a power failure and cooling issue leading to a fraction of a zone being powered down causing services to be degraded.

Google engineers implemented a fix to return the datacenter to full operation and this mitigated the issue.

Google will complete a full IR in the following days that will provide a full root cause.

Customer Impact:

  • Persistent Disk (PD): Customers may observe stuck VM creation, PD control plane errors.
  • Google Compute Engine: Customers lost access to several VMs and disks in the europe-west5-c zone. For the other two zones in the same region, less than 1% of the operations that touch instance and disk resources experienced internal errors.
  • Google Cloud Pub/Sub: Customers may have experienced higher latency for push subscriptions as well as a brief period of elevated Publish unavailability.
  • Google Cloud Dataflow: Batch jobs: some existing jobs experienced delays when scaling workers. Streaming jobs: jobs may not have progressed or scaled up workers.
  • Dataproc: Dataproc cluster operations failures. Uptick in cluster creation and deletion errors during the initial stages of the incident but mitigated from dataproc side by blocking the europe-west-3-c zone ASAP.
  • Cloud Build: Builds in Custom Worker pools took a long time to start.
  • Google Kubernetes Engine (GKE): Customers were unable to create new Google Kubernetes Engine (GKE) cluster nodes in the europe-west3-c zone.
  • Vertex AI Batch Prediction : Vertex batch prediction job failed with "Unable to prepare an infrastructure for serving within time" error.
24 Oct 2024 02:14 PDT

The issue with Google Cloud Pub/Sub, Google Compute Engine, Persistent Disk, Google Cloud Dataflow, Google Cloud Dataproc, Google Kubernetes Engine, Cloud Build, Vertex AI Batch Prediction has been resolved for all affected users as of Thursday, 2024-10-24 02:09 US/Pacific.

We will publish an analysis of this incident once we have completed our internal investigation.

We thank you for your patience while we worked on resolving the issue.

24 Oct 2024 01:25 PDT

Summary: Multiple GCP services impacted in the europe-west3-c zone

Description: We are experiencing an issue with multiple GCP services including Google Compute Engine, Persistent Disk, Google Cloud Dataflow in the europe-west3-c zone due to power and a cooling issue.

Mitigation work is still underway by our engineering team and we do not have an ETA at the moment.

We will provide more information by Thursday, 2024-10-24 02:30 US/Pacific.

Diagnosis: The majority of the services impact is now limited to the zonal level. Vertex AI Batch Prediction continues to be impacted at the regional level.

Services impacted include:

Google Compute Engine:

The loss of power has led to capacity failure in the region. Customers may experience:

A percentage of Virtual Machines (VMs) being terminated and not available until power is restored. A percentage of VMs may have lost access to their Persistent Disk and may be crashlooping. A percentage of regional Persistent Disks may be running in a degraded state.

The incident is affecting the Compute API in the following ways:

Creation of new VMs or disks in europe-west3-c may fail. A percentage of customers attempting to consume VM reservations will be unable to do so. A percentage of customers who would like to delete their previously running VMs in europe-west3, can delete VMs via the console or GCE APIs. However, there may be a delay in processing these deletions. All deletions will be fully processed when issues in europe-west3-c are resolved.

Google Kubernetes Engine:

The Google Kubernetes Engine nodes in the impacted location may be inaccessible and creation of new nodes may fail.

Google Cloud Dataflow:

Some existing batch jobs may experience delays when scaling workers. In addition streaming jobs may not be progressing or scaling up workers.

Google Cloud Dataproc:

While the existing clusters are not impacted, creation of new clusters may fail.

Cloud Build:

Builds in Custom Worker pools take a long time to start.

Vertex AI Batch Prediction:

A Vertex batch prediction job may fail with an error, "Unable to prepare an infrastructure for serving within time".

Google Cloud Pub/Sub:

There is no ongoing impact for the users at the moment.

Workaround:

  1. If you are impacted, please migrate the workload or operations from the europe-west3-c zone to other available zones or regions.

  2. For the customers with a degraded Regional Persistent Disk, we recommend you take regular snapshots of your disk.

24 Oct 2024 00:26 PDT

Summary: Multiple GCP services impacted in the europe-west3-c zone

Description: We are experiencing an issue with multiple GCP services including Google Compute Engine, Persistent Disk, Google Cloud Dataflow in the europe-west3-c zone due to power and a cooling issue.

Mitigation work is still underway by our engineering team and we do not have an ETA at the moment.

We will provide more information by Thursday, 2024-10-24 01:30 US/Pacific.

Diagnosis: The majority of the services impact is now limited to the zonal level. Vertex AI Batch Prediction continues to be impacted at the regional level.

Services impacted include:

Google Compute Engine:

The loss of power has led to capacity failure in the region. Customers may experience:

A percentage of Virtual Machines (VMs) being terminated and not available until power is restored.

A percentage of VMs may have lost access to their Persistent Disk and may be crashlooping.

A percentage of regional Persistent Disks may be running in a degraded state.

The incident is affecting the Compute API in the following ways:

Creation of new VMs or disks in europe-west3-c may fail.

A percentage of customers attempting to consume VM reservations will be unable to do so.

A percentage of customers who would like to delete their previously running VMs in europe-west3, can delete VMs via the console or GCE APIs. However, there may be a delay in processing these deletions. All deletions will be fully processed when issues in europe-west3-c are resolved.

Google Kubernetes Engine:

The Google Kubernetes Engine nodes in the impacted location may be inaccessible and creation of new nodes may fail.

Google Cloud Dataflow:

Some existing batch jobs may experience delays when scaling workers. In addition streaming jobs may not be progressing or scaling up workers.

Google Cloud Dataproc:

While the existing clusters are not impacted, creation of new clusters may fail.

Cloud Build:

Builds in Custom Worker pools take a long time to start.

Vertex AI Batch Prediction:

A Vertex batch prediction job may fail with an error, "Unable to prepare an infrastructure for serving within time".

Google Cloud Pub/Sub:

There is no ongoing impact for the users at the moment.

Workaround:

  1. If you are impacted, please migrate the workload or operations from the europe-west3-c zone to other available zones or regions.

  2. For the customers with a degraded Regional Persistent Disk, we recommend you take regular snapshots of your disk.

23 Oct 2024 23:55 PDT

Summary: Multiple GCP services impacted in the europe-west3-c zone

Description: We are experiencing an issue with multiple GCP services including Google Compute Engine, Persistent Disk, Google Cloud Dataflow in the europe-west3-c zone due to power and a cooling issue.

Mitigation work is still underway by our engineering team and we do not have an ETA at the moment.

We will provide more information by Thursday, 2024-10-24 01:30 US/Pacific.

Diagnosis: The impact is now determined to be back at zonal level. Regional level impact is mitigated at the moment.

Multiple services are impacted in the europe-west3-c zone:

Google Compute Engine:

The loss of power has led to capacity failure in the region. Customers may experience:

A percentage of Virtual Machines (VMs) being terminated and not available until power is restored. A percentage of VMs may have lost access to their Persistent Disk and may be crashlooping. A percentage of regional Persistent Disks may be running in a degraded state.

The incident is affecting the Compute API in the following ways:

Creation of new VMs or disks in europe-west3-c may fail. A percentage of customers attempting to consume VM reservations will be unable to do so. A percentage of customers who would like to delete their previously running VMs in europe-west3, can delete VMs via the console or GCE APIs. However, there may be a delay in processing these deletions. All deletions will be fully processed when issues in europe-west3-c are resolved.

Google Kubernetes Engine:

The Google Kubernetes Engine nodes in the impacted location may be inaccessible and creation of new nodes may fail.

Google Cloud Dataflow:

Some existing batch jobs may experience delays when scaling workers. In addition streaming jobs may not be progressing or scaling up workers.

Google Cloud Dataproc:

While the existing clusters are not impacted, creation of new clusters may fail.

Cloud Build:

Builds in Custom Worker pools take a long time to start.

Google Cloud Pub/Sub:

There is no ongoing impact for the users at the moment.

Workaround: 1. If you are impacted, please migrate the workload or operations from the europe-west3-c zone to other available zones or regions.

  1. For the customers with a degraded regional Persistent Disk, we recommend you take regular snapshots of your disk.
23 Oct 2024 23:14 PDT

Summary: Multiple GCP services impacted in the europe-west3 region

Description: We are experiencing an issue with multiple GCP services including Google Compute Engine, Persistent Disk, Google Cloud Dataflow in the europe-west3 region due to power and a cooling issue.

Mitigation work is still underway by our engineering team and we do not have an ETA at the moment.

We will provide more information by Thursday, 2024-10-24 01:00 US/Pacific.

Diagnosis: Multiple services are impacted in the europe-west3 region:

Google Compute Engine:

The loss of power has led to capacity failure in the region. Customers may experience:

A percentage of Virtual Machines (VMs) being terminated and not available until power is restored. A percentage of VMs may have lost access to their Persistent Disk and may be crashlooping. A percentage of regional Persistent Disks may be running in a degraded state.

The incident is affecting the Compute API in the following ways:

Creation of new VMs or disks in europe-west3 may fail. A percentage of customers attempting to consume VM reservations will be unable to do so. A percentage of customers who would like to delete their previously running VMs in europe-west3, can delete VMs via the console or GCE APIs. However, there may be a delay in processing these deletions. All deletions will be fully processed when issues in europe-west3 are resolved.

Google Kubernetes Engine:

The Google Kubernetes Engine nodes in the impacted location may be inaccessible and creation of new nodes may fail.

Google Cloud Dataflow:

Some existing batch jobs may experience delays when scaling workers. In addition streaming jobs may not be progressing or scaling up workers.

Google Cloud Dataproc:

While the existing clusters are not impacted, creation of new clusters may fail.

Google Cloud Pub/Sub:

There is no ongoing impact for the users at the moment.

Workaround:

  1. If you are impacted, please migrate the workload or operations from the europe-west3 region to another available regions.

  2. For the customers with a degraded regional Persistent Disk, we recommend you take regular snapshots of your disk.

23 Oct 2024 22:59 PDT

Summary: Multiple GCP services impacted in europe-west3-c zone

Description: We are experiencing an issue with multiple GCP services including Google Compute Engine, Persistent Disk, Google Cloud Dataflow in europe-west3-c zone due to power and a cooling issue.

Mitigation work is still underway by our engineering team and we do not have an ETA at the moment.

We will provide more information by Thursday, 2024-10-24 01:00 US/Pacific.

Diagnosis: Multiple services are impacted in europe-west3-c:

Google Compute Engine:

The loss of power has led to capacity failure in the zone. Customers may experience:

A percentage of Virtual Machines (VMs) being terminated and not available until power is restored. A percentage of VMs may have lost access to their Persistent Disk and may be crashlooping. A percentage of regional Persistent Disks may be running in a degraded state.

The incident is affecting the Compute API in the following ways:

Creation of new VMs or disks in europe-west3-c may fail. A percentage of customers attempting to consume VM reservations will be unable to do so. A percentage of customers who would like to delete their previously running VMs in europe-west3-c, can delete VMs via the console or GCE APIs. However, there may be a delay in processing these deletions. All deletions will be fully processed when issues in europe-west3-c are resolved.

Google Kubernetes Engine:

The Google Kubernetes Engine nodes in the impacted location may be inaccessible and creation of new nodes may fail.

Google Cloud Dataflow:

Some existing batch jobs may experience delays when scaling workers. In addition streaming jobs may not be progressing or scaling up workers.

Google Cloud Dataproc:

While the existing clusters are not impacted, creation of new clusters may fail.

Google Cloud Pub/Sub:

There is no ongoing impact for the users at the moment.

Workaround:

  1. If you are impacted, please migrate the workload or operations from the europe-west3-c zone to other available zones or regions.

  2. For the customers with a degraded regional Persistent Disk, we recommend you take regular snapshots of your disk.

23 Oct 2024 21:59 PDT

Summary: Multiple GCP services impacted in europe-west3-c zone

Description: We are experiencing an issue with Google Cloud Pub/Sub, Google Compute Engine, Persistent Disk, Google Cloud Dataflow.

Our engineering team continues to investigate the issue.

We will provide an update by Wednesday, 2024-10-23 23:00 US/Pacific with current details.

Diagnosis: Multiple services are impacted in europe-west3-c:

Google Compute Engine: Impacted users may observe VM creation failing and some instances may not be available for operations in this zone.

Google Kubernetes Engine: The Google Kubernetes Engine nodes in impacted location might be inaccessible. Also, creation of new node may fail.

Persistent Disk: The persistent disk instances might be unreachable for operations.

Google Cloud Dataflow: Some existing batch jobs may experience delays when scaling workers. Also, the streaming jobs may not be progressing or scaling up workers.

Google Cloud Dataproc: While the existing clusters are not impacted, creating new clusters may fail.

Google Cloud Pub/Sub: There is no ongoing impact for the users at the moment.

Workaround: If you are impacted, please migrate the workload or operations from the europe-west3-c zone to other available zones or regions.

23 Oct 2024 21:27 PDT

Summary: Multiple GCP services impacted in europe-west3-c zone

Description: We are experiencing an issue with Google Cloud Pub/Sub, Google Compute Engine, Persistent Disk, Google Cloud Dataflow.

Our engineering team continues to investigate the issue.

We will provide an update by Wednesday, 2024-10-23 23:00 US/Pacific with current details.

Diagnosis: Multiple services are impacted in europe-west3-c:

Google Compute Engine: Impacted users may observe VM creation failing and some instances may not be available for operations in this zone.

Persistent Disk: The persistent disk instances might be unreachable for operations.

Google Cloud Dataflow: While the existing clusters are not impacted, creating new clusters may fail.

Google Cloud Dataproc: While the existing clusters are not impacted, creating new clusters may fail.

Google Cloud Pub/Sub: There is no ongoing impact for the users at the moment.

Workaround: If you are impacted, please migrate the workload or operations from the europe-west3-c zone to other available zones or regions.

23 Oct 2024 20:32 PDT

Summary: Multiple GCP services impacted in europe-west3-c zone

Description: We are experiencing an issue with Google Cloud Pub/Sub, Google Compute Engine, Persistent Disk, Google Cloud Dataflow.

Our engineering team continues to investigate the issue.

We will provide an update by Wednesday, 2024-10-23 21:30 US/Pacific with current details.

Diagnosis: Multiple services are impacted in europe-west3-c:

Google Compute Engine: Impacted users may observe VM creation failing and some instances may not be available for operations in this zone.

Persistent Disk: The persistent disk instances might be unreachable for operations.

Google Cloud Dataflow: While the existing clusters are not impacted, creating new clusters may fail.

Google Cloud Dataproc: While the existing clusters are not impacted, creating new clusters may fail.

Workaround: None at this time.

23 Oct 2024 19:56 PDT

Summary: Multiple GCP services impacted in europe-west3-c zone

Description: We are experiencing an issue with Google Cloud Pub/Sub, Google Compute Engine, Persistent Disk beginning at Wednesday, 2024-10-23 18:24 US/Pacific.

Our engineering team continues to investigate the issue.

We will provide an update by Wednesday, 2024-10-23 20:30 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: None at this time.

Workaround: None at this time.