Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Apigee, Apigee Edge Public Cloud, Batch, Cloud Filestore, Cloud Firestore, Cloud Key Management Service, Cloud NAT, Cloud Run, Google BigQuery, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Deploy, Google Cloud Networking, Google Cloud Pub/Sub, Google Cloud SQL, Google Cloud Storage, Google Compute Engine, Google Kubernetes Engine, Identity and Access Management, Persistent Disk, Resource Manager API, Virtual Private Cloud (VPC)

All the impacted GCP products in australia-southeast2 have recovered.

Incident began at 2024-10-29 16:21 and ended at 2024-10-29 16:43 (all times are US/Pacific).

Previously affected location(s)

Melbourne (australia-southeast2)

Date Time Description
4 Dec 2024 13:43 PST

Incident Report

Summary

On Tuesday, 29 October 2024 at 16:21 US/Pacific, a power voltage swell event impacted the Network Point of Presence (POP) infrastructure in one of the campuses supporting the australia-southeast2 region, causing networking devices to unexpectedly reboot. The impacted network paths were recovered post reboot and connectivity was restored at 16:43 US/Pacific.

To our Google Cloud customers whose businesses were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer, and we are taking immediate steps to improve the platforms’ performance and resilience.

Impact

GCP operates multiple PoPs in the australia-southeast2 region to connect Cloud zones to each other and to Google’s Global Network. Google maintains sufficient capacity to ensure that occasional failures of network capacity are not noticeable and/or have minimal disruption to customers.

On Tuesday, 29 October 2024, two fiber failures had occurred near this region. These failures reduced the available inter-region network capacity, but did not impact the availability of GCP services in the region.

Then later that same day, the PoP where 50% of the network capacity for the region is hosted, experienced a power voltage swell event causing the networking devices in the PoP to reboot. When the networking devices in this datacenter rebooted, this networking capacity was temporarily unavailable.

This rare triple failure resulted in reduced connectivity across zones in the region and caused multiple Google Cloud services to lose connectivity for 21 minutes. Additionally, customers using Zones A and Zones C, experienced up to 15% increased latency and error rates for 16 minutes due to degraded inter-zone network connectivity while the networking devices recovered from the reboot .

Google engineers were already working on remediating the two fiber failures when they were alerted to the network disruption caused by the voltage swell via an internal monitoring alert on Tuesday, 29 October 2024 16:29 US/Pacific and immediately started an investigation.

After the devices had completed rebooting, the impacted network paths were recovered and all affected Cloud Zones regained full connectivity to the network at 16:43 US/Pacific.

Majority of GCP services impacted by the issue recovered shortly thereafter. A few Cloud services experienced longer restoration times as manual actions were required in some cases to complete full recovery.

(Updated on 04 December 2024)

Root cause of device reboots

Upon review of the power management systems, Google engineers have determined there is a mismatch in voltage tolerances in the affected network equipment on the site. The affected network racks are currently designed to tolerate up to ~110% of designed voltage, but the UPS which supplies the power to the network equipment is designed to tolerate up to ~120% of designed voltage.

The voltage swell event caused a deviation between 110% and 120% which was detected by the networking equipment's rectifiers as outside their allowable range and they powered down in order to protect the equipment.

We have determined that a voltage regulator for the datacenter-level UPS was enabled, which is a standard setting. This caused an additional boost during power fluctuations, pushing the voltage into the problematic 110%-120% range. The voltage regulator is necessary to ensure sufficient voltage when loads are high, but because the equipment is well below its load limit, and caused a deviation above expected levels.

Remediation and Prevention

We are taking the following actions to prevent a recurrence and improve reliability in the future:

  • Review our datacenter power distribution design in the region and implement any recommended additional protections for critical networking devices against voltage swells and sags.
  • As an initial risk reduction measure, the datacenter-level UPS voltage regulator will be reconfigured, and we have instituted monthly reviews to ensure it will be configured to correctly match future site load.
  • We are also deploying a double conversion UPS in the affected datacenter's power distribution design for the equipment that failed during this event.
  • Implement changes to network device configuration that reduce time to recover full network connectivity after a failure.
  • Review and verify that this risk does not exist in other locations, and if so proactively perform the above remediations.
  • Root cause and determine corrective actions for GCP Services that did not recover quickly from the incident after the network connectivity was restored.

To summarize, the root cause investigation for the network device reboots impacting the australia-southeast2 region has been concluded. Google teams are implementing additional preventative measures to minimize the risk of recurrence.

This is the final version of the Incident Report.


8 Nov 2024 07:14 PST

Incident Report

Summary

On Tuesday, 29 October 2024 at 16:21 US/Pacific, a power voltage swell event impacted the Network Point of Presence (POP) infrastructure in one of the campuses supporting the australia-southeast2 region, causing networking devices to unexpectedly reboot. The impacted network paths were recovered post reboot and connectivity was restored at 16:43 US/Pacific.

To our Google Cloud customers whose businesses were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer, and we are taking immediate steps to improve the platforms’ performance and resilience.

Impact

GCP operates multiple PoPs in the australia-southeast2 region to connect Cloud zones to each other and to Google’s Global Network. Google maintains sufficient capacity to ensure that occasional failures of network capacity are not noticeable and/or have minimal disruption to customers.

On Tuesday, 29 October 2024, two fiber failures had occurred near this region. These failures reduced the available inter-region network capacity, but did not impact the availability of GCP services in the region.

Then later that same day, the PoP where 50% of the network capacity for the region is hosted, experienced a power voltage swell event causing the networking devices in the PoP to reboot. When the networking devices in this datacenter rebooted, this networking capacity was temporarily unavailable.

This rare triple failure resulted in reduced connectivity across zones in the region and caused multiple Google Cloud services to lose connectivity for 21 minutes. Additionally, customers using Zones A and Zones C, experienced up to 15% increased latency and error rates for 16 minutes due to degraded inter-zone network connectivity while the networking devices recovered from the reboot .

Google engineers were already working on remediating the two fiber failures when they were alerted to the network disruption caused by the voltage swell via an internal monitoring alert on Tuesday, 29 October 2024 16:29 US/Pacific and immediately started an investigation.

After the devices had completed rebooting, the impacted network paths were recovered and all affected Cloud Zones regained full connectivity to the network at 16:43 US/Pacific.

Majority of GCP services impacted by the issue recovered shortly thereafter. A few Cloud services experienced longer restoration times as manual actions were required in some cases to complete full recovery.

Root cause of device reboots

Google’s investigation of the cause of networking devices reboot is ongoing; we will provide additional information once analysis is completed.

Remediation and Prevention

We are taking the following actions to prevent a recurrence and improve reliability in the future:

  • Review our datacenter power distribution design in the region and implement any recommended additional protections for critical networking devices against voltage swells and sags. This includes ensuring there are two additional static UPS systems in the affected datacenter's power distribution design.
  • Implement changes to network device configuration that reduce time to recover full network connectivity after a failure.
  • Root cause and determine corrective actions for GCP Services that did not recover quickly from the incident after the network connectivity was restored.

30 Oct 2024 12:30 PDT

Mini Incident Report

We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support .

(All Times US/Pacific)

Incident Start: 29 October 2024 16:21

Incident End: 29 October 2024 19:34

Duration: 3 hours, 13 minutes

Affected Services and Features:

Apigee, Apigee Edge Public Cloud, Batch, Cloud Filestore, Cloud Firestore, Cloud Key Management Service, Cloud NAT, Cloud Run, Google BigQuery, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Networking, Google Cloud Pub/Sub, Google Cloud SQL, Google Cloud Storage, Google Compute Engine, Google Kubernetes Engine, Identity and Access Management, Persistent Disk, Resource Manager API, Virtual Private Cloud (VPC)

Regions/Zones: australia-southeast2

Description:

Multiple Google Cloud products experienced service disruptions of varying impact and duration, with the longest lasting 2 hours, 3 minutes in the australia-southeast2 region.

From preliminary analysis, the root cause of the issue was a power interruption causing network and optical infrastructure to reboot in a subset of the sites in the australia-southeast2 region.

Google will complete a full Incident Report in the following days that will provide a detailed root cause.

Customer Impact:

Apigee - Impacted users experienced issues with the pod scheduling from 16:40 to 19:34.

Apigee Edge Public Cloud - Impacted users observed increased runtime, latency and an increase in 5XX errors for runtime from 16:21 to 17:46.

Batch - New batch jobs that were created in the australia-southeast2 region remained in SCHEDULED status and did not progress. Customers had the option to switch to different regions.

Cloud Filestore - Instance creation/deletion operations failed in the region. Instances with virtual machines in these clusters were not reachable from other regions. Regional instances, depending on the placement of the majority of the virtual machines, went into lockdown.

Cloud Firestore - Customers in the australia-southeast2 region experienced limited to no availability from 16:21 to 16:44.

Cloud Key Management Service - KMS was not reachable in the australia-southeast2 region. Approximately 60% of the traffic was lost during the impacted period.

Cloud NAT - Customers experienced loss of connectivity in the australia-southeast2 region.

Cloud Run - Impacted users experienced dropped requests and high error rates from 16:20 to 17:14.

Google BigQuery - Impacted users observed delay/errors in handling the requests from 16:20 to 16:55.

Google Cloud Dataflow - Impacted users had limited to no availability in the australia-southeast2 region between 16:21 to 17:40.

Google Cloud Dataproc - Impacted users observed that all dataproc services were down from 17:13 to 19:10.

Google Cloud Networking - Impacted users experienced up to 100% of the requests being dropped from 16:21 to 16:43.

Google Cloud Pub/Sub - Cloud Pub/Sub had limited to no availability in the australia-southeast2 region between 16:21 to 17:06.

Google Cloud SQL - Multiple instances of Cloud SQL were unavailable between 16:23 and 17:52.

Google Cloud Storage - Impacted users/projects experienced request timeout or unavailable errors between 16:21 and 17:12.

Google Compute Engine - Impacted users observed GCE operations such as compute.instances.insert fail from 16:40 to 19:34.

Google Kubernetes Engine - Impacted users were unable to make changes to their workloads on GKE clusters from 16:20 to 17:30.

Identity and Access Management - Impacted users experienced loss of traffic and visible latency/unavailability within the region from 16:28 to 17:14.

Persistent Disk - Customers experienced high latency up to several minutes for I/O operations in the australia-southeast2 region. Customers had the workaround to restore snapshots in another region or use asynchronous failover replicas out of this impacted region.

Virtual Private Cloud (VPC) - Impacted users experienced a network outage in and out of the impacted region from 16:35 to 18:41.

Resource Manager API - Impacted users experienced loss of traffic and visible latency/unavailability within the region from 16:28 to 17:14.

29 Oct 2024 19:29 PDT

The issue with Google Cloud Networking, Cloud Run, Identity and Access Management, Resource Manager API, Persistent Disk, Virtual Private Cloud (VPC), Google Cloud Dataflow, Apigee, Google Compute Engine, Google Cloud SQL, Apigee Edge Public Cloud, Google Cloud Storage, Google BigQuery, Google Kubernetes Engine, Cloud Dataproc has been resolved for all affected users as of Tuesday, 2024-10-29 18:24 US/Pacific.

We thank you for your patience while we worked on resolving the issue.

29 Oct 2024 18:47 PDT

Summary: Some GCP products in australia-southeast2 may have intermittent network connectivity.

Description: We are experiencing an issue with Google Cloud Networking, Cloud Run, Resource Manager API, Google Cloud Dataflow, Google Cloud SQL, Google Cloud Storage, Google Kubernetes Engine, Google Compute Engine beginning at Tuesday, 2024-10-29 16:28 US/Pacific.

Based on the internal metrics the following products have seen recovery.

  • Apigee and Apigee Edge Public Cloud
  • Cloud IAM
  • Cloud KMS
  • Persistent disk
  • Google BigQuery
  • Virtual Private Cloud (VPC)

Our engineering team continues to investigate the issue.

We will provide an update by Tuesday, 2024-10-29 19:30 US/Pacific with current details.

Diagnosis: Customers impacted by this may see issues with new deployment.

Workaround: None at this time.

29 Oct 2024 17:59 PDT

Summary: Multiple GCP products impacted in australia-southeast2 with intermittent network connectivity.

Description: We are experiencing an issue with Google Cloud Networking, Cloud Run, Resource Manager API, Virtual Private Cloud (VPC), Google Cloud Dataflow, Google Compute Engine, Google Cloud SQL, Google Cloud Storage, Google BigQuery beginning at Tuesday, 2024-10-29 16:28 US/Pacific.

Based on the internal metrics the following products have seen recovery.

  • Apigee and Apigee Edge Public Cloud
  • Cloud IAM
  • Cloud KMS
  • Persistent disk

Our engineering team continues to investigate the issue.

We will provide an update by Wednesday, 2024-10-30 00:41 US/Pacific with current details.

Diagnosis: Customers impacted by this may see issues with new deployment.

Workaround: None at this time.

29 Oct 2024 17:29 PDT

Summary: Multiple GCP products impacted in australia-southeast2

Description: We are experiencing an issue with Google Cloud Networking, Cloud Run, Identity and Access Management, Resource Manager API, Persistent Disk, Virtual Private Cloud (VPC), Google Cloud Dataflow, Google Compute Engine beginning at Tuesday, 2024-10-29 16:21US/Pacific.

Our engineering team continues to investigate the issue.

We will provide an update by Tuesday, 2024-10-29 18:00 US/Pacific with current details.

Diagnosis: Customers impacted by this may see issues with new deployment.

Workaround: None at this time.

29 Oct 2024 17:17 PDT

Summary: Multiple GCP products impacted in australia-southeast2

Description: We are experiencing an issue with FEATURE beginning at Tuesday, 2024-10-29 16:21US/Pacific.

Our engineering team continues to investigate the issue.

We will provide an update by Tuesday, 2024-10-29 18:00 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Customers impacted by this may see issues with new deployment.

Workaround: None at this time.