Service Health
Incident affecting AlloyDB for PostgreSQL, Cloud Firestore, Google BigQuery, Google Cloud Bigtable, Google Cloud Composer, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud SQL, Google Compute Engine, Google Kubernetes Engine, Hybrid Connectivity, Identity and Access Management, Persistent Disk, Virtual Private Cloud (VPC)
Customers are experiencing connectivity issues with multiple Google Cloud services in zone us-east5-c
Incident began at 2025-03-29 12:53 and ended at 2025-03-29 19:15 (all times are US/Pacific).
Previously affected location(s)
Columbus (us-east5)
Date | Time | Description | |
---|---|---|---|
| 11 Apr 2025 | 09:10 PDT | Incident ReportSummary:On Saturday, 29 March 2025, multiple Google Cloud Services in the us-east5-c zone experienced degraded service or unavailability for a duration of 6 hours and 10 minutes. To our Google Cloud customers whose services were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. Root Cause:The root cause of the service disruption was a loss of utility power in the affected zone. This power outage triggered a cascading failure within the uninterruptible power supply (UPS) system responsible for maintaining power to the zone during such events. The UPS system, which relies on batteries to bridge the gap between utility power loss and generator power activation, experienced a critical battery failure. This failure rendered the UPS unable to perform its core function of ensuring continuous power to the system. As a direct consequence of the UPS failure, virtual machine instances within the affected zone lost power and went offline, resulting in service downtime for customers. The power outage and subsequent UPS failure also triggered a series of secondary issues, including packet loss within the us-east5-c zone, which impacted network communication and performance. Additionally, a limited number of storage disks within the zone became unavailable during the outage. Remediation and Prevention:Google engineers were alerted to the incident from our internal monitoring alerts at 12:54 US/Pacific on Saturday, 29 March and immediately started an investigation. Google engineers diverted traffic away from the impacted location to partially mitigate impact for some services that did not have zonal resource dependencies. Engineers bypassed the failed UPS and restored power via generator by 14:49 US/Pacific on Saturday, 29 March. The majority of Google Cloud services recovered shortly thereafter. A few services experienced longer restoration times as manual actions were required in some cases to complete full recovery. Google is committed to preventing a repeat of this issue in the future and is completing the following actions:
Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization. We thank you for your business. Detailed Description of Impact:Customers experienced degraded service or unavailability for multiple Google Cloud products in the us-east5-c zone of varying impact and severity as noted below: AlloyDB for PostgreSQL: A few clusters experienced transient unavailability during the failover. Two impacted clusters did not failover automatically and required manual intervention from Google engineers to do the failover. BigQuery: A few customers in the impacted region experienced brief unavailability of the product between 12:57 US/Pacific until 13:19 US/Pacific. Cloud Bigtable: The outage resulted in increased errors and latency for a few customers between 12:47 US/Pacific to 19:37 US/Pacific. Cloud Composer: External streaming jobs for a few customers experienced increased latency for a period of 16 minutes. Cloud Dataflow: Streaming and batch jobs saw brief periods of performance degradation. 17% of streaming jobs experienced degradation from 12:52 US/Pacific to 13:08 US/Pacific, while 14% of batch jobs experienced degradation from 15:42 US/Pacific to 16:00 US/Pacific. Cloud Filestore: All basic, high scale and zonal instances in us-east5-c were unavailable and all enterprise and regional instances in us-east5 were operating in degraded mode from 12:54 to 18:47 US/Pacific on Saturday, 29 March 2025. Cloud Firestore: Limited impact of approximately 2 minutes where customers experienced elevated unavailability and latency, as jobs were being rerouted automatically. Cloud Identity and Access Management: A few customers experienced slight latency or errors while retrying for a short period of time. Cloud Interconnect: All us-east5 attachments connected to zone1 were unavailable for a duration of 2 hours, 7 minutes. Cloud Key Management Service: Customers experienced 5XX errors for a brief period of time (less than 4 mins). Google engineers rerouted the traffic to healthy cells shortly after the power loss to mitigate the impact. Cloud Kubernetes Engine: Customers experienced terminations of their nodes in us-east5-c. Some zonal clusters in us-east5-c experienced loss of connectivity to their control plane. No impact was observed for nodes or control planes outside of us-east5-c. Cloud NAT: Transient control plane outage affecting new VM creation processes and/or dynamic port allocation. Cloud Router: Cloud Router was unavailable for up to 30 seconds while leadership shifted to other clusters. This downtime was within the thresholds of most customer's graceful restart configuration (60 seconds). Cloud SQL: Based on monitoring data, 318 zonal instances experienced 3h of downtime in the us-east5-c zone. All external high-availability instances successfully failed out of the impacted zone. Cloud Spanner: Customers in the us-east5 region may have seen a few minutes of errors or latency increase during the few minutes after 12:52 US/Pacific when the cluster first failed. Cloud VPN: A few legacy customers experienced loss of connectivity of their sessions up to 5 mins. Compute Engine: Customers experienced instance unavailability and inability to manage instances in us-east5-c from 12:54 to 18:30 US/Pacific on Saturday, 29 March 2025. Managed Service for Apache Kafka: CreateCluster and some UpdateCluster commands (those that increased capacity config) had a 100% error rate in the region, with the symptom being INTERNAL errors or timeouts. Based on our monitoring, the impact was limited to one customer who attempted to use these methods during the incident. Memorystore for Redis: High availability instances failed over to healthy zones during the incident. 12 instances required manual intervention to bring back provisioned capacity. All instances were recovered by 19:28 US/Pacific. Persistent Disk: Customers experienced very high I/O latency, including stalled I/O operations or errors in some disks in us-east5-c from 12:54 US/Pacific to 20:45 US/Pacific on Saturday, 29 March 2025. Other products using PD or communicating with impacted PD devices experienced service issues with varied symptoms. Secret Manager: Customers experienced 5XX errors for a brief period of time (less than 4 mins). Google engineers rerouted the traffic to healthy cells shortly after the power loss to mitigate the impact. Virtual Private Cloud: Virtual machine instances running in the us-east5-c zone were unable to reach the network. Services were partially unavailable from the impacted zone. Customers wherever applicable were able to fail over workloads to different Cloud zones. |
| 1 Apr 2025 | 01:53 PDT | Mini Incident ReportWe apologize for the inconvenience this outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support (All Times US/Pacific) Incident Start: 29 March 2025 12:53 Incident End: 29 March 2025 19:12 Duration: 6 hours, 19 minutes Affected Services and Features:
Regions/Zones: us-east5-c Description: Multiple Google Cloud products were impacted in us-east5-c, with some zonal resources unavailable, for a duration of 6 hours and 19 minutes. The root cause of the issue was a utility power outage in the zone and a subsequent failure of batteries within the uninterruptible power supply (UPS) system supporting a portion of the impacted zone. This failure prevented the UPS from operating correctly, thereby preventing a power source transfer to generators during the utility power outage. As a result, some Compute Engine instances in the zone experienced downtime. The incident also caused some packet loss within the us-east5-c zone, as well as some capacity constraints for Google Kubernetes Engine in other zones of us-east5. Additionally, a small number of Persistent Disks were unavailable during the outage. Google engineers diverted traffic away from the impacted location to partially mitigate impact for some services that did not have zonal resource dependencies. Engineers bypassed the failed UPS and restored power via generator, allowing the underlying infrastructure to come back online. Impact to all affected Cloud services was mitigated by 29 March 2025 at 19:12 US/Pacific. Google will complete a full Incident Report in the following days that will provide a detailed root cause analysis. Customer Impact: Customers experienced degraded service or zonal unavailability for multiple Google Cloud products in us-east5-c. Additional details: The us-east5-c zone has transitioned back to primary power without further impact as of 30 March 2025 at 17:30 US/Pacific. |
| 29 Mar 2025 | 19:43 PDT | Currently, the us-east5-c zone is stable on an alternate power source. All previously impacted products are mitigated as of 19:12 US/Pacific. A small number of Persistent Disks remain still in recovery, and are actively being worked on. Customers still experiencing issues attaching Persistent Disks should open a support case. Our engineers continue to monitor service stability prior to transitioning back to primary power. We will provide continuing updates via PSH by Sunday, 2025-03-30 01:30 US/Pacific with current details. We apologize to all who are affected by the disruption. |
| 29 Mar 2025 | 18:30 PDT | Our engineers are actively working on recovery following a power event in the affected zone. Full recovery is currently expected to take several hours. The impacted services include Cloud Interconnect, Virtual Private Cloud (VPC), Google Compute Engine, Persistent Disk, AlloyDB for PostgreSQL, Cloud Dataproc, Cloud Dataflow, Cloud Filestore, Identity and Access Management, Cloud SQL , Google Kubernetes Engine, Cloud Composer, BigQuery, Cloud Bigtable and more. We have determined that no other zones (a, b) in the us-east5 region are impacted. We will provide an update by Saturday, 2025-03-29 20:00 US/Pacific with current details. We apologize to all who are affected by the disruption. |
- All times are US/Pacific