Google Cloud Service Health

Google Cloud Service Health
Incidents
We are investigating elevated error rates with multiple products in us-east1

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

For incidents related to Google Security Products, visit https://status.cloud.google.com/security. For incidents related to Looker (original), visit https://status.cloud.google.com/looker.

Available
Service information
Service disruption
Service outage

Incident affecting AlloyDB for PostgreSQL, Apigee, Artifact Registry, Certificate Authority Service, Cloud Armor, Cloud Billing, Cloud Build, Cloud External Key Manager, Cloud Firestore, Cloud HSM, Cloud Key Management Service, Cloud Load Balancing, Cloud Memorystore, Cloud Monitoring, Cloud Run, Cloud Spanner, Cloud Storage for Firebase, Cloud Workflows, Database Migration Service, Dataproc Metastore, Dialogflow CX, Eventarc, Google App Engine, Google BigQuery, Google Cloud Bigtable, Google Cloud Console, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Pub/Sub, Google Cloud SQL, Google Cloud Storage, Google Cloud Support, Google Cloud Tasks, Google Compute Engine, Google Kubernetes Engine, Hybrid Connectivity, Identity and Access Management, Media CDN, Memorystore for Memcached, Memorystore for Redis, Memorystore for Redis Cluster, Persistent Disk, Private Service Connect, Secret Manager, Service Directory, Vertex AI Online Prediction, Virtual Private Cloud (VPC)

We are investigating elevated error rates with multiple products in us-east1

Incident began at 2025-07-18 07:42 and ended at 2025-07-18 09:47 (all times are US/Pacific).

Previously affected location(s)

South Carolina (us-east1)

Date	Time	Description
22 Jul 2025	06:42 PDT	# Incident Report ## Summary On Friday, 18 July 2025 07:50 US/Pacific, several Google Cloud Platform (GCP) and Google Workspace (GWS) products experienced elevated latencies and error rates in the us-east1 region for a duration of up to 1 hour and 57 minutes. GCP Impact Duration: 18 July 2025 07:50 - 09:47 US/Pacific : 1 hour 57 minutes GWS Impact Duration: 18 July 2025 07:50 - 08:40 US/Pacific : 50 minutes We sincerely apologize for this incident, which does not reflect the level of quality and reliability we strive to offer. We are taking immediate steps to improve the platform’s performance and availability. ## Root Cause The service interruption was triggered by a procedural error during a planned hardware replacement in our datacenter. An incorrect physical disconnection was made to the active network switch serving our control plane, rather than the redundant unit scheduled for removal. The redundant unit had been properly de-configured as part of the procedure, and the combination of these two events led to partitioning of the network control plane. Our network is designed to withstand this type of control plane failure by failing open, continuing operation. However, an operational topology change while the network control plane was in a failed open state caused our network fabric's topology information to become stale. This led to packet loss and service disruption until services were moved away from the fabric and control plane connectivity was restored. ## Remediation and Prevention Google engineers were alerted to the outage by our monitoring system on 18 July 2025 07:06 US/Pacific and immediately started an investigation. The following timeline details the remediation and restoration efforts: 07:39 US/Pacific: The underlying root cause (device disconnect) was identified and onsite technicians were engaged to reconnect the control plane device and restore control plane connectivity. At that moment, network failure open mechanisms worked as expected and no impact was observed. 07:50 US/Pacific: A topology change led to traffic being routed suboptimally, due to the network being in a fail open state. This caused congestion on the subset of links, packet loss, and latency to customer traffic. Engineers made a decision to move traffic away from the affected fabric, which mitigated the impact for the majority of the services. 08:40 US/Pacific: Engineers mitigated Workspace impact by shifting traffic away from the affected region. 09:47 US/Pacific: Onsite technicians reconnected the device, control plane connectivity was fully restored and all services were back to stable state. Google is committed to preventing a repeat of the issue in the future, and is completing the following actions: Pause non-critical workflows until safety controls are implemented (complete). Strengthen safety controls for hardware upgrade workflows by end of Q3 2025. Design and implement a mechanism to prevent control plane partitioning in case of dual failure of upstream routers by end of Q4 2025. ## Detailed Description of Impact ### GCP Impact: Multiple products in us-east1 were affected by the loss of network connectivity, with the most significant impacts seen in us-east1-b. Other regions were not affected. The outage caused a range of issues for customers with zonal resources in the region, including packet loss across VPC networks, increased error rates and latency, service unavailable (503) errors, and slow or stuck operations up to loss of networking connectivity. While regional products were briefly impacted, they recovered quickly by failing over to unaffected zones. A small number (0.1%) of Persistent Disks in us-east1-b were unavailable for the duration of the outage: these disks became available once the outage was mitigated, with no customer data loss. ### GWS Impact: A small subset of Workspace users, primarily around the Southeast US, experienced varying degrees of unavailability and increased delays across multiple products, including Gmail, Google Meet, Google Drive, Google Chat, Google Calendar, Google Groups, Google Doc/Editors, and Google Voice.
18 Jul 2025	15:08 PDT	Mini Incident Report We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support or to Google Workspace Support using help article https://support.google.com/a/answer/1047213. (All Times US/Pacific) GCP Impact start and end time: 18 July 2025 08:10 - 09:47 Duration: 1 hour 37 minutes GWS Impact start and end time: 18 July 2025 08:10 - 08:40 Duration: 30 minutes Regions/Zones: us-east1 Description: On Friday, 18 July 2025 08:10 US/Pacific multiple GCP and GWS products experienced elevated latencies and error rates in the us-east1 region for a duration of up to 1 hour and 37 minutes. Based on the preliminary analysis, the root cause of the issue is a procedural error during a planned hardware maintenance in one of our data centers in the us-east1 region. Our engineering team mitigated the issue by draining traffic away from the clusters and then restoring the affected hardware. Google will be completing a full incident report in the following days that will provide a full root cause and preventive actions. Customer Impact: The affected GCP and GWS products experienced elevated latencies and errors rates in the us-east1 region. Affected Products: GCP : AlloyDB for PostgreSQL, Apigee, Artifact Registry, Cloud Armor, Cloud Billing, Cloud Build, Cloud External Key Manager, Cloud Filestore, Cloud HSM, Cloud Key Management Service, Cloud Load Balancing, Cloud Monitoring, Cloud Run, Cloud Spanner, Cloud Storage for Firebase, Cloud Workflows, Database Migration Service, Dialogflow CX, Dialogflow ES, Google BigQuery, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Storage, Google Cloud Support, Google Cloud Tasks, Google Compute Engine, Hybrid Connectivity, Media CDN, Network Telemetry, Private Service Connect, Secret Manager, Service Directory, Vertex AI Online Prediction, Virtual Private Cloud (VPC) Workspace : Gmail, Google Meet, Google Drive, Google Chat, Google Calendar, Google Groups, Google Doc/Editors, Google Voice Google SecOps: Google SecOps SOAR & Google SecOps
18 Jul 2025	11:03 PDT	The issue has been resolved for all affected products as of 2025-07-18 09:47 US/Pacific. From preliminary analysis, during a routine maintenance of our network in us-east1-b, we experienced elevated packet loss, causing service disruption in the zone. We will publish a full Incident Report with root cause once we have completed our internal investigations. We thank you for your patience while we worked on resolving the issue.
18 Jul 2025	10:32 PDT	Our engineers have successfully recovered the network control plane in the affected us-east1 zones. We're seeing multiple services reporting full recovery, and product engineers continue to validate the remaining services. We'll provide another update with more details by 11:00 AM US/Pacific, July 18, 2025.
18 Jul 2025	09:58 PDT	Our engineers have successfully recovered the network control plane in the affected us-east1 zones. We're seeing multiple services reporting full recovery, and product engineers are now validating the remaining services. We'll provide another update with more details by 10:30 AM US/Pacific, July 18, 2025.
18 Jul 2025	09:29 PDT	Our engineers have confirmed that us-east1-b is partially affected. All other zones in us-east1 are currently operating normally. Our engineers have recovered the failed hardware and are currently recovering the network control plane in the affected zones. We'll provide another update by 10:00 AM US/Pacific, July 18, 2025.
18 Jul 2025	08:54 PDT	We're currently experiencing elevated latency and error rates for several Cloud services in the us-east1 region, beginning at 7:06 AM PDT today, July 18, 2025. Our initial investigation points to a hardware infrastructure failure as the likely cause. We apologize for any disruption this may be causing. We'll provide an update with more details by 9:15 AM PDT today.

All times are US/Pacific

Service Health

We are investigating elevated error rates with multiple products in us-east1

Previously affected location(s)

# Incident Report

## Summary

## Root Cause

## Remediation and Prevention

## Detailed Description of Impact

Mini Incident Report