Google Cloud Service Health

Google Cloud Service Health
Incidents
Multiple cloud products are experiencing network connectivity issues. New instances of several cloud products will come up without a network in many regions.

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Available
Service information
Service disruption
Service outage

Incident affecting AlloyDB for PostgreSQL, Apigee, Backup and DR, Batch, Cloud Build, Cloud Data Fusion, Cloud Developer Tools, Cloud Filestore, Cloud Load Balancing, Cloud Machine Learning, Cloud Memorystore, Cloud NAT, Cloud Run, Cloud Workstations, Colab Enterprise, Contact Center AI Platform, Dataproc Metastore, Firebase Test Lab, Google App Engine, Google Cloud Composer, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Deploy, Google Cloud Networking, Google Cloud SQL, Google Compute Engine, Google Kubernetes Engine, Hybrid Connectivity, Memorystore for Memcached, Memorystore for Redis, Vertex AI Workbench User Managed Notebooks, Virtual Private Cloud (VPC)

Multiple cloud products are experiencing network connectivity issues. New instances of several cloud products will come up without a network in many regions.

Incident began at 2024-05-16 15:22 and ended at 2024-05-16 18:10 (all times are US/Pacific).

Previously affected location(s)

Johannesburg (africa-south1)Taiwan (asia-east1)Hong Kong (asia-east2)Tokyo (asia-northeast1)Osaka (asia-northeast2)Seoul (asia-northeast3)Mumbai (asia-south1)Delhi (asia-south2)Singapore (asia-southeast1)Jakarta (asia-southeast2)Sydney (australia-southeast1)Melbourne (australia-southeast2)Warsaw (europe-central2)Finland (europe-north1)Madrid (europe-southwest1)Belgium (europe-west1)Berlin (europe-west10)Turin (europe-west12)London (europe-west2)Frankfurt (europe-west3)Netherlands (europe-west4)Zurich (europe-west6)Milan (europe-west8)Paris (europe-west9)GlobalDoha (me-central1)Dammam (me-central2)Tel Aviv (me-west1)Montréal (northamerica-northeast1)Toronto (northamerica-northeast2)São Paulo (southamerica-east1)Santiago (southamerica-west1)Iowa (us-central1)South Carolina (us-east1)Northern Virginia (us-east4)Columbus (us-east5)Dallas (us-south1)Oregon (us-west1)Los Angeles (us-west2)Salt Lake City (us-west3)Las Vegas (us-west4)

Date	Time	Description
22 May 2024	09:44 PDT	Incident Report Summary On 16 May 2024, the Google Infrastructure team was executing a routine maintenance action to shut down an unused Virtual Private Cloud (VPC) controller in a single Google Cloud zone. Unfortunately, a bug in the automation caused the component to be shut down in all zones where it was still in use. This resulted in networking connectivity issues and/or service disruptions for multiple Google Cloud products. The majority of the impact lasted 2 hours and 48 minutes. Some products took longer to fully recover based on the failures they experienced as outlined below in the impact section. To our Google Cloud customers, whose services were impacted, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. Root Cause The root cause was a bug in maintenance automation intended to shut down an unused VPC Controller in a single zone. A parameter specifying the zone for maintenance operations was modified during a refactor of the automation software earlier in the year. This modification resulted in the target zone parameter being ignored and the shut down operation taking effect on VPC controllers in all cloud zones. The test environment caught this failure mode, but the error was misinterpreted as an expected failure in the test environment. As a result, the change made it to production. Shutdowns are a routine operational practice that we perform to maintain our systems without customer impact. The automation has a safeguard in place to limit the scope of maintenance operations. This safeguard was inadvertently disabled during a migration to a new version of the automation framework, resulting in the shutdown operation taking effect on VPC controllers in all cloud zones. A separate instance of the VPC Controller is launched into every Google Cloud zone. The jobs running in a given zone are responsible for programming the hosts in that zone (such as GCE VM hosts) so that the Endpoints running on those machines are able to reach other Endpoints in their VPC networks, reach on-prem destinations via VPN / Interconnect, and reach the Internet. Within each zone, the VPC Control Plane is sharded into different components, each of which is responsible for programming a subset of traffic paths. The intent behind this architecture is to isolate outages affecting control plane jobs, in order to limit the blast radius. Outages affecting control plane jobs within a given cluster are expected to only affect the local cluster, and, ideally, only a subset of traffic within the local cluster. The VPC Network architecture follows a fail static design. If control plane jobs are not available, the data plane continues using the last programmed state until the control plane jobs provide an update. This design reduced the impact of this outage to network activity involving interaction with the control plane, detailed below. The VPC controller shutdown affected a subset of Cloud VPC control plane jobs in all zones, impacting control plane operations such as creating new VMs, auto scaler operations, applying configuration changes to customer projects and accurately reflecting changes in VM health state. The data plane impact included packet loss during VM live migration and network congestion in some zones. Remediation and Prevention Google engineers were alerted to the outage through internal monitoring alerts at 15:41 US/Pacific and immediately started an investigation. Maintenance operations were immediately paused to prevent a recurrence. In parallel, affected VPC Controllers were brought back to a serving state. Once the operations were completed, VPC controllers in most of the zones were operational by 17:30 US/Pacific. By 18:09 US/Pacific, all VPC controllers were restored, and the backlog of applying changes to the dataplane cleared up by 18:40 US/Pacific, mitigating impact for most of the products. Some products took longer to mitigate as outlined below in the impact section. Google Cloud is committed to preventing a repeat of this issue in the future and is completing the following actions: Implement additional safety measures to validate maintenance operations have appropriately scoped zone information. Extend defense in depth by improving safety in underlying tools to reject maintenance operations without zones specified. Increase on-demand access control to limit the ability of maintenance operations to work across multiple zones in a single invocation. Ensure we have the required safety checks, appropriate testing and validation processes for maintenance operations including positive and negative test cases for the safety measures 1, 2 and 3 above to prevent regression. We apologize for the impact of this issue and are taking steps to address the scope and duration of this incident as well as the root cause itself. We thank you for your business. ## Detailed Description of Impact AlloyDB for PostgreSQL: From 15:30 to 19:00 US/Pacific, new AlloyDB VM creation may have failed in the europe-west1, europe-west3, europe-north1, europe-north2, asia-east1 and asia-northeast1 regions. All of these regions encountered <0.5% creation failure while the asia-northeast1 region encountered around 1% instance creation failures. There were about 10% auto failovers triggered in us-central1 and less than 1% auto-failovers triggered in other regions. Apigee : From 15:22 to 18:10 US/Pacific, some Apigee customers may have experienced network latencies, timeouts, or 500 errors in their runtimes. Apigee Edge Public Cloud: From 15:22 to 18:10 US/Pacific, some Apigee Edge customers using GCP infrastructure may have experienced network latencies, timeouts, or 500 errors in their runtimes. Additionally, some customers encountered errors during proxy deployments, resulting in partial deployments. The impact was mostly resolved automatically as the network recovered. However, some cases of partial proxy deployments required manual intervention and were recovered by 17 May 2024 01:00 US/Pacific. Batch: From 15:50 to 18:00 US/Pacific, batch jobs experienced elevated “NO_VM_REPORT” failure rates globally. Cloud Build: From 15:39 to 17:28 US/Pacific, build jobs were experiencing elevated scheduling latency or time outs globally. Cloud Composer: From 15:40 to 17:30 US/Pacific, environment creation and update operations were failing across all regions. 80% of environment creation operations failed. Several environments that started image upgrade operation during the disruption were irrecoverably broken. ~2.5% of Airflow tasks in running environments have failed globally due to a direct impact on connectivity of Composer components or outage impact on services that tasks have operated on. Cloud Data Fusion: From 15:50 to 16:50 US/Pacific, new instance creation experienced failures globally. Error rates for new instance creation remained elevated until 17:46 PDT. Instance creation requests submitted during this period may have timed out (failed) or succeeded after a longer-than-usual delay. Cloud Filestore: From 15:40 to 18:00 US/Pacific, Filestore instance operations such as Creation, and Deletion were degraded or failing in most regions and zones. Some existing instances were unable to report metrics or start operations during this period, and due to the degraded VPC performance, customer VMs may not have been able to access existing instances. Cloud Firewall: From 15:20 to 18:40 US/Pacific, all new firewall rules and firewall updates were not propagated. This affected firewall modifications for all VM instances in all GCE Zones. Cloud IDS: From 15:45 to 18:40 US/Pacific, any calls to alter or create IDS Endpoints would have failed. Packet Mirroring to existing Cloud IDS Endpoints continued working as intended. Cloud Interconnect: From 15:45 to 18:09 US/Pacific, customers were unable to make any changes to their Cloud Interconnect resources because the control/management plane was down. Most existing Interconnect attachments were unaffected, except that changes to learned BGP routes did not propagate during the outage. A small number of attachments (1%) experienced dataplane packet loss triggered by maintenance operations (e.g. VM migration or attachment dataplane machine maintenance). Cloud Load Balancing: From 16:00 to 18:20 US/Pacific, customers were unable to make changes to load balancer configurations that involved network programming as a dependency. Some customers experienced data plane impact as well, manifested as 500 errors; this was either because certain workloads could not be autoscaled due to absence of network programming or programming for their network was not complete before the service disruption started. Cloud NAT: From 15:45 to 18:20 US/Pacific, customers were unable to make any changes to their Cloud NAT resources because the control/management plane was down. Cloud NAT Dynamic Port Allocation (DPA) experienced allocation failures. Most existing NAT configurations were unaffected, although a small number (<1%) saw dataplane loss. In addition, a small number (<1%) of NAT configurations took up to two additional hours for control plane changes to take effect. Cloud NGFW Enterprise. From 15:45 to 18:40 US/Pacific, any calls to alter or create firewall-endpoints, firewall-endpoint-associations, security-profiles, or security-profile-groups would have failed. Packet Inspection through “Proceed to L7 Inspection” Firewall Rules continued to work as intended. Cloud Router: From 15:45 to 18:20 US/Pacific, customers were unable to make any changes to their Cloud Router resources because the control/management plane was down. Changes to Border Gateway Protocol (BGP) routes or any new learned routes advertised by customers, or any changes to route health triggered by unhealthy Interconnects / VPN tunnels, would not have been applied to the dataplane. Most existing BGP sessions stayed up during this event. Cloud Run: From 15:35 to 18:06 US/Pacific customers using Direct VPC Egress for Cloud Run were unable to deploy new services. Customers with existing services using Direct VPC Egress were unable to scale up, including from 0. Customers using existing VPC Access Connectors may have experienced network slowdown due to VPC Access Connectors not being able to scale up. Customers were also unable to deploy new VPC Access Connectors. Cloud Security Command Center: From 16:40 to 17:30 US/Pacific, 25% of scheduled Attack Path Simulations failed globally, affecting 7% of onboarded organizations. The attack exposure scores of findings were not updated and new attack paths were not created during this time for affected customers. Customers could still view older attack exposure scores and attack paths during this time. Cloud Shell: From 15:50 to 16:50 US/Pacific, Cloud Shell sessions failed to start globally. ~15% of Cloud Shell sessions failed to startup. Cloud VPN: From 15:45 to 18:40 US/Pacific, customers were unable to make any changes to their Cloud VPN resources because the control/management plane was down. Most existing tunnels were not affected, except that changes to learned BGP routes did not propagate during the outage. Routine VPN maintenance operations which occurred during this time window broke the dataplane for a small number (1%) of customer tunnels, which should have recovered along with the control plane. Cloud Workstations: From 15:50 to 16:50 US/Pacific, Workstation startup failed primarily across US regions. Around 25% of Workstations failed to startup. Colab Enterprise: From 15:50 to 16:50 US/Pacific, Colab Enterprise NotebookRuntime operations: AssignNotebookRuntime, StartNotebookRuntime, UpgradeNotebookRuntime failed globally. Firebase Test Lab: Between 15:50 and 17:50 PDT, new Cloud VM creations failed globally. Around 17% of the total VM creation requests globally at the time encountered issues. 100% of customer executions failed on x86 emulators, impacting external customer CI tests, causing a complete pause of executions on those devices. Google App Engine Flexible: From 15:35 to 18:06 US/Pacific customers were unable to create App Engine Flex deployments and App Engine Flex deployments were unable to scale up. Google Cloud Dataflow: From 15:44 to 18:52 US/Pacific, Batch and streaming jobs became unhealthy in all regions: New Dataflow jobs were not initialized successfully. Already running Dataflow batch jobs experienced prolonged execution time Already running Streaming jobs experienced elevated watermarks Google Cloud Dataproc: From 15:34 to 18:16 US/Pacific, cluster creations failed globally. Some cluster deletion workflows may have failed causing the corresponding VMs to not be entirely deleted. Our engineers have identified such VMs and performed a clean up after the service disruption was mitigated. Google Cloud Deploy: From 15:39 to 17:28 US/Pacific, Cloud Deploy customers were experiencing significant latency or time outs for renders and deploys globally. Google Cloud Functions: From 15:35 to 18:06 US/Pacific customers using Cloud Functions were unable to deploy globally. Customers using existing VPC Access Connectors may have experienced network slowdown due to VPC Access Connectors not being able to scale up. Customers were also unable to deploy new VPC Access Connectors. Google Cloud SQL: From 15:47 to 16:30 US/Pacific, all create, clone, restore operations were failing along with some patch & update operations. Some instances were impacted due to operational failures. From 16:30 to 19:00 US/Pacific, recovery was started with 90% of the impacted instances restored or - recovered by 19:00. Failure rates were still elevated at about 10%. All remaining instances were restored by 22:30 US/Pacific. Google Compute Engine: From 15:45 to 16:45 US/Pacific, most of the networking requests experienced elevated failure rates globally. From 16:00 to 17:15 US/Pacific, instance deletion and instance update operations experienced elevated failure rates globally. From 15:45 to 18:15 US/Pacific, for instance group creation and deletion operations experienced elevated failure rates globally. Google Kubernetes Engine: From 15:45 to 17:30 US/Pacific, GKE operations that created or updated a VM failed. These operations are cluster creation and deletion, node pool creation and deletion, node upgrade, scaling up existing nodepool, network configuration changes. Infrastructure Manager: From 15:40 to 17:15 US/Pacific, deployment create/update/delete/unlock and preview create/delete operations degraded globally with increased latency and failures. Customers could still view and export deployments, previews, revisions and resources. Looker: From 15:50 to 17:30 US/Pacific, <0.2% of looker instances went down due to Cloud SQL DB connection failure. Memorystore for Memcached: From 15:45 to 17:15 US/Pacific, Create, Update and Delete Memcache instance operations were failing across all regions. Around 0.8% of existing nodes restarted with a downtime of approximately 50 minutes possibly because these VMs underwent live migration. Memorystore for Redis: From 15:40 to 18:40 US/Pacific, newly started Redis instance VM nodes were unreachable for networking. These instances could have been newly created by the customer, created indirectly through a mutating change such as resource scaling or addition of a new replica, or created automatically due to routine system operations. Impacts would have manifested as Basic (unreplicated) instances being unusable, and on Standard instances it could have appeared as reduced reader capacity or outright instance unavailability. Some Redis instances experienced some long tail issues and eventually recovered at 20:05 US/Pacific. In addition, mutating operations such as CreateInstance failed during this window. Vertex AI Workbench: From 15:40 to 17:30 US/Pacific, customers could not provision new instances. Existing instances were not affected. Virtual Private Cloud: From 15:20 to 18:40 US/Pacific, all new endpoints (e.g., VMs, VPN gateways) and services (load balancers) received no network programming and observed packet loss to a majority of destinations. 0.07% of VMs that were migrated during the outage also experienced connectivity loss for the duration of the service disruption. Long-running VMs were also affected due to data plane overload in a few locations. Peak loss was 3% in us-east1-c and average loss was 0.05% across all Cloud zones. Access to Google Services was degraded in a few GCE zones; a fraction (up to 25% in us-east4-c) of TCP connections were blackholed. Applications that retry with a different TCP source port would have experienced limited outages even in these Cloud zones.
16 May 2024	23:02 PDT	Mini Incident Report We extend our sincerest apologies for the service interruption incurred as a result of this service outage. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support. (All Times US/Pacific) Incident Start: 16 May, 2024 15:22 Incident End: 16 May, 2024 18:10 Duration: 2 hours, 48 minutes Affected Services and Features: AlloyDB for PostgreSQL Apigee Backup and DR Batch Cloud Build Cloud Data Fusion Cloud Developer Tools Cloud Filestore Cloud Load Balancing Cloud Machine Learning Cloud Memorystore Cloud NAT Cloud Run Cloud Workstations Colab Enterprise Contact Center AI Platform Dataproc Metastore Firebase Test Lab Google App Engine Google Cloud Composer Google Cloud Dataflow Google Cloud Dataproc Google Cloud Deploy Google Cloud Networking Google Cloud SQL Google Compute Engine (GCE) Google Kubernetes Engine (GKE) Hybrid Connectivity Memorystore for Memcached Memorystore for Redis Vertex AI Workbench User Managed Notebooks Virtual Private Cloud (VPC) Regions/Zones: Global Description: Multiple Google Cloud products experienced network connectivity issues and service outage of varying impact for a duration of up to 2 hours and 48 minutes. Preliminary findings are that a bug in maintenance automation intended to shutdown an unused network control component in a single location instead caused the component to be shutdown in many locations where it was still in use. Google engineers restarted the affected component, restoring normal operation. Google engineers have identified the automation that was responsible for this change and have terminated it until appropriate safeguards are put in place. There is no risk of a recurrence of this outage at the moment. Google will complete an Incident Report in the following days that will provide a full root cause. Customer Impact: During the impact timeframe, Google Cloud Networking exhibited the following degradations: New VM instances were provisioned without network connectivity hence unable to establish network connections. Migrated/Restarted VMs lost network connectivity. Virtual networking configurations could not be updated (e.g. firewalls, network load balancers). Partial packet loss for certain VPC network flows was observed in us-central1 and us-east1. Cloud NAT Dynamic Port Allocation (DPA) experienced allocation failures. Creation of new GKE nodes and nodepools experienced failures. Additionally, other Google products that depended on GCE VM creation or network configuration updates were not able to successfully complete operations during this time.
16 May 2024	18:40 PDT	The issue with Network programming of Apigee, Backup and DR, Cloud Build, Cloud Data Fusion, Cloud Filestore, Cloud Load Balancing, Cloud NAT, Cloud Run, Cloud Workstations, Colab Enterprise, Contact Center AI Platform, Dataproc Metastore, Firebase Test Lab, Google App Engine, Google Cloud Composer, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Deploy, Google Cloud Networking, Google Compute Engine, Google Kubernetes Engine, Hybrid Connectivity, Memorystore for Memcached, Memorystore for Redis, Vertex AI Workbench Instances, Virtual Private Cloud (VPC) has been resolved for all affected projects as of Thursday, 2024-05-16 18:10 US/Pacific. We thank you for your patience while we worked on resolving the issue.
16 May 2024	18:37 PDT	Summary: Multiple cloud products are experiencing network connectivity issues. New instances of several cloud products will come up without a network in many regions. Description: We are seeing recovery for most of the affected products across all regions. Cloud Functions, Cloud Run, and App Engine Flex are recovered as of 18:06 US/Pacific Google engineers are actively working to fully restore control plane functionality for all affected products and regions. We do not have an ETA for complete mitigation at this point We will provide an update by Thursday, 2024-05-16 18:55 US/Pacific with current details. Diagnosis: Customers impacted by this issue may see slow programming in the Cloud Networking control plane. New VMs or newly migrated VMs may have delayed network programming. New connections via Google Cloud Load Balancer may also fail to establish. For GKE cluster creation and deletion, node pool creation and deletion, scale ups, upgrades, as well as changes to networking configuration are impacted. For Cloud Run DirectVPC customers, Cloud Run scaling, including from 0 will not work. Serverless VPC Connectors cannot be created or scale. App Engine Flex deployments cannot be created or scale. Workaround: None at this time.
16 May 2024	17:59 PDT	Summary: Multiple cloud products are experiencing network connectivity issues. New instances of several cloud products will come up without a network in many regions. Description: We are experiencing an issue with Google Cloud products that use our network infrastructure beginning on Thursday, 2024-05-16 14:44 US/Pacific. We are seeing considerable recovery on most of the affected products across all regions. Google engineers are actively working to fully restore control plane functionality for all affected products and regions. We do not have an ETA for complete mitigation at this point We will provide an update by Thursday, 2024-05-16 18:20 US/Pacific with current details. Diagnosis: Customers impacted by this issue may see slow programming in the Cloud Networking control plane. New VMs or newly migrated VMs may have delayed network programming. New connections via Google Cloud Load Balancer may also fail to establish. For GKE cluster creation and deletion, node pool creation and deletion, scale ups, upgrades, as well as changes to networking configuration are impacted. For Cloud Run DirectVPC customers, Cloud Run scaling, including from 0 will not work. Serverless VPC Connectors cannot be created or scale. App Engine Flex deployments cannot be created or scale. Workaround: None at this time.
16 May 2024	17:44 PDT	Summary: Multiple cloud products are experiencing network connectivity issues. New instances of several cloud products will come up without a network in many regions. Description: We are experiencing an issue with Google Cloud products that use our network infrastructure beginning on Thursday, 2024-05-16 14:44 US/Pacific. We are seeing considerable recovery on most of the affected products across all regions. Google engineers are actively working to fully restore control plane functionality for all affected products and regions. We do not have an ETA for complete mitigation at this point We will provide an update by Thursday, 2024-05-16 18:15 US/Pacific with current details. Diagnosis: Customers impacted by this issue may see slow programming in the Cloud Networking control plane. New VMs or newly migrated VMs may have delayed network programming. New connections via Google Cloud Load Balancer may also fail to establish. For GKE cluster creation and deletion, node pool creation and deletion, scale ups, upgrades, as well as changes to networking configuration are impacted. For Cloud Run DirectVPC customers, Cloud Run scaling, including from 0 will not work. Serverless VPC Connectors cannot be created or scale. App Engine Flex deployments cannot be created or scale. Workaround: None at this time.
16 May 2024	17:12 PDT	Summary: Multiple cloud products are experiencing network connectivity issues. New instances of several cloud products will come up without a network in many regions. Description: Mitigation work is currently underway by our engineering team. Mitigation actions in us-central1 are completed. Google engineers are actively working to restore control plane functionality to remaining affected regions. We do not have an ETA for mitigation at this point. We will provide more information by Thursday, 2024-05-16 17:45 US/Pacific. Diagnosis: Customers impacted by this issue may see slow programming in the Cloud Networking control plane. New VMs or newly migrated VMs may have delayed network programming. New connections via Google Cloud Load Balancer may also fail to establish. For GKE cluster and node pool creation, and well has changes to networking configuration are impacted. For Cloud Run DirectVPC customers, Cloud Run scaling, including from 0 will not work. Serverless VPC Connectors cannot be created or scale. App Engine Flex deployments cannot be created or scale. Workaround: None at this time.
16 May 2024	16:24 PDT	Summary: Programming failures for New Virtual Private Cloud endpoints globally affecting New GCE VM network programming and Cloud Run using Direct VPC. Description: We are experiencing an issue with Virtual Private Cloud (VPC) beginning on Thursday, 2024-05-16 14:44 US/Pacific. Our engineering team continues to investigate the issue We will provide an update by Thursday, 2024-05-16 16:57 US/Pacific with current details. Diagnosis: Customers impacted by this issue may see slow programming in the Cloud Networking control plane. New VMs or newly migrated VMs may have delayed network programming. For Cloud Run DirectVPC customers, Cloud Run scaling, including from 0 may not work. Serverless VPC Connectors cannot be created or scale. App Engine Flex deployments cannot be created or scale. Workaround: None at this time.

All times are US/Pacific

Service Health

Multiple cloud products are experiencing network connectivity issues. New instances of several cloud products will come up without a network in many regions.

Previously affected location(s)

Incident Report

Summary

Root Cause

Remediation and Prevention

Mini Incident Report