Service Health
Incident affecting AlloyDB for PostgreSQL, Apigee, Backup and DR, Batch, Cloud Build, Cloud Data Fusion, Cloud Developer Tools, Cloud Filestore, Cloud Load Balancing, Cloud Machine Learning, Cloud Memorystore, Cloud NAT, Cloud Run, Cloud Workstations, Colab Enterprise, Contact Center AI Platform, Dataproc Metastore, Firebase Test Lab, Google App Engine, Google Cloud Composer, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Deploy, Google Cloud Networking, Google Cloud SQL, Google Compute Engine, Google Kubernetes Engine, Hybrid Connectivity, Memorystore for Memcached, Memorystore for Redis, Vertex AI Workbench User Managed Notebooks, Virtual Private Cloud (VPC)
Multiple cloud products are experiencing network connectivity issues. New instances of several cloud products will come up without a network in many regions.
Incident began at 2024-05-16 15:22 and ended at 2024-05-16 18:10 (all times are US/Pacific).
Previously affected location(s)
Johannesburg (africa-south1)Taiwan (asia-east1)Hong Kong (asia-east2)Tokyo (asia-northeast1)Osaka (asia-northeast2)Seoul (asia-northeast3)Mumbai (asia-south1)Delhi (asia-south2)Singapore (asia-southeast1)Jakarta (asia-southeast2)Sydney (australia-southeast1)Melbourne (australia-southeast2)Warsaw (europe-central2)Finland (europe-north1)Madrid (europe-southwest1)Belgium (europe-west1)Berlin (europe-west10)Turin (europe-west12)London (europe-west2)Frankfurt (europe-west3)Netherlands (europe-west4)Zurich (europe-west6)Milan (europe-west8)Paris (europe-west9)GlobalDoha (me-central1)Dammam (me-central2)Tel Aviv (me-west1)Montréal (northamerica-northeast1)Toronto (northamerica-northeast2)São Paulo (southamerica-east1)Santiago (southamerica-west1)Iowa (us-central1)South Carolina (us-east1)Northern Virginia (us-east4)Columbus (us-east5)Dallas (us-south1)Oregon (us-west1)Los Angeles (us-west2)Salt Lake City (us-west3)Las Vegas (us-west4)
Date | Time | Description | |
---|---|---|---|
| 22 May 2024 | 09:44 PDT | Incident ReportSummaryOn 16 May 2024, the Google Infrastructure team was executing a routine maintenance action to shut down an unused Virtual Private Cloud (VPC) controller in a single Google Cloud zone. Unfortunately, a bug in the automation caused the component to be shut down in all zones where it was still in use. This resulted in networking connectivity issues and/or service disruptions for multiple Google Cloud products. The majority of the impact lasted 2 hours and 48 minutes. Some products took longer to fully recover based on the failures they experienced as outlined below in the impact section. To our Google Cloud customers, whose services were impacted, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. Root CauseThe root cause was a bug in maintenance automation intended to shut down an unused VPC Controller in a single zone. A parameter specifying the zone for maintenance operations was modified during a refactor of the automation software earlier in the year. This modification resulted in the target zone parameter being ignored and the shut down operation taking effect on VPC controllers in all cloud zones. The test environment caught this failure mode, but the error was misinterpreted as an expected failure in the test environment. As a result, the change made it to production. Shutdowns are a routine operational practice that we perform to maintain our systems without customer impact. The automation has a safeguard in place to limit the scope of maintenance operations. This safeguard was inadvertently disabled during a migration to a new version of the automation framework, resulting in the shutdown operation taking effect on VPC controllers in all cloud zones. A separate instance of the VPC Controller is launched into every Google Cloud zone. The jobs running in a given zone are responsible for programming the hosts in that zone (such as GCE VM hosts) so that the Endpoints running on those machines are able to reach other Endpoints in their VPC networks, reach on-prem destinations via VPN / Interconnect, and reach the Internet. Within each zone, the VPC Control Plane is sharded into different components, each of which is responsible for programming a subset of traffic paths. The intent behind this architecture is to isolate outages affecting control plane jobs, in order to limit the blast radius. Outages affecting control plane jobs within a given cluster are expected to only affect the local cluster, and, ideally, only a subset of traffic within the local cluster. The VPC Network architecture follows a fail static design. If control plane jobs are not available, the data plane continues using the last programmed state until the control plane jobs provide an update. This design reduced the impact of this outage to network activity involving interaction with the control plane, detailed below. The VPC controller shutdown affected a subset of Cloud VPC control plane jobs in all zones, impacting control plane operations such as creating new VMs, auto scaler operations, applying configuration changes to customer projects and accurately reflecting changes in VM health state. The data plane impact included packet loss during VM live migration and network congestion in some zones. Remediation and PreventionGoogle engineers were alerted to the outage through internal monitoring alerts at 15:41 US/Pacific and immediately started an investigation. Maintenance operations were immediately paused to prevent a recurrence. In parallel, affected VPC Controllers were brought back to a serving state. Once the operations were completed, VPC controllers in most of the zones were operational by 17:30 US/Pacific. By 18:09 US/Pacific, all VPC controllers were restored, and the backlog of applying changes to the dataplane cleared up by 18:40 US/Pacific, mitigating impact for most of the products. Some products took longer to mitigate as outlined below in the impact section. Google Cloud is committed to preventing a repeat of this issue in the future and is completing the following actions:
We apologize for the impact of this issue and are taking steps to address the scope and duration of this incident as well as the root cause itself. We thank you for your business. ## Detailed Description of Impact AlloyDB for PostgreSQL:
Apigee :
Apigee Edge Public Cloud:
Batch:
Cloud Build:
Cloud Composer:
Cloud Data Fusion:
Cloud Filestore:
Cloud Firewall:
Cloud IDS:
Cloud Interconnect:
Cloud Load Balancing:
Cloud NAT:
Cloud NGFW Enterprise.
Cloud Router:
Cloud Run:
Cloud Security Command Center:
Cloud Shell:
Cloud VPN:
Cloud Workstations:
Colab Enterprise:
Firebase Test Lab:
Google App Engine Flexible:
Google Cloud Dataflow:
Google Cloud Dataproc:
Google Cloud Deploy:
Google Cloud Functions:
Google Cloud SQL:
Google Compute Engine:
Google Kubernetes Engine:
Infrastructure Manager:
Looker:
Memorystore for Memcached:
Memorystore for Redis:
Vertex AI Workbench:
Virtual Private Cloud:
|
| 16 May 2024 | 23:02 PDT | Mini Incident ReportWe extend our sincerest apologies for the service interruption incurred as a result of this service outage. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support. (All Times US/Pacific) Incident Start: 16 May, 2024 15:22 Incident End: 16 May, 2024 18:10 Duration: 2 hours, 48 minutes Affected Services and Features:
Regions/Zones: Global Description: Multiple Google Cloud products experienced network connectivity issues and service outage of varying impact for a duration of up to 2 hours and 48 minutes. Preliminary findings are that a bug in maintenance automation intended to shutdown an unused network control component in a single location instead caused the component to be shutdown in many locations where it was still in use. Google engineers restarted the affected component, restoring normal operation. Google engineers have identified the automation that was responsible for this change and have terminated it until appropriate safeguards are put in place. There is no risk of a recurrence of this outage at the moment. Google will complete an Incident Report in the following days that will provide a full root cause. Customer Impact: During the impact timeframe, Google Cloud Networking exhibited the following degradations:
Additionally, other Google products that depended on GCE VM creation or network configuration updates were not able to successfully complete operations during this time. |
| 16 May 2024 | 18:40 PDT | The issue with Network programming of Apigee, Backup and DR, Cloud Build, Cloud Data Fusion, Cloud Filestore, Cloud Load Balancing, Cloud NAT, Cloud Run, Cloud Workstations, Colab Enterprise, Contact Center AI Platform, Dataproc Metastore, Firebase Test Lab, Google App Engine, Google Cloud Composer, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Deploy, Google Cloud Networking, Google Compute Engine, Google Kubernetes Engine, Hybrid Connectivity, Memorystore for Memcached, Memorystore for Redis, Vertex AI Workbench Instances, Virtual Private Cloud (VPC) has been resolved for all affected projects as of Thursday, 2024-05-16 18:10 US/Pacific. We thank you for your patience while we worked on resolving the issue. |
| 16 May 2024 | 18:37 PDT | Summary: Multiple cloud products are experiencing network connectivity issues. New instances of several cloud products will come up without a network in many regions. Description: We are seeing recovery for most of the affected products across all regions. Cloud Functions, Cloud Run, and App Engine Flex are recovered as of 18:06 US/Pacific Google engineers are actively working to fully restore control plane functionality for all affected products and regions. We do not have an ETA for complete mitigation at this point We will provide an update by Thursday, 2024-05-16 18:55 US/Pacific with current details. Diagnosis:
Workaround: None at this time. |
| 16 May 2024 | 17:59 PDT | Summary: Multiple cloud products are experiencing network connectivity issues. New instances of several cloud products will come up without a network in many regions. Description: We are experiencing an issue with Google Cloud products that use our network infrastructure beginning on Thursday, 2024-05-16 14:44 US/Pacific. We are seeing considerable recovery on most of the affected products across all regions. Google engineers are actively working to fully restore control plane functionality for all affected products and regions. We do not have an ETA for complete mitigation at this point We will provide an update by Thursday, 2024-05-16 18:20 US/Pacific with current details. Diagnosis:
Workaround: None at this time. |
| 16 May 2024 | 17:44 PDT | Summary: Multiple cloud products are experiencing network connectivity issues. New instances of several cloud products will come up without a network in many regions. Description: We are experiencing an issue with Google Cloud products that use our network infrastructure beginning on Thursday, 2024-05-16 14:44 US/Pacific. We are seeing considerable recovery on most of the affected products across all regions. Google engineers are actively working to fully restore control plane functionality for all affected products and regions. We do not have an ETA for complete mitigation at this point We will provide an update by Thursday, 2024-05-16 18:15 US/Pacific with current details. Diagnosis:
Workaround: None at this time. |
| 16 May 2024 | 17:12 PDT | Summary: Multiple cloud products are experiencing network connectivity issues. New instances of several cloud products will come up without a network in many regions. Description: Mitigation work is currently underway by our engineering team. Mitigation actions in us-central1 are completed. Google engineers are actively working to restore control plane functionality to remaining affected regions. We do not have an ETA for mitigation at this point. We will provide more information by Thursday, 2024-05-16 17:45 US/Pacific. Diagnosis:
Workaround: None at this time. |
| 16 May 2024 | 16:24 PDT | Summary: Programming failures for New Virtual Private Cloud endpoints globally affecting New GCE VM network programming and Cloud Run using Direct VPC. Description: We are experiencing an issue with Virtual Private Cloud (VPC) beginning on Thursday, 2024-05-16 14:44 US/Pacific. Our engineering team continues to investigate the issue We will provide an update by Thursday, 2024-05-16 16:57 US/Pacific with current details. Diagnosis: Customers impacted by this issue may see slow programming in the Cloud Networking control plane. New VMs or newly migrated VMs may have delayed network programming. For Cloud Run DirectVPC customers, Cloud Run scaling, including from 0 may not work. Serverless VPC Connectors cannot be created or scale. App Engine Flex deployments cannot be created or scale. Workaround: None at this time. |
- All times are US/Pacific