Service Health
Incident affecting Batch, Virtual Private Cloud (VPC), Google Compute Engine, Google Kubernetes Engine, Google Cloud Dataflow, Google Cloud Networking, Google Cloud SQL, Cloud Filestore, Cloud Data Fusion, Google Cloud Dataproc
Multiple cloud products are experiencing networking issues in us-central1
Incident began at 2023-10-05 03:08 and ended at 2023-10-05 10:55 (all times are US/Pacific).
Previously affected location(s)
Iowa (us-central1)
Date | Time | Description | |
---|---|---|---|
| 11 Oct 2023 | 07:30 PDT | Incident ReportSummaryOn 5 October, multiple Google Cloud products experienced networking connectivity issues which impacted new and migrated VMs in the us-central1 region for a duration of 7 hours, 47 minutes. Existing VMs were not directly affected. We sincerely apologize for the impact caused to your business. We have identified the root cause and are taking immediate steps to prevent future failures. Root CauseThe root cause of the issues was a management plane behavior change that had been rolling out slowly across Google Cloud. The aim of the change was to provide better decoupling in processing API updates to GCP Instance Groups and Network Endpoint Groups used as load balancer backends, thus providing better reliability and performance. This change had been rolled out to several regions without incident. However, when it was deployed in us-central1, large workload sizes in the region triggered an unexpected memory increase for the control plane for virtual network routers. The controllers eventually ran out of memory, and although they were automatically restarted, the large workload size meant that they repeated the out-of-memory and restart sequence. Virtual routers and their controllers are deployed into separate zonal failure domains. However, as the management plane change affected a regional API, this extended the issue to all virtual routers in the region, causing synchronized memory pressure and unavailability of controllers. This unavailability of controllers prevented the virtual network routers from being updated with fresh state, such as new VMs, new locations of migrated VMs, dynamic routes, and health state of load balancer backends. As the frequency of out-of-memory events increased, delays in updating router state increased until there was no practical progress being made. Existing VMs that did not migrate and did not change their health state were not affected directly. However, traffic to or from these VMs may have passed through a separate affected device such as a VPN Gateway, internal load balancer, or other VM. There are separate sets of virtual routers for intra-region and cross-region traffic, each with their own control plane component. The cross-region routers were affected first and for a longer duration than the intra-region routers. Remediation and PreventionGoogle engineers were alerted to slowness in the virtual network control plane in us-central1 on 04 October at 21:45 US/Pacific and immediately started investigations. Initial investigations revealed that slowness was intermittent. At 02:11 US/Pacific on 05 October alerts were received for failures in the virtual network router controllers due to memory exhaustion. Engineers immediately began an attempt to mitigate by allocating more memory. At 03:08 US/Pacific, our networking telemetry began to indicate cross-region packet loss to or from us-central1. By 05:27 US/Pacific, the memory allocation change started to reach production. At 07:00 US/Pacific, the telemetry indicated intra-region packet loss primarily to and from us-central1-c, but it then subsided at 08:15 US/Pacific due to the rollout of the increased memory allocation. At 08:22 US/Pacific, the increased memory usage was correlated with the rollout of the management plane change. At 08:52 US/Pacific, a rollback of the management plane change was started, completing in us-central1 at 09:35 US/Pacific. At this point all out of memory events had stopped. While impact had been greatly reduced, a small number of routers were not accepting updates and had to be manually restarted. These restarts did not cause any additional packet loss. By 10:55 US/Pacific all packet loss had stopped and the control plane was processing updates normally. If your service or application was affected, we apologize — this is not the level of quality and reliability we strive to offer you. Google is committed to preventing a repeat of this issue in the future and is completing the following actions:
Detailed Description of ImpactOn 5 October 2023 from 03:08 to 10:55 US/Pacific, multiple Google cloud products experienced networking connectivity issues in us-central1. Newly created and recently migrated VMs experienced extended delays before networking became functional. This impacted higher level workloads that rely on provisioning VMs. Virtual Private Cloud:
Google Kubernetes Engine:
Cloud Data Fusion:
Cloud Filestore:
Cloud SQL:
Cloud Dataproc:
Cloud Dataflow:
Cloud Datastream:
|
| 5 Oct 2023 | 18:24 PDT | Mini Incident ReportWe apologize for the inconvenience this service outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support . (All Times US/Pacific) Incident Start: 5 October 2023 03:00 Incident End: 5 October 2023 11:00 Duration: 8 hours Affected Services and Features:
Regions/Zones: us-central1 Description: Multiple Google Cloud products experienced networking connectivity issues which impacted VMs in the us-central1 region for a duration of 8 hours. From preliminary analysis, the issue was due to a recent rollout of the management plane which caused the control plane for some traffic routers to run out of memory. This caused the routing policy in the data plane to become stale. The issue was mitigated by rolling back the management plane change that triggered the issue. The memory allocation for the affected control plane component was increased to prevent recurrence of the issue. Google will complete a full Incident Report in the following days that will provide a detailed root cause. Customer Impact: Virtual Private Cloud:
Google Kubernetes Engine:
Cloud Data Fusion:
Cloud Filestore:
Cloud SQL:
Cloud Dataproc:
Cloud Dataflow:
Any products or services reliant on VM creation may have observed impact for the duration of the incident. We are continuing to investigate and will provide further detail on additional impact in the full Incident Report. |
| 5 Oct 2023 | 11:29 PDT | The issue with Batch, Cloud Data Fusion, Cloud Filestore, Google Cloud Dataflow, Google Cloud Dataproc, Google Cloud Networking, Google Cloud SQL, Google Compute Engine, Google Kubernetes Engine, Virtual Private Cloud (VPC) has been resolved for all affected projects as of Thursday, 2023-10-05 11:12 US/Pacific. We thank you for your patience while we worked on resolving the issue. |
| 5 Oct 2023 | 11:13 PDT | Summary: Multiple cloud products are experiencing networking issues in us-central1 Description: Our engineering rolled out mitigation and our internal monitoring shows signs of recovery. We are closely monitoring for full resolution. We do not have an ETA for full resolution at this point. Cloud SQL impact is mitigated at 10:37 US/Pacific. We will provide more information by Thursday, 2023-10-05 11:45 US/Pacific. Diagnosis: Networking Impact:
GKE impact :
Cloud SQL:
Cloud Datafusion
Workaround: Customers can use unaffected regions where feasible. |
| 5 Oct 2023 | 10:58 PDT | Summary: Multiple cloud products are experiencing networking issues in us-central1 Description: Our engineering rolled out mitigation and our internal monitoring shows signs of recovery. We are closely monitoring for full resolution. We do not have an ETA for full resolution at this point. We will provide more information by Thursday, 2023-10-05 11:45 US/Pacific. Diagnosis: Networking Impact:
GKE impact :
Cloud SQL:
Workaround: Customers can use unaffected regions where feasible. |
| 5 Oct 2023 | 10:35 PDT | Summary: Multiple cloud products are experiencing networking issues in us-central1 Description: Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point. We will provide more information by Thursday, 2023-10-05 11:10 US/Pacific. Diagnosis: Networking Impact:
GKE impact :
Cloud SQL:
Workaround: Customers can use unaffected regions where feasible. |
| 5 Oct 2023 | 10:21 PDT | Summary: Multiple cloud products are experiencing networking issues in us-central1 Description: Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point. We will provide more information by Thursday, 2023-10-05 11:00 US/Pacific. Diagnosis: Networking Impact:
GKE impact :
Workaround: Customers can use unaffected regions where feasible. |
| 5 Oct 2023 | 09:43 PDT | Summary: Google Virtual Private Cloud and Google Kubernetes Engine are experiencing issues in us-central1 Description: Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point. We will provide more information by Thursday, 2023-10-05 10:45 US/Pacific. Diagnosis: Networking Impact:
GKE impact :
Workaround: Customers can use unaffected regions where feasible. |
| 5 Oct 2023 | 09:15 PDT | Summary: Google Virtual Private Cloud is experiencing network issues in us-central1 Description: Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point. We will provide more information by Thursday, 2023-10-05 10:45 US/Pacific. Diagnosis:
Workaround: Customers can use unaffected regions where feasible. |
| 5 Oct 2023 | 08:59 PDT | Summary: Google Virtual Private Cloud is experiencing network issues in us-central1 Description: Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point. We will provide more information by Thursday, 2023-10-05 10:30 US/Pacific. Diagnosis: Newly created VMs in us-central1 may experience delays of a few minutes before the networking stack becomes fully operational. Some customers also experienced packet loss for VM network traffic in us-central1 between 07:00 US/Pacific and 08:15 US/Pacific on 2023-10-05 Workaround: Customers can use unaffected regions where feasible. |
| 5 Oct 2023 | 08:35 PDT | Summary: Connectivity issue impacting VM’s in us-central1 Description: Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point. We will provide more information by Thursday, 2023-10-05 10:00 US/Pacific. Diagnosis: Users may experience packet loss in region: us-central1 Workaround: None at this time |
| 5 Oct 2023 | 07:37 PDT | Summary: Connectivity issue impacting newly created VM’s in us-central1 Description: Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point. We will provide more information by Thursday, 2023-10-05 09:00 US/Pacific. Diagnosis: Newly created VMs may experience delays of a few minutes before the networking stack becomes fully operational. Existing VMs should continue working as before. Workaround: Deploying VM's into unaffected Zones. |
| 5 Oct 2023 | 07:21 PDT | Summary: Connectivity issue impacting newly created VM’s in us-central1 Description: We are experiencing an issue with Google Cloud Networking beginning on Thursday, 2023-10-05 05:14 US/Pacific. Our engineering team continues to investigate the issue. We will provide an update by Thursday, 2023-10-05 07:50 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Newly created VMs may experience delays of a few minutes before the networking stack becomes fully operational. Existing VMs should continue working as before. Workaround: Deploying VM's into unaffected Zones. |
- All times are US/Pacific