Google Cloud Status Dashboard

This page provides status information on the services that are part of Google Cloud Platform. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit cloud.google.com.

Google Cloud Networking Incident #18007

We are investigating an issue with increased packet loss in us-central1 with Google Cloud Networking.

Incident began at 2018-05-02 14:02 and ended at 2018-05-02 14:19 (all times are US/Pacific).

Date Time Description
May 08, 2018 08:24

ISSUE SUMMARY

On Wednesday 2 May, 2018 Google Cloud Networking experienced increased packet loss to the internet as well as other Google regions from the us-central1 region for a duration of 21 minutes. We understand that the network is a critical component that binds all services together. We have conducted an internal investigation and are taking steps to improve our service.

DETAILED DESCRIPTION OF IMPACT

On Wednesday 2 May, 2018 from 13:47 to 14:08 PDT, traffic between all zones in the us-central1 region and all destinations experienced 12% packet loss. Traffic between us-central1 zones experienced 22% packet loss. Customers may have seen requests succeed to services hosted in us-central1 as loss was not evenly distributed, some connections did not experience any loss while others experienced 100% packet loss.

ROOT CAUSE

A control plane is used to manage configuration changes to the network fabric connecting zones in us-central1 to each other as well as the Internet. On Wednesday 2 May, 2018 Google Cloud Network engineering began deploying a configuration change using the control plane as part of planned maintenance work. During the deployment, a bad configuration was generated that blackholed a portion of the traffic flowing over the fabric.

The control plane had a bug in it, which caused it to produce an incorrect configuration. New configurations deployed to the network fabric are evaluated for correctness, and regenerated if an error is found. In this case, the configuration error appeared after the configuration was evaluated, which resulted in deploying the erroneous configuration to the network fabric.

REMEDIATION AND PREVENTION

Automated monitoring alerted engineering teams 2 minutes after the loss started. Google engineers correlated the alerts to the configuration push and routed traffic away from the affected part of the fabric. Mitigation completed 21 minutes after loss began, ending impact to customers.

After isolating the root cause, engineers then audited all configuration changes that were generated by the control plane and replaced them with known-good configurations.

To prevent this from recurring, we will correct the control plane defect that generated the incorrect configuration and are adding additional validation at the fabric layer in order to more robustly detect configuration errors. Additionally, we intend on adding logic to the network control plane to be able to self-heal by automatically routing traffic away from the parts of the network fabric in an error state. Finally, we plan on evaluating further isolation of control plane configuration changes to reduce the size of the possible failure domain.

Again, we would like to apologize for this issue. We are taking immediate steps to improve the platform’s performance and availability.

ISSUE SUMMARY

On Wednesday 2 May, 2018 Google Cloud Networking experienced increased packet loss to the internet as well as other Google regions from the us-central1 region for a duration of 21 minutes. We understand that the network is a critical component that binds all services together. We have conducted an internal investigation and are taking steps to improve our service.

DETAILED DESCRIPTION OF IMPACT

On Wednesday 2 May, 2018 from 13:47 to 14:08 PDT, traffic between all zones in the us-central1 region and all destinations experienced 12% packet loss. Traffic between us-central1 zones experienced 22% packet loss. Customers may have seen requests succeed to services hosted in us-central1 as loss was not evenly distributed, some connections did not experience any loss while others experienced 100% packet loss.

ROOT CAUSE

A control plane is used to manage configuration changes to the network fabric connecting zones in us-central1 to each other as well as the Internet. On Wednesday 2 May, 2018 Google Cloud Network engineering began deploying a configuration change using the control plane as part of planned maintenance work. During the deployment, a bad configuration was generated that blackholed a portion of the traffic flowing over the fabric.

The control plane had a bug in it, which caused it to produce an incorrect configuration. New configurations deployed to the network fabric are evaluated for correctness, and regenerated if an error is found. In this case, the configuration error appeared after the configuration was evaluated, which resulted in deploying the erroneous configuration to the network fabric.

REMEDIATION AND PREVENTION

Automated monitoring alerted engineering teams 2 minutes after the loss started. Google engineers correlated the alerts to the configuration push and routed traffic away from the affected part of the fabric. Mitigation completed 21 minutes after loss began, ending impact to customers.

After isolating the root cause, engineers then audited all configuration changes that were generated by the control plane and replaced them with known-good configurations.

To prevent this from recurring, we will correct the control plane defect that generated the incorrect configuration and are adding additional validation at the fabric layer in order to more robustly detect configuration errors. Additionally, we intend on adding logic to the network control plane to be able to self-heal by automatically routing traffic away from the parts of the network fabric in an error state. Finally, we plan on evaluating further isolation of control plane configuration changes to reduce the size of the possible failure domain.

Again, we would like to apologize for this issue. We are taking immediate steps to improve the platform’s performance and availability.

May 02, 2018 14:19

The issue with Google Cloud Networking having increased packet loss in us-central1 has been resolved for all affected users as of Wednesday, 2018-05-02 14:10 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

The issue with Google Cloud Networking having increased packet loss in us-central1 has been resolved for all affected users as of Wednesday, 2018-05-02 14:10 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

May 02, 2018 14:02

We are investigating an issue with Google Cloud Networking. We will provide more information by 14:45 US/Pacific.

We are investigating an issue with Google Cloud Networking. We will provide more information by 14:45 US/Pacific.