Service Health
Incident affecting Google Cloud Networking, Google Compute Engine, Google Kubernetes Engine, Google App Engine, Cloud Spanner, Cloud Load Balancing, Hybrid Connectivity
Increased latency and packet loss
Incident began at 2021-03-17 08:20 and ended at 2021-03-17 12:50 (all times are US/Pacific).
Date | Time | Description | |
---|---|---|---|
| 23 Mar 2021 | 09:38 PDT | ISSUE SUMMARYOn Wednesday 17 March 2021, Google Cloud Networking and Cloud Services that depend on Google's backbone network experienced a service disruption that resulted in increased latency, packet loss, and service unavailable errors for some services for a duration of 3 hours, 39 minutes. Cloud Interconnect had an extended impact duration of 4 hours, 30 minutes. We understand that this issue has impacted our valued customers and users, and we apologize to those who were affected. ROOT CAUSEGoogle Cloud datacenters connect to Google’s global Backbone network through a datacenter edge networking stack that uses routers to bridge a region’s network with the global Backbone network. There are multiple roles that routers in Google networks have; some routers are dedicated to providing connectivity to the Google Backbone network, others are dedicated to providing aggregation for customer and peering routes. Google utilizes routers from multiple vendors to provide defense in depth in order to reduce impact for issues that affect a specific router vendor. The trigger for this service disruption was a new set of routers being connected to Google’s backbone network as part of the normal router build process. The routers were part of a new network topology, this topology changed routes that some router roles received. This change in topology inadvertently caused the associated routes to be communicated to routers responsible for providing connectivity to the Google backbone, as well as aggregation routers. This triggered a defect in routers of a specific model, causing their routing process to fail. We previously communicated this defect was unknown; this is incorrect, as after further investigation we found that this defect was previously known, however it was not known to affect routers in these roles. During a routing failure these routers are configured to automatically redirect traffic away to minimize congestion and traffic loss, however, this results in some packet loss while the network reconverges onto new paths. This behavior worked as intended to reduce the potential impact of the issue, as repeated widespread routing process failures have the potential to create cascade failures in the backbone network. REMEDIATION AND PREVENTIONOnce the nature and scope of the problem became clear, Google engineers isolated the new set of routers from the network to prevent invalid routes being sent to the backbone routers. Once it was confirmed that affected routers were healthy and no longer had invalid routes, and impact for most services had ended, engineers began work to return traffic back to the routers that had rebooted. During this mitigation work to return traffic to a large number of routers, congestion caused a temporary period of increased loss and latency. Once the rebooted routers were back in the traffic path and the network had reconverged, the incident was considered mitigated. In addition to fixing the underlying cause that resulted in invalid routes to trigger routing process failures on specific routers and repairing the bug in the vendor OS, we will be implementing changes to prevent and reduce the impact of this type of failure in several ways: 1. Improve internal tooling for redirecting traffic from routers to reduce time to mitigation for issues with widespread network impact. 2. Improve testing and release process of new router builds to ensure that topology changes for router roles are identified prior to being connected to the backbone network. In addition to these changes, we are also working on long term architectural changes to help prevent issues of this type in the future. These changes will create well defined functional domains in the backbone network to allow for more consistent enforcement of route policies, limiting the scale of potential impact. These policies would provide better systematic protection against routes propagated through the network. DETAILED DESCRIPTION OF IMPACTOn Wednesday 17 March 2021 from 08:20 to 12:50 US/Pacific, Google Cloud Networking increased latency, packet loss, and service unavailable errors for traffic between regions and from Google to external endpoints including on-premises environments and the public internet. The issue was mitigated when the source of the invalid routes was isolated by 11:00, and work to redirect traffic back to the affected routers began. During this time, the manual one-time mitigation to return traffic to a large number of routers caused a temporary period of congestion, leading to increased loss and latency between 11:13 and 11:24. Additionally, a set of faulty routers in us-east4 incorrectly had traffic routed to them as part of this mitigation work, which resulted in further impact to the network in that region between 11:08 and 11:59. Finally, Cloud Interconnect had extended impact until 12:50 due to a lack of router vendor redundancy in some interconnect locations. Compute EngineVM to VM inter-regional traffic via public IP’s experienced intermittent packet loss with peaks of up to 42.6%. VM to VM inter-regional traffic via private IP’s experienced intermittent packet loss with peaks of up to 15.6%. Connectivity within a zone was not impacted. The period of impact was between 08:20 and 11:59. Cloud Load BalancingRequests to Cloud load balancers and Cloud CDN experienced increased timeouts, errors of up to 25%, and additional latency between 08:20 and 10:43. Cloud InterconnectCloud interconnect experienced intermittent packet loss for interconnects, with peaks of up to 100% loss lasting several minutes between 08:20 and 12:50. Some interconnect locations did not have sufficient vendor redundancy for routers, resulting in a longer period of impact during traffic redirection. Cloud VPNUp to 13.5% of Cloud VPN tunnels experienced intermittent connectivity loss between 08:29 and 10:46. Kubernetes EngineZonal clusters were impacted due to the impact to Cloud Networking and Compute Engine and experienced increased latency, errors, and packet loss for requests to the Kubernetes control plane. Regional clusters were unaffected due to master node redundancy. App Engine FlexSome App Engine Flex traffic experienced increased latency during the incident. Cloud SpannerCloud Spanner experienced increased latency and increased DEADLINE_EXCEEDED errors in us-central1, us-east1, and europe-west1 from 08:50 to 11:45. ADDITIONAL DETAILSThe end time used in the preliminary incident statement was incorrect, as some services had lingering impact after the issue was mitigated. The corrected details for impact times are contained in the above report. |
| 17 Mar 2021 | 14:25 PDT | Preliminary Incident Statement while full Incident Report is prepared. (All Times US/Pacific) Affected:
Description:Google Cloud Platform experienced a multi-region service disruption resulting in intermittent connectivity issues. The root cause was an invalid route that triggered a previously unknown defect in routers of a specific vendor in Google's backbone network causing some to reboot. We have mitigated the issue by isolating the origin of the invalid route and have observed networks stabilize. Customer Impact:Google Cloud Networking experienced increased latency, packet loss, and service unavailable errors for traffic between regions and from Google to external endpoints including on-premises environments and the internet. Connectivity within a zone was not impacted. Additional Details:We communicated during the disruption that Google Workspace was impacted, however upon further analysis, we have determined that the impact to Google Workspace was negligible. Google systems are built with a defense in depth strategy in mind. In this disruption, a diverse set of router vendors in our network helped reduce the length and severity of this issue. We will publish an analysis of this incident once we have completed our internal investigation. |
| 17 Mar 2021 | 11:16 PDT | Our engineering team has implemented a mitigation and is now confident that the issue will not recur. Impact to Cloud Networking started improving starting at 10:45 US/Pacific with complete resolution for all users by 11:00 US/Pacific. We will publish an analysis of this incident once we have completed our internal investigation. We thank you for your patience while we worked on resolving the issue. |
| 17 Mar 2021 | 10:55 PDT | Description: We are experiencing an intermittent issue with Google Cloud Networking impacting traffic between regions and Google from/to the internet. The issue started occurring intermittently at 08:26 US/Pacific. The issue is impacting Google’s Backbone network and may impact various services when accessing them from a different region or from the internet. Impacted services include Cloud Services (Workspace, Firebase, GCP) as well as other Google properties. Connectivity within a zone should not be impacted. Our engineering team has implemented a mitigation and is now monitoring the effectiveness of the change. We will provide an update by Wednesday, 2021-03-17 11:30 US/Pacific with current details. Diagnosis: Customers impacted by this issue may see high latency, packet loss, and service unavailable errors. The issue manifests itself intermittently and different regions and locations may experience the issue at different times. Workaround: None at this time. |
| 17 Mar 2021 | 10:14 PDT | Description: We are experiencing an intermittent issue with Google Cloud Networking impacting traffic between regions and from Google to the internet. The issue started occurring intermittently at 08:26 US/Pacific. The issue is impacting Google’s Backbone network and may impact various services when accessing them from a different region or from the internet. Connectivity within a region should not be impacted. The symptom is understood and our engineering team continues to investigate the issue to find triggers and mitigation. We will provide an update by Wednesday, 2021-03-17 11:00 US/Pacific with current details. Diagnosis: Customers impacted by this issue may see high latency, packet loss, and service unavailable errors. The issue manifests itself intermittently and different regions and locations may experience the issue at different times. Workaround: None at this time. |
| 17 Mar 2021 | 09:23 PDT | Description: We are experiencing an intermittent intermittent issue with Cloud Networking. Our engineering team continues to investigate the issue. We will provide an update by Wednesday, 2021-03-17 11:00 US/Pacific with current details. Diagnosis: High latency, packet loss, and service unavailable. Workaround: None at this time. |
- All times are US/Pacific