Service Health
Incident affecting Google Cloud Networking, Cloud Run, Google App Engine, Google Cloud Functions, Apigee, Cloud Load Balancing
Global: Experiencing Issue with Cloud networking
Incident began at 2021-11-16 09:34 and ended at 2021-11-16 11:28 (all times are US/Pacific).
Date | Time | Description | |
---|---|---|---|
| 22 Nov 2021 | 18:29 PST | INCIDENT REPORTIntroductionWe apologize for any impact the service disruption on Tuesday, 16 November 2021, may have had on your organization. Thank you for your patience and understanding as we worked to resolve the issue. We want to share some information about what happened and the steps we are taking to ensure this issue doesn’t occur again. We also want to assure you that this service disruption does not have any bearing on our preparedness or platform reliability going into Black Friday/Cyber Monday (BFCM). Incident SummaryOn Tuesday, 16 November 2021 at 09:35 PT, Google Cloud Networking experienced issues with the Google External Proxy Load Balancing (GCLB) service. Affected customers received Google 404 errors in response to HTTP/S requests.
Google engineers were alerted to the issue via automated alerting at 09:50 PT, which aligned with incoming customer support requests, and we immediately started to mitigate the issue by rolling back to the last known good configuration.
Between 09:35 and 10:08 PT, customers affected by the outage may have encountered 404 errors when accessing any web page (URL) served by Google External Proxy Load Balancing.
A rollback to the last known good configuration completed at 10:08 PT, which resolved the 404 errors. To avoid the risk of a recurrence, our engineers suspended customer-initiated configuration changes in GCLB. As a result, GCLB service customers were unable to make changes to their load balancing configuration between 10:04 and 11:28 PT.
During the change suspension period, we validated the fix to safeguard against recurrence and deployed additional proctoring and monitoring to ensure safe resumption of service. Root CauseThis incident was caused by a bug in the configuration pipeline that propagates customer configuration rules to GCLB. The bug was introduced 6 months ago and allowed a race condition (when behavior depends on the timing of data accesses) that would, in very rare cases, push a corrupted configuration file to GCLB. The GCLB update pipeline contains extensive validation checks to prevent corrupt configurations, but the race condition was one that could corrupt the file near the end of the pipeline. A Google engineer discovered this bug on 12 November, which caused us to declare an internal high-priority incident because of the latent risk to production systems. After analyzing the bug, we froze a part of our configuration system to make the likelihood of the race condition even lower. Since the race condition had existed in the fleet for several months already, the team believed that this extra step made the risk even lower. Thus the team believed the lowest-risk path, especially given the proximity to BFCM, was to roll out fixes in a controlled manner as opposed to a same-day emergency patch. We developed two mitigations: patch A closed the race condition itself; and patch B added additional input validation to the binary receiving the configuration to prevent it from accepting the new configuration, even if the race condition occurred. Both patches were ready and verified to fix the problem by 13 November. Gradual rollouts of both patches started on Monday, 15 November, and patch B completed rollout by that evening. On Tuesday, 16 November, as the patch A rollout was within 30 minutes of completing, the race condition did manifest in an unpatched cluster, and the outage started. Additionally, even though patch B did protect against the kind of input errors observed during testing, the actual race condition produced a different form of error in the configuration, which the completed rollout of patch B did not prevent from being accepted. Once the root cause was identified, our engineers mitigated the issue by restoring a known-good configuration, and completed and verified the fix, which eliminates the risk of recurrence. Service(s) Affected:
Zone(s) Affected:Global How Customers Experienced the Issue:Between 09:35 and 10:08 PT, most endpoints served by global GCLB load balancers returned a 404 error. For an additional 1 hour and 20 minutes, customers were unable to make changes to their load balancing configuration. Workaround(s):None. Service was restored on 16 November 2021 at 11:28 PT, and the Google Cloud Status Dashboard was updated by 12:08 PT to reflect this. Remediation and PreventionWe have fixed the underlying bug and are taking the following actions to prevent recurrence: We immediately added additional alerting, which will notify us to similar issues significantly faster going forward. We are adding safeguards to prevent similar issues from occurring in the future. These safeguards provide strengthened automated correctness-checking to configurations before they are applied. We are accelerating planned architectural changes that will improve how we isolate and resolve such issues in the future. |
| 16 Nov 2021 | 20:56 PST | PRELIMINARY INCIDENT REPORTWe apologize for the inconvenience this service outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Support by opening a case using https://cloud.google.com/support. (All Times US/Pacific) Incident Start: 16 November 2021 09:34 Incident End: 16 November 2021 11:28 Duration: 1 hour, 54 minutes Affected Services and Features:
Regions/Zones: us-central, europe-west1, global Description: Google Cloud Networking experienced issues with Google Cloud Load Balancing (GCLB) service resulting in impact to several downstream Google Cloud services. Impacted customers observed Google 404 errors on their websites. From preliminary analysis, the root cause of the issue was a latent bug in a network configuration service which was triggered during routine system operation. The outage has been root caused and the mitigation fully deployed, with two forms of safeguards protecting against the issue happening in the future. Customer Impact:
|
| 16 Nov 2021 | 12:08 PST | The issue with Cloud Networking has been resolved for all affected projects as of Tuesday, 2021-11-16 11:28 US/Pacific. Customers impacted by the issue may have encountered 404 errors when accessing web pages served by the Google External Proxy Load Balancer between 09:35 and 10:10 US/Pacific. Customer impact from 10:10 to 11:28 US/Pacific was configuration changes to External Proxy Load Balancers not taking effect. As of 11:28 US/Pacific configuration pushes resumed. Google Cloud Run, Google App Engine, Google Cloud Functions, and Apigee were also impacted. We will publish an analysis of this incident, once we have completed our internal investigation. We thank you for your patience while we worked on resolving the issue. |
| 16 Nov 2021 | 11:26 PST | Summary: Global: Experiencing Issue with Cloud networking Description: We believe the issue with Cloud Networking is partially resolved. Customers will be unable to apply changes to their load balancers until the issue is fully resolved. We do not have an ETA for full resolution at this point. We will provide an update by Tuesday, 2021-11-16 12:28 US/Pacific with current details. Diagnosis: Customers impacted by the issue may have encountered 404 errors when accessing web pages served by the Google External Proxy Load Balancer between 09:35 and 10:10 US/Pacific. Customer impact from 10:10 US/Pacific onward is configuration changes to External Proxy Load Balancers not taking effect. Workaround: None at this time. |
| 16 Nov 2021 | 10:17 PST | Summary: Global: Experiencing Issue with Cloud networking Description: We believe the issue with Cloud Networking is partially resolved. Customers will be unable to apply changes to their load balancers until the issue is fully resolved. We do not have an ETA for full resolution at this point. We will provide an update by Tuesday, 2021-11-16 11:28 US/Pacific with current details. Diagnosis: Customers may encounter 404 errors when accessing web pages. Workaround: None at this time. |
| 16 Nov 2021 | 10:10 PST | Summary: Global: Experiencing Issue with Cloud networking Description: We are experiencing an issue with Cloud Networking beginning at Tuesday, 2021-11-16 09:53 US/Pacific. Our engineering team continues to investigate the issue. We will provide an update by Tuesday, 2021-11-16 10:40 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Customers may encounter 404 errors when accessing web pages. Workaround: None at this time. |
- All times are US/Pacific