Google Cloud Status Dashboard

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Google Cloud Networking

Global: Experiencing Issue with Cloud networking

Incident began at 2021-11-16 09:34 and ended at 2021-11-16 11:28 (all times are US/Pacific).

Date Time Description
22 Nov 2021 18:29 PST

INCIDENT REPORT

Introduction

We apologize for any impact the service disruption on Tuesday, 16 November 2021, may have had on your organization. Thank you for your patience and understanding as we worked to resolve the issue. We want to share some information about what happened and the steps we are taking to ensure this issue doesn’t occur again. We also want to assure you that this service disruption does not have any bearing on our preparedness or platform reliability going into Black Friday/Cyber Monday (BFCM).

Incident Summary

On Tuesday, 16 November 2021 at 09:35 PT, Google Cloud Networking experienced issues with the Google External Proxy Load Balancing (GCLB) service. Affected customers received Google 404 errors in response to HTTP/S requests. Google engineers were alerted to the issue via automated alerting at 09:50 PT, which aligned with incoming customer support requests, and we immediately started to mitigate the issue by rolling back to the last known good configuration. Between 09:35 and 10:08 PT, customers affected by the outage may have encountered 404 errors when accessing any web page (URL) served by Google External Proxy Load Balancing. A rollback to the last known good configuration completed at 10:08 PT, which resolved the 404 errors. To avoid the risk of a recurrence, our engineers suspended customer-initiated configuration changes in GCLB. As a result, GCLB service customers were unable to make changes to their load balancing configuration between 10:04 and 11:28 PT. During the change suspension period, we validated the fix to safeguard against recurrence and deployed additional proctoring and monitoring to ensure safe resumption of service.
By 11:28 PT, customer configuration pushes resumed, and normal service was restored. The total duration of impact was 1 hour and 53 minutes.

Root Cause

This incident was caused by a bug in the configuration pipeline that propagates customer configuration rules to GCLB. The bug was introduced 6 months ago and allowed a race condition (when behavior depends on the timing of data accesses) that would, in very rare cases, push a corrupted configuration file to GCLB. The GCLB update pipeline contains extensive validation checks to prevent corrupt configurations, but the race condition was one that could corrupt the file near the end of the pipeline.

A Google engineer discovered this bug on 12 November, which caused us to declare an internal high-priority incident because of the latent risk to production systems. After analyzing the bug, we froze a part of our configuration system to make the likelihood of the race condition even lower. Since the race condition had existed in the fleet for several months already, the team believed that this extra step made the risk even lower. Thus the team believed the lowest-risk path, especially given the proximity to BFCM, was to roll out fixes in a controlled manner as opposed to a same-day emergency patch.

We developed two mitigations: patch A closed the race condition itself; and patch B added additional input validation to the binary receiving the configuration to prevent it from accepting the new configuration, even if the race condition occurred.

Both patches were ready and verified to fix the problem by 13 November. Gradual rollouts of both patches started on Monday, 15 November, and patch B completed rollout by that evening. On Tuesday, 16 November, as the patch A rollout was within 30 minutes of completing, the race condition did manifest in an unpatched cluster, and the outage started.

Additionally, even though patch B did protect against the kind of input errors observed during testing, the actual race condition produced a different form of error in the configuration, which the completed rollout of patch B did not prevent from being accepted.

Once the root cause was identified, our engineers mitigated the issue by restoring a known-good configuration, and completed and verified the fix, which eliminates the risk of recurrence.

Service(s) Affected:

  • Google Cloud Networking: Customer HTTP/S endpoints served 404 error pages. During partial recovery, traffic was served, but customers were unable to make changes to their load balancer configurations.
  • GCLB can be used to load balance traffic to a number of other Google Cloud services, which lost traffic because of the outage. Customers who use serverless network endpoint groups on GCLB as a frontend to Google Cloud Run, Google App Engine, Google App Engine Flex, or Google Cloud Functions received 404 errors when attempting to access their service. Customers using Apigee, Firebase, or Google App Engine Flex received 404 errors when attempting to access their service.

Zone(s) Affected:

Global

How Customers Experienced the Issue:

Between 09:35 and 10:08 PT, most endpoints served by global GCLB load balancers returned a 404 error. For an additional 1 hour and 20 minutes, customers were unable to make changes to their load balancing configuration.

Workaround(s):

None.

Service was restored on 16 November 2021 at 11:28 PT, and the Google Cloud Status Dashboard was updated by 12:08 PT to reflect this.

Remediation and Prevention

We have fixed the underlying bug and are taking the following actions to prevent recurrence: We immediately added additional alerting, which will notify us to similar issues significantly faster going forward. We are adding safeguards to prevent similar issues from occurring in the future. These safeguards provide strengthened automated correctness-checking to configurations before they are applied. We are accelerating planned architectural changes that will improve how we isolate and resolve such issues in the future.

16 Nov 2021 20:56 PST

PRELIMINARY INCIDENT REPORT

We apologize for the inconvenience this service outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Support by opening a case using https://cloud.google.com/support.

(All Times US/Pacific)

Incident Start: 16 November 2021 09:34

Incident End: 16 November 2021 11:28

Duration: 1 hour, 54 minutes

Affected Services and Features:

  • Google Cloud Networking
  • Google Cloud Functions
  • Google Cloud Run
  • Google App Engine
  • Google App Engine Flex
  • Apigee
  • Firebase

Regions/Zones:

us-central, europe-west1, global

Description:

Google Cloud Networking experienced issues with Google Cloud Load Balancing (GCLB) service resulting in impact to several downstream Google Cloud services. Impacted customers observed Google 404 errors on their websites. From preliminary analysis, the root cause of the issue was a latent bug in a network configuration service which was triggered during routine system operation.

The outage has been root caused and the mitigation fully deployed, with two forms of safeguards protecting against the issue happening in the future.

Customer Impact:

  • Google Cloud Networking – Customers were unable to make changes to their website load balancing and their websites served 404 error pages.
  • Google Cloud Functions – Customers who use GCLB service received 404 errors when attempting to access their service.
  • Google Cloud Run – Observed a 25% decrease in traffic in us-central1. Customers who use GCLB service received 404 errors when attempting to access their service.
  • Google App Engine – Observed 80% decrease in traffic in us-central and europe-west1. Customers who use GCLB service received 404 errors when attempting to access their service.
  • Google App Engine Flex – Customers who use GCLB received 404 errors when attempting to access their service and customer deployments experienced failures.
  • Apigee – Customers who use GCLB received 404 errors for runtime requests.
  • Google Firebase – Customers who use GCLB service received 404 errors when attempting to access their service.
16 Nov 2021 12:08 PST

The issue with Cloud Networking has been resolved for all affected projects as of Tuesday, 2021-11-16 11:28 US/Pacific.

Customers impacted by the issue may have encountered 404 errors when accessing web pages served by the Google External Proxy Load Balancer between 09:35 and 10:10 US/Pacific.

Customer impact from 10:10 to 11:28 US/Pacific was configuration changes to External Proxy Load Balancers not taking effect. As of 11:28 US/Pacific configuration pushes resumed.

Google Cloud Run, Google App Engine, Google Cloud Functions, and Apigee were also impacted.

We will publish an analysis of this incident, once we have completed our internal investigation.

We thank you for your patience while we worked on resolving the issue.

16 Nov 2021 11:26 PST

Summary: Global: Experiencing Issue with Cloud networking

Description: We believe the issue with Cloud Networking is partially resolved.

Customers will be unable to apply changes to their load balancers until the issue is fully resolved.

We do not have an ETA for full resolution at this point.

We will provide an update by Tuesday, 2021-11-16 12:28 US/Pacific with current details.

Diagnosis: Customers impacted by the issue may have encountered 404 errors when accessing web pages served by the Google External Proxy Load Balancer between 09:35 and 10:10 US/Pacific.

Customer impact from 10:10 US/Pacific onward is configuration changes to External Proxy Load Balancers not taking effect.

Workaround: None at this time.

16 Nov 2021 10:17 PST

Summary: Global: Experiencing Issue with Cloud networking

Description: We believe the issue with Cloud Networking is partially resolved.

Customers will be unable to apply changes to their load balancers until the issue is fully resolved.

We do not have an ETA for full resolution at this point.

We will provide an update by Tuesday, 2021-11-16 11:28 US/Pacific with current details.

Diagnosis: Customers may encounter 404 errors when accessing web pages.

Workaround: None at this time.

16 Nov 2021 10:10 PST

Summary: Global: Experiencing Issue with Cloud networking

Description: We are experiencing an issue with Cloud Networking beginning at Tuesday, 2021-11-16 09:53 US/Pacific.

Our engineering team continues to investigate the issue.

We will provide an update by Tuesday, 2021-11-16 10:40 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Customers may encounter 404 errors when accessing web pages.

Workaround: None at this time.