Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Google Cloud Networking, Cloud Load Balancing

Global: Elevated HTTP 4xx Errors on External Application Load Balancer

Incident began at 2023-09-21 23:30 and ended at 2023-09-22 23:34 (all times are US/Pacific).

Previously affected location(s)

Global

Date Time Description
28 Sep 2023 10:56 PDT

Incident Report

Summary

On Thursday, 21 September 2023, Cloud Load Balancing experienced a configuration propagation issue resulting in a failure to serve traffic for some customers. To our Cloud Load Balancing customers whose businesses were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability.

Root Cause

The root cause of the issue was a change in the leader election process of a component of Google Cloud Load Balancer’s configuration management pipeline. This change triggered a race condition in a downstream component in the pipeline, resulting in overwriting some customer configurations with empty configurations.

This change of leader election was being done as part of a roll out to increase the resiliency of the system. Google engineers rolled out the change gradually and made use of a canary process to gate progress on correct functionality; however our automated rollout verification did not detect this issue, as in isolation the component rolling out was performing correctly. While the roll out was in progress, the issue was detected by our dataplane monitoring due to the increased rate of 4xx errors being served.

Remediation and Prevention

Google engineers were alerted to an issue on 22 September 2023 at 04:51 US/Pacific from dataplane monitoring for HTTP 4xx responses and froze the configuration pipeline to avoid further impact.

At 05:15 US/pacific, Google engineers reinstated a snapshot of known good config for the impacted config type, mitigating the dataplane errors for which we had been alerted. At 07:15 US/Pacific the configuration pipeline was un-frozen, following our standard process. However our validations identified an underlying issue that was still present and that impact to the data plane resurfaced after the configuration pipeline was unfrozen.

Between 07:15 and 12:40 Google engineers focused on root cause investigation and continuous manual mitigation of impacted project configurations, which were being affected due to natural leader election flips. During this time we identified the root cause component but due to the nature of the issue, were unable to follow our standard rollback procedure resulting in a delay until 16:58, when Google engineers started a custom rollback of the relevant management plane component, which was completed at 23:34.

Once this rollback was completed, Google engineers performed additional validations by triggering leader election changes and ensuring that these changes no longer caused the unintended behavior.

Affected customers were recommended to try making changes to the configuration of their External Application Load Balancer. This resolved URL loading failure issues for impacted customers. We apologize for the length and severity of this incident. We are taking multiple steps to prevent a recurrence and improve reliability in the future.

  • Identify and resolve the underlying race condition in our configuration lifecycle that resulted in the unintended modifications.
  • Improve tooling and processes to more quickly and safely rollback components in the configuration management system.
  • Improve Google's internal monitoring and visualization to better assess and understand source of 4XX response alerting to expedite detection.
  • The Configuration Management System used by Global Load Balancer is scoped to operate at a project level within the global scope today. This architecture was a key reason for customers experiencing global impact. We had already initiated a program to make these operations happen at a regional level, in every region as part of our ongoing reliability improvement roadmap. This would reduce the impact of a similar outage to a region instead of global.

Detailed Description of Impact

Starting at 23:30 on 21 September 2023, some customers experienced a global outage to their Global External Load Balancer data plane for affected projects for the period their configuration was in the bad state. This manifested as HTTP 4xx errors until either the suggested workaround or the Google mitigation was applied. The incident lasted for a period of 1 day, 4 minutes. However, individual customers were impacted for a much shorter period.

Customer impact was noted mostly in one specific configuration type of Google front end (GFE) that is used for Internet Network End Groups (NEG) [1] where up to 20% of customers’ queries failed due to missing configuration. At peak, this impact was observed in up to 9.3% of customer’s projects hosted in this specific type of configuration. Impact to customers using other configuration types was minimal to negligible.

[1] - https://cloud.google.com/load-balancing/docs/negs

25 Sep 2023 15:07 PDT

Mini Incident Report

We apologize for the inconvenience this service disruption may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support.

(All Times US/Pacific)

Incident Start: 21 Sep 2023 23:30

Incident End: 22 Sep 2023 23:34

Duration: 1 day and 4 minutes

Affected Services and Features:

Google Cloud Load Balancing

Regions/Zones: Global

Description:

Google Cloud Load Balancing experienced elevated HTTP 4xx errors for a period of 1 day, 4 minutes.

From preliminary analysis the root cause of the issue is a change in leader election process of a component of Google Cloud Load Balancer’s management pipeline that triggered a race condition in a downstream component in the pipeline, resulting in the unintended modification of some customer configurations.

Google will complete a full incident report in the following days that will provide a detailed root cause.

Customer Impact:

Customers would have experienced a global outage to their data plane for the period their config was in the bad state and would have observed 4xx errors from external load balancers resulting in failure of loading of URLs

23 Sep 2023 01:14 PDT

The issue with Cloud Load Balancing has been resolved for all affected users as of Saturday, 2023-09-23 01:10 US/Pacific.

We thank you for your patience while we worked on resolving the issue.

23 Sep 2023 00:15 PDT

Summary: Global: Elevated HTTP 4xx Errors on External Application Load Balancer

Description: Mitigation efforts are ongoing with our engineering team, and almost all impact has been mitigated. The impact is now limited to extremely rare cases of individual load balancers experiencing a spike in errors of at most three minutes. Engineering teams are continuing to monitor the situation and are working to resolve the root cause of the problem

We do not have an ETA for full mitigation at this point.

We will provide more information by Saturday, 2023-09-23 10:00 US/Pacific.

Diagnosis: Some customers may see errors in particular HTTP 4xx errors on External Application Load Balancer.

Workaround: Affected customers are recommended to try making any change to the configuration of their External Application Load Balancer. Projects with recently updated configuration should be unaffected by the problem.

22 Sep 2023 16:56 PDT

Summary: Global: Elevated HTTP 4xx Errors on External Application Load Balancer

Description: Mitigation work is still underway with our engineering team and the majority of impact has been mitigated. Engineering teams are continuing to monitor the situation and are working to resolve the source of the problem.

We do not have an ETA for mitigation at this point.

We will provide more information by Saturday, 2023-09-23 10:00 US/Pacific.

Diagnosis: Some customers may see errors in particular HTTP 4xx errors on External Application Load Balancer.

Workaround: Affected customers are recommended to try making any change to the configuration of their External Application Load Balancer. Projects with recently updated configuration should be unaffected by the problem.

22 Sep 2023 15:52 PDT

Summary: Global: Elevated HTTP 4xx Errors on External Application Load Balancer

Description: Mitigation work is still underway with our engineering team and the majority of impact has been mitigated. Engineering teams are continuing to monitor to confirm full recovery.

We do not have an ETA for mitigation at this point.

We will provide more information by Friday, 2023-09-22 17:00 US/Pacific.

Diagnosis: Some customers may see errors in particular HTTP 4xx errors on External Application Load Balancer.

Workaround: Affected customers are recommended to try making any change to the configuration of their External Application Load Balancer. Projects with recently updated configuration should be unaffected by the problem.

22 Sep 2023 13:19 PDT

Summary: Global: Elevated HTTP 4xx Errors on External Application Load Balancer

Description: Mitigation work is currently underway by our engineering team and the majority of impact is mitigated. Engineering teams are continuing to monitor to confirm full recovery.

We do not have an ETA for mitigation at this point.

We will provide more information by Friday, 2023-09-22 16:00 US/Pacific.

Diagnosis: Some customers may see errors in particular HTTP 4xx errors on External Application Load Balancer.

Workaround: Affected customers are recommended to try making any change to the configuration of their External Application Load Balancer. Projects with recently updated configuration should be unaffected by the problem.

22 Sep 2023 10:21 PDT

Summary: Global: Elevated HTTP 4xx Errors on External Application Load Balancer

Description: Mitigation work is currently underway and our engineering team continues to investigate the root cause of the issue.

We do not have an ETA for mitigation at this point.

We will provide more information by Friday, 2023-09-22 13:00 US/Pacific.

Diagnosis: Some customers may see errors in particular HTTP 4xx errors on External Application Load Balancer.

Workaround: Affected customers are recommended to try making any change to the configuration of their External Application Load Balancer. Projects with recently updated configuration should be unaffected by the problem.

22 Sep 2023 09:34 PDT

Summary: Global: Elevated HTTP 4xx Errors on External Application Load Balancer

Description: Mitigation work is currently underway and our engineering team continues to investigate the root cause of the issue.

We do not have an ETA for mitigation at this point.

We will provide more information by Friday, 2023-09-22 11:00 US/Pacific.

Diagnosis: Some customers may see errors in particular HTTP 4xx errors on External Application Load Balancer.

Workaround: None at this time.

22 Sep 2023 08:49 PDT

Summary: Global: Elevated HTTP 4xx Errors on External Application Load Balancer

Description: Mitigation work is currently underway by our engineering team.

We do not have an ETA for mitigation at this point.

We will provide more information by Friday, 2023-09-22 09:51 US/Pacific.

Diagnosis: Some customers may see errors in particular HTTP 4xx errors on External Application Load Balancer.

Workaround: None at this time.

22 Sep 2023 08:47 PDT

Summary: We are experiencing an issue with Cloud Load Balancing.

Description: We are experiencing an issue with Cloud Load Balancing.

Our engineering team continues to investigate the issue.

We will provide an update by Friday, 2023-09-22 09:20 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Customers may experience elevated 400 errors.

Workaround: None at this time.