Service Health
Incident affecting Google Cloud Networking, Cloud Load Balancing
Global: Elevated HTTP 4xx Errors on External Application Load Balancer
Incident began at 2023-09-21 23:30 and ended at 2023-09-22 23:34 (all times are US/Pacific).
Previously affected location(s)
Global
Date | Time | Description | |
---|---|---|---|
| 28 Sep 2023 | 10:56 PDT | Incident ReportSummaryOn Thursday, 21 September 2023, Cloud Load Balancing experienced a configuration propagation issue resulting in a failure to serve traffic for some customers. To our Cloud Load Balancing customers whose businesses were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. Root CauseThe root cause of the issue was a change in the leader election process of a component of Google Cloud Load Balancer’s configuration management pipeline. This change triggered a race condition in a downstream component in the pipeline, resulting in overwriting some customer configurations with empty configurations. This change of leader election was being done as part of a roll out to increase the resiliency of the system. Google engineers rolled out the change gradually and made use of a canary process to gate progress on correct functionality; however our automated rollout verification did not detect this issue, as in isolation the component rolling out was performing correctly. While the roll out was in progress, the issue was detected by our dataplane monitoring due to the increased rate of 4xx errors being served. Remediation and PreventionGoogle engineers were alerted to an issue on 22 September 2023 at 04:51 US/Pacific from dataplane monitoring for HTTP 4xx responses and froze the configuration pipeline to avoid further impact. At 05:15 US/pacific, Google engineers reinstated a snapshot of known good config for the impacted config type, mitigating the dataplane errors for which we had been alerted. At 07:15 US/Pacific the configuration pipeline was un-frozen, following our standard process. However our validations identified an underlying issue that was still present and that impact to the data plane resurfaced after the configuration pipeline was unfrozen. Between 07:15 and 12:40 Google engineers focused on root cause investigation and continuous manual mitigation of impacted project configurations, which were being affected due to natural leader election flips. During this time we identified the root cause component but due to the nature of the issue, were unable to follow our standard rollback procedure resulting in a delay until 16:58, when Google engineers started a custom rollback of the relevant management plane component, which was completed at 23:34. Once this rollback was completed, Google engineers performed additional validations by triggering leader election changes and ensuring that these changes no longer caused the unintended behavior. Affected customers were recommended to try making changes to the configuration of their External Application Load Balancer. This resolved URL loading failure issues for impacted customers. We apologize for the length and severity of this incident. We are taking multiple steps to prevent a recurrence and improve reliability in the future.
Detailed Description of ImpactStarting at 23:30 on 21 September 2023, some customers experienced a global outage to their Global External Load Balancer data plane for affected projects for the period their configuration was in the bad state. This manifested as HTTP 4xx errors until either the suggested workaround or the Google mitigation was applied. The incident lasted for a period of 1 day, 4 minutes. However, individual customers were impacted for a much shorter period. Customer impact was noted mostly in one specific configuration type of Google front end (GFE) that is used for Internet Network End Groups (NEG) [1] where up to 20% of customers’ queries failed due to missing configuration. At peak, this impact was observed in up to 9.3% of customer’s projects hosted in this specific type of configuration. Impact to customers using other configuration types was minimal to negligible. |
| 25 Sep 2023 | 15:07 PDT | Mini Incident ReportWe apologize for the inconvenience this service disruption may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support. (All Times US/Pacific) Incident Start: 21 Sep 2023 23:30 Incident End: 22 Sep 2023 23:34 Duration: 1 day and 4 minutes Affected Services and Features: Google Cloud Load Balancing Regions/Zones: Global Description: Google Cloud Load Balancing experienced elevated HTTP 4xx errors for a period of 1 day, 4 minutes. From preliminary analysis the root cause of the issue is a change in leader election process of a component of Google Cloud Load Balancer’s management pipeline that triggered a race condition in a downstream component in the pipeline, resulting in the unintended modification of some customer configurations. Google will complete a full incident report in the following days that will provide a detailed root cause. Customer Impact: Customers would have experienced a global outage to their data plane for the period their config was in the bad state and would have observed 4xx errors from external load balancers resulting in failure of loading of URLs |
| 23 Sep 2023 | 01:14 PDT | The issue with Cloud Load Balancing has been resolved for all affected users as of Saturday, 2023-09-23 01:10 US/Pacific. We thank you for your patience while we worked on resolving the issue. |
| 23 Sep 2023 | 00:15 PDT | Summary: Global: Elevated HTTP 4xx Errors on External Application Load Balancer Description: Mitigation efforts are ongoing with our engineering team, and almost all impact has been mitigated. The impact is now limited to extremely rare cases of individual load balancers experiencing a spike in errors of at most three minutes. Engineering teams are continuing to monitor the situation and are working to resolve the root cause of the problem We do not have an ETA for full mitigation at this point. We will provide more information by Saturday, 2023-09-23 10:00 US/Pacific. Diagnosis: Some customers may see errors in particular HTTP 4xx errors on External Application Load Balancer. Workaround: Affected customers are recommended to try making any change to the configuration of their External Application Load Balancer. Projects with recently updated configuration should be unaffected by the problem. |
| 22 Sep 2023 | 16:56 PDT | Summary: Global: Elevated HTTP 4xx Errors on External Application Load Balancer Description: Mitigation work is still underway with our engineering team and the majority of impact has been mitigated. Engineering teams are continuing to monitor the situation and are working to resolve the source of the problem. We do not have an ETA for mitigation at this point. We will provide more information by Saturday, 2023-09-23 10:00 US/Pacific. Diagnosis: Some customers may see errors in particular HTTP 4xx errors on External Application Load Balancer. Workaround: Affected customers are recommended to try making any change to the configuration of their External Application Load Balancer. Projects with recently updated configuration should be unaffected by the problem. |
| 22 Sep 2023 | 15:52 PDT | Summary: Global: Elevated HTTP 4xx Errors on External Application Load Balancer Description: Mitigation work is still underway with our engineering team and the majority of impact has been mitigated. Engineering teams are continuing to monitor to confirm full recovery. We do not have an ETA for mitigation at this point. We will provide more information by Friday, 2023-09-22 17:00 US/Pacific. Diagnosis: Some customers may see errors in particular HTTP 4xx errors on External Application Load Balancer. Workaround: Affected customers are recommended to try making any change to the configuration of their External Application Load Balancer. Projects with recently updated configuration should be unaffected by the problem. |
| 22 Sep 2023 | 13:19 PDT | Summary: Global: Elevated HTTP 4xx Errors on External Application Load Balancer Description: Mitigation work is currently underway by our engineering team and the majority of impact is mitigated. Engineering teams are continuing to monitor to confirm full recovery. We do not have an ETA for mitigation at this point. We will provide more information by Friday, 2023-09-22 16:00 US/Pacific. Diagnosis: Some customers may see errors in particular HTTP 4xx errors on External Application Load Balancer. Workaround: Affected customers are recommended to try making any change to the configuration of their External Application Load Balancer. Projects with recently updated configuration should be unaffected by the problem. |
| 22 Sep 2023 | 10:21 PDT | Summary: Global: Elevated HTTP 4xx Errors on External Application Load Balancer Description: Mitigation work is currently underway and our engineering team continues to investigate the root cause of the issue. We do not have an ETA for mitigation at this point. We will provide more information by Friday, 2023-09-22 13:00 US/Pacific. Diagnosis: Some customers may see errors in particular HTTP 4xx errors on External Application Load Balancer. Workaround: Affected customers are recommended to try making any change to the configuration of their External Application Load Balancer. Projects with recently updated configuration should be unaffected by the problem. |
| 22 Sep 2023 | 09:34 PDT | Summary: Global: Elevated HTTP 4xx Errors on External Application Load Balancer Description: Mitigation work is currently underway and our engineering team continues to investigate the root cause of the issue. We do not have an ETA for mitigation at this point. We will provide more information by Friday, 2023-09-22 11:00 US/Pacific. Diagnosis: Some customers may see errors in particular HTTP 4xx errors on External Application Load Balancer. Workaround: None at this time. |
| 22 Sep 2023 | 08:49 PDT | Summary: Global: Elevated HTTP 4xx Errors on External Application Load Balancer Description: Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point. We will provide more information by Friday, 2023-09-22 09:51 US/Pacific. Diagnosis: Some customers may see errors in particular HTTP 4xx errors on External Application Load Balancer. Workaround: None at this time. |
| 22 Sep 2023 | 08:47 PDT | Summary: We are experiencing an issue with Cloud Load Balancing. Description: We are experiencing an issue with Cloud Load Balancing. Our engineering team continues to investigate the issue. We will provide an update by Friday, 2023-09-22 09:20 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Customers may experience elevated 400 errors. Workaround: None at this time. |
- All times are US/Pacific