Service Health
Incident affecting Cloud Key Management Service, Cloud Spanner, Google Compute Engine, Google Cloud Bigtable, Traffic Director, Google Cloud Dataflow, Google Cloud Storage, Google Cloud Networking, Dataplex, Google Cloud Pub/Sub, Cloud Load Balancing, Service Directory
Multiple services impacted in us-central1 region
Incident began at 2022-08-25 01:29 and ended at 2022-08-25 02:30 (all times are US/Pacific).
Previously affected location(s)
Iowa (us-central1)
Date | Time | Description | |
---|---|---|---|
| 6 Sep 2022 | 14:21 PDT | Incident ReportSummary: On Thursday, 25 August 2022, Google Cloud Networking experienced increased latency in us-central1 starting at 25 August 2022 01:29 for a duration of 1 hour and 1 minute. This caused errors and failures in several downstream services. To our customers that were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. We have conducted an internal investigation and are taking steps to improve our service. Root Cause: Google Cloud Networking utilizes physical routers to aggregate traffic across our network. Periodically, these routers need to be rebooted for various maintenance activities. Google regularly deploys updates to our network control plane to enhance performance, security, and reliability. We also take steps to ensure these reboots take place in a way that minimizes any downtime or customer impact by moving traffic away from the device before the reboot. During routine maintenance, and after draining existing network traffic, one of the routers being rebooted in us-central1 failed to boot from the primary boot media. Instead, it booted from the alternative boot media, which contained an outdated configuration that was missing updated routing information. That configuration caused the router to attract traffic that it was unable to route correctly. The changes in the network since the outdated configuration meant that our existing fail-safe to make the router take itself out of service again was ineffective. Remediation and Prevention: This failure was automatically remediated by Google’s automation systems by pushing the updated configuration on Thursday, 25 August 2022, 02:21 US/Pacific. The proper routes were restored by 02:30. The configuration push is one of the steps taken by automation before closing on the maintenance. Google is committed to preventing this type of disruption from reoccurring and is taking the follow actions: Complete an audit of old router configurations and remove them, and prevent them from accidentally being re-enabled. This is expected to be completed by 9 September 2022. Deploy additional monitoring to notify Google engineers when a device is expected to be out of service (i.e. drained) but is re-attracting traffic to reduce future outage durations. This will be rolled out by 30 September 2022. Detailed Description of Impact: On Thursday, 25 August 2022, from 01:29 to 02:30 US/Pacific unless otherwise noted: Cloud Pub/SubCustomers in us-central1 experienced reduced availability in the form of http 502 and 503 (retriable) errors across both publish and subscribe (Pub/Sub) operations. Less than 1% of customer projects were affected. Backlog statistics freshness was also impacted with less than 1% of subscriptions experiencing stale backlog data. Google Compute Engine (GCE)All GCE Compute API globally would have experienced a 20% rate of errors or requests hanging. Retries would have helped to get a successful request. Additionally, 50% of the compute.instance.insert operations in the us-central1-a zone succeeded. Cloud Key Management Service (KMS)Affected Cloud KMS customers would receive errors in nam-eur-asia1, nam7, nam9, nam10, nam11, and nam12 for a small subset of methods. Approximately 3% of requests over the impact period would see an Unavailable error. Cloud BigtableLess than 2% of Cloud Bigtable customers (approximately 10% of requests) experienced elevated latency and errors in the Data API and Admin API from 01:29 to 2:30 on 25 August in us-central1. The disruption also impacted users of Key Visualizer for Cloud Bigtable during this time. Cloud DataFlowDuring the incident (01:29 to 02:30 PT), affected customers may have been temporarily unable to create new DataFlow jobs, both Batch and Streaming (~2% of all customer projects). Information about existing jobs may have been temporarily unavailable (~57.5% of all customer projects). Some Streaming jobs may have been stuck (~4.2% of us-central1 Streaming jobs). Google Cloud StorageApproximately 0.5% of customer uploads and downloads saw authentication errors during the impact period of 01:31 PDT to 02:24 PDT. Customers would have experienced errors indicating deadline exceeded. Cloud SpannerLess than 5% of Cloud Spanner customers (approximately 15% of requests) experienced elevated latency and errors in the Data API and Admin API from 01:29 to 2:30 on 25 August in us-central1. Cloud Load BalancingAffected customers using cloud load balancing products would have been unable to submit configuration changes in us-central1 between 01:29 to 2:30. Already configured load balancers would continue serving traffic. Service Directory10 - 20% of registration operations and 0.2% of resolve operations to Service Directory in us-central1 failed during the affected period. |
| 29 Aug 2022 | 13:38 PDT | Mini Incident ReportWe apologize for the inconvenience this service disruption may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support or to Google Workspace Support using help article https://support.google.com/a/answer/1047213. (All Times US/Pacific) Incident Start: 25 August 2022 01:29 Incident End: 25 August 2022 02:30 Duration: 1 hour, 1 minute Affected Services and Features: Cloud Pub/Sub Google Compute Engine Cloud Key Management Service Cloud BigTable Cloud DataFlow Google Cloud Storage Cloud Spanner Cloud Load Balancing, Service Directory, Traffic Director, Dataplex Regions/Zones: us-central1 Description: Multiple Google Cloud services experienced increased latency, errors, and failures in us-central-1 for a period of 1 hour and 1 minute. From preliminary analysis, the root cause of the issue appears to be a new rollout to our network control plane, in which a networking component was rebooted and returned to an old configuration. Customer Impact: Customers experienced an increase in error rates across all the listed impacted services. |
| 25 Aug 2022 | 03:22 PDT | The issue with Cloud Key Management Service, Cloud Load Balancing, Cloud Spanner, Dataplex, Google Cloud Bigtable, Google Cloud Dataflow, Google Cloud Pub/Sub, Google Cloud Storage, Google Compute Engine, Service Directory, Traffic Director has been resolved for all affected users as of Thursday, 2022-08-25 02:30 US/Pacific. We thank you for your patience while we worked on resolving the issue. |
| 25 Aug 2022 | 03:14 PDT | Summary: Multiple services impacted in us-central1 region Description: We are experiencing an issue with Cloud Load Balancing, Traffic Director, Service Directory, Google Compute Engine, Google Cloud Pub/Sub, Cloud Load Balancing, Google Cloud Pub/Sub, Service Directory, Traffic Director, Cloud Load Balancing, Traffic Director, Service Directory, Google Compute Engine, Google Cloud Pub/Sub, Dataplex, Cloud Load Balancing, Traffic Director, Service Directory, Google Compute Engine, Google Cloud Pub/Sub, Dataplex, Google Cloud Bigtable. Our engineering team continues to investigate the issue. We will provide an update by Thursday, 2022-08-25 03:40 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Customers can observe errors while using services in us-central1 Workaround: None at this time |
| 25 Aug 2022 | 02:53 PDT | Summary: Multiple services impacted in us-central1 region Description: We are experiencing an issue with Cloud Load Balancing, Traffic Director, Service Directory, Google Compute Engine, Google Cloud Pub/Sub, Cloud Load Balancing, Google Cloud Pub/Sub, Service Directory, Traffic Director, Cloud Load Balancing, Traffic Director, Service Directory, Google Compute Engine, Google Cloud Pub/Sub, Dataplex, Cloud Load Balancing, Traffic Director, Service Directory, Google Compute Engine, Google Cloud Pub/Sub, Dataplex, Google Cloud Bigtable. Our engineering team continues to investigate the issue. We will provide an update by Thursday, 2022-08-25 03:20 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Customers can observe errors while using services in us-central1 Workaround: None at this time |
| 25 Aug 2022 | 02:49 PDT | Summary: Multiple services impacted in us-central1 region Description: We are experiencing an issue with Cloud Load Balancing, Traffic Director, Service Directory, Google Compute Engine, Google Cloud Pub/Sub, Cloud Load Balancing, Google Cloud Pub/Sub, Service Directory, Traffic Director, Cloud Load Balancing, Traffic Director, Service Directory, Google Compute Engine, Google Cloud Pub/Sub, Dataplex, Cloud Load Balancing, Traffic Director, Service Directory, Google Compute Engine, Google Cloud Pub/Sub, Dataplex, Google Cloud Bigtable. Our engineering team continues to investigate the issue. We will provide an update by Thursday, 2022-08-25 03:20 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Customers can observe errors while using services in us-central1 Workaround: None at this time |
- All times are US/Pacific