Service Health
Incident affecting Google Compute Engine, Google Kubernetes Engine, Google Cloud Bigtable, Persistent Disk, Google Cloud Dataflow, Google App Engine, Google Cloud SQL
Multiple services for Google Cloud Platform are impacted in us-central1-a
Incident began at 2023-09-12 23:46 and ended at 2023-09-13 03:32 (all times are US/Pacific).
Previously affected location(s)
Iowa (us-central1)
Date | Time | Description | |
---|---|---|---|
| 19 Sep 2023 | 16:04 PDT | Incident ReportSummaryOn Tuesday, 12 September 2023, multiple Google Cloud products experienced elevated error rates and request failures mostly in the us-central1-a zone. The total duration of this incident was 3 hours and 46 minutes. To our Google Cloud customers whose businesses were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. Root CauseGoogle’s data centers rely on a distributed strongly-consistent file distribution system to perform operations such as name resolution in the data plane, and consist of servers that are used for distributing widely-used data. The root cause of the issue was a significant increase in traffic due to internal changes that generated more tasks than expected. This caused the file distribution system to begin crashing. Remediation and PreventionGoogle engineers were alerted to the issue via internal monitoring on 12 September 2023 at 23:46 US/Pacific and immediately started an investigation. Once the nature and scope of the issue became clear, Google engineers began redirecting traffic away from the affected servers, and added more memory resources. This procedure took a few hours because it is a manual process requiring extra care, due to the criticality of the service and it being foundational to data center operations. During this time, some services saw recovery before others, and impact was fully mitigated for all services on 13 September 2023 at 03:32 US/Pacific. Google is committed preventing a repeat of this issue in the future and is completing the following actions:
We apologize for the impact this incident had on our customers and their businesses in the us-central1 region. We are taking immediate steps to prevent a recurrence in the future. Detailed Description of ImpactOn Tuesday, 12 September 2023 from 23:46 to Wednesday, 13 September 2023 at 03:32 US/Pacific, multiple Google Cloud products experienced elevated error rates and request failures in us-central1 which are detailed below: Google Compute Engine :
Impact began on Wednesday, 13 September 2023 at 00:05 and was mitigated at 02:40 US/Pacific. Total duration of impact was 2 hours, 35 minutes. Persistent Disk:
Impact began on Wednesday, 13 September 2023 at 00:28 and was mitigated at 02:53 US/Pacific. Total duration of impact was 3 hours, 5 minutes. Google Kubernetes Engine:
Impact began on Wednesday, 13 September 2023 at 00:55 and was mitigated at 04:00 US/Pacific. Total duration of impact was 3 hours, 5 minutes. Google Cloud Bigtable:
Impact began on Tuesday, 12 September 2023 at 23:57. The first symptoms were detected on Wednesday, 13 September 2023 at 01:00, and the incident was mitigated at 02:20 US/Pacific. Total duration of impact was 2 hours, 23 minutes. Google Cloud Dataflow:
Impact began on Wednesday, 13 September 2023 at 00:07 and was mitigated at 01:15 US/Pacific. Total duration of impact was 1 hour, 8 minutes. Google Cloud App Engine:
Impact began on Wednesday, 13 September 2023 at 00:15 and was mitigated at 00:56 US/Pacific. Total duration of impact was 41 minutes. Google Cloud SQL:
Impact began on Wednesday, 13 September 2023 at 00:11 and was mitigated at 00:45 US/Pacific. Total duration of impact was 34 minutes. |
| 13 Sep 2023 | 12:16 PDT | Mini Incident ReportWe apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support. (All Times US/Pacific) Incident Start: 12 September 2023 23:46 Incident End: 13 September 2023 03:32 Duration: 3 hours, 46 minutes Affected Services and Features:
Regions/Zones: us-central1 Description: Multiple Google Cloud products experienced elevated error rates and request failures in us-central1 for a duration of 3 hours, 46 minutes. From preliminary analysis, the root cause of the issue is task failures in the caching proxy of Google's distributed lock service in us-central1-a due to high memory usage. Our engineers mitigated the issue by redirecting the traffic away from the affected servers and by adding more memory resources. While the mitigation activities were ongoing, some products saw service recovery before others. Google will complete a full Incident Report in the following days that will provide a full root cause. Customer Impact: Google Compute Engine :
Persistent Disk:
Google Kubernetes Engine:
Google Cloud Bigtable:
Google Cloud Dataflow:
Google Cloud App Engine:
Google Cloud SQL:
|
| 13 Sep 2023 | 04:03 PDT | The issue with Google App Engine, Google Cloud Bigtable, Google Cloud Dataflow, Google Cloud SQL, Google Kubernetes Engine, Persistent Disk has been resolved for all affected projects as of Wednesday, 2023-09-13 04:01 US/Pacific. Google Compute Engine The issue for GCE has been resolved on Wednesday, 2023-09-13 01:13 US/Pacific. Persistent Disk The issue for Persistent Disk has been resolved on Wednesday, 2023-09-13 03:07 US/Pacific. Google App Engine The issue for Google App Engine has been resolved on Wednesday, 2023-09-13 03:40 US/Pacific. Cloud Dataflow The issue for Cloud Dataflow has been resolved on Wednesday, 2023-09-13 03:14 US/Pacific. Google Kubernetes Engine The issue for Google Kubernetes Engine has been resolved on Wednesday, 2023-09-13 04:01 US/Pacific. Cloud Bigtable Cloud Bigtable had elevated latency and elevated error rates us-central1 but was mitigated on Wednesday, 2023-09-13 02:57 US/Pacific. Google Cloud SQL Google Cloud SQL had elevated latency and elevated error rates for instance creations and upgrades in us-central1 and was mitigated on Wednesday, 2023-09-13 00:45 US/Pacific. We thank you for your patience while we worked on resolving the issue. |
| 13 Sep 2023 | 03:47 PDT | Summary: Multiple services for Google Cloud Platform are impacted in us-central1-a Description: Mitigation work is currently underway by our engineering team. The mitigation is expected to complete by Wednesday, 2023-09-13 04:30 US/Pacific. We will provide more information by Wednesday, 2023-09-13 04:30 US/Pacific. Diagnosis: Google Compute Engine The issue for GCE has been resolved on Wednesday, 2023-09-13 01:13 US/Pacific. Persistent Disk The issue for Persistent Disk has been resolved on Wednesday, 2023-09-13 03:07 US/Pacific. Google App Engine The issue for Google App Engine has been resolved on Wednesday, 2023-09-13 03:40 US/Pacific. Cloud Dataflow The issue for Cloud Dataflow has been resolved on Wednesday, 2023-09-13 03:14 US/Pacific. Google Kubernetes Engine Cluster creation and upgrade operations are failing in us-central1-a. Cloud Bigtable Cloud Bigtable had elevated latency and elevated error rates us-central1 but was mitigated on Wednesday, 2023-09-13 02:57 US/Pacific. Workaround: None at this time. |
| 13 Sep 2023 | 03:05 PDT | Summary: Multiple services for Google Cloud Platform are impacted in us-central1-a Description: Mitigation work is currently underway by our engineering team. The mitigation is expected to complete by Wednesday, 2023-09-13 04:00 US/Pacific. We will provide more information by Wednesday, 2023-09-13 04:00 US/Pacific. Diagnosis: Google Compute Engine
Persistent Disk
Google App Engine
Cloud Dataflow
Google Kubernetes Engine
Cloud Bigtable
Workaround: None at this time. |
| 13 Sep 2023 | 02:32 PDT | Summary: Multiple services for Google Cloud Platform are impacted in us-central1-a Description: Mitigation work is currently underway by our engineering team. The regional impact has been mitigated on Wednesday, 2023-09-13 01:05 US/Pacific but the impact in us-central1-a is still ongoing. We do not have an ETA for mitigation at this point. We will provide more information by Wednesday, 2023-09-13 03:30 US/Pacific. Diagnosis: Google Compute Engine An issue that is preventing VM creation in us-central1 clusters. Also, HTTP requests to GCE API in us-central1-a are failing intermittently. Persistent Disk Input Output operation from Virtual Machine to Persistent Disk is not completing and is stuck in us-central1-a Google App Engine Google App Engine Flexible deployments and version updates and deletes fail in us-central1-a Cloud Dataflow Unable to start and run Dataflow jobs in us-central1-a Google Kubernetes Engine Cluster creation and upgrade operations are failing in us-central1-a. Workaround: None at this time. |
| 13 Sep 2023 | 01:51 PDT | Summary: Multiple services for Google Cloud Platform are impacted in us-central1 Description: We are experiencing an issue with Google Cloud Dataflow, Google Compute Engine, Google App Engine, Google Kubernetes Engine, Persistent Disk beginning at Wednesday, 2023-09-13 00:30 US/Pacific. Our engineering team continues to investigate the issue. We will provide an update by Wednesday, 2023-09-13 03:00 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Google Compute Engine An issue that is preventing VM creation in us-central1 clusters. Also, HTTP requests to GCE API in us-central1 and it's zones are failing intermittently. Persistent Disk Input Output operation from Virtual Machine to Persistent Disk is not completing and is stuck. Google App Engine Google App Engine Flexible deployments and version updates and deletes fail in us-central1 Cloud Dataflow Unable to start and run Dataflow jobs. Google Kubernetes Engine Cluster creation operations are failing in us-central1 and us-central1-a. Workaround: None at this time. |
- All times are US/Pacific