Service Health
Incident affecting Google BigQuery, Apigee, Google Compute Engine, Google Kubernetes Engine, Cloud Memorystore, Google Cloud Bigtable, Persistent Disk, Google Cloud Dataflow, Google Cloud Networking, Google Cloud Pub/Sub, Google Cloud SQL, Cloud Filestore, Cloud Data Fusion, Cloud Load Balancing, Memorystore for Redis
We are experiencing an issue with Persistent Disk affecting multiple services in us-central1-b
Incident began at 2022-05-06 01:30 and ended at 2022-05-06 12:06 (all times are US/Pacific).
Previously affected location(s)
Iowa (us-central1)
Date | Time | Description | |
---|---|---|---|
| 16 May 2022 | 16:12 PDT | INCIDENT REPORT Summary: On 6 May 2022 at 01:30 US/Pacific, multiple Google Cloud services experienced issues in the us-central1 region. These issues mostly were isolated to us-central1-b for zonal services, but some regional services experienced degradation until their traffic could be shifted away from the impacted zone. Most Google Cloud services recovered automatically, after the underlying problem was resolved. We sincerely apologize for the impact to your service or application. We completed an internal investigation and are taking immediate steps to improve the quality and reliability of our services. If you believe that your services experienced an SLA violation as a result of this incident, please contact us. Root Cause: Google Cloud systems are built on a zonal distributed storage system called Colossus, which replicates data across a large number of individual storage servers called D Servers. In this incident, a background job responsible for repacking storage objects began to retry those repack operations more aggressively as part of its normal operations. This subsequently increased the load on the Colossus system in the zone, including the number of open connections to the D Servers. The sudden increase in connection load to D Servers caused a small number of servers to unexpectedly crash due to high memory pressure. This led our automated management systems to remove them from the serving fleet for Colossus. This further reduced the number of D Servers available to handle the rising traffic loads and increased the traffic latency within the Colossus system in the impacted zone. This significant increase in latency subsequently impacted our customers’ performance across a range of Google Cloud services that are built atop Colossus, including Persistent Disk, BigQuery, and many others. This zonal incident impacted some regional services due to the specific failure mode. When a Colossus cluster is marked down, the regional services receive proactive notification and automatically shift traffic away from the cluster. Since this cluster was still up, but with variable latency for some operations, the regional services received no proactive notification and were unable to automatically shift traffic away from the cluster. Therefore, the impact to a number of regional services was extended as they had to manually remove the impacted cluster from serving. Remediation and Prevention: Google engineers were alerted to the issue on Friday, 6 May, 2022 at 01:54 US/Pacific and immediately started an investigation. Google engineers stopped the background traffic. To increase traffic capacity, Google engineers re-added the impacted D Servers to the serving fleet, mitigating the issue at 12:06 US/Pacific. Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We are taking the following steps to prevent this or similar issues from happening again:
Detailed Description of Impact: Some customers may have experienced high latency or errors in multiple Google Cloud services in the impacted region.
|
| 6 May 2022 | 16:12 PDT | Mini Incident Report We apologize for the inconvenience this service disruption may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Support by opening a case using https://cloud.google.com/support or help article https://support.google.com/a/answer/1047213. (All Times US/Pacific) Incident Start: 06 May 2022 01:30 Incident End: 06 May 2022 12:06 Duration: 10 hours, 36 minutes Affected Services and Features:
Regions/Zones: us-central1-b Description: Multiple Google Cloud services experienced issues in the us-central1 region beginning Friday, 6 May 2022 at 01:30 PT. These issues were predominantly isolated to us-central1-b for zonal services, but some regional services experienced degradation until their traffic could be shifted away from the impacted zone. Most services recovered automatically after the underlying problem was resolved. The issues were triggered by an unexpected increase in normally occurring background traffic in the Google Cloud distributed storage infrastructure[1] within the us-central1-b zone. The system automatically directed load away from backend file servers that were impacted by this load increase. This subsequently reduced the overall traffic capacity in the zone. Google engineers mitigated the issue by stopping the background traffic and marking the impacted file servers as available in order to increase capacity. Customer Impact: How Customers Experienced the Issue: Some customers may have experienced high latency or errors in multiple Google Cloud services in the impacted region.
|
| 6 May 2022 | 12:09 PDT | The issue with Cloud Data Fusion, Cloud Filestore, Cloud Memorystore, Google BigQuery, Google Cloud Dataflow, Google Cloud Networking, Google Cloud Pub/Sub, Google Cloud SQL, Google Kubernetes Engine, Persistent Disk, Apigee has been resolved for all affected projects as of Friday, 2022-05-06 12:06 US/Pacific. Products with Narrow Impact:
We will publish an Incident Report once we have completed our internal investigation. We thank you for your patience while we worked on resolving the issue. |
| 6 May 2022 | 11:16 PDT | Summary: We are experiencing an issue with Persistent Disk affecting multiple services in us-central1-b Description: We are experiencing an issue with multiple cloud services including BigQuery, Cloud Networking, Cloud SQL, Google Kubernetes Engine (GKE), Cloud Filestore, Cloud Bigtable, Cloud Memorystore, Apigee, Cloud Dataflow services, Cloud Data Fusion (CDF) beginning at Friday, 2022-05-06 01:20 US/Pacific in us-central1-b. Mitigation is completed, and most of the affected services have recovered. We will provide more information by Friday, 2022-05-06 12:30 US/Pacific. Products Recovered:
Products Still Recovering:
Diagnosis: Customers might see connectivity issues in us-central1-b Workaround: Move the workloads to a different zone if possible |
| 6 May 2022 | 10:57 PDT | Summary: We are experiencing an issue with Persistent Disk affecting multiple services in us-central1-b Description: We are experiencing an issue with multiple cloud services including BigQuery, Cloud Networking, Cloud SQL, Google Kubernetes Engine (GKE), Cloud Filestore, Cloud Bigtable, Cloud Memorystore, Apigee, Cloud Dataflow services, Cloud Data Fusion (CDF) beginning at Friday, 2022-05-06 01:20 US/Pacific in us-central1-b. Mitigation is completed, and most of the affected services have recovered. We will provide more information by Friday, 2022-05-06 11:30 US/Pacific. Products Recovered:
Products Still Recovering:
Diagnosis: Customers might see connectivity issues in us-central1-b Workaround: Move the workloads to a different zone if possible |
| 6 May 2022 | 10:28 PDT | Summary: We are experiencing an issue with Persistent Disk affecting multiple services in us-central1-b Description: We are experiencing an issue with Persistent Disk affecting multiple services including BigQuery, Cloud Networking, Cloud SQL, Google Kubernetes Engine (GKE), Cloud Filestore, Cloud Bigtable, Cloud Memorystore, Apigee, Cloud Dataflow services, Cloud Data Fusion (CDF) beginning at Friday, 2022-05-06 01:20 US/Pacific in us-central1-b. Mitigation is completed and most of the affected services have recovered. We will provide more information by Friday, 2022-05-06 11:00 US/Pacific. Products Recovered:
Products Still Recovering:
Diagnosis: Customers might see connectivity issues in us-central1-b Workaround: Move the workloads to a different zone if possible |
| 6 May 2022 | 09:55 PDT | Summary: We are experiencing an issue with Persistent Disk affecting multiple services in us-central1-b Description: We are experiencing an issue with Persistent Disk affecting multiple services including BigQuery, Cloud Networking, Cloud SQL, Google Kubernetes Engine (GKE), Cloud Filestore, Cloud Bigtable, Cloud Memorystore, Apigee, Cloud Dataflow services, Cloud Data Fusion (CDF) beginning at Friday, 2022-05-06 01:20 US/Pacific in us-central1-b. Mitigation is completed and most of the affected services have recovered. We will provide more information by Friday, 2022-05-06 10:30 US/Pacific. Products Recovered:
Products Still Recovering:
Diagnosis: Customers might see connectivity issues in us-central1-b Workaround: Move the workloads to a different zone if possible |
| 6 May 2022 | 09:26 PDT | Summary: We are experiencing an issue with Persistent Disk affecting multiple services in us-central1-b Description: We are experiencing an issue with Persistent Disk affecting multiple services including BigQuery, Cloud Networking, Cloud SQL, Google Kubernetes Engine (GKE), Cloud Filestore, Cloud Bigtable, Cloud Memorystore, Apigee, Cloud Dataflow services, Cloud Data Fusion (CDF) beginning at Friday, 2022-05-06 01:20 US/Pacific in us-central1-b. Mitigation is completed and most of the affected services have recovered. We will provide more information by Friday, 2022-05-06 10:00 US/Pacific. Products Recovered:
Products Still Recovering:
Diagnosis: Customers might see connectivity issues in us-central1-b Workaround: Move the workloads to a different zone if possible |
| 6 May 2022 | 08:56 PDT | Summary: We are experiencing an issue with Persistent Disk affecting multiple services in us-central1-b Description: We are experiencing an issue with Persistent Disk affecting multiple services including BigQuery, Cloud Networking, Cloud SQL, Google Kubernetes Engine (GKE), Cloud Filestore, Cloud Bigtable, Cloud Memorystore, Apigee, Cloud Dataflow services, Cloud Data Fusion (CDF) beginning at Friday, 2022-05-06 01:20 US/Pacific in us-central1-b. Mitigation is completed and we see that most of the affected services have recovered. We will provide more information by Friday, 2022-05-06 09:30 US/Pacific. Products Recovered:
Products Still Recovering:
Diagnosis: Customers might see connectivity issues in us-central1-b Workaround: Move the workloads to a different zone if possible |
| 6 May 2022 | 08:27 PDT | Summary: We are experiencing an issue with Persistent Disk affecting multiple services in us-central1-b Description: We are experiencing an issue with Persistent Disk affecting multiple services including BigQuery, Cloud Networking, Cloud SQL, Google Kubernetes Engine (GKE), Cloud Filestore, Cloud Bigtable, Cloud Memorystore, Apigee, Cloud Dataflow services, Cloud Data Fusion (CDF) beginning at Friday, 2022-05-06 01:20 US/Pacific in us-central1-b. Mitigation is completed and we see that most of the affected services have recovered. We will provide more information by Friday, 2022-05-06 09:00 US/Pacific. Products Recovered: BigQuery Engine:Cloud Pub/Sub, Cloud Networking, Compute Engine, Datastream, Cloud Filestore, Cloud Memorystore, Cloud SQL, Apigee, Dataflow Products Still Recovering:
Diagnosis: Customers might see connectivity issues in us-central1-b Workaround: Move the workloads to a different zone if possible |
| 6 May 2022 | 08:01 PDT | Summary: We are experiencing an issue with Persistent Disk affecting multiple services in us-central1-b Description: We are experiencing an issue with Persistent Disk affecting multiple services including BigQuery, Cloud Networking, Cloud SQL, Google Kubernetes Engine (GKE), Cloud Filestore, Cloud Bigtable, Cloud Memorystore, Apigee, Cloud Dataflow services, Cloud Data Fusion (CDF) beginning at Friday, 2022-05-06 01:20 US/Pacific in us-central1-b. Mitigation is completed and we see that most of the affected services have recovered. We will provide more information by Friday, 2022-05-06 08:30 US/Pacific. Products Recovered: BigQuery Engine:Cloud Pub/Sub, Cloud Networking, Compute Engine, Datastream, Cloud Filestore, Cloud Memorystore, Cloud SQL, Apigee, Dataflow Products Still Recovering:
Diagnosis: Customers might see connectivity issues in us-central1-b Workaround: Move the workloads to a different zone if possible |
| 6 May 2022 | 07:25 PDT | Summary: We are experiencing an issue with Persistent Disk affecting multiple services in us-central1-b Description: We are experiencing an issue with Persistent Disk affecting multiple services including BigQuery, Cloud Networking, Cloud SQL, GKE, Cloud Filestore, Cloud Bigtable, Cloud Memorystore, Apigee, Cloud Dataflow services, Cloud Data Fusion beginning at Friday, 2022-05-06 01:20 US/Pacific in us-central1-b. Mitigation is completed and we see that most of the affected services have recovered. We will provide more information by Friday, 2022-05-06 08:00 US/Pacific. Product Impact:
Diagnosis: Customers might see connectivity issues in us-central1-b Workaround: Move the workloads to a different zone if possible |
| 6 May 2022 | 06:54 PDT | Summary: We are experiencing an issue with Persistent Disk affecting multiple services in us-central1-b Description: We are experiencing an issue with Persistent Disk affecting multiple services including BigQuery, Cloud Networking, Cloud SQL, GKE, Cloud Filestore, Cloud Bigtable, Cloud Memorystore, Apigee, Cloud Dataflow services beginning at Friday, 2022-05-06 01:20 US/Pacific in us-central1-b. Mitigation work is currently underway by our engineering team. We see partial recovery for some services. We will provide more information by Friday, 2022-05-06 07:30 US/Pacific. Product Impact:
Diagnosis: Customers might see connectivity issues in us-central1-b Workaround: Move the workloads to a different zone if possible |
| 6 May 2022 | 06:20 PDT | Summary: We are experiencing an issue with Persistent Disk affecting multiple services in us-central1-b Description: We are experiencing an issue with Persistent Disk affecting multiple services including BigQuery, Cloud Networking, Cloud SQL, GKE Control Plane, Cloud Filestore, Cloud Bigtable, Cloud Memorystore, Apigee, Cloud Dataflow services beginning at Friday, 2022-05-06 01:20 US/Pacific in us-central1-b. Mitigation work is currently underway by our engineering team. We will provide more information by Friday, 2022-05-06 07:00 US/Pacific. Diagnosis: Customers might see connectivity issues in us-central1-b Workaround: Move the workloads to a different zone if possible |
| 6 May 2022 | 06:04 PDT | Summary: We are experiencing an issue with Persistent Disk affecting multiple services in us-central1-b Description: We are experiencing an issue with Persistent Disk affecting multiple services including Bigquery, Cloud Networking, Cloud SQL, GKE beginning at Friday, 2022-05-06 01:20 US/Pacific in us-central1-b. Mitigation work is currently underway by our engineering team. We will provide more information by Friday, 2022-05-06 06:30 US/Pacific. Diagnosis: Customers might see connectivity issues in us-central1-b Workaround: Move the workloads to a different zone if possible |
| 6 May 2022 | 05:58 PDT | Summary: We are experiencing an issue with Persistent Disk affecting multiple services in us-central1-b Description: Mitigation work is currently underway by our engineering team. We will provide more information by Friday, 2022-05-06 06:30 US/Pacific. Diagnosis: Customers might see connectivity issues in us-central1-b Workaround: Move the workloads to a different zone if possible |
| 6 May 2022 | 05:45 PDT | Summary: We are experiencing an issue with Persistent Disk affecting multiple services in us-central1 Description: We are experiencing an issue with Persistent Disk affecting multiple services beginning at Friday, 2022-05-06 01:20 US/Pacific in us-central1. Our engineering team continues to investigate the issue. We will provide an update by Friday, 2022-05-06 06:30 US/Pacific with current details. with current details. We apologize to all who are affected by the disruption. Diagnosis: Some I/O operations in Persistent Disk Standard devices are stuck for a long time (>1 min) Workaround: Move the workloads to a different zone if possible |
| 6 May 2022 | 05:01 PDT | Summary: We are experiencing an issue with Persistent Disk affecting multiple services in us-central1 Description: We are experiencing an issue with Persistent Disk affecting multiple services beginning at Friday, 2022-05-06 01:20 US/Pacific in us-central1. Our engineering team continues to investigate the issue. We will provide an update by Friday, 2022-05-06 06:30 US/Pacific with current details. with current details. We apologize to all who are affected by the disruption. Diagnosis: Some I/O operations in Persistent Disk Standard devices are stuck for a long time (>1 min) Workaround: Move the workloads to a different zone if possible |
- All times are US/Pacific