Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Google Cloud Networking

We are experiencing an issue with Cloud Networking in us-east1-c and us-east1-d

Incident began at 2020-06-29 08:20 and ended at 2020-06-29 13:06 (all times are US/Pacific).

Date Time Description
8 Jul 2020 12:40 PDT

ISSUE SUMMARY

On 2020-06-29 07:47 US/Pacific, Google Cloud experienced unavailability for some services hosted from our us-east1-c and us-east1-d zones. The unavailability primarily impacted us-east1-c but did have a short impact on us-east1-d. For approximately 1 hour and 30 minutes, 22.5% of Google Compute Engine (GCE) instances in us-east1-c, were unavailable. For approximately 7 minutes, 1.8% of GCE instances in us-east1-d, were unavailable. In addition, 0.0267% Persistent Disk (PD) devices hosted in us-east1-c were unavailable for up to 28 hours and the us-east1 region as a whole experienced 5% packet loss between 07:55 and 08:05 for Public IP and Network LB Traffic.

We sincerely apologize and are taking steps detailed below to ensure this doesn’t happen again.

BACKGROUND

Google Cloud Platform is built on various layers of abstraction in order to provide scale and distinct failure domains. One of those abstractions is Zones and clusters [1]. Zonal services such as Google Compute Engine (GCE) assign projects to one cluster to handle the majority of the compute needs when a project requests resources in a cloud zone. If a cluster backing a zone becomes degraded, services in that zone have resilience built in to handle some level of machine failures. Regional services, depending on the architecture, may see a short degradation before automatically recovering, or see no impact at all. Regional services with tasks in a degraded cluster are generally migrated to other functional clusters in the same region to reduce overall impact. In the Detailed Impact section below, the impact is only to projects and services mapped to the affected clusters, unless otherwise noted.

Datacenter power delivery is architected in three tiers. The primary tier of power delivery is utility power, with multiple grid feeds and robust substations. Backing up utility power are generators, each generator powers a different part of each cluster, and additional backup generators and fuel are available if required in the event that a part of this backup power system fails.The fuel supply system for the generators is broken into two parts, storage tanks which store fuel in bulk, and a system which pumps that fuel to generators for consumption. The final tier of power delivery are batteries which provide power conditioning and a short run times when power from the other two tiers is interrupted.

[1] https://cloud.google.com/compute/docs/regions-zones#zones_and_clusters

ROOT CAUSE

During planned substation maintenance by the site’s electrical utility provider, two clusters supporting the us-east1 region were transferred to backup generator power for the duration of the maintenance, which was scheduled as a four hour window. Three hours into the maintenance window, 17% of the operating generators began to run out of fuel due to fuel delivery system failures even though there was adequate fuel available in the storage tanks. Multiple redundancies built into the backup power system were automatically activated as primary generators began to run out of fuel, however, as more primary generators ran out of fuel the part of the cluster they were supporting shutdown.

REMEDIATION AND PREVENTION

Google engineers were alerted to the power issue impacting us-east1-c and us-east1-d at 2020-06-29 07:50 US/Pacific and immediately started an investigation. Impact to us-east1-d was resolved automatically by cluster level services. Other than some Persistent Disk devices, service impact in us-east1-d ended by 08:24. Onsite datacenter operators identified a fuel supply issue as the root cause of the power loss and quickly established a mitigation plan. Once a workaround for the fuel supply issue was deployed, the operators began restoring the affected generators to active service at 08:49. Almost at the same time, at 08:55, the planned substation maintenance had concluded and utility power returned to service. Between the restored utility power and recovered generators, power was fully restored to both clusters by 08:59.

In a datacenter recovery scenario there is a sequential process that must be followed for downstream service recovery to succeed. By 2020-06-29 09:34, most GCE instances had recovered as the necessary upstream services were restored. All services had recovered by 10:50 except for a small percentage of Persistent Disk impacted instances. A more detailed timeline of individual service impact is included below in the “DETAILED DESCRIPTION OF IMPACT” section below.

In the days following this incident the same system was put under load. There was an unplanned utility power outage for the same location on 2020-06-30 (the next day) due to a lightning strike near a substation transformer. The system was again tested on 2020-07-02 when a final maintenance operation was conducted on the site substation.

We are committed to preventing this situation from happening again and are implementing the following actions:

Resolving the issues identified with the fuel supply system which led to this incident. An audit of sites which have a similar fuel system has been conducted and onsite personnel have been provided updated procedures and training for dealing with this situation should it occur again.

DETAILED DESCRIPTION OF IMPACT

On 2020-06-29 from 07:47 to 10:50 US/Pacific, Google Cloud experienced unavailability for some services hosted from cloud zones us-east1-c and us-east1-d as described in detail below:

Google Compute Engine

22.5% of Google Compute Engine (GCE) instances in the us-east1-c zone were unavailable starting 2020-06-29 07:57 US/Pacific for 1 hour and 30 minutes. Up to 1.8% of instances in the us-east1-d zone were unavailable starting 2020-06-29 08:17 for 7 minutes. A small percentage of the instances in us-east1-c continued to be unavailable for up to 28 hours due to manual recovery of PD devices.

Persistent Disk

Persistent Disk (PD) experienced 23% of PD devices becoming degraded in us-east1-c starting at 2020-06-29 07:53 to 2020-06-29 09:28 US/Pacific for a duration of 1 hour and 35 minutes. 0.0267% of PD devices were unable to recover automatically and required manual recovery which completed at 2020-06-30 09:54 resulting in 26 hours of additional unavailability.

The delay in recovery was primarily due to a configuration setting in PD clients that set metadata initialization retry attempts to a maximum value (with exponential backoff). Upon power loss, 0.0267% of PD devices in us-east1-c reached this limit and were unable to recover automatically as they had exhausted their retry attempts before power had been fully restored. To prevent this scenario from recurring, we are significantly increasing the number of retry attempts that will be performed by PD metadata initialization to ensure PD can recover from extended periods of power loss.

A secondary factor resulting in the delay of some customer VMs was due to filesystem errors triggered by the PD unavailability. PD itself maintains defense-in-depth through a variety of end-to-end integrity mechanisms which prevented any PD corruption during this incident. However, some filesystems are not designed to be robust against cases where some parts of the block device presented by PD fail to initialize while others are still usable. This issue was technically external to PD, and only repairable by customers using filesystem repair utilities. The PD product team assisted affected customers in their manual recovery efforts during the extended incident window.

Additionally, up to 0.429% of PD devices in us-east1-d were unhealthy for approximately 1 hour from 2020-06-29 08:15 to 2020-06-29 09:10. All PD devices in us-east1-d recovered automatically once power had been restored.

Cloud Networking

The us-east1 region as a whole experienced a 5% packet loss between 07:55 and 08:05 for Public IP and Network LB Traffic as Maglevs [1] servicing the region in the impacted cluster became unavailable.

Cloud VPN saw 7% of Classic VPN tunnels in us-east1 reset between 07:57 and 08:07. As tunnels disconnected, they were rescheduled automatically in other clusters in the region. HA VPN tunnels were not impacted.

Cloud Router saw 13% of BGP sessions in us-east1 flap between 07:57 and 08:07, Cloud Router tasks in the impacted cluster were rescheduled automatically in other clusters in the region.

Cloud HTTP(S) Load Balancing saw a 166% spike in baseline HTTP 500 errors between 08:00 to 08:10.

Starting at 09:38 the network control plane in the impacted cluster began to initialize but ran into an issue that required manual intervention to resolve. Between 09:38 and 10:14 instances continued to be inaccessible until the control plane initialized, updates were also not being propagated down to the cluster control plane. Due to this, some resources such as Internal Load Balancers and instances that were deleted during this time period, and then recreated at any time between 9:38 and 12:47 would have seen availability issues. This was resolved with additional intervention from the SRE team.

To reduce time to recover for similar classes of issues, we are increasing the robustness of the control plane to better handle such exceptional failure conditions. We are also implementing additional monitoring and alerting to more quickly detect update propagation issues under exceptional failure conditions

[1] https://cloud.google.com/blog/products/gcp/google-shares-software-network-load-balancer-design-powering-gcp-networking

Cloud SQL

Google Cloud SQL experienced a 7% drop in network connections that resulted in database unavailability from 2020-06-29 08:00 - 10:50 US/Pacific affecting <1.5% of instances in us-east1 due to power loss. This degraded Cloud SQL dependencies (Persistent Disk and GCE) in us-east1-c and us-east1-d for a period of 2 hours and 50 minutes.

Filestore + Memorystore

Cloud Filestore and Memorystore instances in us-east1-c were unavailable due to power loss from 2020-06-29 07:56 - 10:24 US/Pacific for a duration of 2 hours and 28 minutes. 7.4% of Redis non-HA (zonal) instances in us-east1 were unavailable and 5.9% of Redis HA (regional) standard tier instances in us-east1 failed over. All Cloud Filestore and Cloud Memorystore instances recovered by 10:24.

Google Kubernetes Engine

Google Kubernetes Engine (GKE) customers were unable to create or delete clusters in either the us-east1-c zone or the us-east1 region from 2020-06-29 8:00 to 2020-06-29 10:30 US/Pacific. Additionally, customers could not create nodepools in the us-east1-c zone during that same period. A maximum of 29% of zonal clusters in us-east1-c and 2% of zonal clusters in us-east1-d could not be administered; all but a small subset of clusters recovered by 11:00, and the remaining clusters recovered by 14:30. Existing regional clusters in us-east1 were not affected except for customers who had workloads on nodes in us-east1-c that they could not or were unable to migrate.

Google Cloud Storage

Google Cloud Storage (GCS) in us-east1 experienced 10 minutes of impact when availability fell down to 99.7%, with a 3-minute burst down to 98.6% availability. GCS multi-region experienced a total of 40 minutes of impact, down to 99.55% availability.

SLA CREDITS

If you believe your paid application experienced an SLA violation as a result of this incident, please submit the SLA credit request: https://support.google.com/cloud/contact/cloud_platform_sla

A full list of all Google Cloud Platform Service Level Agreements can be found at https://cloud.google.com/terms/sla/

29 Jun 2020 13:06 PDT

The issue with Cloud Networking and Persistent Disk has been resolved for the majority of affected projects as of Monday, 2020-06-29 10:20 US/Pacific, and we expect full mitigation to occur for remaining projects within the hour.

If you have questions or feel that you may be impacted, please open a case with the Support Team and we will work with you until the issue is resolved.

We will publish analysis of this incident once we have completed our internal investigation.

We thank you for your patience while we're working on resolving the issue.

29 Jun 2020 12:17 PDT

Description: We are experiencing an issue with Cloud Networking in us-east1-c and us-east1-d, beginning on Monday, 2020-06-29 07:54 US/Pacific, affecting multiple Google Cloud Services.

Services in us-east1-d have been fully restored. Services in us-east1-c are fully restored except for Persistent Disk which is partially restored. No ETA for full recovery of Persistent Disk yet.

Impact is due to power failure. A more detailed analysis will be available at a later time.

Our engineering team is working on recovery of impacted services.

We will provide an update by Monday, 2020-06-29 13:00 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Some services in us-east1-c and us-east1-d are failing, customers impacted by this incident would likely experience a total unavailability of zonal services hosted in us-east1-c or us-east1-d. It is possible for customers to experience service interruption in none, one, or both zones.

Workaround: Other zones in the region are not impacted. If possible, migrating workloads would mitigate impact. If workloads are unable to be migrated, there is no workaround at this time.

29 Jun 2020 11:45 PDT

Description: We are experiencing an issue with Cloud Networking in us-east1-c and us-east1-d, beginning on Monday, 2020-06-29 07:54 US/Pacific, affecting multiple Google Cloud Services.

Services in us-east1-d have been fully restored. Services in us-east1-c are fully restored except for Persistent Disk which is partially restored. No ETA for full recovery of Persistent Disk yet.

Impact is due to power failure. A more detailed analysis will be available at a later time.

Our engineering team is working on recovery of impacted services.

We will provide an update by Monday, 2020-06-29 12:20 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Some services in us-east1-c and us-east1-d are failing, customers impacted by this incident would likely experience a total unavailability of zonal services hosted in us-east1-c or us-east1-d. It is possible for customers to experience service interruption in none, one, or both zones.

Workaround: Other zones in the region are not impacted. If possible, migrating workloads would mitigate impact. If workloads are unable to be migrated, there is no workaround at this time.

29 Jun 2020 11:00 PDT

Description: We are experiencing an issue with Cloud Networking in us-east1-c and us-east1-d, beginning on Monday, 2020-06-29 07:54 US/Pacific, affecting multiple Google Cloud Services.

Services in us-east1-d have been restored. Most services in us-east1-c are restored except for Persistent Disk. No ETA for Persistent disk recovery as of now.

Impact is due to power failure. A more detailed analysis will be available at a later time.

Our engineering team is working on recovery of impacted services.

We will provide an update by Monday, 2020-06-29 11:45 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Some services in us-east1-c and us-east1-d are failing, customers impacted by this incident would likely experience a total unavailability of zonal services hosted in us-east1-c or us-east1-d. It is possible for customers to experience service interruption in none, one, or both zones.

Workaround: Other zones in the region are not impacted. If possible, migrating workloads would mitigate impact. If workloads are unable to be migrated, there is no workaround at this time.

29 Jun 2020 10:31 PDT

Description: We are experiencing an issue with Cloud Networking in us-east1-c and us-east1-d, beginning on Monday, 2020-06-29 07:54 US/Pacific, affecting multiple Google Cloud Services.

Services in us-east1-d have been restored. Most services in us-east1-c are restored except for Persistent Disk. No ETA for Persistent disk recovery as of now.

Impact is due to power failure. A more detailed analysis will be available at a later time.

Our engineering team is working on recovery of impacted services.

We will provide an update by Monday, 2020-06-29 11:00 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Some services in us-east1-c and us-east1-d are failing, customers impacted by this incident would likely experience a total unavailability of zonal services hosted in us-east1-c or us-east1-d. It is possible for customers to experience service interruption in none, one, or both zones.

Workaround: Other zones in the region are not impacted. If possible, migrating workloads would mitigate impact. If workloads are unable to be migrated, there is no workaround at this time.

29 Jun 2020 10:00 PDT

Description: We are experiencing an issue with Cloud Networking in us-east1-c and us-east1-d, beginning on Monday, 2020-06-29 07:54 US/Pacific, affecting multiple Google Cloud Services.

Services in us-east1-d have been restored. Services in us-east1-c are still being restored. No ETA as of now.

Impact is due to power failure. A more detailed analysis will be available at a later time.

Our engineering team is working on recovery of impacted services.

We will provide an update by Monday, 2020-06-29 10:30 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Some services in us-east1-c and us-east1-d are failing, customers impacted by this incident would likely experience a total unavailability of zonal services hosted in us-east1-c or us-east1-d. It is possible for customers to experience service interruption in none, one, or both zones.

Workaround: Other zones in the region are not impacted. If possible, migrating workloads would mitigate impact. If workloads are unable to be migrated, there is no workaround at this time.

29 Jun 2020 09:22 PDT

Description: We are experiencing an issue with Cloud Networking in us-east1-c and us-east1-d, beginning on Monday, 2020-06-29 07:54 US/Pacific, affecting multiple Google Cloud Services.

We expect services in us-east1-d to be recovering within the next 30 minutes, there is no eta for service recovery for us-east1-c yet.

Impact is due to power failure. A more detailed analysis will be available at a later time.

Our engineering team is working on recovery of impacted services..

We will provide an update by Monday, 2020-06-29 10:00 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Some services in us-east1-c and us-east1-d are failing, customers impacted by this incident would likely experience a total unavailability of zonal services hosted in us-east1-c or us-east1-d. It is possible for customers to experience service interruption in none, one, or both zones.

Workaround: Other zones in the region are not impacted. If possible, migrating workloads would mitigate impact. If workloads are unable to be migrated, there is no workaround at this time.

29 Jun 2020 08:48 PDT

Description: We are experiencing an issue with Cloud Networking in us-east1-c and us-east1-d, beginning at Monday,2020-06-29 08:15 US/Pacific, affecting multiple Google Cloud Services.

Symptoms: Connections to and from VMs in us-east1-c and us-east1-d may fail.

Our engineering team continues to investigate the issue..

We will provide an update by Monday, 2020-06-29 10:10 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: VM networking in us-east1-c and us-east1-d are failing.

Workaround: None at this time.