Google Cloud Status Dashboard
This page provides status information on the services that are part of Google Cloud Platform. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit cloud.google.com.
Google Cloud Datastore Incident #19006
We've received a report of an issue with Cloud Firestore.
Incident began at 2019-11-11 03:15 and ended at 2019-11-11 05:36 (all times are US/Pacific).
Date | Time | Description | |
---|---|---|---|
Nov 15, 2019 | 12:09 | ISSUE SUMMARYOn Monday 11 November, 2019, Google's internal key management system (KMS), suffered a failure which began to cause user facing impact in the us-east1, us-east4, and southamerica-east1 regions at 02:39 US/Pacific and recovered by 03:27. Some services took longer to fully recover and continued to experience issues outside of this period. Google's commitment to user privacy and data security means that KMS is a common dependency across many infrastructure components. Specific service impact is outlined below. DETAILED DESCRIPTION OF IMPACTOn Monday 11 November, 2019 from 02:39 to 03:27 US/Pacific various services in us-east1, us-east4, southamerica-east1 experienced varying degrees of impact as detailed below. Some global services were also impacted during this period. Cloud IAMCloud IAM saw an average error rate of 93.1% and 89.3% in us-east4 and southamerica-east1 respectively for the first 36 minutes of the incident. Cloud IAM in us-east1 saw an average error rate of 94.5% for the duration of the event. Cloud Pub/SubCloud Pub/Sub saw an average error rate of 15% for Publish operations during the impact period. Google Compute Engine (GCE)Google Compute Engine API requests saw a 16% error rate during the impact period. Persistent Disk control plane operations: Device Creation, Grow, Deletion, Attachment, Snapshot upload/download, and Image Restores failed in the us-east1, us-east4, and southamerica-east1 regions. Existing instances were unaffected. Google App Engine (GAE)Google App Engine in us-east1 served an average elevated error rate of 10% (peaking to 15%). Applications in us-east4 served an average of 10% errors for the first 35 minutes of the incident and applications in southamerica-east1 served an average of 3% errors for the first 35 minutes of the incident. Cloud FunctionsCloud Functions HTTP triggering was degraded in us-east1, us-east4 and southamerica-east1 with peak error rates of 1%, 1.7%, and 8.5% respectively. Cloud RunCloud Run experienced elevated error rates averaging 11% in us-east1 for the first 35 minutes of the incident. In all cases, the majority of serverless errors were due to permission checks--for example--if a user were an admin, or if a user were authorized to access a restricted endpoint. Errors were returned in the form of HTTP 5XX responses. The Serverless API served an average of 70% errors during the impact period due to requiring IAM permission checks. Cloud BigtableCloud Bigtable instances in us-east1, us-east4, and southamerica-east1 were inaccessible for the first 18 minutes of the incident. Cloud MemorystoreCloud Memorystore Delete Instance, Get Instance, List, and List Instance operations were unavailable during the impact period. Existing instances were unaffected. Cloud DNSCloud DNS API had an average error rate of 27% for the first 25 minutes of the incident. Domain resolution was unaffected. Cloud ConsoleThe Cloud Console served an average of 4.7% errors for the impact duration. Google Cloud Storage (GCS)Buckets in us-east1 served an average of 99% errors for the duration of the incident, buckets in us-east4 and southamerica-east1 served an average of 97% and 96% errors respectively for the first 36 minutes of the incident. Buckets in the US multiregion served an average of 1.7% errors from 02:39 to 05:54 US/Pacific for a duration of 3 hours 15 minutes. Google BigQueryBigQuery Dataset, Table, and Job operations were unavailable in us-east1, us-east4, and southamerica-east1 for the duration of the incident. Google Cloud Load BalancingHealth checks to new instances or instances that were live-migrated during the incident failed in us-east1-c and us-east1-d. Instances that fail health checks will generally not receive any traffic. Existing instances were unaffected unless they were migrated. 4% of Managed Instance Groups (MIGs) in us-east1-b failed to repair itself, while up to 85% MIG instance creations failed. This meant that instances were created successfully, but their health checks timed out and the instances were unable to receive traffic. Autohealing and MIG health checks failed for up to 60% of requests in us-east1-d and up to 25% in us-east1-c. Up to 6% of Internal Load Balancer’s (ILB) backend health checks failed in us-east1-d and up to 3.5% in us-east1-c. Google Cloud HTTP(S) Load Balancing backends saw up to 0.5% health checks failed globally. Stackdriver LoggingStackdriver Logging saw an average error rate of 11% globally, and up to a 98% error rate in us-east1, us-east4, and southamerica-east1. Stackdriver MonitoringStackdriver Monitoring was unavailable in us-east, us-east4, and southamerica-east1 from 02:44 to 03:03 US/Pacific. G SuiteThe impact on G Suite users was different from and generally lower than the impact on Google Cloud Platform users due to differences in architecture and provisioning of these services. Please see the G Suite Status Dashboard (https://www.google.com/appsstatus) for details on affected G Suite services. ROOT CAUSE AND REMEDIATIONAt Google, data security is a critical part of service design. To accomplish this, services depend on the KMS for performing cryptographic functions such as encrypting and decrypting the keys used for protecting user data. On Monday 11 November, 2019 at 00:11 US/Pacific a new version of the KMS began to roll out, starting with a single zone in the us-east1 region. This binary added a new feature which was incompatible with older versions of related components. Six minutes later, an alert notified Google engineers of an increased error rate in the KMS. Additional alerts went off at 00:34 while investigation was ongoing. At this time there was no impact to Google Cloud users. In order to begin recovery, Google engineers started to roll forward a change in the affected region, bringing all tasks up to the same version. A roll forward was necessary due to the backwards-incompatible nature of the newest version of the KMS, as persisted keys could not be processed by other components. The roll forward began at 02:32, and resulted in a temporary increase of internal errors from the KMS as the ratio of tasks with the new incompatible version and the previous version changed. By 02:39 the error rate reached a point where external users began to see service degradation. KMS is a low level service and is a critical dependency for core services such as IAM and Storage. The dependency on these core services created distinct start and end times for impact across higher-level services. The roll forward completed by 02:59 and brought the KMS back to a healthy state. From 02:59 onward, services which depend on the KMS began to recover. Most services returned to a healthy state by 03:27, however, some services which integrate with the KMS in multiple locations, such as GCS, saw a longer tail of errors until other locations were brought back to a healthy state. PREVENTION AND FOLLOW-UPA key reliability principle for GCP is that regions provide failure domain isolation; as a result, we will be modifying our processes to ensure that multiple regions are not impacted by a rollout of this particular component simultaneously. Projects are already in progress to split and isolate shared dependencies for the regions impacted by this incident. We are making changes to our canary and rollout validation processes to improve detection speed of similar incidents in future. An additional principle is the ability to mitigate incidents quickly, typically through rollbacks, and we will be introducing explicit backwards-compatibility testing for the components implicated in this outage to ensure we comply with that principle in future. |
|
ISSUE SUMMARYOn Monday 11 November, 2019, Google's internal key management system (KMS), suffered a failure which began to cause user facing impact in the us-east1, us-east4, and southamerica-east1 regions at 02:39 US/Pacific and recovered by 03:27. Some services took longer to fully recover and continued to experience issues outside of this period. Google's commitment to user privacy and data security means that KMS is a common dependency across many infrastructure components. Specific service impact is outlined below. DETAILED DESCRIPTION OF IMPACTOn Monday 11 November, 2019 from 02:39 to 03:27 US/Pacific various services in us-east1, us-east4, southamerica-east1 experienced varying degrees of impact as detailed below. Some global services were also impacted during this period. Cloud IAMCloud IAM saw an average error rate of 93.1% and 89.3% in us-east4 and southamerica-east1 respectively for the first 36 minutes of the incident. Cloud IAM in us-east1 saw an average error rate of 94.5% for the duration of the event. Cloud Pub/SubCloud Pub/Sub saw an average error rate of 15% for Publish operations during the impact period. Google Compute Engine (GCE)Google Compute Engine API requests saw a 16% error rate during the impact period. Persistent Disk control plane operations: Device Creation, Grow, Deletion, Attachment, Snapshot upload/download, and Image Restores failed in the us-east1, us-east4, and southamerica-east1 regions. Existing instances were unaffected. Google App Engine (GAE)Google App Engine in us-east1 served an average elevated error rate of 10% (peaking to 15%). Applications in us-east4 served an average of 10% errors for the first 35 minutes of the incident and applications in southamerica-east1 served an average of 3% errors for the first 35 minutes of the incident. Cloud FunctionsCloud Functions HTTP triggering was degraded in us-east1, us-east4 and southamerica-east1 with peak error rates of 1%, 1.7%, and 8.5% respectively. Cloud RunCloud Run experienced elevated error rates averaging 11% in us-east1 for the first 35 minutes of the incident. In all cases, the majority of serverless errors were due to permission checks--for example--if a user were an admin, or if a user were authorized to access a restricted endpoint. Errors were returned in the form of HTTP 5XX responses. The Serverless API served an average of 70% errors during the impact period due to requiring IAM permission checks. Cloud BigtableCloud Bigtable instances in us-east1, us-east4, and southamerica-east1 were inaccessible for the first 18 minutes of the incident. Cloud MemorystoreCloud Memorystore Delete Instance, Get Instance, List, and List Instance operations were unavailable during the impact period. Existing instances were unaffected. Cloud DNSCloud DNS API had an average error rate of 27% for the first 25 minutes of the incident. Domain resolution was unaffected. Cloud ConsoleThe Cloud Console served an average of 4.7% errors for the impact duration. Google Cloud Storage (GCS)Buckets in us-east1 served an average of 99% errors for the duration of the incident, buckets in us-east4 and southamerica-east1 served an average of 97% and 96% errors respectively for the first 36 minutes of the incident. Buckets in the US multiregion served an average of 1.7% errors from 02:39 to 05:54 US/Pacific for a duration of 3 hours 15 minutes. Google BigQueryBigQuery Dataset, Table, and Job operations were unavailable in us-east1, us-east4, and southamerica-east1 for the duration of the incident. Google Cloud Load BalancingHealth checks to new instances or instances that were live-migrated during the incident failed in us-east1-c and us-east1-d. Instances that fail health checks will generally not receive any traffic. Existing instances were unaffected unless they were migrated. 4% of Managed Instance Groups (MIGs) in us-east1-b failed to repair itself, while up to 85% MIG instance creations failed. This meant that instances were created successfully, but their health checks timed out and the instances were unable to receive traffic. Autohealing and MIG health checks failed for up to 60% of requests in us-east1-d and up to 25% in us-east1-c. Up to 6% of Internal Load Balancer’s (ILB) backend health checks failed in us-east1-d and up to 3.5% in us-east1-c. Google Cloud HTTP(S) Load Balancing backends saw up to 0.5% health checks failed globally. Stackdriver LoggingStackdriver Logging saw an average error rate of 11% globally, and up to a 98% error rate in us-east1, us-east4, and southamerica-east1. Stackdriver MonitoringStackdriver Monitoring was unavailable in us-east, us-east4, and southamerica-east1 from 02:44 to 03:03 US/Pacific. G SuiteThe impact on G Suite users was different from and generally lower than the impact on Google Cloud Platform users due to differences in architecture and provisioning of these services. Please see the G Suite Status Dashboard (https://www.google.com/appsstatus) for details on affected G Suite services. ROOT CAUSE AND REMEDIATIONAt Google, data security is a critical part of service design. To accomplish this, services depend on the KMS for performing cryptographic functions such as encrypting and decrypting the keys used for protecting user data. On Monday 11 November, 2019 at 00:11 US/Pacific a new version of the KMS began to roll out, starting with a single zone in the us-east1 region. This binary added a new feature which was incompatible with older versions of related components. Six minutes later, an alert notified Google engineers of an increased error rate in the KMS. Additional alerts went off at 00:34 while investigation was ongoing. At this time there was no impact to Google Cloud users. In order to begin recovery, Google engineers started to roll forward a change in the affected region, bringing all tasks up to the same version. A roll forward was necessary due to the backwards-incompatible nature of the newest version of the KMS, as persisted keys could not be processed by other components. The roll forward began at 02:32, and resulted in a temporary increase of internal errors from the KMS as the ratio of tasks with the new incompatible version and the previous version changed. By 02:39 the error rate reached a point where external users began to see service degradation. KMS is a low level service and is a critical dependency for core services such as IAM and Storage. The dependency on these core services created distinct start and end times for impact across higher-level services. The roll forward completed by 02:59 and brought the KMS back to a healthy state. From 02:59 onward, services which depend on the KMS began to recover. Most services returned to a healthy state by 03:27, however, some services which integrate with the KMS in multiple locations, such as GCS, saw a longer tail of errors until other locations were brought back to a healthy state. PREVENTION AND FOLLOW-UPA key reliability principle for GCP is that regions provide failure domain isolation; as a result, we will be modifying our processes to ensure that multiple regions are not impacted by a rollout of this particular component simultaneously. Projects are already in progress to split and isolate shared dependencies for the regions impacted by this incident. We are making changes to our canary and rollout validation processes to improve detection speed of similar incidents in future. An additional principle is the ability to mitigate incidents quickly, typically through rollbacks, and we will be introducing explicit backwards-compatibility testing for the components implicated in this outage to ensure we comply with that principle in future. |
|||
Nov 11, 2019 | 12:15 | We have updated our initial incident description below to reflect impact more accurately. |
|
We have updated our initial incident description below to reflect impact more accurately. |
|||
Nov 11, 2019 | 05:47 | We will publish analysis of this incident once we have completed our internal investigation. |
|
We will publish analysis of this incident once we have completed our internal investigation. |
|||
Nov 11, 2019 | 05:36 | The issue with multiple products is resolved for most of projects and our Engineering Team is working on full resolution for Stackdriver Monitoring and Google Cloud Storage for US multi-region. If you have questions or are impacted, please open a case with the Support Team and we will work with you until this issue is resolved. No further updates will be provided here. |
|
The issue with multiple products is resolved for most of projects and our Engineering Team is working on full resolution for Stackdriver Monitoring and Google Cloud Storage for US multi-region. If you have questions or are impacted, please open a case with the Support Team and we will work with you until this issue is resolved. No further updates will be provided here. |
|||
Nov 11, 2019 | 04:44 | Description: We are investigating an issue with an infrastructure component impacting multiple products. We believe we have identified the cause and are currently rolling out mitigation. We will provide an update by Monday, 2019-11-11 05:30 US/Pacific with current details. Diagnosis: None at this time Workaround: None at this time |
|
Description: We are investigating an issue with an infrastructure component impacting multiple products. We believe we have identified the cause and are currently rolling out mitigation. We will provide an update by Monday, 2019-11-11 05:30 US/Pacific with current details. Diagnosis: None at this time Workaround: None at this time |
|||
Nov 11, 2019 | 04:00 | Description: Mitigation work is currently underway by our engineering team. We will provide more information by Monday, 2019-11-11 05:00 US/Pacific. Diagnosis: None at this time Workaround: None at this time |
|
Description: Mitigation work is currently underway by our engineering team. We will provide more information by Monday, 2019-11-11 05:00 US/Pacific. Diagnosis: None at this time Workaround: None at this time |
|||
Nov 11, 2019 | 03:15 | Description: We are experiencing an issue with some Google Cloud APIs across us-east1, us-east4 and southamerica-east1, with some APIs impacted globally. This includes the APIs for Compute Engine, Cloud Storage, BigQuery, Dataflow, Dataproc, and Pub/Sub. App Engine applications in those regions are also impacted. Our engineering team continues to investigate the issue. We will provide an update by Monday, 2019-11-11 03:45 US/Pacific with current details. Diagnosis: None at this time Workaround: None at this time |
|
Description: We are experiencing an issue with some Google Cloud APIs across us-east1, us-east4 and southamerica-east1, with some APIs impacted globally. This includes the APIs for Compute Engine, Cloud Storage, BigQuery, Dataflow, Dataproc, and Pub/Sub. App Engine applications in those regions are also impacted. Our engineering team continues to investigate the issue. We will provide an update by Monday, 2019-11-11 03:45 US/Pacific with current details. Diagnosis: None at this time Workaround: None at this time |