Incident affecting Google Cloud Infrastructure Components, Google Cloud Support, Google Cloud Console, Google BigQuery, Google Cloud Storage, Google Cloud Networking, Google Kubernetes Engine, Virtual Private Cloud (VPC)
Google Cloud services are experiencing issues and we have an other update at 5:30 PDT
Incident began at 2020-12-14 04:07 and ended at 2020-12-14 06:23 (all times are US/Pacific).
|22 Dec 2020
The following is a correction to the previously posted ISSUE SUMMARY, which after further research we determined needed an amendment. All services that require sign-in via a Google Account were affected with varying impact. Some operations with Cloud service accounts experienced elevated error rates on requests to the following endpoints: www.googleapis.com or oauth2.googleapis.com. Impact varied based on the Cloud Service and service account. Please open a support case if you were impacted and have further questions.
|18 Dec 2020
On Monday 14 December, 2020, for a duration of 47 minutes, customer-facing Google services that required Google OAuth access were unavailable. Cloud Service accounts used by GCP workloads were not impacted and continued to function. We apologize to our customers whose services or businesses were impacted during this incident, and we are taking immediate steps to improve the platform’s performance and availability.
The Google User ID Service maintains a unique identifier for every account and handles authentication credentials for OAuth tokens and cookies. It stores account data in a distributed database, which uses Paxos protocols to coordinate updates. For security reasons, this service will reject requests when it detects outdated data.
Google uses an evolving suite of automation tools to manage the quota of various resources allocated for services. As part of an ongoing migration of the User ID Service to a new quota system, a change was made in October to register the User ID Service with the new quota system, but parts of the previous quota system were left in place which incorrectly reported the usage for the User ID Service as 0. An existing grace period on enforcing quota restrictions delayed the impact, which eventually expired, triggering automated quota systems to decrease the quota allowed for the User ID service and triggering this incident. Existing safety checks exist to prevent many unintended quota changes, but at the time they did not cover the scenario of zero reported load for a single service:
• Quota changes to large number of users, since only a single group was the target of the change,
• Lowering quota below usage, since the reported usage was inaccurately being reported as zero,
• Excessive quota reduction to storage systems, since no alert fired during the grace period,
• Low quota, since the difference between usage and quota exceeded the protection limit.
As a result, the quota for the account database was reduced, which prevented the Paxos leader from writing. Shortly after, the majority of read operations became outdated which resulted in errors on authentication lookups.
REMEDIATION AND PREVENTION
The scope of the problem was immediately clear as the new quotas took effect. This was detected by automated alerts for capacity at 2020-12-14 03:43 US/Pacific, and for errors with the User ID Service starting at 03:46, which paged Google Engineers at 03:48 within one minute of customer impact. At 04:08 the root cause and a potential fix were identified, which led to disabling the quota enforcement in one datacenter at 04:22. This quickly improved the situation, and at 04:27 the same mitigation was applied to all datacenters, which returned error rates to normal levels by 04:33. As outlined below, some user services took longer to fully recover.
In addition to fixing the underlying cause, we will be implementing changes to prevent, reduce the impact of, and better communicate about this type of failure in several ways:
1. Review our quota management automation to prevent fast implementation of global changes
2. Improve monitoring and alerting to catch incorrect configurations sooner
3. Improve reliability of tools and procedures for posting external communications during outages that affect internal tools
4. Evaluate and implement improved write failure resilience into our User ID service database
5. Improve resilience of GCP Services to more strictly limit the impact to the data plane during User ID Service failures
We would like to apologize for the scope of impact that this incident had on our customers and their businesses. We take any incident that affects the availability and reliability of our customers extremely seriously, particularly incidents which span multiple regions. We are conducting a thorough investigation of the incident and will be making the changes which result from that investigation our top priority in Google Engineering.
DETAILED DESCRIPTION OF IMPACT
On Monday 14 December, 2020 from 03:46 to 04:33 US/Pacific, credential issuance and account metadata lookups for all Google user accounts failed. As a result, we could not verify that user requests were authenticated and served 5xx errors on virtually all authenticated traffic. The majority of authenticated services experienced similar control plane impact: elevated error rates across all Google Cloud Platform and Google Workspace APIs and Consoles. Products continued to deliver service normally during the incident except where specifically called out below. Most services recovered automatically within a short period of time after the main issue ended at 04:33. Some services had unique or lingering impact, which is detailed below.
Any users who hadn't already previously authenticated to Cloud Console were unable to login. Users who had already authenticated may have been able to use Cloud Console but may have seen some features degraded.
During the incident, streaming requests returned ~75% errors, while BigQuery jobs returned ~10% errors on average globally.
Google Cloud Storage
Approximately 15% of requests to Google Cloud Storage (GCS) were impacted during the outage, specifically those using OAuth, HMAC or email authentication. After 2020-12-14 04:31 US/Pacific, the majority of impact was resolved, however, there was lingering impact, for <1% of clients that attempted to finalize resumable uploads that started during the window. These uploads were left in a non-resumable state; the error code GCS returned was retryable, but subsequent retries were unable to make progress, leaving these objects unfinalized.
Google Cloud Networking
The networking control plane continued to see elevated error rates on operations until it fully recovered at 2020-12-14 05:21 US/Pacific. Only operations that made modifications to the data plane VPC network were impacted. All existing configurations in the data plane remained operational.
Google Kubernetes Engine
During the incident, ~4% of requests to the GKE control plane API failed, and nearly all Google-managed and customer workloads could not report metrics to Cloud Monitoring.
We believe ~5% of requests to Kubernetes control planes failed but do not have accurate measures due to unreported Cloud Monitoring metrics.
For up to an hour after the outage, ~1.9% nodes reported conditions such as StartGracePeriod or NetworkUnavailable which may have had an impact on user workloads.
All Google Workspace services rely on Google's account infrastructure for login, authentication, and enforcement of access control on resources (e.g. documents, Calendar events, Gmail messages). As a consequence, all authenticated Google Workspace apps were down for the duration of the incident. After the issue was mitigated at 2020-12-14 04:32 US/Pacific, Google Workspace apps recovered, and most services were fully recovered by 05:00. Some services, including Google Calendar and Google Workspace Admin Console, served errors up to 05:21 due to a traffic spike following initial recovery. Some Gmail users experienced errors for up to an hour after recovery due to caching of errors from identity services.
Cloud Support's internal tools were impacted, which delayed our ability to share outage communications with customers on the Google Cloud Platform and Google Workspace Status Dashboards. Customers were unable to create or view cases in the Cloud Console. We were able to update customers at 2020-12-14 05:34 US/Pacific after the impact had ended.
|14 Dec 2020
Preliminary Incident Statement while full Incident Report is prepared.
(All Times US/Pacific)
Google Cloud Platform and Google Workspace experienced a global outage affecting all services which require Google account authentication for a duration of 50 minutes. The root cause was an issue in our automated quota management system which reduced capacity for Google's central identity management system, causing it to return errors globally. As a result, we couldn’t verify that user requests were authenticated and served errors to our users.
|14 Dec 2020
As of 4:32 PST the system affected was restored and all services recovered shortly afterwards.
|14 Dec 2020
Google Cloud services are experiencing issues and we have an other update at 5:30 PST
- All times are US/Pacific