Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Google BigQuery, Apigee, Cloud Developer Tools, Cloud Firestore, Google Compute Engine, Google Cloud Bigtable, Google Cloud Storage, Google Cloud Networking, Identity and Access Management, Google Cloud Pub/Sub, Cloud Build, Google Cloud Functions, Container Registry

Multiple services are being impacted globally within Google Cloud

Incident began at 2023-02-27 04:58 and ended at 2023-02-27 05:11 (all times are US/Pacific).

Previously affected location(s)

Taiwan (asia-east1)Hong Kong (asia-east2)Tokyo (asia-northeast1)Osaka (asia-northeast2)Seoul (asia-northeast3)Mumbai (asia-south1)Delhi (asia-south2)Singapore (asia-southeast1)Jakarta (asia-southeast2)Sydney (australia-southeast1)Melbourne (australia-southeast2)Warsaw (europe-central2)Finland (europe-north1)Madrid (europe-southwest1)Belgium (europe-west1)London (europe-west2)Frankfurt (europe-west3)Netherlands (europe-west4)Zurich (europe-west6)Milan (europe-west8)Paris (europe-west9)GlobalTel Aviv (me-west1)Montréal (northamerica-northeast1)Toronto (northamerica-northeast2)São Paulo (southamerica-east1)Santiago (southamerica-west1)Iowa (us-central1)South Carolina (us-east1)Northern Virginia (us-east4)Columbus (us-east5)Dallas (us-south1)Oregon (us-west1)Los Angeles (us-west2)Salt Lake City (us-west3)Las Vegas (us-west4)

Date Time Description
3 Mar 2023 10:25 PST

Incident Report

Summary

On Monday, February 27, 2023, Google Cloud Networking’s production network experienced significant packet loss starting 04:58 US/Pacific for a duration of up to seven minutes. This caused errors and failures in several downstream Google Cloud and Google Workspace services that took up to an additional six minutes to recover. To our customers that were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. We have conducted an internal investigation and are taking steps to improve our service.

Root Cause

Google’s production network has several levels of redundancy and several systems to ensure optimal bandwidth routing. One of our back end control plane elements responsible for calculating optimal paths for bandwidth, consumes snapshot data from a critical element that provides detailed network modeling, including topology, statistics, and forwarding table information. During a routine update to the critical elements snapshot data, an incomplete snapshot was inadvertently shared which removed several sites from the topology map. This caused traffic originating from those sites to be discarded until recovery mechanisms kicked in and correctly reprogrammed all paths. The packet loss caused errors and failures in multiple downstream Google Cloud and Google Workspace services.

Remediation and Prevention

Google's automation systems mitigated this failure by pushing a complete topology snapshot during the next programming cycle. The proper sites were restored and the network converged by 05:05 US/Pacific.

Google is committed to preventing this type of disruption from reoccurring and is taking the following actions:

  • Input validation in the control plane has been fully activated for the most critical components and is rolling out to remaining components. The bandwidth routing system is now more robust to accommodate unexpectedly large changes in the topology
  • Deploy a fix to the topology system to prevent this cause of incomplete snapshots.
  • The system providing topology input to the control plane will be sharded not to span multiple regions. This will ensure that an erroneous input, if any, will not impact traffic originating from many regions at the same time.
  • Safer sequencing when removing sites from the topology map

Detailed Description of Impact

On Monday, 27 February 2023, from 04:58 to 05:12 US/Pacific unless otherwise noted:

Google Cloud Platform Services:

Apigee: Up to 20% of requests in affected regions experienced timeouts and elevated 5xx error rates in southamerica-east1, us-east1 and us-central1.

Virtual Private Cloud: Affected customers/users experienced increased packet loss for cross region traffic between affected regions: asia-east1, asia-southeast1, europe-north1, europe-west1, europe-west4, us-central1, us-central2, us-east1, us-east4, us-west1, us-west4

Cloud Interconnect: Affected customers/users in regions us-east4, asia-southeast1, southamerica-east1, us-east7, us-west4 experienced increased packet loss on their interconnects.

BigQuery: Affected customers/users running queries in affected BigQuery regions experienced increased latency and elevated UNAVAILABLE error rates (retriable 503 errors) between 04:58 and 05:10. Regions affected were: aws-us-east-1, azure-eastus2, southamerica-east1, us-multiregion, us-east4, us-east5, us-east7, us-south1.

Cloud Dataflow: Overall two regions were impacted: us-central1 and us-east4. ~20% of Dataflow jobs in us-central1 experienced streaming data disruption: no data passed through the Dataflow pipelines for about 13 minutes. No visible impact in us-central1 for Dataflow Batch jobs. For us-east4 the impact was on 55% of Dataflow Streaming jobs and 75% of Dataflow Batch jobs, for about 13 minutes.

Cloud Bigtable: Affected customers/users in us-central1, us-east4, and us-west1 experienced unavailable errors (retriable 503 errors) or deadline exceeded (504 errors). 11.3% of customer projects were affected by the issue.

Cloud Key Management Service (KMS): Affected customers/users in us-east4, us-east5, us-central1, multi-region us, multi-region nam7, and global experienced reduced availability in the form of retriable 503 (unavailable) and 504 (deadline exceeded) errors across two categories of KMS keys: software and hardware keys. For software keys, 2.4% of customers were affected, across all the regions mentioned above. For hardware keys, the impact was limited to the regions: us-central1, multi-region us and global where 0.7% of customers were affected. During the outage, 0.1% of software requests returned 503 and 504 errors and 0.78% of hardware requests returned 503 and 504 errors. Requests which were not able to reach our servers likely got 503 (unavailable) which means they have retried and eventually succeeded or got a 504 (deadline exceeded) error.

Cloud Monitoring: Affected customers/users experienced elevated latency on Cloud Monitoring dashboards. Customers writing metric data through the Monitoring API experienced elevated error rates.

Persistent Disk: Affected customers/users would have seen read, writes, and unmaps stuck. Approximately 0.12% of devices were affected globally. Affected regions were: us-central1, us-east4, southamerica-east1, us-west4, asia-southeast1.

Cloud SQL: Affected customers/users may have seen intermittent connectivity issues for ~10 minutes from 05:00-05:10 US/Pacific (us-central1, us-east4, us-west4, southamerica-east1). Retrying should have succeeded, as the success rate was ~80-95% depending on region and method.

Cloud Workflows: ~1% of (global) requests to the workflow executions API failed. 75% of requests failed in us-west4. 46% of requests failed in us-east4. ~3.5% of requests failed in us-central1. ~2.5% of requests failed in southamerica-east1 and northamerica-northeast2.

Cloud Console: Affected customers/users experienced an elevated number of GUI failures. Up to 9% of page views were affected during impact period..

Cloud Load Balancing: Affected customers/users experienced elevated 500 error rates from load balancers for traffic passing through asia-southeast1, us-central1, us-west4.

Google Compute Engine: Affected customers/users experienced elevated 500 error rates when sending HTTP requests to Google Compute Engine APIs. This would also include timeouts when reading or writing GCE metadata guest attributes. Additionally, affected users would have experienced an increase in latency and elevated UNAVAILABLE error rates (retriable 503 errors) between 05:00 am PST and 05:10 am PST for Compute Engine Frontend UI pages.

Cloud Run: Affected customers/users experienced elevated error rates (400 or 500 errors), request time outs, and control plane request failures. Retrying the requests may have succeeded in some cases..

Cloud App Engine: Affected customers/users experienced elevated error rates (400 or 500 errors), request time outs, and control plane request failures. Retrying the requests may have succeeded in some cases. .

Cloud Functions: Affected customers/users experienced elevated error rates (400 or 500 errors), request time outs, and control plane request failures. Retrying the requests may have succeeded in some cases.

Cloud VPN: Affected customers/users experienced elevated packet loss in us-east4, asia-southeast1, southamerica-east1, us-east7, us-west4.

Identity and Access Management: Affected customers/users experienced NOT_FOUND, PERMISSION_DENIED, DEADLINE_EXCEEDED and UNAVAILABLE errors.

Cloud Pub/Sub: Affected customers/users experienced unavailability and increased latency for Publish operations, with the most severe impact occurring in us-east1, us-east4, and us-east5, where approximately 40% of projects experienced at least one minute in which fewer than 99% of requests succeeded. In most regions, <10% of projects experienced impact. Availability of subscribe operations was also impacted in a similar pattern. In combination with inability of the system to move message data between locations as normal, this led to increased end-to-end message delivery latency for approximately 20% of subscriptions.

Google Cloud Storage: On 2023-02-27, from 04:58 to 05:06 (8 minutes), some GCS customers/users in us-east4 experienced reduced availability in the form of Service Unavailable (retryable) errors at a rate of about 1% overall. Less than 1% of customer projects had an error rate of more than 1%. Some GCS customers in the configurations us-multiregion, us-central1 and us-east1 also experienced an elevated error rate during the impact window that did not exceed 1%.

Cloud Firestore: Affected customers/users would have seen increased unavailability errors with Firestore and Datastore API in various regions, including southamerica-east1, us-west4, us-east4, us-east1 and nam5. Globally, around 33% of active customers and 0.38% of active requests received 502 unavailable errors or increased latency between 4:58 and 5:10 US/Pacific.

Cloud Data Loss Prevention: No user impact

Cloud Memorystore:

  • Control Plane: Between 4:58 and 5:09 US/Pacific, some customers/users issuing control plane requests (like GetInstance, CreateInstance, etc) experienced a significant increase in latency, and, in ~35% of the cases, - request failures with 5xx error code. The issue was most pronounced in us-east4, us-west4, us-east1, and asia-southeast1 cloud regions, but was noticeable in other regions as well.

  • Data plane: Between 5:00 and 5:04 US/Pacific, a number of instances in us-east4, us-west4, southamerica-east1, us-central1, and europe-west2 experienced 1-5 minutes of unavailability, where customers were unable to connect to their Redis server. In many STANDARD-tier instances, this resulted in a failover.

Cloud Spanner: On 2023-02-27, from 04:57 to 05:06 US/Pacific (9 minutes), some customers/users in us-east4, nam3, nam7, nam9, nam11, nam12, nam-eur-asia3 experienced reduced availability in the form of deadline exceeded (retryable) errors and also an increase in latency. Less than 1% of customer projects had an error rate of more than 1%.

Cloud Build: Cloud Build API customers/users in 2 regions (southamerica-east1 and prod-global) experienced high latency and DEADLINE_EXCEEDED responses. Availability SLO for {Get,List}WorkerPool in southamerica-east1 was down to 13% for 3 minutes and consumed 25% of the 30-day error budget. Availability SLO for ReceiveGitHubDotComWebhook was down to 28% for 3 minutes and consumed 9% of the 30-day error budget.

Container Registry: Container registry customers/users in the global region experienced high latency and HTTP 504 responses. Availability SLO for manifests_get consumed 6% of the 30-day error budget. Availability SLO for ping_and_token_availability consumed 15% of the 30-day error budget.

Cloud Tasks: Cloud Tasks customers/users in the us-central1 region experienced high latency and DEADLINE_EXCEEDED responses for CreateTasks requests. Remote Procedure Call (RPC) error rate increased from 0 to 10% for 3 minutes.

Google Kubernetes Engine (GKE): GKE customers/users may have experienced service degradation and elevated 500 errors in affected locations.

Workspace services

Gmail: Affected customers/users would have experienced unavailability, 502 errors when accessing Gmail, and email delivery delays and failures between 04:58 and 05:06 US/Pacific.

Google Calendar: Affected customers/users experienced general unavailability when accessing Calendar.

Google Chat: Affected customers/users at affected locations experienced errors when attempting to access and use Google Chat.

Google Meet: Affected customers/users experienced failure rates of up to 14% when attempting to start or join a new meeting.

Google Docs: Customers/users in affected locations would experience errors when loading or accessing documents.

Google Drive: Up to 10% of customers/users accessing Google Drive during the time window experienced unavailability (HTTP 500 errors).

Google Tasks Affected customers/users experienced availability issues with tasks.

Google Voice: Affected customers/users experienced up to a 2.9% error rate when interacting with Voice API. Up to 2% of Ongoing GV calls may have been dropped. Up to 13% of desk phones may not have been able to make or receive calls during the window and any outgoing calls on these would have been dropped.

27 Feb 2023 10:54 PST

Mini Incident Report

We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support or to Google Workspace Support using help article https://support.google.com/a/answer/1047213.

(All Times US/Pacific)

Incident Start: 27 Feb 2023 04:58

Incident End: 27 Feb 2023 05:11

Duration: 13 minutes

Affected Services and Features: Multiple Google Cloud Platform and Google Workspace services

Regions/Zones: Multiple regions

Description:

Multiple Google Cloud Platform and Google Workspace services experienced elevated error rates for 13 minutes due to packet loss on Google’s production backbone network. Google will be completing a full Incident Report in the following days that will include additional details and provide a full root cause.

Customer Impact:

Google Cloud Platform Services:

  • Apigee - Affected users experienced client connect timeouts and elevated 5XX errors.
  • Virtual Private Cloud (VPC) - Affected users experienced increased packet loss for cross region traffic.
  • Cloud Interconnect - Affected users in regions us-east4, asia-southeast1, southamerica-east1, us-east7, us-west4 experienced increased packet loss on their interconnects.
  • BigQuery - Affected users experienced availability issues with BigQuery.
  • Cloud Dataflow - Affected users experienced elevated latency and error rates for batch and streaming jobs. Some users may have experienced increased latency in Create Workflow operations.
  • Cloud Bigtable - Affected users experienced deadline exceeded or unavailable errors with Data and Admin APIs in us-east4, us-central1, and us-west1 regions
  • Cloud Key Management Service (KMS) - Affected users experienced “unavailable” errors when attempting to use the service.
  • Cloud Monitoring - Affected users experienced elevated latency on Cloud Monitoring dashboard when accessing resources.
  • Persistent Disk - Affected users experienced elevated read/write latency.
  • Cloud SQL - Customers may have seen intermittent connectivity issues for ~10 minutes from 05:00-05:10 (us-central1, us-east4, us-west4, southamerica-east1). Retrying should have succeeded, as the success rate was ~80-95% depending on region and method.
  • Cloud Workflows - Affected users may have seen increased error rates.
  • Cloud Console - Affected users may have seen increased GUI failures for the impact duration.
  • Google Compute Engine - Affected users experienced elevated 500 error rates when sending HTTP requests to Google Compute Engine APIs.
  • Cloud Load Balancing - Affected users experienced elevated 500 error rates from load balancers for traffic passing through asia-southeast1, us-central1, us-west4.
  • Cloud Run - Affected users experienced elevated error rates (400 or 500 errors), request time outs, and control plane request failures. Retrying the requests may have succeeded in some cases.
  • Cloud App Engine - Affected users experienced elevated error rates (400 or 500 errors), request time outs, and control plane request failures. Retrying the requests may have succeeded in some cases.
  • Cloud Functions - Affected users experienced elevated error rates (400 or 500 errors), request time outs, and control plane request failures. Retrying the requests may have succeeded in some cases.
  • Cloud VPN - Affected users experienced elevated packet loss in us-east4, asia-southeast1, southamerica-east1, us-east7, us-west4.
  • Identity and Access Management - Affected users experienced NOT_FOUND, PERMISSION_DENIED, DEADLINE_EXCEEDED and UNAVAILABLE errors.
  • Cloud Pub/sub – Affected users experienced availability issues.
  • Google Cloud Storage- Affected users experienced “unavailable” errors
  • Cloud Firestore— Affected users experienced issues with Firestore service availability.
  • Cloud Data Loss Prevention - Affected users experienced elevated RPC error rates.
  • Cloud Memorystore - Affected users experienced increased latency and deadline exceeded errors for some requests like GetInstance, ListInstance, ExportInstance. Some users experienced issues with creating new Memorystore for Redis instances.

Workspace services:

  • Gmail - Affected users experienced elevated error rates
  • Google Calendar - Affected users experienced availability issues with the calendar.
  • Google Chat - Affected users experienced issues with chat interactions on web and mobile
  • Google Meet - Affected users experienced failures to join a meeting. Retry attempt to join the meeting succeeded for most of the affected users.
  • Google Docs - Affected users experienced errors when loading or accessing a document.
  • Google Drive - Affected users experienced elevated errors.
  • Google Tasks - Affected users experienced availability issues with tasks.
  • Google Voice - Affected users experienced elevated error rate when interacting with Voice API.

In most of the cases for Workspace services, retrying or refreshing the request had a chance of succeeding. A subset of users experienced some persistent (more than ~3-10 servers errors in a 10 minute timespan) unavailability.

27 Feb 2023 06:26 PST

The issue with Apigee, Cloud Build, Cloud Firestore, Container Registry, Google BigQuery, Google Cloud Bigtable, Google Cloud Networking, Google Cloud Pub/Sub, Google Cloud Storage, Google Compute Engine, Identity and Access Management has been resolved for all affected users as of Monday, 2023-02-27 05:11 US/Pacific.

We thank you for your patience while we worked on resolving the issue.

27 Feb 2023 06:02 PST

Summary: Multiple services are being impacted globally within Google Cloud

Description: We are experiencing an issue with Cloud Build, Cloud Firestore, Google BigQuery, Google Cloud Networking, Google Cloud Pub/Sub, Google Compute Engine.

Our engineering team continues to investigate the issue while some services are starting to be restored.

We will provide an update by Monday, 2023-02-27 06:40 US/Pacific with current details.

Diagnosis: Multiple services can be affected globally due to an ongoing outage

Workaround: None at this time

27 Feb 2023 05:58 PST

Summary: Multiple services are being impacted globally within Google Cloud

Description: We are experiencing an issue with Cloud Build, Cloud Firestore, Google BigQuery, Google Cloud Networking, Google Cloud Pub/Sub, Google Compute Engine.

Our engineering team continues to investigate the issue while some services are starting to be restored.

We will provide an update by Monday, 2023-02-27 06:30 US/Pacific with current details.

Diagnosis: Multiple services can be affected globally due to an ongoing outage

Workaround: None at this time

27 Feb 2023 05:55 PST

Summary: Multiple services are being impacted globally within Google Cloud

Description: We are experiencing an issue with Cloud Build, Cloud Firestore, Google BigQuery, Google Cloud Networking, Google Cloud Pub/Sub, Google Compute Engine. Our engineering team continues to investigate the issue. We will provide an update by Monday, 2023-02-27 06:30 US/Pacific with current details.

Diagnosis: Multiple services can be affected globally due to an ongoing outage

Workaround: None at this time

27 Feb 2023 05:41 PST

Summary: Multiple services are being impacted globally within Google Cloud

Description: We are experiencing an issue with Cloud Build, Cloud Firestore, Google BigQuery, Google Cloud Networking, Google Cloud Pub/Sub, Google Compute Engine.

Our engineering team continues to investigate the issue.

We will provide an update by Monday, 2023-02-27 06:10 US/Pacific with current details.

Diagnosis: Multiple services can be affected globally due to an ongoing outage

Workaround: None at this time