Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Vertex AI AutoML Image, Vertex AI Matching Engine, Vertex AI AutoML Tabular, Pub/Sub Lite, Hybrid Connectivity, Cloud Key Management Service, Google Cloud Deploy, Cloud Run, Vertex AI TensorBoard, Cloud Developer Tools, Virtual Private Cloud (VPC), Dialogflow CX, Cloud Workflows, Operations, Cloud Spanner, Vertex AI Explainable AI, Vertex AI Workbench User Managed Notebooks, Google Compute Engine, Cloud Memorystore, Dataproc Metastore, Cloud Logging, Certificate Authority Service, Artifact Registry, Vertex AI Vizier, Persistent Disk, Vertex AI Data Labeling, Google Cloud Dataflow, Data Catalog, Vertex AI Model Registry, Google Cloud Networking, Google Cloud Console, Eventarc, Identity and Access Management, Vertex AI Training, Google Cloud Pub/Sub, Cloud Build, Vertex AI AutoML Video, Vertex AI AutoML Text, Cloud Load Balancing, Vertex AI Pipelines, Vertex AI Feature Store, Vertex AI ML Metadata, Vertex AI Online Prediction, Vertex AI Model Monitoring, Google Cloud Tasks, Vertex AI Batch Prediction, Google Cloud Dataproc, Cloud Machine Learning, Healthcare and Life Sciences, Google Cloud SQL, Google Kubernetes Engine, GKE fleet management, Document AI Warehouse

Multiple Google Cloud Products are experiencing issues in us-west1

Incident began at 2024-02-14 09:45 and ended at 2024-02-14 12:52 (all times are US/Pacific).

Previously affected location(s)

Oregon (us-west1)

Date Time Description
21 Feb 2024 13:39 PST

Incident Report

Summary

On 14 February 2024 from 09:45 AM to 12:52 PM US/Pacific, Google Cloud customers in us-west1 experienced control plane unavailability because of elevated latencies and errors. In addition, a few services experienced data plane unavailability for the same reason. The full list of impacted products and services are detailed below.

To our Google Cloud customers whose businesses were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer you.

Root Cause

Most Google Cloud products and services use a regional metadata store to support their internal operations. The metadata store supports critical functions such as servicing customer requests and handling scale, load balancing, admin operations and for retrieving/storing metadata including server location information.

The regional metadata store continuously manages load by automatically adjusting compute capacity in response to changes in demand. When usage increases, additional resources are added and load is also balanced automatically. However, an unexpected spike in demand exceeded the system’s ability to quickly provision additional resources. As a result, multiple Google Cloud products and services in the region experienced elevated latencies and errors until the unexpected load was isolated.

Remediation and Prevention/Detection

Google engineers were alerted to this problem by our internal monitoring system and throttled the spiking workloads on the underlying regional metadata store. This allowed Google Cloud products and services to read/write state at a normal rate allowing for healthy servicing of customer requests after the backlog of operations on the regional metadata store were processed.

Google is committed to preventing a repeat of this issue in the future and is completing the following actions:

  • Improve monitoring and alerting for an earlier detection of unexpected spikes in the regional metadata stores.
  • Enhance the ability of regional data stores to automatically throttle workloads more aggressively when experiencing unexpected spikes.

Detailed Description of Impact

Google Compute Engine

  • From 10:30 AM to 12:30 PM US/Pacific, a number of GCE APIs returned internal or timeout errors in the us-west1 region, across all zones. The overall error rate for most APIs remained around 1% with errors affecting around 33% of projects in the region.

  • From 10:00 AM to 12:00 PM US/Pacific, less than 0.004% VMs running in the region crashed, 80% of the crashed VMs recovered while 20% of the crashed VMs experienced delays in rescheduling.

  • From 10:00 AM to 12:00 PM US/Pacific, about 1% of read/write requests in the region for guest attributes failed or experienced latency exceeding 1 second.

Google Cloud Pub/Sub Lite

  • From 10:05 AM to 12:00 PM US/Pacific, customers may have experienced elevated end-to-end latency and publish request latency in us-west1. At peak impact, 65% of publish requests in the region failed with aborted, canceled, deadline exceeded, or unavailable errors, affecting up to 47% of projects publishing in the region. Resource administrative operations also displayed unavailability in the region.

App Engine Standard

  • From 09:45 AM to 11:42 AM US/Pacific, ~6% of customers in the region experienced deployment failures and latency on deployments for Google App Engine apps.

Cloud Functions

  • From 09:45 AM to 11:42 AM US/Pacific, ~6% of customers in the region experienced deployment failures and latency on deployments for Cloud Function apps.

Cloud Run

  • From 09:45 AM to 11:42 AM US/Pacific, ~2% of Customers in the region experienced deployment failures and latency on deployments for Cloud Run apps.

Dialogflow CX

  • From 9:45 AM to 11:45 AM US/Pacific, a percentage of Dialogflow requests returned internal or timeout errors in the us-west1 region. The error rate stayed below 5% before peaking at 100% around 11:00 US/Pacific.

Vertex AI products:

  • From about 10:15 AM to 11:35 AM US/Pacific, all Vertex AI services that heavily rely on metadata store operations including Online Prediction, Training, and Featurestore, ML Metadata and Notebooks experienced ~50% error rates (spiking to near 100% at times) in the region.

Google Cloud Pub/Sub

  • From 09:57 AM to 11:26 AM US/Pacific, Cloud Pub/Sub customers with traffic in us-west1 experienced publish errors and unavailability. The publish error rate peaked at ~99% for customers with publish traffic in us-west1. In addition, backlog stats metrics were unavailable for some customers who did not have publish or subscribe traffic in us-west1.

Cloud Memorystore

  • From 10:05 AM and 11:35 AM US/Pacific, customers creating, updating, deleting, importing, or exporting Redis Standalone or Cluster instances in us-west1 may have experienced failures. Around 17% of such requests failed with timeouts or internal errors.

Eventarc

  • From 10:30 AM and 11:30 AM US/Pacific Eventarc customers in us-west1 experienced event delivery delays for up to 50 minutes as we saw event publish errors in our dataplane that peaked at 100%. There are high error ratios and latencies for all control plane long running operations with peak error rate at 100% and peak latency of 55 minutes.

Dataproc Metastore

  • From 10:00 AM and 11:45 AM US/Pacific, all control plane operations were sporadically returning an internal or deadline exceeded error (differing ratios throughout the outage) in the region. The peak impact was from around 10:30 AM to 11:30 AM US/Pacific where only around 3.33% of operations were completed with OK.

Google Cloud Tasks

  • From 10:40 AM and 11:30 AM US/Pacific, Cloud Tasks’s main data operation (CreateTask) returned DEADLINE_EXCEEDED error for all requests in us-west1. This means customers in this region were not able to buffer Tasks and, subsequently, our system was not able to dispatch them.

Cloud Build

  • From 9:45 AM to 12:00 PM US/Pacific, using Cloud Build to create or retry builds may have failed in us-west1. ~15-20% of requests failed during the issue.

Cloud SQL

  • From 10:06 AM to 11:46 AM US/Pacific, Many operations requiring the regional metadata store in us-west1 timed out or failed, this affected Cloud SQL instance creations or any modifications to existing us-west1 instances. 30% of instance creations failed, 10% of export operations failed, and <10% of update operations failed in us-west1.

Speech-to-text

  • From 10:30 AM to 11:30 AM US/Pacific, Speech-to-Text (STT) experienced a spike in server errors in us-west1. The issue primarily affected control plane traffic to client STT resources (e.g. CreateRecognizer) which experienced a spike in INTERNAL server errors. At peak, around 10% of Create.* traffic or around 0.5QPS of traffic returned such errors.

Cloud Load Balancing

  • From 9:50 AM to 12:52 PM US/Pacific, new Cloud Load Balancer creation was failing for load balancers with backends in the us-west1 region. Also, configuration changes on the same family of products could not be made. The data plane was not affected.

Cloud Networking

  • From 10:00 AM to 11:30 AM US/Pacific, Cloud NAT, Cloud Router, Cloud Interconnect, and Cloud VPN users experienced time-outs for add/delete/modify operations in us-west1 region .

  • Existing programming and forwarding rules were not impacted.

Cloud Deploy

  • From 9:45 AM to 11:40 AM US/Pacific, Cloud Deploy releases and rollouts in the region were either delayed or failed due to the inability to create builds with Cloud Build which was also affected by the regional metadata store issue. Whether the release/rollout was delayed or failed depended on whether retrying was successful. We also saw errors creating or updating Cloud Deploy resources due to metadata store RPC errors at the time.

Workflows

  • From 10:10 AM to 11:30 AM US/Pacific, Cloud Workflows experienced latency and availability issues in the us-west1 region, This issue impacted ~2% of customer projects and customers experienced internal errors like deadline_exceeded: metadata store reads could not be validated after transaction function returned error: context deadline exceeded. Example methods that were impacted: CancelExecutions, CreateExecutions, CreateWorkflows and TriggerPubsubExecution.

Cloud Logging

  • From 9:45 AM to 12:45 PM US/Pacific, Cloud Logs ingestion storage was delayed in the us-west1 region. This issue impacted ~12.5% of global buckets, but regional buckets do not seem to have been delayed.

Dataform

  • From 09:55 AM to 11:40 AM US/Pacific, our business critical consumer API availability had the lowest availability of 14.29% in us-west1 during the metadata store outage.

  • The metadata store RPC Error Ratio had the highest of 57.6807%. This metadata store is used for executing customer's release & workflow schedules.

Certificate Authority Service

  • From 9:50 AM to 11:40 AM US/Pacific, 3% of overall traffic to Certificate Authority Service in us-west1 experienced slowness and errors for control and data plane operations. Customers experienced an error rate of 85% for Create Certificate Revocation List requests, while other operations were affected at a rate between 1-15%.

VPC and Serverless VPC Access

  • From 9:48 AM to 11:16 AM US/Pacific Serverless VPC Access customers in us-west1 were unable to create, delete, modify or list Serverless VPC Access Connectors. We saw error rates hovering from 50% to 90% where customers would see DEADLINE_EXCEEDED. Serverless VPC Access Connector proxying functionality was unaffected by this incident.

Cloud Dataflow

  • From 10:03 AM to 12:22 PM US/Pacific, Dataflow customers in us-west1 experienced job submission failures peaking at 100%. ~6% of running streaming jobs experienced unusually high system watermarks. Up to 100% of running batch jobs were stuck and failed to make progress during the outage.

Cloud Key Management Service

  • From 10:32 AM to 11:24 AM US/Pacific, ~0.0046% of overall traffic for Cloud Key Management Service in us-west1 served errors (INTERNAL, UNAVAILABLE, or DEADLINE_EXCEEDED) for control and crypto operations.

  • Customers experienced an error rate of 0.000189% for Crypto operations (within SLO) due to serving path redundancy with another storage system.

  • Customers experienced an error rate of 0.28% for Control operations, mostly (CreateCryptoKey, CreateKeyRing, DestroyCryptoKeyVersion) were affected, potentially any metadata read/write operations could have been affected as well. Around ~0.0219% of resource projects are believed to have been affected during the outage.

Persistent Disk

  • From 10:35 AM to 12:10 PM US/Pacific, some Persistent Disk deletion flows were stuck in the us-west1 region. Affected customers would have perceived very long running disk delete operations without any errors. Less than 0.01 percent of projects were affected.

Cloud Data Loss Prevention

  • From 10:35 AM to 11:30 AM US/Pacific, around 60% of total requests encountered errors in us-west1.

Cloud Dataproc

  • From 10:00 AM to 12:07 PM US/Pacific, Dataproc customers were unable to perform cluster and batch operations in us-west1. At peak impact between 10:30 AM to 11:32 AM US/Pacific, 65% of requests to Dataproc returned errors mostly DEADLINE_EXCEEDED, with some requests like create cluster returning a 100% error rate during this period.

Dataplex Catalog

  • From 9:45 AM to 11:35 AM US/Pacific, Dataplex Catalog customers in us-west1 were unable to create, delete, modify, list or search data stored in Dataplex Catalog. Error rates up to 90% of requests where customers would see a server error, and increased latency overall. Customers using other regions were not affected by this incident.

Cloud Composer

  • From 9:48 AM to ~11:28 US/Pacific, some Cloud Composer customers in us-west1 experienced issues performing control plane operations like creating/deleting/updating environments and operations requiring control plane like snapshots could have been also impacted. Composer dataplane (aka Composer environment) was operating fine.

  • The problem was discovered by Composer probers and there is an impact on SLO of Composer Control Plane API availability.

Instances API

  • From 10:35 AM to 11:50 AM US/Pacific, all Snapshot Schedules for Persistent Disks in us-west1 were not created according to schedule and were created with delay.

Added on 28 Feb 2024

Document AI Warehouse

  • From 11:00 AM to 11:17 AM US/Pacific, the API returned server and client error messages in the US/multi-region.
  • During the impact window, all API requests experienced elevated error rate. Overall error rate intermittently spiked beyond 90%.

Google Kubernetes Engine

  • From 10:47 AM to 11:32 AM US/Pacific, customers may have experienced API call failures and in some cases cluster unavailability in the us-west1 region.

  • The API call failure rate peaked at 25% of API calls to the us-west1 region and the cluster unavailability in the us-west1 region peaked at 1% of clusters.

GKE Fleet Management

  • From 09:52 AM to 11:40 AM US/Pacific, customers may have experienced API call failures to add GKE clusters into Fleet in the us-west1 region.

  • The API call failure rate peaked at 100% of API calls to the us-west1 region on the Fleet (GKE Hub) services


To summarize, multiple Google Cloud Products experienced unavailability and/or elevated error rates for services in the us-west1 region during this issue.

This is the final version of the Incident Report.



15 Feb 2024 09:40 PST

Mini Incident Report

We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support. (All Times US/Pacific)

Incident Start: 14 February 2024 10:30

Incident End: 14 February 2024 13:10

Duration: 2 hours, 40 minutes

Affected Services and Features:

  • Artifact Registry
  • Certificate Authority Service
  • Cloud Build
  • Cloud Healthcare
  • Cloud Key Management Service
  • Cloud Load Balancing
  • Cloud Logging
  • Cloud Memorystore
  • Cloud Run
  • Cloud Spanner
  • Cloud SQL
  • Cloud Workflows
  • Data Catalog
  • Dataproc Metastore
  • Dialogflow CX
  • Eventarc
  • Google Cloud Console
  • Google Cloud Dataflow
  • Google Cloud Dataproc
  • Google Cloud Deploy
  • Google Cloud Networking
  • Google Cloud Pub/Sub
  • Google Cloud Tasks
  • Google Compute Engine
  • Hybrid Connectivity
  • Identity and Access Management
  • Persistent Disk
  • Pub/Sub Lite
  • Vertex AI AutoML Image
  • Vertex AI AutoML Tabular
  • Vertex AI AutoML Text
  • Vertex AI AutoML Video
  • Vertex AI Batch Prediction
  • Vertex AI Data Labeling
  • Vertex AI Explainable AI
  • Vertex AI Feature Store
  • Vertex AI Matching Engine
  • Vertex AI ML Metadata
  • Vertex AI Model Monitoring
  • Vertex AI Model Registry
  • Vertex AI Online Prediction
  • Vertex AI Pipelines
  • Vertex AI Search
  • Vertex AI TensorBoard
  • Vertex AI Training
  • Vertex AI Vizier
  • Vertex AI Workbench Instances
  • Virtual Private Cloud (VPC)

Regions/Zones: us-west1

Description:

Customers of multiple Google Cloud products experienced increased latency and error rates in us-west1 for a period of 2 hours, 40 minutes. From preliminary analysis, the root cause of the issue has been narrowed to an internal database resource allocation issue which caused reduced availability and increased latency for many GCP services in the region.

Our engineering team mitigated the issue by isolating the problematic traffic and have implemented measures to prevent a recurrence.

Google will complete a full Incident Report in the following days that will provide a detailed root cause.

Customer Impact:

During the time of impact, customers would have experienced high latency and error rates for GCP services in the us-west1 region.

14 Feb 2024 13:07 PST

The core issue affecting Google Cloud Products in us-west1 has been mitigated and all the affected products have full service restoration. We understand the disruption this may have caused and sincerely apologize for any inconvenience.

The root cause of the issue was identified to be an overloaded common infrastructure component. Our engineering team has mitigated the issue by isolating the traffic and have implemented measures to prevent a recurrence.

If you have questions or are still experiencing issues, please open a case with the Support Team and we will work with you until this issue is resolved.

We thank you for your patience while we're working on resolving the issue. We will publish a preliminary analysis of this incident once we have completed our internal investigation.

14 Feb 2024 12:24 PST

Summary: Multiple Google Cloud Products are experiencing issues in us-west1

Description: We are experiencing an issue with multiple Google Cloud Products beginning on Wednesday, 2024-02-14 9:40 US/Pacific.

Our engineers have identified and mitigated the underlying issue. Most of the affected products have recovered and we expect the remaining products to fully recover in the next 1 to 2 hours.

The following services have already recovered: Google Kubernetes Engine, Cloud Pub Sub, Virtual Private Cloud, VPC, VPC Serverless Access, Google Compute Engine, Dataplex Catalog, Cloud Interconnect, Cloud Workflows, Cloud Logging, Google Cloud Storage , Eventarc, Cloud SQL, Cloud Key Management Service, Cloud Run, Cloud Dataproc, Cloud Spanner, Diagflow, Cloud Tasks

We will provide an update by Wednesday, 2024-02-14 13:00 US/Pacific with current details.

Diagnosis: Existing customer load balancers will continue to function. New load balancers or changes to existing load balancers will not propagate configs and changes to the configurations of load balancers may result in an error.

Configuration changes can not be made to the Regional Internal, Regional External, and Global External Application Load Balancers in the affected region.

Customers may see errors when making configuration changes.

Workaround: None at this time.

14 Feb 2024 12:10 PST

Summary: Multiple Google Cloud Products are experiencing issues in us-west1

Description: We are experiencing an issue with multiple Google Cloud Products beginning on Wednesday, 2024-02-14 9:40 US/Pacific.

Our engineers have identified a common infrastructure component as the root cause and we are attempting a mitigation. As the mitigation progresses, some products may see partial recovery.

The following services have recovered:

Google Kubernetes Engine, Cloud Pub Sub, Virtual Private Cloud, VPC, VPC Serverless Access, Google Compute Engine, Dataplex Catalog, Cloud Interconnect, Cloud Workflows, Cloud Logging, Google Cloud Storage , Eventarc, Cloud SQL, Cloud Key Management Service, Cloud Run, Cloud Dataproc

We do not have an ETA for mitigation at this point.

We will provide an update by Wednesday, 2024-02-14 12:45 US/Pacific with current details.

Diagnosis: Existing customer load balancers will continue to function. New load balancers or changes to existing load balancers will not propagate configs and changes to the configurations of load balancers may result in an error.

Configuration changes can not be made to the Regional Internal, Regional External, and Global External Application Load Balancers in the affected region.

Customers may see errors when making configuration changes.

Workaround: None at this time.

14 Feb 2024 11:47 PST

Summary: Multiple Google Cloud Products are experiencing issues in us-west1

Description: We are experiencing an issue with multiple Google Cloud Products beginning on Wednesday, 2024-02-14 9:40 US/Pacific.

Our engineers have identified a common infrastructure component as the root cause and we are attempting a mitigation. As the mitigation progresses, some products may see partial recovery.

We do not have an ETA for mitigation at this point.

We will provide an update by Wednesday, 2024-02-14 12:20 US/Pacific with current details.

Diagnosis: Existing customer load balancers will continue to function. New load balancers or changes to existing load balancers will not propagate configs and changes to the configurations of load balancers may result in an error.

Configuration changes can not be made to the Regional Internal, Regional External, and Global External Application Load Balancers in the affected region.

Customers may see errors when making configuration changes.

Workaround: None at this time.

14 Feb 2024 11:29 PST

Summary: Multiple Google Cloud Products are experiencing issues in us-west1

Description: We are experiencing an issue with multiple Google Cloud Products beginning on Wednesday, 2024-02-14 9:40 US/Pacific.

Our engineering team continues to investigate the issue.

We will provide an update by Wednesday, 2024-02-14 12:30 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Existing customer load balancers will continue to function. New load balancers or changes to existing load balancers will not propagate configs and changes to the configurations of load balancers may result in an error.

Configuration changes can not be made to the Regional Internal, Regional External, and Global External Application Load Balancers in the affected region.

Customers may see errors when making configuration changes.

Workaround: None at this time.

14 Feb 2024 11:05 PST

Summary: We are experiencing an issue with Cloud Load Balancing.

Description: We are experiencing an issue with Cloud Load Balancing.

Our engineering team continues to investigate the issue.

We will provide an update by Wednesday, 2024-02-14 12:44 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: None at this time.

Workaround: None at this time.