Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Cloud Monitoring, Cloud Spanner, Google Cloud Bigtable, Google Kubernetes Engine, Operations

us-central1: Elevated Errors and Degraded Query Performance with Monitoring Metrics

Incident began at 2024-07-15 09:29 and ended at 2024-07-15 11:50 (all times are US/Pacific).

Previously affected location(s)

GlobalIowa (us-central1)

Date Time Description
22 Jul 2024 09:17 PDT

Incident Report

Summary

On Monday, 15 July 2024, Google Cloud Monitoring experienced elevated query errors and degraded performance in the us-central1 region for 2 hours and 6 minutes. This impacted monitoring metrics for cloud products in the region, including Cloud Spanner, Google Kubernetes Engine, Cloud Bigtable, AlloyDB and Cloud SQL.

To our Cloud Monitoring customers whose monitoring capabilities were impacted during this disruption, we sincerely apologize. We understand the critical role monitoring plays in maintaining your cloud environments, and this is not the level of service we strive to provide.

We are committed to preventing similar disruptions in the future and continuing to improve the platform's reliability and performance.

Root Cause

Cloud Monitoring has experienced a sudden and unexpected, inorganic increase in usage, observing a 30% increase in growth over the past 30 days. Our automation responded to the unexpected growth, which pushed services past their current scaling limits leading to out of memory crashes reducing/degrading Cloud Monitoring query capacity in the us-central1 region.

As a mitigation, engineers increased the memory allocation limit on affected services to increase their scaling limits and will be working with the source of the unexpected growth to try to reduce their usage back into more expected limits.

Remediation and Prevention

Google engineers were alerted to the issue by internal monitoring on 15 July 2024 at 08:53 US/Pacific and immediately started an investigation.

At 10:11 US/Pacific engineers began rolling out mitigation to increase the memory allocation limit on affected services. The mitigation was completed at 10:52 US/Pacific resolving the issue.

Google is committed to preventing a repeat of this issue in the future and is completing the following actions:

  • Dynamically adjust our scaling limits to better respond to load demands using spare capacity
  • Partition the zonal query processor into multiple partitions to avoid becoming a bottleneck in times of heavy load.

Detailed Description of Impact

On Monday 15 July 2024, from 08:46 to 10:52 US/Pacific, multiple Google Cloud services experienced increased query latency and/or reduced availability in the us-central1.

Cloud Monitoring

Cloud Monitoring customers experienced increased query latency and/or reduced availability for Cloud Monitoring metrics stored in the us-central1 cloud region. Queries for metrics stored in other regions, including the "global" region were unaffected.

Metrics Cloud Monitoring API queries e.g. via QueryTimeSeries, ListTimeSeries, or PromQL endpoints, for metrics in this region may have returned a partial or empty response. Queries fanning out to multiple regions would have returned applicable data from all other regions.

Certain service metrics which are backed by precomputed queries in this region were unavailable during the outage window. Due to the real-time nature of precomputed queries, these gaps cannot be backfilled and will remain unavailable indefinitely.

Dashboards Cloud Console dashboards displaying metrics from this region may have data gaps and, in turn, presented a degraded experience to end users during the outage window. Dashboards displaying precomputed query-backed metrics will continue to display data gaps during this period.

Incidents and Alert Notifications Cloud Alerting policies where the location is retained and maps to the us-central1 region may have returned incorrect results which prevented alerts from firing and associated notifications being sent in a timely manner or, if short-lived, at all.

80% of alerts in us-central1 (8% of all alerts) were dropped during the outage window, however most Cloud Alerting policies are global, not region-specific.

Customers may have experienced the following related to Cloud Alerting incidents and alert notifications:

  • Incident creation: Some incidents were never created, and related notifications were not sent out.
  • Incident creation: Some incidents were created up to 2 hours late; and the notifications were delayed by a similar duration.
  • Incident close: Some incidents opened prior to the outage were prematurely closed due to the absence of the alerting signal during the outage window.
  • Incident reopen: The incidents that prematurely closed as captured in #3 could reopen once the alerts started firing again - leading to double alerting for customers.

When the query processing service was restored, all ad-hoc and precomputed queries, dashboards, alerts, and notifications also returned to normal operation with the exception of the data gaps noted for precomputed queries during the outage.

Cloud Bigtable

Cloud Bigtable customers experienced a period of missing Google Cloud Monitoring metrics for bigtable.googleapis.com for the duration of this outage. When Google Cloud Monitoring returned to normal operations, Cloud Bigtable Google Cloud Monitoring returned as well. Cloud Bigtable's internal auto scaling capability was not impacted, but customers who use Google Cloud Monitoring metrics to scale their Cloud Bigtable usage would have lost metric signal and may have incorrectly scaled their instances as a result of this outage.

Alloy DB

AlloyDB customers experienced intermittently a period of missing Google Cloud Monitoring metrics from 9:00 to 10:30 PDT. When Google Cloud Monitoring returned to normal operations, AlloyDB Google Cloud Monitoring returned as well.There were no missing metrics after 10:30 PDT.

Google Kubernetes Engine

Google Kubernetes Engine customers experienced intermittently a period of missing Google Cloud Monitoring metrics from 9:00 to 10:30 PDT. When Google Cloud Monitoring returned to normal operations, GKE Google Cloud Monitoring returned as well.There were no missing metrics after 10:30 PDT.

Workload autoscaling based on external / custom metrics may not have been actuated during this period. Workload autoscaling based on cpu / memory were not affected.

Cloud Spanner

Cloud Spanner customers experienced a period of missing Google Cloud Monitoring metrics for spanner.googleapis.com for the duration of this outage. When Google Cloud Monitoring returned to normal operations, Cloud Spanner Google Cloud Monitoring returned as well. Cloud Spanner’s native autoscaler was not impacted, but customers who use Google Cloud Monitoring metrics to scale their Cloud Spanner usage (e.g. via open source autoscalers) would have lost metric signal and may have incorrectly scaled their instances as a result of this outage. Data Boost customers who have set up alerts for usage may have gotten alerted as well but Data Boost billing is not impacted.

Cloud SQL

Cloud SQL customers experienced missing Google Cloud Monitoring metrics cloudsql.googleapis.com for the duration of this period. Some customers who set alerts based on these metrics may get incorrectly notified, but Cloud SQL operations and the database datapath were not affected by this incident. The databases all continued to operate normally.

15 Jul 2024 16:21 PDT

Mini Incident Report

We apologize for the inconvenience this service outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support.

(All Times US/Pacific)

Incident Start: 15 July, 2024 08:46

Incident End: 15 July, 2024 11:00

Duration: 2 hours, 14 minutes

Affected Services and Features:

  • Cloud Monitoring
  • Cloud Spanner
  • Google Cloud Bigtable
  • Google Kubernetes Engine
  • AlloyDB

Regions/Zones: us-central1

Description:

Cloud Monitoring experienced elevated query errors and degraded query performance, impacting monitoring metrics for multiple cloud products in us-central1, due to out-of-memory crashes in part of the query processing service. Google engineers increased the memory allocation limits for this service to mitigate the problem.

Google will complete a full IR in the following days that will provide a full root cause.

Customer Impact:

  • Customers experienced errors and/or increased latency when querying monitoring data. This includes queries from Cloud Monitoring API and Google-Managed Prometheus API requests, autoscaling, and viewing dashboards.
  • Alert evaluations were impacted, resulting in potentially missed or false-positive alerts.
  • Any operations relying on querying monitoring metrics were affected.
  • Some GCP system metrics may have missing or incorrect data from the outage period.
  • No customer-written metric data was lost.
15 Jul 2024 11:50 PDT

The issue with Cloud Monitoring metrics has been resolved for all affected users as of Monday, 2024-07-15 10:52 US/Pacific.

We will publish an analysis of this incident once we have completed our internal investigation.

We thank you for your patience while we worked on resolving the issue.

15 Jul 2024 11:07 PDT

Summary: us-central1: Elevated Errors and Degraded Query Performance with Monitoring Metrics

Description: We are experiencing issues querying Monitoring metrics with Cloud Monitoring, affecting system metrics from multiple Cloud products and user-defined metrics.

We’ve implemented a mitigation which is showing improvements and engineers will continue to monitor.

We will provide more information by Monday, 2024-07-15 12:00 US/Pacific.

Diagnosis: - Affected customers may observe errors and/or latency when trying to query monitoring data, autoscaling, dashboards, for metrics in the us-central1 and global regions.

  • Alert evaluations are also impacted which means customers may not see expected alerts or may see some false positive alerts.

  • The issue impacts any operations that rely on monitoring metrics.

Workaround: None at this time.

15 Jul 2024 10:54 PDT

Summary: us-central1: Multiple Cloud Products Experiencing Elevated Errors and Degraded Query Performance with Monitoring Metrics

Description: We are experiencing issues with Monitoring metrics with Cloud Monitoring, Bigtable, and Cloud Spanner, Cloud SQL, Google Kubernetes Engine.

Mitigation work is currently underway by our engineering team.

The mitigation is expected to complete by Monday, 2024-07-15 12:00 US/Pacific.

We will provide more information by Monday, 2024-07-15 12:30 US/Pacific.

Diagnosis: Affected customers may observe errors and/or latency when trying to query monitoring data, autoscaling, dashboards, and alert evaluations in the us-central1 region.

Workaround: None at this time.

15 Jul 2024 10:43 PDT

Summary: us-central1: Multiple Cloud Products Experiencing Elevated Errors and Degraded Query Performance with Monitoring Metrics

Description: We are experiencing issues with Monitoring metrics with Cloud Monitoring, Bigtable, and Cloud Spanner, Cloud SQL, Google Kubernetes Engine.

Mitigation work is currently underway by our engineering team.

We will provide more information by Monday, 2024-07-15 11:20 US/Pacific.

Diagnosis: Affected customers may observe errors and/or latency when trying to query monitoring data, autoscaling, dashboards, and alert evaluations in the us-central1 region.

Workaround: None at this time.