Service Health
Incident affecting Cloud Monitoring, Cloud Spanner, Google Cloud Bigtable, Google Kubernetes Engine, Operations
us-central1: Elevated Errors and Degraded Query Performance with Monitoring Metrics
Incident began at 2024-07-15 09:29 and ended at 2024-07-15 11:50 (all times are US/Pacific).
Previously affected location(s)
GlobalIowa (us-central1)
Date | Time | Description | |
---|---|---|---|
| 22 Jul 2024 | 09:17 PDT | Incident ReportSummaryOn Monday, 15 July 2024, Google Cloud Monitoring experienced elevated query errors and degraded performance in the us-central1 region for 2 hours and 6 minutes. This impacted monitoring metrics for cloud products in the region, including Cloud Spanner, Google Kubernetes Engine, Cloud Bigtable, AlloyDB and Cloud SQL. To our Cloud Monitoring customers whose monitoring capabilities were impacted during this disruption, we sincerely apologize. We understand the critical role monitoring plays in maintaining your cloud environments, and this is not the level of service we strive to provide. We are committed to preventing similar disruptions in the future and continuing to improve the platform's reliability and performance. Root CauseCloud Monitoring has experienced a sudden and unexpected, inorganic increase in usage, observing a 30% increase in growth over the past 30 days. Our automation responded to the unexpected growth, which pushed services past their current scaling limits leading to out of memory crashes reducing/degrading Cloud Monitoring query capacity in the us-central1 region. As a mitigation, engineers increased the memory allocation limit on affected services to increase their scaling limits and will be working with the source of the unexpected growth to try to reduce their usage back into more expected limits. Remediation and PreventionGoogle engineers were alerted to the issue by internal monitoring on 15 July 2024 at 08:53 US/Pacific and immediately started an investigation. At 10:11 US/Pacific engineers began rolling out mitigation to increase the memory allocation limit on affected services. The mitigation was completed at 10:52 US/Pacific resolving the issue. Google is committed to preventing a repeat of this issue in the future and is completing the following actions:
Detailed Description of ImpactOn Monday 15 July 2024, from 08:46 to 10:52 US/Pacific, multiple Google Cloud services experienced increased query latency and/or reduced availability in the us-central1. Cloud Monitoring Cloud Monitoring customers experienced increased query latency and/or reduced availability for Cloud Monitoring metrics stored in the us-central1 cloud region. Queries for metrics stored in other regions, including the "global" region were unaffected. Metrics Cloud Monitoring API queries e.g. via QueryTimeSeries, ListTimeSeries, or PromQL endpoints, for metrics in this region may have returned a partial or empty response. Queries fanning out to multiple regions would have returned applicable data from all other regions. Certain service metrics which are backed by precomputed queries in this region were unavailable during the outage window. Due to the real-time nature of precomputed queries, these gaps cannot be backfilled and will remain unavailable indefinitely. Dashboards Cloud Console dashboards displaying metrics from this region may have data gaps and, in turn, presented a degraded experience to end users during the outage window. Dashboards displaying precomputed query-backed metrics will continue to display data gaps during this period. Incidents and Alert Notifications Cloud Alerting policies where the location is retained and maps to the us-central1 region may have returned incorrect results which prevented alerts from firing and associated notifications being sent in a timely manner or, if short-lived, at all. 80% of alerts in us-central1 (8% of all alerts) were dropped during the outage window, however most Cloud Alerting policies are global, not region-specific. Customers may have experienced the following related to Cloud Alerting incidents and alert notifications:
When the query processing service was restored, all ad-hoc and precomputed queries, dashboards, alerts, and notifications also returned to normal operation with the exception of the data gaps noted for precomputed queries during the outage. Cloud Bigtable Cloud Bigtable customers experienced a period of missing Google Cloud Monitoring metrics for bigtable.googleapis.com for the duration of this outage. When Google Cloud Monitoring returned to normal operations, Cloud Bigtable Google Cloud Monitoring returned as well. Cloud Bigtable's internal auto scaling capability was not impacted, but customers who use Google Cloud Monitoring metrics to scale their Cloud Bigtable usage would have lost metric signal and may have incorrectly scaled their instances as a result of this outage. Alloy DB AlloyDB customers experienced intermittently a period of missing Google Cloud Monitoring metrics from 9:00 to 10:30 PDT. When Google Cloud Monitoring returned to normal operations, AlloyDB Google Cloud Monitoring returned as well.There were no missing metrics after 10:30 PDT. Google Kubernetes Engine Google Kubernetes Engine customers experienced intermittently a period of missing Google Cloud Monitoring metrics from 9:00 to 10:30 PDT. When Google Cloud Monitoring returned to normal operations, GKE Google Cloud Monitoring returned as well.There were no missing metrics after 10:30 PDT. Workload autoscaling based on external / custom metrics may not have been actuated during this period. Workload autoscaling based on cpu / memory were not affected. Cloud Spanner Cloud Spanner customers experienced a period of missing Google Cloud Monitoring metrics for spanner.googleapis.com for the duration of this outage. When Google Cloud Monitoring returned to normal operations, Cloud Spanner Google Cloud Monitoring returned as well. Cloud Spanner’s native autoscaler was not impacted, but customers who use Google Cloud Monitoring metrics to scale their Cloud Spanner usage (e.g. via open source autoscalers) would have lost metric signal and may have incorrectly scaled their instances as a result of this outage. Data Boost customers who have set up alerts for usage may have gotten alerted as well but Data Boost billing is not impacted. Cloud SQL Cloud SQL customers experienced missing Google Cloud Monitoring metrics cloudsql.googleapis.com for the duration of this period. Some customers who set alerts based on these metrics may get incorrectly notified, but Cloud SQL operations and the database datapath were not affected by this incident. The databases all continued to operate normally. |
| 15 Jul 2024 | 16:21 PDT | Mini Incident ReportWe apologize for the inconvenience this service outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support. (All Times US/Pacific) Incident Start: 15 July, 2024 08:46 Incident End: 15 July, 2024 11:00 Duration: 2 hours, 14 minutes Affected Services and Features:
Regions/Zones: us-central1 Description: Cloud Monitoring experienced elevated query errors and degraded query performance, impacting monitoring metrics for multiple cloud products in us-central1, due to out-of-memory crashes in part of the query processing service. Google engineers increased the memory allocation limits for this service to mitigate the problem. Google will complete a full IR in the following days that will provide a full root cause. Customer Impact:
|
| 15 Jul 2024 | 11:50 PDT | The issue with Cloud Monitoring metrics has been resolved for all affected users as of Monday, 2024-07-15 10:52 US/Pacific. We will publish an analysis of this incident once we have completed our internal investigation. We thank you for your patience while we worked on resolving the issue. |
| 15 Jul 2024 | 11:07 PDT | Summary: us-central1: Elevated Errors and Degraded Query Performance with Monitoring Metrics Description: We are experiencing issues querying Monitoring metrics with Cloud Monitoring, affecting system metrics from multiple Cloud products and user-defined metrics. We’ve implemented a mitigation which is showing improvements and engineers will continue to monitor. We will provide more information by Monday, 2024-07-15 12:00 US/Pacific. Diagnosis: - Affected customers may observe errors and/or latency when trying to query monitoring data, autoscaling, dashboards, for metrics in the us-central1 and global regions.
Workaround: None at this time. |
| 15 Jul 2024 | 10:54 PDT | Summary: us-central1: Multiple Cloud Products Experiencing Elevated Errors and Degraded Query Performance with Monitoring Metrics Description: We are experiencing issues with Monitoring metrics with Cloud Monitoring, Bigtable, and Cloud Spanner, Cloud SQL, Google Kubernetes Engine. Mitigation work is currently underway by our engineering team. The mitigation is expected to complete by Monday, 2024-07-15 12:00 US/Pacific. We will provide more information by Monday, 2024-07-15 12:30 US/Pacific. Diagnosis: Affected customers may observe errors and/or latency when trying to query monitoring data, autoscaling, dashboards, and alert evaluations in the us-central1 region. Workaround: None at this time. |
| 15 Jul 2024 | 10:43 PDT | Summary: us-central1: Multiple Cloud Products Experiencing Elevated Errors and Degraded Query Performance with Monitoring Metrics Description: We are experiencing issues with Monitoring metrics with Cloud Monitoring, Bigtable, and Cloud Spanner, Cloud SQL, Google Kubernetes Engine. Mitigation work is currently underway by our engineering team. We will provide more information by Monday, 2024-07-15 11:20 US/Pacific. Diagnosis: Affected customers may observe errors and/or latency when trying to query monitoring data, autoscaling, dashboards, and alert evaluations in the us-central1 region. Workaround: None at this time. |
- All times are US/Pacific