Service Health
Incident affecting Cloud Monitoring, Operations, Google Cloud Console
Multiple Google Cloud products are impacted by a Cloud Monitoring issue
Incident began at 2024-01-10 07:23 and ended at 2024-01-12 03:08 (all times are US/Pacific).
Previously affected location(s)
Taiwan (asia-east1)Hong Kong (asia-east2)Tokyo (asia-northeast1)Osaka (asia-northeast2)Seoul (asia-northeast3)Mumbai (asia-south1)Delhi (asia-south2)Singapore (asia-southeast1)Jakarta (asia-southeast2)Sydney (australia-southeast1)Melbourne (australia-southeast2)Warsaw (europe-central2)Finland (europe-north1)Madrid (europe-southwest1)Belgium (europe-west1)Berlin (europe-west10)Turin (europe-west12)London (europe-west2)Frankfurt (europe-west3)Netherlands (europe-west4)Zurich (europe-west6)Milan (europe-west8)Paris (europe-west9)GlobalDoha (me-central1)Dammam (me-central2)Tel Aviv (me-west1)Montréal (northamerica-northeast1)Toronto (northamerica-northeast2)São Paulo (southamerica-east1)Santiago (southamerica-west1)Iowa (us-central1)South Carolina (us-east1)Northern Virginia (us-east4)Columbus (us-east5)Dallas (us-south1)Oregon (us-west1)Los Angeles (us-west2)Salt Lake City (us-west3)Las Vegas (us-west4)
Date | Time | Description | |
---|---|---|---|
| 22 Jan 2024 | 10:33 PST | Incident ReportSummaryOn Wednesday, 10 January 2024, Google Cloud Monitoring and all Google Cloud Products that expose Google Cloud Monitoring experienced dashboard delays and metric query failures (Initial degradation started on 09 January 2024 8:30 am PST, due to data staleness) for a duration of 1 day, 19 hours, 45 minutes and service metric data unavailability (which started on 3 January 2024 11:23 PST with a low impact window until 10 January 2024 09:30 PST) with significant impact window starting 10 January 2024 9:30 PST for a duration of 7 hours, 15 minutes. To our Google Cloud Monitoring and Google Cloud Products customers who were affected, we sincerely apologize. This is not the level of quality and reliability we strive to offer you and we are taking immediate steps to improve the platform’s performance and availability. Root CauseGoogle Cloud Monitoring experienced two distinct issues that impacted system metric data for most Google Cloud Products. Metric Data Queries: Metric data is stored in-memory prior to being stored on-disk. Initial degradation started on 09 January 2024 8:30 am PST, due to data staleness. A configuration change in data replication for us-central1 triggered a bottleneck in the pipeline responsible for moving data to disk for querying. Initially, this bottleneck induced backlog did not cause user-visible impact, given data continued to be served from the in-memory tier for the most recent 24h. When the pipeline blockage was mitigated 10 January 2024 8:00 am PST, the entire 20 hours backlog of files was rapidly ingested into the system that serves queries from disk. But, the resulting huge number of files triggered yet-another bottleneck in the on-disk system, causing high latency or failure for most queries. Metric Data Unavailability: A combination of two changes - one permissions-related and another scheduling-related - caused certain Cloud Metrics to be unavailable. These changes were rolled out on 03 January 2024 11:23 PST. The impact from these changes was limited initially, but when mitigation was attempted on 09 January 2024 2:07 PST, it induced a bigger issue. The new problem surfaced due to a higher rate of server restarts. Remediation and PreventionMetric Data Queries: Google engineers were alerted to the (not yet user-visible) data staleness issue by internal SLIs (Service Level Indicators) on 09 January 2024 16:58 PST and immediately started an investigation. Staleness began 09 January 2024 08:30:00 PST. When the high latency replica was removed at 10 January 2024 07:23 PST, the processing pipeline returned to normalcy. While response to this first issue was still ongoing, engineers were alerted by user-facing SLIs 10 January 2024 07:29 PST of user-visible query unavailability that had begun at 07:23 PST. They reconfigured the system to remove bottlenecks and increased the overall amount of compute resources available. This eventually reduced the backlog and returned to normalcy. All query availability/latency SLI also recovered fully at 12 January 2024 03:15 PST. Metric Data Unavailability: Google engineers were alerted by SLIs (Service Level Indicators) 09 January 2024 23:51 PST and immediately started an investigation. The initial attempt at mitigating the problem caused wider issues. We stopped this mitigation and re-applied a newer patch that fixed the root-cause correctly. All remaining errors were resolved on 10 January 2024 16:40 PST. Remediation: Google is committed to preventing a repeat of this issue in the future and is completing the following actions:
|
| 12 Jan 2024 | 13:02 PST | Mini Incident ReportWe apologize for the inconvenience this service outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support. (All Times US/Pacific) Metric data unavailability:
Metric data queries issue start: 10 January, 2024 07:23 Metric data queries issue end: 12 January, 2024 03:08 Duration: 1 day, 19 hours, 45 minutes Affected Services and Features: Google Cloud Monitoring and All Google Cloud Products that expose Google Cloud Monitoring. Regions/Zones: Global Description: Google Cloud Monitoring experienced two distinct issues (metric data unavailability, metric data query failures) that impacted system metric data for most Google Cloud Products. Between 03 January 2024, 11:23 and 10 January 2024, 16:45 US/Pacific a small number of Google Cloud Monitoring users experienced sporadic issues where metric data was unavailable creating gaps in the metric data. From preliminary analysis, the root cause was a rollout to increase the monitoring platform’s reliability, which inadvertently introduced an issue that is triggered upon monitoring server restart. Between 03 January 11:23 and 10 January 09:30 US/Pacific, the rate of monitoring server restarts (and thus chances to trigger the issue) was very low. On 10 January starting 09:30 US/Pacific, restarts for a subsequent rollout triggered the issue more frequently. By 10 January 16:45 US/Pacific, engineers had mitigated the issue by rolling back the change that triggered the issue on monitoring server restarts. Between 10 January 2024, 07:23 US/Pacific and 12 January 2024, 03:08 US/Pacific querying of metric data that is older than 24 hours in us-central1 experienced significant delays leading to query failures. This also led to issues loading some Cloud Console dashboards. From preliminary analysis, the root cause of the issue is a failure in the pipeline responsible for data transmission from our memory component to storage component, hence creating a backlog. This issue was mitigated by resolving the failure and clearing the backlog. Google will complete a full IR in the following days that will provide a full root cause. Customer Impact: Impact of Metric data unavailability:
Impact on Metric data queries:
|
| 12 Jan 2024 | 07:19 PST | The issue with Cloud Monitoring, Google Cloud Console has been resolved for all affected users as of Friday, 2024-01-12 06:29 US/Pacific. We will publish an analysis of this incident once we have completed our internal investigation. We thank you for your patience while we worked on resolving the issue. |
| 11 Jan 2024 | 19:26 PST | Summary: Multiple Google Cloud products are impacted by a Cloud Monitoring issue Description: We are experiencing an issue with Cloud monitoring which is impacting metrics and dashboards related to multiple products. Our Engineering team identified two distinct issues causing the impact and are continuing to work on applying mitigations. The first problem causing gaps in monitoring data has been completely mitigated. There should be no further impact on alerting, or gaps in monitoring data. The mitigation for the second problem causing latency in querying metric data in us-central1 is currently running, we expect the latency to return to normal levels by 2024-01-12 18:00 US/Pacific. We will provide a progress update by Friday, 2024-01-12 08:00 US/Pacific with current details. Diagnosis:
Workaround: Customers can load the metric data in their monitoring dashboards by excluding us-central1 using a location filter. Queries against data from after 2024-01-10 16:00 should be faster. |
| 11 Jan 2024 | 15:14 PST | Summary: Multiple Google Cloud products are impacted by a Cloud Monitoring issue Description: We are experiencing an issue with Cloud monitoring which is impacting metrics and dashboards related to multiple products. Our Engineering team identified two distinct issues causing the impact and are continuing to work on applying mitigations. We are rolling out the mitigation for the gaps in monitoring data and estimate we are 90% mitigated. Engineers are evaluating the timeline to complete mitigation. We have tested a mitigation for the delays in querying metric data in us-central1, the mitigation appears to reduce the backlog and we anticipate the delays to gradually return to base levels over the next 24 hours. We will provide a progress update by Friday, 2024-01-12 08:00 US/Pacific with current details. Diagnosis: Customers impacted by this issue may experience delays and in some cases may not see the metric data in their monitoring dashboard. Some dashboards may fail to load entirely. GCP and third party products and services that consume these metrics via Cloud monitoring API may experience these above stated issues. This includes products that rely on metrics for autoscaling. Alerting mechanisms reliant on this metric data may experience the issue too. The issue affects alerts for both Google Cloud service defined and custom metrics. The metric data for custom metrics is unaffected by the issue. Workaround: Customers that are querying metric data older than 24 hours can workaround timeout failures in loading the data by querying for data that is not older than 24 hours. Alternatively, customers can load the metric data in their monitoring dashboards by excluding us-central1 using a location filter. Customers can continue to write and query custom metric data newer than 24 hours without any issues. |
| 11 Jan 2024 | 12:39 PST | Summary: Multiple Google Cloud products are impacted by a Cloud Monitoring issue Description: We are experiencing an issue with Cloud monitoring which is impacting metrics and dashboards related to multiple products. Our Engineering team identified two distinct issues causing the impact and are continuing to work on applying mitigations. We identified a potential fix for the issue where metrics are not available and are testing it in some affected regions. The testing is taking longer than anticipated and it is expected to take another 90 minutes. The issue with delays in querying metric data in us-central1 is being actively investigated for a mitigation. Our Engineering team currently understands the cause of this issue and has taken measures to limit the volume of backlog in the region. The delays are more significant for metric data between 2024-01--09 16:00 US/Pacific and 2024-01-10 16:00 US/Pacific leading to query failures. Some customers may encounter query delays and failures with data outside this window. We will provide an update by Thursday, 2024-01-11 15:00 US/Pacific with current details. Diagnosis: Customers impacted by this issue may experience delays and in some cases may not see the metric data in their monitoring dashboard. Some dashboards may fail to load entirely. GCP and third party products and services that consume these metrics via Cloud monitoring API may experience these above stated issues. This includes products that rely on metrics for autoscaling. Alerting mechanisms reliant on this metric data may experience the issue too. The issue affects alerts for both Google Cloud service defined and custom metrics. The metric data for custom metrics is unaffected by the issue. Workaround: Customers that are querying metric data older than 24 hours can workaround timeout failures in loading the data by querying for data that is not older than 24 hours. Alternatively, customers can load the metric data in their monitoring dashboards by excluding us-central1 using a location filter. Customers can continue to write and query custom metric data newer than 24 hours without any issues. |
| 11 Jan 2024 | 11:29 PST | Summary: Multiple Google Cloud products are impacted by a Cloud Monitoring issue Description: We are experiencing an issue with Cloud monitoring which is impacting metrics and dashboards related to multiple products. Our Engineering team identified two distinct issues causing the impact and are continuing to work on applying mitigations. We identified a potential fix for the issue where metrics are not available and are testing it in some affected regions. We currently do not have an ETA for the fix roll out. The issue with delays in querying metric data in us-central1 that is older than 24 hours is being actively investigated for a mitigation. Our Engineering team currently understands the cause of this issue and has taken measures to limit the volume of backlog in the region. We will provide an update by Thursday, 2024-01-11 12:30 US/Pacific with current details. Diagnosis: Customers impacted by this issue may experience delays and in some cases may not see the metric data in their monitoring dashboard. Some dashboards may fail to load entirely. GCP and third party products and services that consume these metrics via Cloud monitoring API may experience these above stated issues. This includes products that rely on metrics for autoscaling. Alerting mechanisms reliant on this metric data may experience the issue too. The issue affects alerts for both Google Cloud service defined and custom metrics. The metric data for custom metrics is unaffected by the issue. Workaround: Customers that are querying metric data older than 24 hours can workaround timeout failures in loading the data by querying for data that is not older than 24 hours. Alternatively, customers can load the metric data in their monitoring dashboards by excluding us-central1 using a location filter. Customers can continue to write and query custom metric data newer than 24 hours without any issues. |
| 11 Jan 2024 | 10:23 PST | Summary: Multiple Google Cloud products are impacted by a Cloud Monitoring issue Description: We are experiencing an issue with Cloud monitoring which is impacting metrics and dashboards related to multiple products. We are currently working on applying mitigations. We do not have an ETA. We will provide an update by Thursday, 2024-01-11 11:30 US/Pacific with current details. Diagnosis: - Customers impacted by this issue may experience delays and in some cases may not see the metric data in their monitoring dashboard. Some dashboards may fail to load entirely.
The issue affects alerts for both Google Cloud service defined and custom metrics. The metric data for custom metrics is unaffected by the issue. Workaround: - Customers that are querying metric data older than 24 hours can workaround timeout failures in loading the data by querying for data that is not older than 24 hours.
|
| 11 Jan 2024 | 09:40 PST | Summary: Multiple Google Cloud products are impacted by a Cloud Monitoring issue Description: We are experiencing an issue with Cloud monitoring which is impacting metrics and dashboards related to multiple products. We are currently working on applying mitigations. We do not have an ETA. We will provide an update by Thursday, 2024-01-11 10:30 US/Pacific with current details. Diagnosis: - Customers impacted by this issue may experience delays and in some cases may not see the metric data in their monitoring dashboard. Some dashboards may fail to load entirely.
Workaround: None at this time. |
| 11 Jan 2024 | 09:26 PST | Summary: Multiple Google Cloud products are impacted by a Cloud Monitoring issue Description: We are experiencing an issue with Cloud monitoring which is impacting metrics and dashboards related to multiple products. We are currently working on applying mitigations. We do not have an ETA. We will provide an update by Thursday, 2024-01-11 10:00 US/Pacific with current details. Diagnosis: Customers may experience intermittent issues when querying metrics. Impact is more visible when querying metrics older than 24 hours. Workaround: None at this time. |
- All times are US/Pacific