Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Operations, Cloud Monitoring, Google Cloud Dataflow, Google Cloud Pub/Sub, Cloud NAT, Google Cloud Bigtable

Global: Cloud Monitoring elevated errors requesting underlying monitoring data

Incident began at 2021-10-19 11:00 and ended at 2021-10-19 12:45 (all times are US/Pacific).

Date Time Description
27 Oct 2021 14:26 PDT

INCIDENT REPORT

Summary

On 19 October 2021 11:00 US/Pacific, Cloud Monitoring experienced errors querying all monitoring data for approximately 1 hour and 45 minutes in the us-central1 region. We apologize for the inconvenience and are taking steps toward preventing recurrence in the future.

Root Cause

Cloud Monitoring is a global service but is subdivided into internal locales, each of which collect monitoring data which is generated locally. When users query Cloud Monitoring, each query fans out through a series of nodes (called mixers) within the corresponding locales. The mixers reach out to source nodes to gather the appropriate data, temporarily retaining it within a limited set of memory.

During a recent infrastructure change in the U.S. locale, the amount of memory allocated to mixers in the us-central1 region was inadvertently reduced. This caused mixer tasks to run low on memory. The number of tasks in a low memory state grew over a period of several days as the change was gradually rolled out to production, following Google's standard progressive rollout policies.

The mixer task has safeguards which are designed to detect and reduce the impact of low memory conditions by pausing queries that use significant memory. However, in this case, an existing misconfiguration of this safeguard prevented it from activating correctly. Eventually, tasks which were low on memory failed; enough tasks failed in total to cause widespread failures and service impact.

Remediation and Prevention

Google engineers were alerted to the problem on 19 October 2021 at 11:11 and immediately started an investigation. Root cause - the reduction in memory allocation for mixer nodes - was identified at 11:32. Google engineers quickly identified a mitigation, which we began to roll out at 11:50. Restoring the proper memory capacity for mixer nodes fully mitigated the issue at 12:54.

Google is committed to quickly and continually improving our technology and operations to prevent service disruptions.

We are taking the following immediate steps to prevent this or similar issues from happening again:

  • Fixing the misconfiguration so that mixers which are low on memory will correctly detect that condition.
  • Introduce load-shedding, such that mixers which run out of memory will simply reject new queries until memory usage subsides, rather than failing.
  • Optimize the mixers to reduce the likelihood of out-of-memory scenarios.
  • Modifying Cloud Monitoring's rollout automation so that it automatically spots problems of this type, allowing engineers to be alerted sooner.
19 Oct 2021 15:48 PDT

Mini Incident Report while full Incident Report is prepared

We apologize for the inconvenience this service disruption may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Support by opening a case using https://cloud.google.com/support

(All Times US/Pacific)

Incident Start: 19 October 2021 11:00

Incident End: 19 October 2021 12:45

Duration: 1 hours, 45 minutes

Affected Services and Features:

Google Cloud Monitoring, Google Cloud Dataflow, Google Cloud Pub/Sub, Google Cloud NAT, Google Cloud Router, Google Cloud Interconnect, Google Bigtable, Google Cloud Databases

Regions/Zones: us-central1 and us-central2

Description:

Google Cloud Monitoring experienced errors querying monitoring data for approximately 1 hour and 45 minutes. From preliminary analysis, the root cause of the issue was due to resource contention that occurred following a recent roll out which included a misconfiguration. Engineers corrected the configuration, and restarted the affected instances to resolve the issue.

Customer Impact:

-Customers may have experienced errors or incomplete monitoring data.

-Missing precomputed data from between 11:00 PT and 12:45 PT is expected, but can still be viewed via raw query.

-Customers may have also experienced false alerts during the impact window based on the underlying monitoring data.

19 Oct 2021 13:27 PDT

The issue with Cloud Monitoring has been resolved for all affected users as of Tuesday, 2021-10-19 12:45 US/Pacific.

We will publish an analysis of this incident once we have completed our internal investigation.

We thank you for your patience while we worked on resolving the issue.

19 Oct 2021 13:07 PDT

Summary: Global: Cloud Monitoring elevated errors requesting underlying monitoring data

Description: All customer impact is mitigated as of Tuesday, 2021-10-19 12:45 US/Pacific. Missing precomputed data from between 11:00 to 12:45 is expected, but can still be viewed via a raw query. Users might see lingering impact when running queries with large windows. Users may have received false alerts during that window based on the underlying monitoring data.

We will continue to monitor the situation. We do not have an ETA for full resolution at this point.

We will provide an update by Tuesday, 2021-10-19 14:00 US/Pacific with current details.

Diagnosis: Affected customers may see errors when trying to query their monitoring data.

Workaround: None at this time.

19 Oct 2021 12:51 PDT

Summary: Global: Cloud Monitoring elevated errors requesting underlying monitoring data

Description: We believe the issue with Cloud Monitoring is partially resolved and the mitigation is continuing to reduce the error rate. We will continue to monitor the situation.

We do not have an ETA for full resolution at this point.

We will provide an update by Tuesday, 2021-10-19 14:00 US/Pacific with current details.

Diagnosis: Affected customers may see errors when trying to query their monitoring data.

Workaround: None at this time.

19 Oct 2021 12:11 PDT

Summary: Global: Cloud Monitoring elevated errors requesting underlying monitoring data

Description: Mitigation work is currently underway by our engineering team.

We do not have an ETA for mitigation at this point.

We will provide more information by Tuesday, 2021-10-19 13:05 US/Pacific.

Diagnosis: Affected customers may see errors when trying to query their monitoring data.

Workaround: None at this time.