Service Health
Incident affecting Operations, Cloud Monitoring, Google Cloud Dataflow, Google Cloud Pub/Sub, Cloud NAT, Google Cloud Bigtable
Global: Cloud Monitoring elevated errors requesting underlying monitoring data
Incident began at 2021-10-19 11:00 and ended at 2021-10-19 12:45 (all times are US/Pacific).
Date | Time | Description | |
---|---|---|---|
| 27 Oct 2021 | 14:26 PDT | INCIDENT REPORTSummaryOn 19 October 2021 11:00 US/Pacific, Cloud Monitoring experienced errors querying all monitoring data for approximately 1 hour and 45 minutes in the us-central1 region. We apologize for the inconvenience and are taking steps toward preventing recurrence in the future. Root CauseCloud Monitoring is a global service but is subdivided into internal locales, each of which collect monitoring data which is generated locally. When users query Cloud Monitoring, each query fans out through a series of nodes (called mixers) within the corresponding locales. The mixers reach out to source nodes to gather the appropriate data, temporarily retaining it within a limited set of memory. During a recent infrastructure change in the U.S. locale, the amount of memory allocated to mixers in the us-central1 region was inadvertently reduced. This caused mixer tasks to run low on memory. The number of tasks in a low memory state grew over a period of several days as the change was gradually rolled out to production, following Google's standard progressive rollout policies. The mixer task has safeguards which are designed to detect and reduce the impact of low memory conditions by pausing queries that use significant memory. However, in this case, an existing misconfiguration of this safeguard prevented it from activating correctly. Eventually, tasks which were low on memory failed; enough tasks failed in total to cause widespread failures and service impact. Remediation and PreventionGoogle engineers were alerted to the problem on 19 October 2021 at 11:11 and immediately started an investigation. Root cause - the reduction in memory allocation for mixer nodes - was identified at 11:32. Google engineers quickly identified a mitigation, which we began to roll out at 11:50. Restoring the proper memory capacity for mixer nodes fully mitigated the issue at 12:54. Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We are taking the following immediate steps to prevent this or similar issues from happening again:
|
| 19 Oct 2021 | 15:48 PDT | Mini Incident Report while full Incident Report is prepared We apologize for the inconvenience this service disruption may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Support by opening a case using https://cloud.google.com/support (All Times US/Pacific) Incident Start: 19 October 2021 11:00 Incident End: 19 October 2021 12:45 Duration: 1 hours, 45 minutes Affected Services and Features: Google Cloud Monitoring, Google Cloud Dataflow, Google Cloud Pub/Sub, Google Cloud NAT, Google Cloud Router, Google Cloud Interconnect, Google Bigtable, Google Cloud Databases Regions/Zones: us-central1 and us-central2 Description: Google Cloud Monitoring experienced errors querying monitoring data for approximately 1 hour and 45 minutes. From preliminary analysis, the root cause of the issue was due to resource contention that occurred following a recent roll out which included a misconfiguration. Engineers corrected the configuration, and restarted the affected instances to resolve the issue. Customer Impact: -Customers may have experienced errors or incomplete monitoring data. -Missing precomputed data from between 11:00 PT and 12:45 PT is expected, but can still be viewed via raw query. -Customers may have also experienced false alerts during the impact window based on the underlying monitoring data. |
| 19 Oct 2021 | 13:27 PDT | The issue with Cloud Monitoring has been resolved for all affected users as of Tuesday, 2021-10-19 12:45 US/Pacific. We will publish an analysis of this incident once we have completed our internal investigation. We thank you for your patience while we worked on resolving the issue. |
| 19 Oct 2021 | 13:07 PDT | Summary: Global: Cloud Monitoring elevated errors requesting underlying monitoring data Description: All customer impact is mitigated as of Tuesday, 2021-10-19 12:45 US/Pacific. Missing precomputed data from between 11:00 to 12:45 is expected, but can still be viewed via a raw query. Users might see lingering impact when running queries with large windows. Users may have received false alerts during that window based on the underlying monitoring data. We will continue to monitor the situation. We do not have an ETA for full resolution at this point. We will provide an update by Tuesday, 2021-10-19 14:00 US/Pacific with current details. Diagnosis: Affected customers may see errors when trying to query their monitoring data. Workaround: None at this time. |
| 19 Oct 2021 | 12:51 PDT | Summary: Global: Cloud Monitoring elevated errors requesting underlying monitoring data Description: We believe the issue with Cloud Monitoring is partially resolved and the mitigation is continuing to reduce the error rate. We will continue to monitor the situation. We do not have an ETA for full resolution at this point. We will provide an update by Tuesday, 2021-10-19 14:00 US/Pacific with current details. Diagnosis: Affected customers may see errors when trying to query their monitoring data. Workaround: None at this time. |
| 19 Oct 2021 | 12:11 PDT | Summary: Global: Cloud Monitoring elevated errors requesting underlying monitoring data Description: Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point. We will provide more information by Tuesday, 2021-10-19 13:05 US/Pacific. Diagnosis: Affected customers may see errors when trying to query their monitoring data. Workaround: None at this time. |
- All times are US/Pacific