Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Cloud Monitoring, Cloud Run, Operations, Cloud Spanner, Google Compute Engine, Google Kubernetes Engine, Google Cloud Bigtable, Google Cloud Dataflow, Google Cloud Pub/Sub, Google App Engine

Cloud Monitoring is serving query failures, errors, and metrics unavailability impacting Google Compute Engine, Cloud Spanner, Cloud Dataflow, Cloud Bigtable, Cloud AppEngine, Kubernetes Engine, Cloud Pub/Sub, Cloud Run in us-central

Incident began at 2022-08-20 05:25 and ended at 2022-08-20 08:20 (all times are US/Pacific).

Date Time Description
26 Aug 2022 12:28 PDT

Full Incident Report

Background:

Cloud Monitoring relies on an indexing service to look up data relevant to queries. Each region’s index is dynamically built from data, and then acts as a routing layer for all queries. The index is authoritative, and without it, queries will fail or return empty results.

SUMMARY:

On Saturday, 20 August 2022 starting at 05:25 US/Pacific, some Cloud Monitoring queries and metrics were unavailable for multiple Google Cloud products (Google Compute Engine, Cloud Spanner, Cloud Dataflow, Cloud Bigtable, Cloud AppEngine, Kubernetes Engine, Cloud Pub/Sub, Cloud Run) for a period of 2 hours and 55 minutes. Google Compute Engine (GCE) Autoscaling (for virtual machines on GCE) may have been impacted during the incident, as it relies on the Cloud Monitoring metrics.

ROOT CAUSE:

An increase in storage layer tasks triggered a previously unknown bug in the indexing service. All instances of the indexing service in the us-central1 and us-central2 regions repeatedly crashed and were unable to provide routing lookups to any queries. As a result, Cloud Monitoring queries did not retrieve data stored in us-central1 and us-central2 or failed in a smaller fraction of cases.

REMEDIATION AND PREVENTION:

Google Engineers were first alerted to the issue Saturday, 20 August 2022 at 05:33 US/Pacific by internal monitoring systems. The issue was then mitigated by reducing the number of storage layer tasks. Services were fully recovered at 08:20 US/Pacific.

Google is committed to preventing future issues like this and is completing the following actions to prevent a recurrence:

  • Enhancing the indexing service to support a larger number of tasks (beyond the current threshold).
  • Improving testing to ensure we detect any similar limitations in the future before they reach production.

We apologize for the length and severity of this incident. We are taking immediate steps to improve reliability in the future.

23 Aug 2022 13:04 PDT

Mini Incident Report while full Incident Report is prepared

We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support.

(All Times US/Pacific)

Incident Start: 20 Aug 2022 05:25

Incident End: 20 Aug 2022 08:20

Duration: 2 hours, 55 minutes

Affected Services and Features:

Google Compute Engine, Cloud Spanner, Google Cloud Dataflow, Google Cloud Bigtable, Cloud AppEngine, Google Kubernetes Engine, Google Cloud Pub/Sub, Cloud Run, Operations, Cloud Monitoring

Regions/Zones: us-central1 and us-central2

Description: Cloud Monitoring metrics for multiple Google Cloud products (Google Compute Engine, Cloud Spanner, Cloud Dataflow, Cloud Bigtable, Cloud AppEngine, Kubernetes Engine, Cloud Pub/Sub, Cloud Run) were missing for a period of 2 hours and 55 minutes. During the same period of time, Google Compute Engine (GCE) Autoscaling (for virtual machines on GCE) also was not functioning, as it relies on the Cloud Monitoring metrics.

From preliminary analysis, the root cause of the issue was an overload in the system that is responsible for handling the metrics in Cloud Monitoring. This resulted in the failure of the queries which provide these metrics.

Customer Impact:

Customers lost their ability to comprehensively observe various cloud operations at scale for the impacted services. Customers who use these metrics retroactively may be missing this information for the duration of the impact. While the custom metrics will be available again, a few that are system-generated shall not be available any more due to system limitations.

Additional details:

The issue was mitigated by reducing the number of storage layer tasks. To prevent recurrence, engineers are also actively working on bolstering the system responsible for handling metrics to support a larger number of tasks (beyond the current threshold).

20 Aug 2022 08:33 PDT

The issue with Cloud Monitoring, Cloud Run, Cloud Spanner, Google App Engine, Google Cloud Bigtable, Google Cloud Dataflow, Google Cloud Pub/Sub, Google Compute Engine, Google Kubernetes Engine has been resolved for all affected users as of Saturday, 2022-08-20 08:32 US/Pacific.

We thank you for your patience while we worked on resolving the issue.

20 Aug 2022 08:29 PDT

Summary: Cloud Monitoring is serving query failures, errors, and metrics unavailability impacting Google Compute Engine, Cloud Spanner, Cloud Dataflow, Cloud Bigtable, Cloud AppEngine, Kubernetes Engine, Cloud Pub/Sub, Cloud Run in us-central

Description: We are experiencing an issue with Cloud Monitoring,

Identified impacted services Google Compute Engine, Cloud Spanner, Cloud Dataflow, Cloud Bigtable, Cloud AppEngine, Kubernetes Engine, Cloud Pub/Sub, Cloud Run

Our engineering team continues to investigate the issue.

We will provide an update by Saturday, 2022-08-20 09:30 US/Pacific with current details.

Diagnosis: Customers may experience queries failure, errors and metrics being not available in us-central.

Workaround: None at the moment.

20 Aug 2022 07:55 PDT

Summary: Cloud Monitoring is serving query failures, errors, and metrics unavailability impacting Google Compute Engine, Cloud Spanner, Cloud Dataflow, Cloud Bigtable, Cloud AppEngine, Kubernetes Engine in us-central

Description: We are experiencing an issue with Cloud Monitoring,

Identified impacted services Google Compute Engine, Cloud Spanner, Cloud Dataflow, Cloud Bigtable, Cloud AppEngine, Kubernetes Engine

Our engineering team continues to investigate the issue.

We will provide an update by Saturday, 2022-08-20 09:30 US/Pacific with current details.

Diagnosis: Customers may experience queries failure, errors and metrics being not available in us-central.

Workaround: None at the moment.

20 Aug 2022 07:30 PDT

Summary: Cloud Monitoring is serving bad queries (errors and/or empty results) in Cloud Monarch us-central.

Description: We are experiencing an issue with Cloud Monitoring, Google Compute Engine.

Our engineering team continues to investigate the issue.

We will provide an update by Saturday, 2022-08-20 09:30 US/Pacific with current details.

Diagnosis: Customers may experience queries failure, errors and metrics being not available in us-central.

Workaround: None at the moment.