Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Operations, Cloud Logging

Multiple regions: Cloud Logging was experiencing issues with displaying log data in Google Cloud Console

Incident began at 2024-01-23 09:30 and ended at 2024-01-24 22:00 (all times are US/Pacific).

Previously affected location(s)

Taiwan (asia-east1)Hong Kong (asia-east2)Tokyo (asia-northeast1)Osaka (asia-northeast2)Seoul (asia-northeast3)Mumbai (asia-south1)Delhi (asia-south2)Singapore (asia-southeast1)Jakarta (asia-southeast2)Sydney (australia-southeast1)Melbourne (australia-southeast2)Multi-region: euWarsaw (europe-central2)Finland (europe-north1)Madrid (europe-southwest1)Belgium (europe-west1)Berlin (europe-west10)Turin (europe-west12)London (europe-west2)Frankfurt (europe-west3)Netherlands (europe-west4)Zurich (europe-west6)Milan (europe-west8)Paris (europe-west9)GlobalDoha (me-central1)Dammam (me-central2)Tel Aviv (me-west1)Montréal (northamerica-northeast1)Toronto (northamerica-northeast2)São Paulo (southamerica-east1)Santiago (southamerica-west1)Multi-region: usIowa (us-central1)South Carolina (us-east1)Northern Virginia (us-east4)Columbus (us-east5)Dallas (us-south1)Oregon (us-west1)Los Angeles (us-west2)Salt Lake City (us-west3)Las Vegas (us-west4)

Date Time Description
30 Jan 2024 17:58 PST

Incident Report

Summary

On Tuesday, 23 January 2024 at 09:30 PT, Cloud Logging, and Google Cloud products and services that rely on Cloud Logging, experienced delays ingesting logs originating from us-central1, us-east1, europe-west1, europe-north1, and asia-east1. This resulted in customers not being able to view these logs in Google Cloud Console or other places that make use of the logging query APIs during the impact windows for each region:

  • us-central1: 9:30-13:30 PT (4h)
  • us-east1: 9:30-10:20 PT (50m)
  • europe-west1: 9:30-14:15 PT (4h45m)
  • europe-north1: 9:30-14:15 PT (4h45m)
  • asia-east1: 9:30-10:10 PT (40m)

Exports of logs as well as writing to log-based metrics were also delayed in these regions during the impact windows. The issue also affected alerting through Cloud Console for Personalized Service Health (PSH).

To our Google Cloud customers who were affected, we sincerely apologize. This is not the level of quality and reliability we strive to offer you and we are taking immediate steps to improve the platform’s performance and availability.

Root Cause

The first time data is received for a log bucket that has Log Analytics enabled, Cloud Logging dynamically provisions resources necessary to ingest and store logs in BigQuery for that bucket. This requires updating state in a configuration database, which is accessed during log ingestion and routing. This configuration is required to ensure data is stored in compliance with each customer's organization settings.

As part of ongoing feature development, Cloud Logging Engineers increased traffic from a new set of projects to Log Analytics. These projects were ingesting logs in multiple regions and the ramp-up resulted in a large number of concurrent dynamic provisioning requests in the five regions listed above. This caused contention and slowdowns accessing the configuration database. While the Log Router has a load-shedding mechanism to protect against loss of throughput in such situations, there was a previously unknown latent issue that caused the problematic traffic to not be isolated quickly enough in a separate buffer. As a result, Log Router throughput was reduced by about 40% in the impacted regions, causing a log processing backlog to form.

The primary impact was that recently written logs were not visible to queries and log exports were delayed for log data originating from any of the impacted regions. This delay also affected Log Analytics, log-based metrics, and other Google products and services that rely on log data, including Personalized Service Health. No log entries were permanently lost, and all log entries were eventually successfully ingested, indexed for queries, and exported to configured destinations.

To quickly process the large backlog of data, the Log Analytics ingesters and log-based metrics pipeline were scaled up. This scale up led to two additional unintended secondary impacts.

  • Log-based Metrics: Queries for some log-based metrics with high cardinality were degraded for some users for about 25 hours following the event start time of 09:30 PT on 23 January 2024. Queries for such log-based metrics would timeout if the query time interval overlapped with the interval containing high cardinality values, as it is less efficient to store and query high cardinality data points. The internal cardinality of the metrics were increased because the writer tasks had scaled up to ingest the backlog and there was an increased number of late arriving logs. The issue was resolved when the high cardinality data aged out of the 25 hour in-memory retention window.
  • Log Analytics: Queries in Log Analytics would not return recently ingested data during the outage period until 24 January 2024 02:00 PT because the ingester scale up led to exceeding an internal connection quota limit.

Remediation and Prevention

Google engineers were alerted to the outage via a support case and internal alerts on Tuesday, 23 January 2024 09:55 PT and immediately started an investigation. Once they determined that the feature ramp-up was the cause of the outage, they rolled back the feature ramp-up at 11:15 PT. This reduced contention but the recovery process was slow because the backlog already contained many logs that triggered provisioning. A change was then rolled out to accelerate the recovery of the impacted regions by disabling provisioning for the affected logs. The rollout completed by 13:40 PT, and the backlog was fully processed in all impacted regions by 14:15 PT.

To remediate the log-based metrics issue, Google engineers scaled down the number of writer tasks to reduce cardinality of the generated metrics. The degraded query performance was mitigated by 24 January 2024 22:00 PT as the high cardinality data aged out of the in-memory retention window.

To remediate the issue in Log Analytics, Google engineers raised the internal connection quota and changed the connection type used by the buffer processor from exclusive connections to shared multiplexed connections. These changes mitigated the issue for Log Analytics and the buffered logs were fully processed by 24 January 2024 02:00 PT. The mitigation will also reduce the likelihood of a future occurrence of connection quota issues in Log Analytics.

Google is committed to preventing recurrence of this incident. The following actions are in progress:

  • Change the rollout process used to implement these types of traffic ramp-ups, to reduce the blast radius of any issues that result.
  • Reduce the timeout for database operations in the Log Router, to ensure that problematic data is isolated more quickly.
  • Implement fault injection testing under load to verify that Log Router throughput can be maintained during both failures and slow operation of the configuration database.
  • Change the dynamic provisioning process to eliminate this dependency from the log routing critical path.
  • Improve monitoring, alerting, and playbooks so engineers are notified of and able to respond to log entry backlogs more quickly.
  • Improve the log-based metrics pipeline processing to reduce the cardinality issues that were caused by scaling up writers to consume the backlog.

Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization. We thank you for your business.

Detailed Description of Impact

Cloud Logging

  • Log ingestion in Cloud Logging was delayed for logs ingested from us-central1, us-east1, europe-west1, europe-north1, and asia-east1 regions. Ingestion of logs from other regions was not impacted. The following are the impact windows for each affected region on 23 January 2024:

    • us-central1: 9:30-13:30 PT (4h) ~50% of messages
    • us-east1: 9:30-10:20 PT (50m) ~40% of messages
    • europe-west1: 9:30-14:15 PT (4h45m) ~75% of messages
    • europe-north1: 9:30-14:15 PT (4h45m) ~1% of messages
    • asia-east1: 9:30-10:10 PT (40m) ~20% of messages
  • Queries of logs from the impacted regions would not return recently written data during the outage period.

  • Exports of logs from the impacted regions to BigQuery, GCS, and Cloud PubSub destinations were delayed.

  • Graphs and queries for log-based metrics may have appeared to have missing or incomplete data during the outage period. Alerts that depended on this data may have been missed.

  • Log-based metrics queries for some high cardinality metrics experienced degraded performance for some users for about 25 hours following the event due to in-memory retention of high cardinality data.

  • No logs were lost during the incident. However some log-based metrics, derived from the logs, may have gaps in the corresponding data points during the outage period.

  • Ingestion to Log Analytics was degraded for a duration of 12 hours.

Personalized Service Health

  • PSH Alerting through Cloud Console would have been unavailable during the impact windows for each affected region.
  • Additionally, messages written by PSH to Cloud Logging were also delayed for the affected regions.
  • The ability to view incident status on the PSH Dashboard and integrations to the PSH API were not impacted.
24 Jan 2024 12:45 PST

Mini Incident Report

We apologize for the inconvenience this service outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support.

(All Times US/Pacific)

Incident Start: 23 Jan, 2024 09:30

Incident End for us-east1 logs delay: 23 Jan 2024 10:20

Incident End for us-central1 logs delay: 23 Jan 2024 13:45

Incident End for europe-west1 logs delay: 23 Jan 2024 14:15

Incident End for Cloud Metrics backfill : 23 Jan, 2024 18:05

Cumulative Duration: 8 hours, 35 minutes

Affected Services and Features:

  • Google Cloud Logging
  • Cloud Dataflow
  • Personalized Service Health
  • Google Cloud products and services that rely on Cloud Logging.

Regions/Zones: us-central1, europe-west1, us-east1

Description:

Cloud Logging experienced delays ingesting logs that originated from us-central1, us-east1, and europe-west1 resulting in customers not being able to view these logs during that time in Google Cloud Console or other places that make use of the Logging query APIs. Exports of logs as well as writing to log-based metrics were also delayed during this period.

Ingestion delays in Cloud Logging had a downstream impact on Google Cloud products and services that rely on Cloud Logging.

From our preliminary investigations, the root cause of the issue is a roll out of an internal feature for Cloud Trace that uses Cloud Logging. The rollout of the new feature caused an unexpected contention in accessing the configuration database used for Log Routing, which caused a backlog in the ingestion pipeline.

The issue was mitigated by rolling back the internal feature of Cloud Trace which concluded at 13:45 US/Pacific and the logs in the pending queue were gradually processed, mitigating the delayed logs issue by 14:15 US/Pacific. The issue with log-based metrics where the logs were not written to corresponding data points was mitigated at 18:05 US/Pacific.

Google will complete a full Incident Report in the following days to provide a full root cause.

Customer Impact:

Impact to Google Cloud products and services:

  • Log ingestion in Cloud Logging was delayed for logs ingested from us-central1, europe-west1, and us-east1. Ingestion of logs from other regions was not impacted.

  • Log queries, exports of logs (to BigQuery, GCS, or Cloud PubSub destinations), and writing corresponding data points to log-based metrics were delayed for the three impacted regions.

  • No logs were lost during the incident. However some log-based metrics, derived from the logs, may have gaps in the corresponding data points during the outage period.

Impact to Personalized Service Health (PSH):

  • The PSH dashboard displayed outage status, however the notification system to customers was down during the incident.
  • In addition to this, the log messages written by PSH were also impacted in the locations above.
  • Integrations to the PSH API were not impacted.

Additional Information:

  • A subset of customers are experiencing delays with querying historical log-based metrics data and our engineers are continuing to work on resolving the issue for those customers via a separate incident.

23 Jan 2024 18:05 PST

The issue with Cloud Logging, Personalized Service Health has been resolved for all affected users as of Tuesday, 2024-01-23 17:50 US/Pacific.

We will publish an analysis of this incident once we have completed our internal investigation.

We thank you for your patience while we worked on resolving the issue.

23 Jan 2024 15:17 PST

Summary: Multiple regions: Cloud Logging is experiencing issues with displaying log data in Google Cloud Console

Description: Our engineering team has narrowed down the root cause to a feature rollout that started at 09:30 US/Pacific. The new feature usage caused contention in the backend database.

After further investigation, the impact is narrowed down to logs from GCP products or services in us-central1, us-east1, and europe-west1 destined to any Cloud Logging buckets. We apologize for any confusion by the previous communications.

us-east1 recovered as of 10:20 US/Pacific and us-central1 recovered as of 13:42 US/Pacific.

Our engineers have completed the roll back of the new feature and are currently working on processing the logs in the pending queue in us-central1, europe-west1, Global. The rollout to clear the logs backlog has completed to all the affected regions. The logs backlog for us-east1 completed as of 10:20 US/Pacific, for us-central1 completed as of 13:42 US/Pacific and for europe-west1 completed as of 14:15 US/Pacific

Remaining Impact: Customers using log based metrics may observe metrics with no backing logs. Our engineering teams are working on mitigating the metrics backfill. We do not have an ETA for mitigation of log based metrics issue at this point.

The notification error rate for PSH has reduced significantly and we believe that the majority of the backed up notifications are delivered. As we gradually process the remaining logs in the pending queue, customers may see an increase of historical outage notifications from PSH for the affected time period. We apologize for the inconvenience.

We will provide more information by Tuesday, 2024-01-23 20:00 US/Pacific.

Diagnosis: - Customers using log based metrics may observe metrics with no backing logs.

  • Other impacts that were previously reported for Cloud Logging and PSH notifications should now be resolved.

Workaround: None at this time.

23 Jan 2024 14:24 PST

Summary: Multiple regions: Cloud Logging is experiencing issues with displaying log data in Google Cloud Console.

Description: Our engineering team has narrowed down the root cause to a feature rollout that started at 09:30 US/Pacific. The new feature usage caused contention in the backend database.

After further investigation, the impact is narrowed down to logs from GCP products or services in us-central1, us-east1, and europe-west1 destined to any Cloud Logging buckets. We apologize for any confusion by the previous communications.

us-east1 recovered as of 10:20 US/Pacific and us-central1 recovered as of 13:42 US/Pacific.

The rollout to clear the logs backlog has completed to all the affected regions. The logs backlog for us-east1 completed as of 10:20 US/Pacific, for us-central1 completed as of 13:42 US/Pacific and for europe-west1 completed as of 14:15 US/Pacific

Remaining Impact: Customers using log based metrics may observe metrics with no backing logs. Our engineering teams are working on mitigating the metrics backfill. We do not have an ETA for mitigation of log based metrics issue at this point.

The notification error rate for PSH has reduced significantly and we believe that the majority of the backed up notifications are delivered. As we gradually process the remaining logs in the pending queue, customers may see an increase of historical outage notifications from PSH for the affected time period. We apologize for the inconvenience.

We will provide more information by Tuesday, 2024-01-23 15:30 US/Pacific.

Diagnosis:

  • Customers using log based metrics may observe metrics with no backing logs.

  • Other impacts that were previously reported for Cloud Logging and PSH notifications should now be resolved.

Workaround: None at this time.

23 Jan 2024 13:17 PST

Summary: Multiple regions: Cloud Logging Data is experiencing issues with displaying log data in Google Cloud Console

Description: We are experiencing an issue with Cloud Logging, Cloud Dataflow, and Personalized Service Health (PSH).

After further investigation, the impact is narrowed down to logs from GCP products or services in us-central1, us-east1, and europe-west1 destined to any Cloud Logging buckets. We apologize for any confusion by the previous communications.

us-east1 recovered as of 10:20 US/Pacific.

Our engineering team has narrowed down the root cause to a feature rollout that started at 09:30 US/Pacific. The new feature usage caused contention in the backend database.

Our engineers have completed the roll back of the new feature and are currently working on processing the logs in the pending queue in us-central1, europe-west1, Global. Currently, the ETA for the full mitigation within the hour.

As we gradually process the logs in the pending queue, customers may see an increase of historical outage notifications from PSH for the affected time period. We apologize for the inconvenience.

We will provide more information by Tuesday, 2024-01-23 14:00 US/Pacific.

Diagnosis:

  • Customers are unable to receive logs at this time in the Google Cloud Console.

  • Cloud Dataflow is impacted and we are working to identify other downstream impacts.

  • GCP products and services using Cloud Logging are potentially impacted.

  • Customers using PSH notifications are also impacted. While the PSH dashboard is available and outage communications are displayed on it, customers will not get notifications on the outages until issue resolution. Additionally, log messages written by PSH are also impacted. Integrations to the PSH API are not impacted.

Workaround: None at this time.

23 Jan 2024 12:56 PST

Summary: Global: Cloud Logging Data is experiencing issues with displaying log data in Google Cloud Console

Description: We are experiencing an issue with Cloud Logging, Cloud Dataflow, and Personalized Service Health (PSH).

After further investigation, the impact is narrowed down to Cloud Logging customers using buckets configured in us-central1, us-east1, europe-west1, and Global rather than all the regions as previously mentioned. We apologize for any confusion this may have caused.

Our engineering team has narrowed down the root cause to a feature rollout that started at 09:30 US/Pacific. The new feature usage caused contention in the backend database.

Our engineers have completed the roll back of the new feature and are currently working on processing the logs in the pending queue in us-central1, europe-west1,Global buckets. us-east1 recovered as of 10:20 US/Pacific. Currently, the ETA for the full mitigation is 1 hour.

We will provide more information by Tuesday, 2024-01-23 14:00 US/Pacific.

Diagnosis:

  • Customers are unable to receive logs at this time in the Google Cloud Console.

  • Cloud Dataflow is impacted and we are working to identify other downstream impacts.

  • GCP products and services using Cloud Logging are potentially impacted.

  • Customers using PSH notifications are also impacted. While the PSH dashboard is available and outage communications are displayed on it, customers will not get notifications on the outages until issue resolution. Additionally, log messages written by PSH are also impacted. Integrations to the PSH API are not impacted.

Workaround: None at this time.

23 Jan 2024 12:24 PST

Summary: Global: Cloud Logging Data is experiencing issues with displaying log data in Google Cloud Console

Description: We are experiencing an issue with Cloud Logging, Cloud Dataflow, and Personalized Service Health (PSH).

Our engineering team has narrowed down the root cause to a feature rollout that started at 09:30 US/Pacific. The new feature usage caused contention in the backend database.

Our engineers have completed the roll back of the new feature and are currently working on processing the logs in the pending queue. Currently, the ETA for the full mitigation is 3 hours.

We will provide more information by Tuesday, 2024-01-23 13:00 US/Pacific.

Diagnosis:

  • Customers are unable to receive logs at this time in the Google Cloud Console.

  • Cloud Dataflow is impacted and we are working to identify other downstream impacts.

  • GCP products and services using Cloud Logging are potentially impacted.

  • Customers using PSH notifications are also impacted. While the PSH dashboard is available and outage communications are displayed on it, customers will not get notifications on the outages until issue resolution. Additionally, log messages written by PSH are also impacted. Integrations to the PSH API are not impacted.

Workaround: None at this time.

23 Jan 2024 11:31 PST

Summary: Global: Cloud Logging Data Unavailable in Google Cloud Console

Description: We are experiencing an issue with Cloud Logging.

Our engineering team continues to investigate the issue. We identified the potential root cause and working on a mitigation strategy.

We will provide more information by Tuesday, 2024-01-23 12:00 US/Pacific.

Diagnosis: Customers are unable to receive fresh logs at this time in the Google Cloud Console.

Workaround: None at this time.

23 Jan 2024 11:12 PST

Summary: Global: Cloud Logging Data Unavailable in Google Cloud Console

Description: We've received a report of an issue with Cloud Logging as of Tuesday, 2024-01-23 10:36 US/Pacific. We will provide more information by Tuesday, 2024-01-23 11:35 US/Pacific.

Diagnosis: Cloud Logging data is unavailable in the Google Cloud Console.

Workaround: None at this time.