Service Health
Incident affecting Operations, Cloud Logging
Multiple regions: Cloud Logging was experiencing issues with displaying log data in Google Cloud Console
Incident began at 2024-01-23 09:30 and ended at 2024-01-24 22:00 (all times are US/Pacific).
Previously affected location(s)
Taiwan (asia-east1)Hong Kong (asia-east2)Tokyo (asia-northeast1)Osaka (asia-northeast2)Seoul (asia-northeast3)Mumbai (asia-south1)Delhi (asia-south2)Singapore (asia-southeast1)Jakarta (asia-southeast2)Sydney (australia-southeast1)Melbourne (australia-southeast2)Multi-region: euWarsaw (europe-central2)Finland (europe-north1)Madrid (europe-southwest1)Belgium (europe-west1)Berlin (europe-west10)Turin (europe-west12)London (europe-west2)Frankfurt (europe-west3)Netherlands (europe-west4)Zurich (europe-west6)Milan (europe-west8)Paris (europe-west9)GlobalDoha (me-central1)Dammam (me-central2)Tel Aviv (me-west1)Montréal (northamerica-northeast1)Toronto (northamerica-northeast2)São Paulo (southamerica-east1)Santiago (southamerica-west1)Multi-region: usIowa (us-central1)South Carolina (us-east1)Northern Virginia (us-east4)Columbus (us-east5)Dallas (us-south1)Oregon (us-west1)Los Angeles (us-west2)Salt Lake City (us-west3)Las Vegas (us-west4)
Date | Time | Description | |
---|---|---|---|
| 30 Jan 2024 | 17:58 PST | Incident ReportSummaryOn Tuesday, 23 January 2024 at 09:30 PT, Cloud Logging, and Google Cloud products and services that rely on Cloud Logging, experienced delays ingesting logs originating from us-central1, us-east1, europe-west1, europe-north1, and asia-east1. This resulted in customers not being able to view these logs in Google Cloud Console or other places that make use of the logging query APIs during the impact windows for each region:
Exports of logs as well as writing to log-based metrics were also delayed in these regions during the impact windows. The issue also affected alerting through Cloud Console for Personalized Service Health (PSH). To our Google Cloud customers who were affected, we sincerely apologize. This is not the level of quality and reliability we strive to offer you and we are taking immediate steps to improve the platform’s performance and availability. Root CauseThe first time data is received for a log bucket that has Log Analytics enabled, Cloud Logging dynamically provisions resources necessary to ingest and store logs in BigQuery for that bucket. This requires updating state in a configuration database, which is accessed during log ingestion and routing. This configuration is required to ensure data is stored in compliance with each customer's organization settings. As part of ongoing feature development, Cloud Logging Engineers increased traffic from a new set of projects to Log Analytics. These projects were ingesting logs in multiple regions and the ramp-up resulted in a large number of concurrent dynamic provisioning requests in the five regions listed above. This caused contention and slowdowns accessing the configuration database. While the Log Router has a load-shedding mechanism to protect against loss of throughput in such situations, there was a previously unknown latent issue that caused the problematic traffic to not be isolated quickly enough in a separate buffer. As a result, Log Router throughput was reduced by about 40% in the impacted regions, causing a log processing backlog to form. The primary impact was that recently written logs were not visible to queries and log exports were delayed for log data originating from any of the impacted regions. This delay also affected Log Analytics, log-based metrics, and other Google products and services that rely on log data, including Personalized Service Health. No log entries were permanently lost, and all log entries were eventually successfully ingested, indexed for queries, and exported to configured destinations. To quickly process the large backlog of data, the Log Analytics ingesters and log-based metrics pipeline were scaled up. This scale up led to two additional unintended secondary impacts.
Remediation and PreventionGoogle engineers were alerted to the outage via a support case and internal alerts on Tuesday, 23 January 2024 09:55 PT and immediately started an investigation. Once they determined that the feature ramp-up was the cause of the outage, they rolled back the feature ramp-up at 11:15 PT. This reduced contention but the recovery process was slow because the backlog already contained many logs that triggered provisioning. A change was then rolled out to accelerate the recovery of the impacted regions by disabling provisioning for the affected logs. The rollout completed by 13:40 PT, and the backlog was fully processed in all impacted regions by 14:15 PT. To remediate the log-based metrics issue, Google engineers scaled down the number of writer tasks to reduce cardinality of the generated metrics. The degraded query performance was mitigated by 24 January 2024 22:00 PT as the high cardinality data aged out of the in-memory retention window. To remediate the issue in Log Analytics, Google engineers raised the internal connection quota and changed the connection type used by the buffer processor from exclusive connections to shared multiplexed connections. These changes mitigated the issue for Log Analytics and the buffered logs were fully processed by 24 January 2024 02:00 PT. The mitigation will also reduce the likelihood of a future occurrence of connection quota issues in Log Analytics. Google is committed to preventing recurrence of this incident. The following actions are in progress:
Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization. We thank you for your business. Detailed Description of ImpactCloud Logging
Personalized Service Health
|
| 24 Jan 2024 | 12:45 PST | Mini Incident ReportWe apologize for the inconvenience this service outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support. (All Times US/Pacific) Incident Start: 23 Jan, 2024 09:30 Incident End for us-east1 logs delay: 23 Jan 2024 10:20 Incident End for us-central1 logs delay: 23 Jan 2024 13:45 Incident End for europe-west1 logs delay: 23 Jan 2024 14:15 Incident End for Cloud Metrics backfill : 23 Jan, 2024 18:05 Cumulative Duration: 8 hours, 35 minutes Affected Services and Features:
Regions/Zones: us-central1, europe-west1, us-east1 Description: Cloud Logging experienced delays ingesting logs that originated from us-central1, us-east1, and europe-west1 resulting in customers not being able to view these logs during that time in Google Cloud Console or other places that make use of the Logging query APIs. Exports of logs as well as writing to log-based metrics were also delayed during this period. Ingestion delays in Cloud Logging had a downstream impact on Google Cloud products and services that rely on Cloud Logging. From our preliminary investigations, the root cause of the issue is a roll out of an internal feature for Cloud Trace that uses Cloud Logging. The rollout of the new feature caused an unexpected contention in accessing the configuration database used for Log Routing, which caused a backlog in the ingestion pipeline. The issue was mitigated by rolling back the internal feature of Cloud Trace which concluded at 13:45 US/Pacific and the logs in the pending queue were gradually processed, mitigating the delayed logs issue by 14:15 US/Pacific. The issue with log-based metrics where the logs were not written to corresponding data points was mitigated at 18:05 US/Pacific. Google will complete a full Incident Report in the following days to provide a full root cause. Customer Impact: Impact to Google Cloud products and services:
Impact to Personalized Service Health (PSH):
Additional Information:
|
| 23 Jan 2024 | 18:05 PST | The issue with Cloud Logging, Personalized Service Health has been resolved for all affected users as of Tuesday, 2024-01-23 17:50 US/Pacific. We will publish an analysis of this incident once we have completed our internal investigation. We thank you for your patience while we worked on resolving the issue. |
| 23 Jan 2024 | 15:17 PST | Summary: Multiple regions: Cloud Logging is experiencing issues with displaying log data in Google Cloud Console Description: Our engineering team has narrowed down the root cause to a feature rollout that started at 09:30 US/Pacific. The new feature usage caused contention in the backend database. After further investigation, the impact is narrowed down to logs from GCP products or services in us-central1, us-east1, and europe-west1 destined to any Cloud Logging buckets. We apologize for any confusion by the previous communications. us-east1 recovered as of 10:20 US/Pacific and us-central1 recovered as of 13:42 US/Pacific. Our engineers have completed the roll back of the new feature and are currently working on processing the logs in the pending queue in us-central1, europe-west1, Global. The rollout to clear the logs backlog has completed to all the affected regions. The logs backlog for us-east1 completed as of 10:20 US/Pacific, for us-central1 completed as of 13:42 US/Pacific and for europe-west1 completed as of 14:15 US/Pacific Remaining Impact: Customers using log based metrics may observe metrics with no backing logs. Our engineering teams are working on mitigating the metrics backfill. We do not have an ETA for mitigation of log based metrics issue at this point. The notification error rate for PSH has reduced significantly and we believe that the majority of the backed up notifications are delivered. As we gradually process the remaining logs in the pending queue, customers may see an increase of historical outage notifications from PSH for the affected time period. We apologize for the inconvenience. We will provide more information by Tuesday, 2024-01-23 20:00 US/Pacific. Diagnosis: - Customers using log based metrics may observe metrics with no backing logs.
Workaround: None at this time. |
| 23 Jan 2024 | 14:24 PST | Summary: Multiple regions: Cloud Logging is experiencing issues with displaying log data in Google Cloud Console. Description: Our engineering team has narrowed down the root cause to a feature rollout that started at 09:30 US/Pacific. The new feature usage caused contention in the backend database. After further investigation, the impact is narrowed down to logs from GCP products or services in us-central1, us-east1, and europe-west1 destined to any Cloud Logging buckets. We apologize for any confusion by the previous communications. us-east1 recovered as of 10:20 US/Pacific and us-central1 recovered as of 13:42 US/Pacific. The rollout to clear the logs backlog has completed to all the affected regions. The logs backlog for us-east1 completed as of 10:20 US/Pacific, for us-central1 completed as of 13:42 US/Pacific and for europe-west1 completed as of 14:15 US/Pacific Remaining Impact: Customers using log based metrics may observe metrics with no backing logs. Our engineering teams are working on mitigating the metrics backfill. We do not have an ETA for mitigation of log based metrics issue at this point. The notification error rate for PSH has reduced significantly and we believe that the majority of the backed up notifications are delivered. As we gradually process the remaining logs in the pending queue, customers may see an increase of historical outage notifications from PSH for the affected time period. We apologize for the inconvenience. We will provide more information by Tuesday, 2024-01-23 15:30 US/Pacific. Diagnosis:
Workaround: None at this time. |
| 23 Jan 2024 | 13:17 PST | Summary: Multiple regions: Cloud Logging Data is experiencing issues with displaying log data in Google Cloud Console Description: We are experiencing an issue with Cloud Logging, Cloud Dataflow, and Personalized Service Health (PSH). After further investigation, the impact is narrowed down to logs from GCP products or services in us-central1, us-east1, and europe-west1 destined to any Cloud Logging buckets. We apologize for any confusion by the previous communications. us-east1 recovered as of 10:20 US/Pacific. Our engineering team has narrowed down the root cause to a feature rollout that started at 09:30 US/Pacific. The new feature usage caused contention in the backend database. Our engineers have completed the roll back of the new feature and are currently working on processing the logs in the pending queue in us-central1, europe-west1, Global. Currently, the ETA for the full mitigation within the hour. As we gradually process the logs in the pending queue, customers may see an increase of historical outage notifications from PSH for the affected time period. We apologize for the inconvenience. We will provide more information by Tuesday, 2024-01-23 14:00 US/Pacific. Diagnosis:
Workaround: None at this time. |
| 23 Jan 2024 | 12:56 PST | Summary: Global: Cloud Logging Data is experiencing issues with displaying log data in Google Cloud Console Description: We are experiencing an issue with Cloud Logging, Cloud Dataflow, and Personalized Service Health (PSH). After further investigation, the impact is narrowed down to Cloud Logging customers using buckets configured in us-central1, us-east1, europe-west1, and Global rather than all the regions as previously mentioned. We apologize for any confusion this may have caused. Our engineering team has narrowed down the root cause to a feature rollout that started at 09:30 US/Pacific. The new feature usage caused contention in the backend database. Our engineers have completed the roll back of the new feature and are currently working on processing the logs in the pending queue in us-central1, europe-west1,Global buckets. us-east1 recovered as of 10:20 US/Pacific. Currently, the ETA for the full mitigation is 1 hour. We will provide more information by Tuesday, 2024-01-23 14:00 US/Pacific. Diagnosis:
Workaround: None at this time. |
| 23 Jan 2024 | 12:24 PST | Summary: Global: Cloud Logging Data is experiencing issues with displaying log data in Google Cloud Console Description: We are experiencing an issue with Cloud Logging, Cloud Dataflow, and Personalized Service Health (PSH). Our engineering team has narrowed down the root cause to a feature rollout that started at 09:30 US/Pacific. The new feature usage caused contention in the backend database. Our engineers have completed the roll back of the new feature and are currently working on processing the logs in the pending queue. Currently, the ETA for the full mitigation is 3 hours. We will provide more information by Tuesday, 2024-01-23 13:00 US/Pacific. Diagnosis:
Workaround: None at this time. |
| 23 Jan 2024 | 11:31 PST | Summary: Global: Cloud Logging Data Unavailable in Google Cloud Console Description: We are experiencing an issue with Cloud Logging. Our engineering team continues to investigate the issue. We identified the potential root cause and working on a mitigation strategy. We will provide more information by Tuesday, 2024-01-23 12:00 US/Pacific. Diagnosis: Customers are unable to receive fresh logs at this time in the Google Cloud Console. Workaround: None at this time. |
| 23 Jan 2024 | 11:12 PST | Summary: Global: Cloud Logging Data Unavailable in Google Cloud Console Description: We've received a report of an issue with Cloud Logging as of Tuesday, 2024-01-23 10:36 US/Pacific. We will provide more information by Tuesday, 2024-01-23 11:35 US/Pacific. Diagnosis: Cloud Logging data is unavailable in the Google Cloud Console. Workaround: None at this time. |
- All times are US/Pacific