Service Health
Incident affecting Cloud Firestore, Google App Engine, Google Cloud Functions
Increased latency and error rates observed on Google App Engine, Cloud Firestore, and Google Cloud Functions gen 1.
Incident began at 2024-09-18 12:34 and ended at 2024-09-18 15:30 (all times are US/Pacific).
Previously affected location(s)
Taiwan (asia-east1)Osaka (asia-northeast2)Seoul (asia-northeast3)Mumbai (asia-south1)Singapore (asia-southeast1)Jakarta (asia-southeast2)Sydney (australia-southeast1)Warsaw (europe-central2)London (europe-west2)Frankfurt (europe-west3)Zurich (europe-west6)São Paulo (southamerica-east1)Iowa (us-central1)South Carolina (us-east1)Northern Virginia (us-east4)Salt Lake City (us-west3)
Date | Time | Description | |
---|---|---|---|
| 23 Sep 2024 | 07:11 PDT | Incident ReportSummaryOn Wednesday, 18 September, 2024, Google App Engine, Cloud Firestore, and Google Cloud Run functions (1st gen) experienced increased latency and error rates for a duration of 2 hours and 56 minutes in multiple regions. In some regions, customers experienced a complete service outage for a period between 5 minutes and 67 minutes. Issue began on 18 September 2024 at 12:34 US/Pacific and was completely resolved on 18 September 2024 at 15:30 US/Pacific. To our customers who were impacted during this disruption, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. Root CauseThe root cause was a newly implemented automation code which created a bad traffic routing policy. This policy incorrectly directed our traffic routing control plane to mark all clusters as being unavailable to serve traffic for App Engine, Google Cloud Run functions (1st gen)* and dependent services. Google engineers intervened before the policy was rolled out to all clusters, resulting in a partial outage of the service. Remediation and PreventionGoogle engineers were alerted to the issue via internal production monitoring on 18 September 2024 at 13:01 US/Pacific shortly after customers began experiencing the impact. Engineering teams have identified the automation which caused the impact and terminated it at 13:46. However customer impact was only mitigated at 15:30 post manually directing the traffic back to the affected clusters. Google is committed to preventing a repeat of the issue in the future and is completing the following actions:
Detailed Description of ImpactOn Wednesday 18 September, 2024 from 12:34 US/Pacific to 15:30 US/Pacific, Google App Engine, Google Cloud Run Functions (1st gen)* and Cloud Firestore experienced elevated error rates and increased latency. Customers reported 5xx errors with the message “Request was aborted after waiting too long to attempt to service your request.” and high latency. Customers also experienced high cold starts during this time. In 13 regions, customers experienced a complete service outage for a period between 5 minutes and 67 minutes.
In other 11 regions, customers might observe elevated error rates:
*Cloud Run and Cloud Run functions (gen2) were not affected. |
| 18 Sep 2024 | 21:36 PDT | Mini Incident ReportWe apologize for the inconvenience this service outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support (All Times US/Pacific) Incident Start 18 September, 2024, 13:01 Incident End 18 September, 2024, 15:30 Duration 2 hours, 29 minutes Affected Services and Features
Regions/Zones Global Description Google App Engine, Google Cloud Functions Gen1, Firestore experienced elevated error rates and increased latency for a period of 2 hours, 29 minutes. Based on our preliminary analysis, the root cause of the issue was identified as a newly implemented automation code which created a bad traffic routing policy. This policy incorrectly directed our traffic routing control plane to mark all clusters as being unavailable to serve traffic for App Engine and dependent services. Google engineers intervened before the policy was rolled out to all clusters, resulting in a partial outage of the service. Google engineers have identified the automation that was responsible for this change and have terminated it until appropriate safeguards are put in place. The impact was mitigated by manually directing the traffic back to the affected clusters. There is no risk of a recurrence of this outage at the moment. Google will complete a full IR in the following days that will provide a full root cause. Customer Impact
|
| 18 Sep 2024 | 15:50 PDT | The issue with Google App Engine, Google Cloud Functions, Cloud Firestore has been resolved for all affected users as of Wednesday, 2024-09-18 15:30 US/Pacific. We will publish an analysis of this incident once we have completed our internal investigation. We thank you for your patience while we worked on resolving the issue. |
| 18 Sep 2024 | 15:19 PDT | Summary: Increased latency and error rates observed on Google App Engine, Cloud Firestore, and Google Cloud Functions gen 1. Description: Mitigation has been successfully applied by our engineering team. We are currently monitoring our environment to ensure stability. We will provide more information by Wednesday, 2024-09-18 16:00 US/Pacific. Diagnosis: Affected users may encounter elevated latency or an elevated error rate for the impacted products. Workaround: None at this time. |
| 18 Sep 2024 | 14:57 PDT | Summary: Increased latency and error rates observed on Google App Engine and Google Cloud Functions gen 1. Description: Mitigation work is currently underway by our engineering team. Based on the investigation thus far, our engineers have identified that Cloud Run is not currently impacted. We do not have an ETA for mitigation at this point. We will provide more information by Wednesday, 2024-09-18 16:00 US/Pacific. Diagnosis: Affected users may encounter elevated latency or an elevated error rate for the impacted products. Workaround: None at this time. |
- All times are US/Pacific