Service Health
Incident affecting Cloud Build, Cloud Developer Tools, Google Cloud Deploy
Cloud Build - builds execution is degraded in us-central1
Incident began at 2024-10-25 02:15 and ended at 2024-10-25 08:03 (all times are US/Pacific).
Previously affected location(s)
Iowa (us-central1)
Date | Time | Description | |
---|---|---|---|
| 31 Oct 2024 | 10:26 PDT | Incident ReportSummaryCloud Build in the us-central1 region experienced an outage for 4 hours 30 minutes starting from 02:15 US/Pacific on Friday, 25 October 2024, resulting in builds to be stuck and subsequently expire. Beginning 06:45 US/Pacific Cloud Build started processing the requests in the us-central1 region, however experienced significant execution delays for a duration of 2 hours and 2 minutes. To our Google Cloud customers who were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. Root CauseCloud Build uses an internal component that manages the execution of the builds by using transient worker Google Compute Engine (GCE) virtual machine (VM) instances. Each machine is only used for execution of one build and deleted after that. The root cause of the issue is Cloud Build's failure to degrade gracefully upon being throttled by its GCE API quota in the us-central1 region. As a result of being throttled, the Cloud Build component responsible for managing worker pools initiated multiple retries due to an incorrect configuration, further exacerbating the quota exhaustion and completely preventing the creation of new worker instances in the region. This effectively halted build processing in the us-central1 region, leading to a regional outage. Remediation and PreventionGoogle engineers were alerted to the outage from an internal monitoring alert at 02:31 US/Pacific on Friday 25 October, 2024 and immediately started an investigation. Once the nature and scope of the issue became clear, Google engineers tuned retry settings by 06:40 US/Pacific to bring calls to the GCE API to sustainable levels. This ensured the requests were no longer throttled and the internal system was able to create worker VM instances again to execute new builds and process the backlog. Subsequently at 08:47 the backlog queue dropped to normal levels. Google engineers proactively increased the quota at 09:43 US/Pacific to ensure this incident does not immediately reoccur. Google is committed to preventing a repeat of this issue in the future and is completing the following actions:
Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization. We thank you for your business. Detailed Description of ImpactCloud BuildDuring the period of impact, customers’ would have noticed that all the builds (both created manually and scheduled by build triggers) in the us-central1 region were being queued but not executed, and appeared as stuck. Once the incident was resolved, all builds that had exceeded the amount of time they could be queued got marked as expired. The remaining ones were eventually executed but with a delay. Google Cloud DeployCustomers were able to create new releases and rollouts (whether initiated manually or through automation), however the resources became stuck in an 'in-progress' state. Essentially, the Cloud Deploy service was not operational in the us-central1 region, as no deployments were being executed. Google App Engine (GAE)GAE version deployments in the us-central1 region saw elevated latency in its execution or failed with an “INTERNAL” error. Request traffic to existing GAE versions was not impacted by this incident. Google Cloud functionsCustomers would have seen elevated “RESOURCE_EXHAUSTED” errors or elevated latency for the create and update operations in the us-central1 region for Cloud Run functions. Request traffic to existing Cloud Run functions (1st and 2nd Gen) was not impacted by this incident. Google Cloud RunA few customers deploying source code to Cloud Run in the us-central1 region experienced increased deployment latency of deployment failures with an “INTERNAL” error. |
| 25 Oct 2024 | 10:09 PDT | Mini Incident ReportWe apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support. (All Times US/Pacific) Incident Start: 25 October, 2024 02:15 Incident End: 25 October, 2024 08:03 Duration: 5 hours, 48 minutes Affected Services and Features:
Regions/Zones: us-central1 Description: Cloud Build and Cloud Deploy customers in the us-central1 region experienced a service disruption lasting 5 hours and 48 minutes. During this time, 100% of builds were either stuck in the queued phase and subsequently expired, or faced significant execution delays. The root cause was identified as a bug in how Cloud Build interacts with Google Compute Engine (GCE) to provision compute resources. Due to organic traffic growth, Cloud Build exceeded its GCE quota, triggering a retry-storm instead of graceful degradation. This prevented the creation of new worker instances, leading to a regional outage as existing builds could not be processed. Google will complete a full IR in the following days that will provide a full root cause. Customer Impact:
|
| 25 Oct 2024 | 08:06 PDT | The issue with Cloud Build, Google Cloud Deploy has been resolved for all affected projects as of Friday, 2024-10-25 08:06 US/Pacific. We thank you for your patience while we worked on resolving the issue. |
| 25 Oct 2024 | 07:18 PDT | Summary: Cloud Build - builds execution is degraded in us-central1 Description: We believe the issue with Cloud Build, Google Cloud Deploy is partially resolved. Builds should be now getting executed, however the delays are possible. We are still working on a full resolution. We do not have an ETA for full resolution at this point. We will provide an update by Friday, 2024-10-25 09:30 US/Pacific with current details. Diagnosis: Customers might still observe scheduled builds time-out or being stuck in the queued phase or being executed with delays. Workaround: Customers may try to use another Cloud region. |
| 25 Oct 2024 | 07:03 PDT | Summary: Cloud Build - no builds are being executed in us-central1 Description: Mitigation work is still underway by our engineering team. We will provide more information by Friday, 2024-10-25 09:00 US/Pacific. Diagnosis: Customers would observe scheduled builds time-out or being stuck in the queued phase. Workaround: Customers may try to use another Cloud region. |
| 25 Oct 2024 | 05:33 PDT | Summary: Cloud Build - no builds are being executed in us-central1 Description: Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point. We will provide more information by Friday, 2024-10-25 07:00 US/Pacific. Diagnosis: Customers would observe scheduled builds time-out or being stuck in the queued phase. Workaround: Customers may try to use another Cloud region. |
| 25 Oct 2024 | 05:01 PDT | Summary: Cloud Build - no builds are being executed in us-central1 Description: Our resolving teams are still investigating the issue and possible mitigation plan. We do not have an ETA for mitigation at this point. We will provide more information by Friday, 2024-10-25 06:00 US/Pacific. Diagnosis: Customers would observe scheduled builds time-out or being stuck in the queued phase. Workaround: Customers may try to use another Cloud region. |
| 25 Oct 2024 | 04:13 PDT | Summary: Cloud Build - no builds are being executed in us-central1 Description: Investigation is currently still ongoing. We do not have an ETA for mitigation at this point. We will provide more information by Friday, 2024-10-25 05:02 US/Pacific. Diagnosis: Customers would observe scheduled builds time-out or being stuck in the queued phase. Workaround: Customers may try to use another Cloud region. |
- All times are US/Pacific