Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Cloud Build, Cloud Developer Tools, Google Cloud Deploy

Cloud Build - builds execution is degraded in us-central1

Incident began at 2024-10-25 02:15 and ended at 2024-10-25 08:03 (all times are US/Pacific).

Previously affected location(s)

Iowa (us-central1)

Date Time Description
31 Oct 2024 10:26 PDT

Incident Report

Summary

Cloud Build in the us-central1 region experienced an outage for 4 hours 30 minutes starting from 02:15 US/Pacific on Friday, 25 October 2024, resulting in builds to be stuck and subsequently expire. Beginning 06:45 US/Pacific Cloud Build started processing the requests in the us-central1 region, however experienced significant execution delays for a duration of 2 hours and 2 minutes. To our Google Cloud customers who were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability.

Root Cause

Cloud Build uses an internal component that manages the execution of the builds by using transient worker Google Compute Engine (GCE) virtual machine (VM) instances. Each machine is only used for execution of one build and deleted after that.

The root cause of the issue is Cloud Build's failure to degrade gracefully upon being throttled by its GCE API quota in the us-central1 region. As a result of being throttled, the Cloud Build component responsible for managing worker pools initiated multiple retries due to an incorrect configuration, further exacerbating the quota exhaustion and completely preventing the creation of new worker instances in the region. This effectively halted build processing in the us-central1 region, leading to a regional outage.

Remediation and Prevention

Google engineers were alerted to the outage from an internal monitoring alert at 02:31 US/Pacific on Friday 25 October, 2024 and immediately started an investigation.

Once the nature and scope of the issue became clear, Google engineers tuned retry settings by 06:40 US/Pacific to bring calls to the GCE API to sustainable levels. This ensured the requests were no longer throttled and the internal system was able to create worker VM instances again to execute new builds and process the backlog. Subsequently at 08:47 the backlog queue dropped to normal levels.

Google engineers proactively increased the quota at 09:43 US/Pacific to ensure this incident does not immediately reoccur.

Google is committed to preventing a repeat of this issue in the future and is completing the following actions:

  • We are improving the mechanism which Cloud Build creates and deletes GCE worker VMs to ensure graceful degradation of service in case of errors.
  • We are conducting a thorough investigation of internal quotas in all regions to ensure that we have enough capacity to execute builds from all customers at peak traffic.

Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization. We thank you for your business.

Detailed Description of Impact

Cloud Build

During the period of impact, customers’ would have noticed that all the builds (both created manually and scheduled by build triggers) in the us-central1 region were being queued but not executed, and appeared as stuck.

Once the incident was resolved, all builds that had exceeded the amount of time they could be queued got marked as expired. The remaining ones were eventually executed but with a delay.

Google Cloud Deploy

Customers were able to create new releases and rollouts (whether initiated manually or through automation), however the resources became stuck in an 'in-progress' state. Essentially, the Cloud Deploy service was not operational in the us-central1 region, as no deployments were being executed.

Google App Engine (GAE)

GAE version deployments in the us-central1 region saw elevated latency in its execution or failed with an “INTERNAL” error. Request traffic to existing GAE versions was not impacted by this incident.

Google Cloud functions

Customers would have seen elevated “RESOURCE_EXHAUSTED” errors or elevated latency for the create and update operations in the us-central1 region for Cloud Run functions. Request traffic to existing Cloud Run functions (1st and 2nd Gen) was not impacted by this incident.

Google Cloud Run

A few customers deploying source code to Cloud Run in the us-central1 region experienced increased deployment latency of deployment failures with an “INTERNAL” error.

25 Oct 2024 10:09 PDT

Mini Incident Report

We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support.

(All Times US/Pacific)

Incident Start: 25 October, 2024 02:15

Incident End: 25 October, 2024 08:03

Duration: 5 hours, 48 minutes

Affected Services and Features:

  • Cloud Build
  • Cloud Deploy

Regions/Zones: us-central1

Description:

Cloud Build and Cloud Deploy customers in the us-central1 region experienced a service disruption lasting 5 hours and 48 minutes. During this time, 100% of builds were either stuck in the queued phase and subsequently expired, or faced significant execution delays. The root cause was identified as a bug in how Cloud Build interacts with Google Compute Engine (GCE) to provision compute resources. Due to organic traffic growth, Cloud Build exceeded its GCE quota, triggering a retry-storm instead of graceful degradation. This prevented the creation of new worker instances, leading to a regional outage as existing builds could not be processed.

Google will complete a full IR in the following days that will provide a full root cause.

Customer Impact:

  • Cloud Build: During the incident, builds were queued for execution but not actually executed and would have appeared as stuck. That applied both to builds created manually and via triggers. Upon this incident’s resolution, most of the builds would have exceeded the amount of time the build can be queued and were marked as expired. These builds will not be executed any more and must be re-created.
  • Google Cloud Deploy: New releases and rollouts (whether initiated manually or through automation) could be created, but the resources became stuck in an 'in-progress' state. Essentially, the Cloud Deploy service was not operational in the us-central1 region, as no deployments were being executed.
25 Oct 2024 08:06 PDT

The issue with Cloud Build, Google Cloud Deploy has been resolved for all affected projects as of Friday, 2024-10-25 08:06 US/Pacific.

We thank you for your patience while we worked on resolving the issue.

25 Oct 2024 07:18 PDT

Summary: Cloud Build - builds execution is degraded in us-central1

Description: We believe the issue with Cloud Build, Google Cloud Deploy is partially resolved. Builds should be now getting executed, however the delays are possible.

We are still working on a full resolution. We do not have an ETA for full resolution at this point.

We will provide an update by Friday, 2024-10-25 09:30 US/Pacific with current details.

Diagnosis: Customers might still observe scheduled builds time-out or being stuck in the queued phase or being executed with delays.

Workaround: Customers may try to use another Cloud region.

25 Oct 2024 07:03 PDT

Summary: Cloud Build - no builds are being executed in us-central1

Description: Mitigation work is still underway by our engineering team.

We will provide more information by Friday, 2024-10-25 09:00 US/Pacific.

Diagnosis: Customers would observe scheduled builds time-out or being stuck in the queued phase.

Workaround: Customers may try to use another Cloud region.

25 Oct 2024 05:33 PDT

Summary: Cloud Build - no builds are being executed in us-central1

Description: Mitigation work is currently underway by our engineering team.

We do not have an ETA for mitigation at this point.

We will provide more information by Friday, 2024-10-25 07:00 US/Pacific.

Diagnosis: Customers would observe scheduled builds time-out or being stuck in the queued phase.

Workaround: Customers may try to use another Cloud region.

25 Oct 2024 05:01 PDT

Summary: Cloud Build - no builds are being executed in us-central1

Description: Our resolving teams are still investigating the issue and possible mitigation plan.

We do not have an ETA for mitigation at this point.

We will provide more information by Friday, 2024-10-25 06:00 US/Pacific.

Diagnosis: Customers would observe scheduled builds time-out or being stuck in the queued phase.

Workaround: Customers may try to use another Cloud region.

25 Oct 2024 04:13 PDT

Summary: Cloud Build - no builds are being executed in us-central1

Description: Investigation is currently still ongoing.

We do not have an ETA for mitigation at this point.

We will provide more information by Friday, 2024-10-25 05:02 US/Pacific.

Diagnosis: Customers would observe scheduled builds time-out or being stuck in the queued phase.

Workaround: Customers may try to use another Cloud region.