Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Google Kubernetes Engine

Customers using Container-Optimized OS (COS) in Google Kubernetes Engine (GKE) were not able to fetch specific NVIDIA GPU drivers.

Incident began at 2024-03-12 08:00 and ended at 2024-03-12 14:55 (all times are US/Pacific).

Previously affected location(s)

Johannesburg (africa-south1)Taiwan (asia-east1)Hong Kong (asia-east2)Tokyo (asia-northeast1)Osaka (asia-northeast2)Seoul (asia-northeast3)Mumbai (asia-south1)Delhi (asia-south2)Singapore (asia-southeast1)Jakarta (asia-southeast2)Sydney (australia-southeast1)Melbourne (australia-southeast2)Warsaw (europe-central2)Finland (europe-north1)Madrid (europe-southwest1)Belgium (europe-west1)Berlin (europe-west10)Turin (europe-west12)London (europe-west2)Frankfurt (europe-west3)Netherlands (europe-west4)Zurich (europe-west6)Milan (europe-west8)Paris (europe-west9)Doha (me-central1)Dammam (me-central2)Tel Aviv (me-west1)Montréal (northamerica-northeast1)Toronto (northamerica-northeast2)São Paulo (southamerica-east1)Santiago (southamerica-west1)Iowa (us-central1)South Carolina (us-east1)Northern Virginia (us-east4)Columbus (us-east5)Dallas (us-south1)Oregon (us-west1)Los Angeles (us-west2)Salt Lake City (us-west3)Las Vegas (us-west4)

Date Time Description
13 Mar 2024 11:10 PDT

Mini Incident Report

We apologize for the inconvenience this service outage may have caused. We would like to provide some information about this incident below. Please note that Google worked with the appropriate partner to resolve the underlying issue. This is the final version of the report and no further information will be provided here. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support.

(All Times US/Pacific)

Incident Start: 12 March 2024 08:00

Incident End: 12 March 2024 14:55

Duration: 6 hours, 55 minutes

Affected Services and Features:

Google Kubernetes Engine (GKE)

Regions/Zones: All GPU regions and zones

Description:

Google Kubernetes Engine experienced elevated errors due to failures in downloading some NVIDIA GPU drivers for use with Container-Optimized OS (COS) for a duration of 6 hours, 55 minutes. These failures in downloading the GPU drivers led to node unavailability in some cases and impacted customers using T4, L4, H100 80GB and A100 GPUs, COS milestone 105 or above, and those who were attempting to install GPU driver versions R525 and above.

From the preliminary analysis, the root cause of the issue was an access issue to the storage bucket required for driver downloads. This is owned by our partner that supplies these GPU drivers. To limit the impact, Google Cloud took swift actions while the issue was happening, by halting automatic node recreations (which attempt GPU driver downloads) until the issue was mitigated. Other GKE features continued to operate normally without disruption.

Customer Impact:

  • GKE users encountered an error "Failed to download GPU driver installer, status: 403 Forbidden" on the GPU node when installing affected GPU drivers using COS. In some cases, the GPU driver download failures led to node unavailability.
  • GPU driver downloads for GPU models P4, P100, V100, K80 were unaffected.

12 Mar 2024 15:39 PDT

The issue with Google Kubernetes Engine affecting customers using NVIDIA GPU drivers on Container Optimized OS (COS) has been resolved as of Tuesday, 2024-03-12 15:09 US/Pacific.

Customers who have disabled auto-repair may need to recreate or restart the affected nodes to regain the functionality.

We will publish an analysis of this incident once we have completed our internal investigation.

We thank you for your patience while we worked on resolving the issue.

12 Mar 2024 14:06 PDT

Summary: Customers using Container-Optimized OS (COS) in Google Kubernetes Engine (GKE) may not be able to fetch specific NVIDIA GPU drivers

Description: We are investigating an issue with Google Kubernetes Engine affecting customers using NVIDIA GPU drivers on Container Optimized OS (COS). Affected Nodes that are newly created or recreated have non functional GPU drivers preventing functioning of workloads using the drivers. Some GPU drivers are unaffected (P4, P100, V100, K80).

Our engineering team continues to work towards resolving the driver fetching issue.

We will provide more information by Tuesday, 2024-03-12 18:00 US/Pacific.

We apologize to all who are affected by the disruption.

Diagnosis: GKE users will see error messages on the GPU node when installing the GPU driver of this nature - "Failed to download GPU driver installer, status: 403 Forbidden".

Workaround: None at this time. However, the issue can be mitigated by avoiding recreation of existing Nodes running GPUs. Note GCP has halted automatic Node recreation as a partial mitigation.