Service Health
Incident affecting Google Kubernetes Engine
Customers using Container-Optimized OS (COS) in Google Kubernetes Engine (GKE) were not able to fetch specific NVIDIA GPU drivers.
Incident began at 2024-03-12 08:00 and ended at 2024-03-12 14:55 (all times are US/Pacific).
Previously affected location(s)
Johannesburg (africa-south1)Taiwan (asia-east1)Hong Kong (asia-east2)Tokyo (asia-northeast1)Osaka (asia-northeast2)Seoul (asia-northeast3)Mumbai (asia-south1)Delhi (asia-south2)Singapore (asia-southeast1)Jakarta (asia-southeast2)Sydney (australia-southeast1)Melbourne (australia-southeast2)Warsaw (europe-central2)Finland (europe-north1)Madrid (europe-southwest1)Belgium (europe-west1)Berlin (europe-west10)Turin (europe-west12)London (europe-west2)Frankfurt (europe-west3)Netherlands (europe-west4)Zurich (europe-west6)Milan (europe-west8)Paris (europe-west9)Doha (me-central1)Dammam (me-central2)Tel Aviv (me-west1)Montréal (northamerica-northeast1)Toronto (northamerica-northeast2)São Paulo (southamerica-east1)Santiago (southamerica-west1)Iowa (us-central1)South Carolina (us-east1)Northern Virginia (us-east4)Columbus (us-east5)Dallas (us-south1)Oregon (us-west1)Los Angeles (us-west2)Salt Lake City (us-west3)Las Vegas (us-west4)
Date | Time | Description | |
---|---|---|---|
| 13 Mar 2024 | 11:10 PDT | Mini Incident ReportWe apologize for the inconvenience this service outage may have caused. We would like to provide some information about this incident below. Please note that Google worked with the appropriate partner to resolve the underlying issue. This is the final version of the report and no further information will be provided here. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support. (All Times US/Pacific) Incident Start: 12 March 2024 08:00 Incident End: 12 March 2024 14:55 Duration: 6 hours, 55 minutes Affected Services and Features: Google Kubernetes Engine (GKE) Regions/Zones: All GPU regions and zones Description: Google Kubernetes Engine experienced elevated errors due to failures in downloading some NVIDIA GPU drivers for use with Container-Optimized OS (COS) for a duration of 6 hours, 55 minutes. These failures in downloading the GPU drivers led to node unavailability in some cases and impacted customers using T4, L4, H100 80GB and A100 GPUs, COS milestone 105 or above, and those who were attempting to install GPU driver versions R525 and above. From the preliminary analysis, the root cause of the issue was an access issue to the storage bucket required for driver downloads. This is owned by our partner that supplies these GPU drivers. To limit the impact, Google Cloud took swift actions while the issue was happening, by halting automatic node recreations (which attempt GPU driver downloads) until the issue was mitigated. Other GKE features continued to operate normally without disruption. Customer Impact:
|
| 12 Mar 2024 | 15:39 PDT | The issue with Google Kubernetes Engine affecting customers using NVIDIA GPU drivers on Container Optimized OS (COS) has been resolved as of Tuesday, 2024-03-12 15:09 US/Pacific. Customers who have disabled auto-repair may need to recreate or restart the affected nodes to regain the functionality. We will publish an analysis of this incident once we have completed our internal investigation. We thank you for your patience while we worked on resolving the issue. |
| 12 Mar 2024 | 14:06 PDT | Summary: Customers using Container-Optimized OS (COS) in Google Kubernetes Engine (GKE) may not be able to fetch specific NVIDIA GPU drivers Description: We are investigating an issue with Google Kubernetes Engine affecting customers using NVIDIA GPU drivers on Container Optimized OS (COS). Affected Nodes that are newly created or recreated have non functional GPU drivers preventing functioning of workloads using the drivers. Some GPU drivers are unaffected (P4, P100, V100, K80). Our engineering team continues to work towards resolving the driver fetching issue. We will provide more information by Tuesday, 2024-03-12 18:00 US/Pacific. We apologize to all who are affected by the disruption. Diagnosis: GKE users will see error messages on the GPU node when installing the GPU driver of this nature - "Failed to download GPU driver installer, status: 403 Forbidden". Workaround: None at this time. However, the issue can be mitigated by avoiding recreation of existing Nodes running GPUs. Note GCP has halted automatic Node recreation as a partial mitigation. |
- All times are US/Pacific