Google Cloud Service Health Updates2024-03-13T18:10:58+00:00Google Cloudhttps://status.cloud.google.com/RESOLVED: Customers using Container-Optimized OS (COS) in Google Kubernetes Engine (GKE) were not able to fetch specific NVIDIA GPU drivers.tag:status.cloud.google.com,2024:feed:aRSt8sTQLKMTVgdbbK6P.zf3BrUfdoVL74AqhpFeV2024-03-13T18:10:58+00:00<p> Incident began at <strong>2024-03-12 08:00</strong> and ended at <strong>2024-03-12 14:55</strong> <span>(all times are <strong>US/Pacific</strong>).</span></p><div class="cBIRi14aVDP__status-update-text"><h1>Mini Incident Report</h1>
<p>We apologize for the inconvenience this service outage may have caused. We would like to provide some information about this incident below. Please note that Google worked with the appropriate partner to resolve the underlying issue. This is the final version of the report and no further information will be provided here. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using <a href="https://cloud.google.com/support">https://cloud.google.com/support</a>.</p>
<p>(All Times US/Pacific)</p>
<p><strong>Incident Start:</strong> 12 March 2024 08:00</p>
<p><strong>Incident End:</strong> 12 March 2024 14:55</p>
<p><strong>Duration:</strong> 6 hours, 55 minutes</p>
<p><strong>Affected Services and Features:</strong></p>
<p>Google Kubernetes Engine (GKE)</p>
<p><strong>Regions/Zones:</strong> <a href="https://cloud.google.com/compute/docs/gpus/gpu-regions-zones">All GPU regions and zones</a></p>
<p><strong>Description:</strong></p>
<p>Google Kubernetes Engine experienced elevated errors due to failures in downloading some NVIDIA GPU drivers for use with Container-Optimized OS (COS) for a duration of 6 hours, 55 minutes. These failures in downloading the GPU drivers led to node unavailability in some cases and impacted customers using T4, L4, H100 80GB and A100 GPUs, <a href="https://cloud.google.com/container-optimized-os/docs/release-notes">COS milestone</a> 105 or above, and those who were attempting to install GPU driver versions R525 and above.</p>
<p>From the preliminary analysis, the root cause of the issue was an access issue to the storage bucket required for driver downloads. This is owned by our partner that supplies these GPU drivers. To limit the impact, Google Cloud took swift actions while the issue was happening, by halting automatic node recreations (which attempt GPU driver downloads) until the issue was mitigated. Other GKE features continued to operate normally without disruption.</p>
<p><strong>Customer Impact:</strong></p>
<ul>
<li>GKE users encountered an error "Failed to download GPU driver installer, status: 403 Forbidden" on the GPU node when installing affected GPU drivers using COS. In some cases, the GPU driver download failures led to node unavailability.</li>
<li>GPU driver downloads for GPU models P4, P100, V100, K80 were unaffected.</li>
</ul>
<hr>
</div><hr><p>Affected products: Google Kubernetes Engine</p><p>Affected locations: Johannesburg (africa-south1), Taiwan (asia-east1), Hong Kong (asia-east2), Tokyo (asia-northeast1), Osaka (asia-northeast2), Seoul (asia-northeast3), Mumbai (asia-south1), Delhi (asia-south2), Singapore (asia-southeast1), Jakarta (asia-southeast2), Sydney (australia-southeast1), Melbourne (australia-southeast2), Warsaw (europe-central2), Finland (europe-north1), Madrid (europe-southwest1), Belgium (europe-west1), Berlin (europe-west10), Turin (europe-west12), London (europe-west2), Frankfurt (europe-west3), Netherlands (europe-west4), Zurich (europe-west6), Milan (europe-west8), Paris (europe-west9), Doha (me-central1), Dammam (me-central2), Tel Aviv (me-west1), Montréal (northamerica-northeast1), Toronto (northamerica-northeast2), São Paulo (southamerica-east1), Santiago (southamerica-west1), Iowa (us-central1), South Carolina (us-east1), Northern Virginia (us-east4), Columbus (us-east5), Dallas (us-south1), Oregon (us-west1), Los Angeles (us-west2), Salt Lake City (us-west3), Las Vegas (us-west4)</p>