Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Cloud Machine Learning, Vertex AI Training

Vertex AI custom training jobs failing if using more than 2GB ephemeral storage

Incident began at 2024-08-16 11:44 and ended at 2024-08-16 16:23 (all times are US/Pacific).

Previously affected location(s)

Taiwan (asia-east1)Hong Kong (asia-east2)Tokyo (asia-northeast1)Seoul (asia-northeast3)Mumbai (asia-south1)Singapore (asia-southeast1)Sydney (australia-southeast1)Belgium (europe-west1)London (europe-west2)Frankfurt (europe-west3)Netherlands (europe-west4)Zurich (europe-west6)Montréal (northamerica-northeast1)Toronto (northamerica-northeast2)Iowa (us-central1)South Carolina (us-east1)Northern Virginia (us-east4)Oregon (us-west1)Los Angeles (us-west2)

Date Time Description
16 Aug 2024 16:23 PDT

The issue with Vertex AI Training has been resolved for all affected users as of Friday, 2024-08-16 16:07 US/Pacific.

We thank you for your patience while we worked on resolving the issue.

Thank you for choosing us.

16 Aug 2024 12:03 PDT

Summary: Vertex AI custom training jobs failing if using more than 2GB ephemeral storage

Description: Mitigation work is currently underway by our engineering team.

We do not have an ETA for mitigation at this point.

We will provide more information by Friday, 2024-08-16 17:30 US/Pacific.

Diagnosis: Custom Vertex AI training jobs running on GKE and using more than 2GB of ephemeral storage may fail with the error ""Pod ephemeral local storage usage exceeds the total limit of containers 2Gi."

Workaround: None at this time.

16 Aug 2024 11:58 PDT

Summary: Vertex AI custom training jobs failing if using more than 2GB ephemeral storage

Description: Mitigation work is currently underway by our engineering team.

We do not have an ETA for mitigation at this point.

We will provide more information by Friday, 2024-08-16 17:00 US/Pacific.

Diagnosis: Custom Vertex AI training jobs running on GKE and using more than 2GB of ephemeral storage may fail with the error ""Pod ephemeral local storage usage exceeds the total limit of containers 2Gi."

Workaround: None at this time.