Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Vertex AI Training, Cloud Machine Learning

Cloud AI Platform and Vertex AI Training elevated error rates for GPU jobs in us-central1, us-east1, and europe-west3

Incident began at 2023-03-03 21:56 and ended at 2023-03-04 22:40 (all times are US/Pacific).

Previously affected location(s)

Frankfurt (europe-west3)Iowa (us-central1)South Carolina (us-east1)

Date Time Description
4 Mar 2023 22:40 PST

The issue with Vertex AI Training has been resolved for all affected users as of Saturday, 2023-03-04 22:39 US/Pacific.

We thank you for your patience while we worked on resolving the issue.

4 Mar 2023 10:57 PST

Summary: Cloud AI Platform and Vertex AI Training elevated error rates for GPU jobs in us-central1, us-east1, and europe-west3

Description: Mitigation work is currently underway by our engineering team.

At this time, we believe the issue has been resolved for the us-central1 region, we are working to confirm and also working on mitigation for us-east1, and europe-west3.

We do not have an ETA for mitigation in us-east1 and europe-west3 at this point.

We will provide more information by Sunday, 2023-03-05 14:00 US/Pacific.

Diagnosis: Cloud AI Platform and Vertex AI Training GPU jobs may experience elevated failure rates in us-central1, us-east1, and europe-west3.

Workaround: None at this time.

3 Mar 2023 23:00 PST

Summary: Cloud AI Platform and Vertex AI Training elevated error rates for GPU jobs in us-central1, us-east1, and europe-west3

Description: Mitigation work is currently underway by our engineering team.

At this time, we believe the issue has been resolved for the us-central1 region, we are working to confirm and also working on mitigation for us-east1, and europe-west3.

We do not have an ETA for mitigation in us-east1 and europe-west3 at this point.

We will provide more information by Saturday, 2023-03-04 11:00 US/Pacific.

Diagnosis: Cloud AI Platform and Vertex AI Training GPU jobs may experience elevated failure rates in us-central1, us-east1, and europe-west3.

Workaround: None at this time.

3 Mar 2023 22:20 PST

Summary: Cloud AI Platform and Vertex AI Training elevated error rates for GPU jobs in us-central1, us-east1, and europe-west3

Description: Mitigation work is currently underway by our engineering team.

At this time, we believe the issue has been resolved for the us-central1 region and are working to confirm.

We do not have an ETA for mitigation in us-east1 and europe-west3 at this point.

We will provide more information by Friday, 2023-03-03 23:30 US/Pacific.

Diagnosis: Cloud AI Platform and Vertex AI Training GPU jobs may experience elevated failure rates in us-central1, us-east1, and europe-west3.

Workaround: None at this time.