Service Health
Incident affecting Vertex AI Training, Cloud Machine Learning
Cloud AI Platform and Vertex AI Training elevated error rates for GPU jobs in us-central1, us-east1, and europe-west3
Incident began at 2023-03-03 21:56 and ended at 2023-03-04 22:40 (all times are US/Pacific).
Previously affected location(s)
Frankfurt (europe-west3)Iowa (us-central1)South Carolina (us-east1)
Date | Time | Description | |
---|---|---|---|
| 4 Mar 2023 | 22:40 PST | The issue with Vertex AI Training has been resolved for all affected users as of Saturday, 2023-03-04 22:39 US/Pacific. We thank you for your patience while we worked on resolving the issue. |
| 4 Mar 2023 | 10:57 PST | Summary: Cloud AI Platform and Vertex AI Training elevated error rates for GPU jobs in us-central1, us-east1, and europe-west3 Description: Mitigation work is currently underway by our engineering team. At this time, we believe the issue has been resolved for the us-central1 region, we are working to confirm and also working on mitigation for us-east1, and europe-west3. We do not have an ETA for mitigation in us-east1 and europe-west3 at this point. We will provide more information by Sunday, 2023-03-05 14:00 US/Pacific. Diagnosis: Cloud AI Platform and Vertex AI Training GPU jobs may experience elevated failure rates in us-central1, us-east1, and europe-west3. Workaround: None at this time. |
| 3 Mar 2023 | 23:00 PST | Summary: Cloud AI Platform and Vertex AI Training elevated error rates for GPU jobs in us-central1, us-east1, and europe-west3 Description: Mitigation work is currently underway by our engineering team. At this time, we believe the issue has been resolved for the us-central1 region, we are working to confirm and also working on mitigation for us-east1, and europe-west3. We do not have an ETA for mitigation in us-east1 and europe-west3 at this point. We will provide more information by Saturday, 2023-03-04 11:00 US/Pacific. Diagnosis: Cloud AI Platform and Vertex AI Training GPU jobs may experience elevated failure rates in us-central1, us-east1, and europe-west3. Workaround: None at this time. |
| 3 Mar 2023 | 22:20 PST | Summary: Cloud AI Platform and Vertex AI Training elevated error rates for GPU jobs in us-central1, us-east1, and europe-west3 Description: Mitigation work is currently underway by our engineering team. At this time, we believe the issue has been resolved for the us-central1 region and are working to confirm. We do not have an ETA for mitigation in us-east1 and europe-west3 at this point. We will provide more information by Friday, 2023-03-03 23:30 US/Pacific. Diagnosis: Cloud AI Platform and Vertex AI Training GPU jobs may experience elevated failure rates in us-central1, us-east1, and europe-west3. Workaround: None at this time. |
- All times are US/Pacific