Service Health
Incident affecting Cloud Machine Learning, Vertex AI Online Prediction
Large Vertex Model Garden Deployment may experience failures.
Incident began at 2024-10-22 14:08 and ended at 2024-11-04 20:30 (all times are US/Pacific).
Previously affected location(s)
Taiwan (asia-east1)Hong Kong (asia-east2)Tokyo (asia-northeast1)Osaka (asia-northeast2)Seoul (asia-northeast3)Mumbai (asia-south1)Jakarta (asia-southeast2)Sydney (australia-southeast1)Melbourne (australia-southeast2)Warsaw (europe-central2)Finland (europe-north1)Madrid (europe-southwest1)Belgium (europe-west1)London (europe-west2)Frankfurt (europe-west3)Zurich (europe-west6)Milan (europe-west8)Paris (europe-west9)Doha (me-central1)Dammam (me-central2)Tel Aviv (me-west1)Montréal (northamerica-northeast1)Toronto (northamerica-northeast2)São Paulo (southamerica-east1)Santiago (southamerica-west1)Columbus (us-east5)Dallas (us-south1)Oregon (us-west1)Los Angeles (us-west2)Salt Lake City (us-west3)Las Vegas (us-west4)
Date | Time | Description | |
---|---|---|---|
| 5 Nov 2024 | 11:31 PST | Mini Incident ReportWe apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support (All Times US/Pacific) Incident Start: 22 October 2024, 14:08 Incident End: 4 November 2024, 20:30 Duration: 13 days, 6 hours, 22 minutes Affected Services and Features: Vertex AI Online Prediction (Vertex Model Garden Deployments) Regions/Zones: All regions except asia-southeast1, europe-west4, us-central1, us-east1, us-east4 Description: Deployment of large models (those that require more than 100GB of disk size) in Vertex AI Online Prediction (Vertex Model Garden Deployments) failed in most of the regions for a duration of up to 13 days, 6 hours, 22 minutes starting on Tuesday, 22 October 2024 at 14:08 US/Pacific. From preliminary analysis, the root cause of the issue is an internal storage provisioning configuration error that was implemented as part of a recent change. Google engineers mitigated the impact by rolling back the configuration change that caused the issue. Customer Impact:
Additional details:
|
| 4 Nov 2024 | 20:52 PST | The issue with Vertex AI Online Prediction (Large Vertex Model Garden deployment failure) has been resolved for all affected users as of Monday, 2024-11-04 20:28 US/Pacific. We thank you for your patience while we worked on resolving the issue. |
| 4 Nov 2024 | 18:52 PST | Summary: Large Vertex Model Garden Deployment may experience failures. Description: Our engineering team is continuing to work on mitigating the issue. We do not have an ETA for mitigation at this point. We will provide more information by Monday, 2024-11-04 21:30 US/Pacific. Diagnosis: Customers may experience failures with large Vertex Model Garden deployments when using L4 Graphics Processing Unit (GPU). Workaround: None at this time. |
| 4 Nov 2024 | 13:46 PST | Summary: Large Vertex Model Garden Deployment may experience failures. Description: Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point. We will provide more information by Monday, 2024-11-04 19:00 US/Pacific. Diagnosis: Customers may experience failures with large Vertex Model Garden deployments when using L4 Graphics Processing Unit (GPU). Workaround: None at this time. |
| 4 Nov 2024 | 13:17 PST | Summary: Large Vertex Model Garden Deployment Failures Description: We are experiencing an issue with Vertex Model Garden Deployments. Our engineering team continues to investigate the issue. We will provide an update by Monday, 2024-11-04 14:00 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Customers may experience failures with large Vertex Model Garden deployments (greater than 100GB) when deployed on a GKE Autopilot cluster. Workaround: None at this time. |
- All times are US/Pacific