Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Cloud Machine Learning, Vertex AI Online Prediction

Large Vertex Model Garden Deployment may experience failures.

Incident began at 2024-10-22 14:08 and ended at 2024-11-04 20:30 (all times are US/Pacific).

Previously affected location(s)

Taiwan (asia-east1)Hong Kong (asia-east2)Tokyo (asia-northeast1)Osaka (asia-northeast2)Seoul (asia-northeast3)Mumbai (asia-south1)Jakarta (asia-southeast2)Sydney (australia-southeast1)Melbourne (australia-southeast2)Warsaw (europe-central2)Finland (europe-north1)Madrid (europe-southwest1)Belgium (europe-west1)London (europe-west2)Frankfurt (europe-west3)Zurich (europe-west6)Milan (europe-west8)Paris (europe-west9)Doha (me-central1)Dammam (me-central2)Tel Aviv (me-west1)Montréal (northamerica-northeast1)Toronto (northamerica-northeast2)São Paulo (southamerica-east1)Santiago (southamerica-west1)Columbus (us-east5)Dallas (us-south1)Oregon (us-west1)Los Angeles (us-west2)Salt Lake City (us-west3)Las Vegas (us-west4)

Date Time Description
5 Nov 2024 11:31 PST

Mini Incident Report

We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support

(All Times US/Pacific)

Incident Start: 22 October 2024, 14:08

Incident End: 4 November 2024, 20:30

Duration: 13 days, 6 hours, 22 minutes

Affected Services and Features: Vertex AI Online Prediction (Vertex Model Garden Deployments)

Regions/Zones: All regions except asia-southeast1, europe-west4, us-central1, us-east1, us-east4

Description:

Deployment of large models (those that require more than 100GB of disk size) in Vertex AI Online Prediction (Vertex Model Garden Deployments) failed in most of the regions for a duration of up to 13 days, 6 hours, 22 minutes starting on Tuesday, 22 October 2024 at 14:08 US/Pacific.

From preliminary analysis, the root cause of the issue is an internal storage provisioning configuration error that was implemented as part of a recent change.

Google engineers mitigated the impact by rolling back the configuration change that caused the issue.

Customer Impact:

  • Customers would have received errors stating “Model server never became ready”, while performing deployments during the period of impact.

Additional details:

  • As a workaround, customers were able to deploy in one of the non-impacted regions noted above.
4 Nov 2024 20:52 PST

The issue with Vertex AI Online Prediction (Large Vertex Model Garden deployment failure) has been resolved for all affected users as of Monday, 2024-11-04 20:28 US/Pacific.

We thank you for your patience while we worked on resolving the issue.

4 Nov 2024 18:52 PST

Summary: Large Vertex Model Garden Deployment may experience failures.

Description: Our engineering team is continuing to work on mitigating the issue.

We do not have an ETA for mitigation at this point.

We will provide more information by Monday, 2024-11-04 21:30 US/Pacific.

Diagnosis: Customers may experience failures with large Vertex Model Garden deployments when using L4 Graphics Processing Unit (GPU).

Workaround: None at this time.

4 Nov 2024 13:46 PST

Summary: Large Vertex Model Garden Deployment may experience failures.

Description: Mitigation work is currently underway by our engineering team.

We do not have an ETA for mitigation at this point.

We will provide more information by Monday, 2024-11-04 19:00 US/Pacific.

Diagnosis: Customers may experience failures with large Vertex Model Garden deployments when using L4 Graphics Processing Unit (GPU).

Workaround: None at this time.

4 Nov 2024 13:17 PST

Summary: Large Vertex Model Garden Deployment Failures

Description: We are experiencing an issue with Vertex Model Garden Deployments.

Our engineering team continues to investigate the issue.

We will provide an update by Monday, 2024-11-04 14:00 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Customers may experience failures with large Vertex Model Garden deployments (greater than 100GB) when deployed on a GKE Autopilot cluster.

Workaround: None at this time.