Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Cloud Machine Learning, AI Platform Training, Vertex AI Training

Global: Jobs failing with internal error for GKE version 1.18

Incident began at 2021-10-05 15:53 and ended at 2021-10-05 18:23 (all times are US/Pacific).

Date Time Description
6 Oct 2021 10:06 PDT

We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Support by opening a case using https://cloud.google.com/support

(All Times US/Pacific)

Incident Start: 05 October 2021 15:53

Incident End: 05 October 2021 18:23

Duration: 2 hours, 30 minutes

Affected Services and Features:

  • Google Cloud AI: Training Jobs

Regions/Zones: Global

Description:

Google Cloud AI experienced an issue with Distributed training jobs, Bring your own service account (BYOSA) jobs and VPC peering jobs that run on GKE v1.18 which failed with an internal error for a duration of 2 hours, 30 minutes. From preliminary analysis, the root cause of this issue was the removal of certain GKE v1.18 releases as an available GKE cluster configuration. The issue has been fixed by rolling Cloud AI forward to use GKE v1.19 for training jobs.

Customer Impact:

  • Google Cloud AI Training jobs (Distributed training jobs, BYOSA jobs and VPC peering jobs) that run on GKE v1.18 would fail with an internal error.
5 Oct 2021 18:24 PDT

The issue with Cloud AI has been resolved for all affected projects as of Tuesday, 2021-10-05 18:23 US/Pacific.

We thank you for your patience while we worked on resolving the issue.

5 Oct 2021 18:11 PDT

Summary: Global: Jobs failing with internal error for GKE version 1.18

Description: Our Engineering team continues to work on mitigating the issue and have observed reduction in customer impact.

The mitigation is expected to complete by Tuesday, 2021-10-05 18:45 US/Pacific.

We will provide more information by Tuesday, 2021-10-05 18:45 US/Pacific.

Diagnosis: All training jobs (Distributed training jobs, BYOSA jobs and VPC peering jobs) that run on GKE v1.18 to fail with internal error.

Workaround: None at this time.

5 Oct 2021 16:50 PDT

Summary: Global: Jobs failing with internal error for GKE version 1.18

Description: Mitigation work is currently underway by our engineering team.

The mitigation is expected to complete by Tuesday, 2021-10-05 18:30 US/Pacific.

We will provide more information by Tuesday, 2021-10-05 18:30 US/Pacific.

Diagnosis: All training jobs (Distributed training jobs, BYOSA jobs and VPC peering jobs) that run on GKE v1.18 to fail with internal error.

Workaround: None at this time.

5 Oct 2021 16:25 PDT

Summary: Global: Jobs failing with internal error for GKE version 1.18

Description: We are experiencing an issue with Cloud AI where distributed training jobs, BYOSA jobs and VPC peering jobs that run on GKE v1.18 will fail with internal error.

Our engineering team continues to investigate the issue.

We will provide an update by Tuesday, 2021-10-05 17:30 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: All training jobs (Distributed training jobs, BYOSA jobs and VPC peering jobs) that run on GKE v1.18 to fail with internal error.

Workaround: None at this time.