Service Health
Incident affecting Cloud Machine Learning, AI Platform Training, Vertex AI Training
Global: Jobs failing with internal error for GKE version 1.18
Incident began at 2021-10-05 15:53 and ended at 2021-10-05 18:23 (all times are US/Pacific).
Date | Time | Description | |
---|---|---|---|
| 6 Oct 2021 | 10:06 PDT | We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Support by opening a case using https://cloud.google.com/support (All Times US/Pacific) Incident Start: 05 October 2021 15:53 Incident End: 05 October 2021 18:23 Duration: 2 hours, 30 minutes Affected Services and Features:
Regions/Zones: Global Description: Google Cloud AI experienced an issue with Distributed training jobs, Bring your own service account (BYOSA) jobs and VPC peering jobs that run on GKE v1.18 which failed with an internal error for a duration of 2 hours, 30 minutes. From preliminary analysis, the root cause of this issue was the removal of certain GKE v1.18 releases as an available GKE cluster configuration. The issue has been fixed by rolling Cloud AI forward to use GKE v1.19 for training jobs. Customer Impact:
|
| 5 Oct 2021 | 18:24 PDT | The issue with Cloud AI has been resolved for all affected projects as of Tuesday, 2021-10-05 18:23 US/Pacific. We thank you for your patience while we worked on resolving the issue. |
| 5 Oct 2021 | 18:11 PDT | Summary: Global: Jobs failing with internal error for GKE version 1.18 Description: Our Engineering team continues to work on mitigating the issue and have observed reduction in customer impact. The mitigation is expected to complete by Tuesday, 2021-10-05 18:45 US/Pacific. We will provide more information by Tuesday, 2021-10-05 18:45 US/Pacific. Diagnosis: All training jobs (Distributed training jobs, BYOSA jobs and VPC peering jobs) that run on GKE v1.18 to fail with internal error. Workaround: None at this time. |
| 5 Oct 2021 | 16:50 PDT | Summary: Global: Jobs failing with internal error for GKE version 1.18 Description: Mitigation work is currently underway by our engineering team. The mitigation is expected to complete by Tuesday, 2021-10-05 18:30 US/Pacific. We will provide more information by Tuesday, 2021-10-05 18:30 US/Pacific. Diagnosis: All training jobs (Distributed training jobs, BYOSA jobs and VPC peering jobs) that run on GKE v1.18 to fail with internal error. Workaround: None at this time. |
| 5 Oct 2021 | 16:25 PDT | Summary: Global: Jobs failing with internal error for GKE version 1.18 Description: We are experiencing an issue with Cloud AI where distributed training jobs, BYOSA jobs and VPC peering jobs that run on GKE v1.18 will fail with internal error. Our engineering team continues to investigate the issue. We will provide an update by Tuesday, 2021-10-05 17:30 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: All training jobs (Distributed training jobs, BYOSA jobs and VPC peering jobs) that run on GKE v1.18 to fail with internal error. Workaround: None at this time. |
- All times are US/Pacific