Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Google Kubernetes Engine, Google Cloud Networking, Cloud Load Balancing, Cloud Filestore

We are experiencing an issue with Google Cloud Networking at us-central1 across multiple products, beginning at Tuesday, 2022-06-14 06:00 US/Pacific.

Incident began at 2022-06-14 06:56 and ended at 2022-06-14 17:43 (all times are US/Pacific).

Previously affected location(s)

Iowa (us-central1)

Date Time Description
28 Jun 2022 11:27 PDT

Incident Report

Summary:

On Tuesday, 14 June 2022, at approximately 06:00 US/Pacific, Google Cloud Networking in us-central1 began experiencing increased delays in applying administrative operations, impacting several downstream services. Customers performing administrative actions on resources in us-central1 experienced delays, connectivity issues, and elevated rates of failure.

To our customers that were impacted during this outage, we sincerely apologize. We are conducting an internal investigation and are taking steps to improve our service.

Background:

Google’s Cloud Load Balancer (GCLB) is a collection of software and services that load balances HTTP traffic across customer services. A key component of GCLB is the Google Front End (GFE), which load balances traffic over customer backend instances. GCLB also includes a health checking service to determine whether backends such as customer virtual machines, are responding to traffic as expected, or are unhealthy and should be removed from service.

Root Cause:

The health checking service and GFE share a common resource pool. For a period of time, GCLB traffic was directed away from certain clusters, and, independently, during this time, health checking load in these clusters increased significantly.

The incident was triggered when Google engineers rerouted traffic to these clusters as part of a standard maintenance activity. This rerouted traffic increased load in those clusters, and this slowed the performance of the health checking service, resulting in an increasing rate of health check failures. This in turn led to a general networking control plane slowdown, as the service struggled to keep up with erroneous and rapid health status changes of load balancer backends. The slowdown in the networking control plane resulted in the impact on administrative operations for other resource types which involve network configuration. Customers may have experienced this as slowness or timeouts in administrative operations for the resource types listed in the Impact section, or delays in new resources (like VM instances) connecting to networks.

Remediation and Prevention:

On Monday, 13 June 2022, at 16:30 US/Pacific, Google engineers rerouted traffic to additional compute resources in the us-central1 region as part of a standard maintenance activity.

Google engineers were alerted to control plane slowness on Tuesday, 14 June 2022, at 09:17 and started an investigation. Initially, Google engineers were unable to determine the severity of the latency, due to insufficient monitoring. After attempts to mitigate by adding resources were unsuccessful, the incident was escalated at 15:50. A cross team effort of Google engineers was launched and at 17:39 mitigated the incident by removing the earlier reroute. This reduced the load on the health check system, and subsequently the networking control plane recovered at 17:43.

Google is committed to improving our service in the future and will be completing the following actions:

  • Prevent similar incidents by improving the capacity modeling of our health checking service, and implementing improved resource isolation in our health checking service.
  • Detect similar incidents more rapidly by improving alerting related to the health checking process to notify responsible teams earlier and speed time to mitigate.
  • Add defense in depth: protect the downstream networking control plane from high rates of load balancing health reports, thus avoiding this type of incident in future.

Detailed Description of Impact:

On 13 June 16:30 to 14 June 17:43 2022 US/Pacific:

Google Kubernetes Engine (GKE)

Affected customers would have observed latency and up to ~40% elevated errors or timeouts during GKE Private Service Connect cluster operations including creation, deletion, and updates for a subset of clusters in us-central1.

Google Cloud Load Balancing (GCLB)

Customers with resources in us-central1-c and us-central1-f would have observed increased latency or timeouts and connection errors from the Load Balancer service for resources in us-central1. Customers would have seen up to a 4.5% overall error rate, with up to 23% of requests timing out.

Filestore

Affected customers would have observed new instance creation failures for ~50% of the outage duration. In addition, existing instances would have been running at reduced capacity as some of their nodes may have been incorrectly marked down.

Virtual Private Cloud (VPC)

Increased latency and timeouts for creating and updating networking resources in us-central1.

14 Jun 2022 22:58 PDT

Mini Incident Report

We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Support by opening a case https://cloud.google.com/support or help article https://support.google.com/a/answer/1047213.

(All Times US/Pacific)

Incident Start: 14 Jun 2022 06:56

Incident End: 14 Jun 2022 17:43

Duration: 10 hrs, 47 minutes

Affected Services and Features:

Google Compute Engine, Google Kubernetes Engine, Filestore, Google Cloud Load Balancers

Regions/Zones: us-central1

Description:

Delays in creating Virtual machines or connecting to existing virtual machines for a period of 10 hrs 47 minutes. From preliminary analysis, the root cause of the issue is slow down in global health checking infrastructure check to an extent where flow cache in packet processing service was expiring.

Customer Impact:

Google Compute Engine - Users may have observed an error while creating new VMs or connecting VMs in use.

Google Kubernetes Engine - Users observed latency and timeouts during GKE Cluster operations.

Google Cloud Load Balancers - Customers may have observed errors from Load Balancer service for resources in us-central1.

Filestore - Users may have observed an error while creating new VMs.

14 Jun 2022 18:15 PDT

The issue with Google Cloud Networking that impacted multiple products is fully mitigated as of 2022-06-14 17:43 US/Pacific. Our Engineering team is continuing to monitor the environment and working towards full resolution.

If you still have questions or are impacted, please open a case with the Support Team, and we will continue to work with you.

We thank you for your patience while we are working towards resolving this issue.

14 Jun 2022 17:47 PDT

Summary: We are experiencing an issue with Google Cloud Networking at us-central1 across multiple products, beginning at Tuesday, 2022-06-14 06:00 US/Pacific.

Description: We are experiencing an issue with Google Cloud Networking at us-central1, beginning at Tuesday, 2022-06-14 06:00 US/Pacific, with new VM creation errors, latencies, timeouts, and connection errors reported across multiple products.

  • Google Compute Engine
  • Google Kubernetes Engine
  • Google Cloud Load Balancers
  • Filestore

Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point.

We will provide an update by Tuesday, 2022-06-14 18:20 US/Pacific with current details.

Diagnosis: Customers may see new VM creation errors, latencies, timeouts, and connection errors reported across multiple products in us-central1.

Workaround: None at this time.

14 Jun 2022 17:15 PDT

Summary: We are experiencing an issue with Google Cloud Networking at us-central1 across multiple products, beginning at Tuesday, 2022-06-14 06:00 US/Pacific.

Description: We are experiencing an issue with Google Cloud Networking at us-central1 beginning at Tuesday, 2022-06-14 06:00 US/Pacific, with latencies across products.

  • Google Compute Engine.
  • Google Kubernetes Engine.
  • Google Cloud Load Balancers.
  • Filestore.

Our engineering team continues to investigate the issue.

We will provide an update by Tuesday, 2022-06-14 17:45 US/Pacific with current details.

Diagnosis: Services running from us-central1 may experience latency issues.

Workaround: None at this time.

14 Jun 2022 17:09 PDT

Summary: We are experiencing an issue with Google Cloud Networking at us-central1 across multiple products, beginning at Tuesday, 2022-06-14 06:00 US/Pacific.

Description: We are experiencing an issue with Google Cloud Networking beginning at Tuesday, 2022-06-14 06:00 US/Pacific with reports of possible latencies across products

  • Google Kubernetes Engine
  • Google Cloud Load Balancers
  • Filestore.

Our engineering team continues to investigate the issue.

We will provide an update by Tuesday, 2022-06-14 17:45 US/Pacific with current details.

Diagnosis: Services running from us-central1 may experience latency issues.

Workaround: None at this time.