Incident affecting Google Cloud Networking, Private Service Connect, Google Compute Engine, Google Kubernetes Engine, Hybrid Connectivity, Cloud Load Balancing
GCE Networking issue in us-east4
Incident began at 2018-05-16 19:24 and ended at 2018-05-16 20:17 (all times are US/Pacific).
| ||22 May 2018||13:32 PDT|| |
On Wednesday 16 May 2018, Google Cloud Networking experienced loss of connectivity to external IP addresses located in us-east4 for a duration of 58 minutes.
DETAILED DESCRIPTION OF IMPACT
On Wednesday 16 May 2018 from 18:43 to 19:41 PDT, Google Compute Engine, Google Cloud VPN, and Google Cloud Network Load Balancers hosted in the us-east4 region experienced 100% packet loss from the internet and other GCP regions. Google Dedicated Interconnect Attachments located in us-east4 also experienced loss of connectivity.
Every zone in Google Cloud Platform advertises several sets of IP addresses to the Internet via BGP. Some of these IP addresses are global and are advertised from every zone, others are regional and advertised only from zones in the region. The software that controls the advertisement of these IP addresses contained a race condition during application startup that would cause regional IP addresses to be filtered out and withdrawn from a zone. During a routine binary rollout of this software, the race condition was triggered in each of the three zones in the us-east4 region. Traffic continued to be routed until the last zone received the rollout and stopped advertising regional prefixes. Once the last zone stopped advertising the regional IP addresses, external regional traffic stopped entering us-east4.
REMEDIATION AND PREVENTION
Google engineers were alerted to the problem within one minute and as soon as investigation pointed to a problem with the BGP advertisements, a rollback of the binary in the us-east4 region was created to mitigate the issue. Once the rollback proved effective, the original rollout was paused globally to prevent any further issues.
We are taking the following steps to prevent the issue from happening again. We are adding additional monitoring which will provide better context in future alerts to allow us to diagnose issues faster. We also plan on improving the debuggability of the software that controls the BGP advertisements. Additionally, we will be reviewing the rollout policy for these types of software changes so we can detect issues before they impact an entire region.
We apologize for this incident and we recognize that regional outages like this should be rare and quickly rectified. We are taking immediate steps to prevent recurrence and improve reliability in the future.
| ||16 May 2018||20:17 PDT|| |
The issue with GCE Networking (affecting GCE, GKE, Cloud VPN and Cloud Private Interconnect) in us-east4 region has been resolved for all affected users as of Wednesday, 2018-05-16 19:40 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.
| ||16 May 2018||19:59 PDT|| |
The issue with GCE Networking (affecting GCE, GKE, Cloud VPN and Cloud Private Interconnect) in us-east4 region should be resolved for the majority of users and we expect a full resolution in the near future. We will provide another status update by Wednesday, 2018-05-16 20:20 US/Pacific with current details.
| ||16 May 2018||19:38 PDT|| |
Mitigation work is currently underway by our Engineering Team. We will provide another status update by Wednesday, 2018-05-16 20:10 US/Pacific with current details.
| ||16 May 2018||19:24 PDT|| |
We are investigating an issue with GCE Networking in us-east4 region affecting GCE VMs, GKE, Cloud VPN and Cloud Private Interconnect resulting in network packet loss. We will provide more information by Wednesday, 2018-05-16 19:43 US/Pacific.