Google Cloud Service Health

Google Cloud Service Health
Incidents
Short DNS outage and network packet loss affecting Compute Engine in three Asian zones.

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

For incidents related to Google Security Products, visit https://status.cloud.google.com/security.

Available
Service information
Service disruption
Service outage

Incident affecting Google Compute Engine

Short DNS outage and network packet loss affecting Compute Engine in three Asian zones.

Incident began at 2014-10-13 03:42 and ended at 2014-10-13 04:30 (all times are US/Pacific).

	Date	Time	Description
	16 Oct 2014	16:58 PDT	SUMMARY: For a period of 51 minutes on Monday 13 October 2014, Google Compute Engine instances in the asia-east1 region experienced intermittent network connectivity problems, including DNS failures and loss of connectivity to other Google services. Additionally, a subset of external users in Asia were unable to connect to any Compute Engine instances. If this issue affected you or your systems, we sincerely apologize; we strive for a very high level of service and in this case we did not meet that bar. DETAILED DESCRIPTION OF IMPACT: Beginning at 03:39 PDT and ending at 03:50, Compute Engine instances in the asia-east1 region were unable to resolve DNS queries. From 03:52 until 04:30, connectivity to all Compute Engine instances was disrupted for users entering Google’s network in Taiwan. Between 04:13 and 04:30, Compute Engine instances in the asia-east1 region were unable to contact other Google services. All network connectivity between Compute Engine instances continued to operate normally throughout the incident. ROOT CAUSE: While performing a network upgrade, Google engineers encountered a process bug in the upgrade procedure, resulting in uneven traffic distribution and a traffic overload on several links serving two datacenters in Taiwan. Google engineers were quickly alerted to the traffic overload and attempted to reroute traffic away from the affected datacenters. The initial attempt uncovered a latent bug in our network control plane software that prevented a load balancer from withdrawing all routes. Traffic continued to be routed to this device, which became overwhelmed. REMEDIATION AND PREVENTION: Upon receiving alerts of network overload in the affected datacenters, Google network engineers responded within 5 minutes to withdraw routes to these datacenters. When this did not have the desired effect, the engineers redirected traffic around the misbehaving network components. At 04:28, Google engineers resolved the underlying issue by manually disabling the faulty control plane software. The issue in the upgrade procedure has been identified and fixed. The bug in the network control plane software has been identified and Google engineers are currently developing and testing a fix. Google engineers are augmenting the Compute Engine network monitoring to quickly and accurately pinpoint network issues affecting Compute Engine instances. Finally, Google engineers are improving the way traffic is routed to and from Compute Engine instances to better segment the system’s network failure domains.
	13 Oct 2014	06:39 PDT	We experienced a DNS outage and network packet loss affecting Compute Engine instances in the zones asia-east1-a, asia-east1-b and asia-east1-c and users with instances in these zones may have been unable to reach their instances or seen high packet loss. The issue started at 03:42AM and has been solved by 4:30AM (US/Pacific time). For everyone who was affected, we apologize for any inconvenience you may have experienced.

Date

Time

Description

16 Oct 2014

16:58 PDT

SUMMARY: For a period of 51 minutes on Monday 13 October 2014, Google Compute Engine instances in the asia-east1 region experienced intermittent network connectivity problems, including DNS failures and loss of connectivity to other Google services. Additionally, a subset of external users in Asia were unable to connect to any Compute Engine instances. If this issue affected you or your systems, we sincerely apologize; we strive for a very high level of service and in this case we did not meet that bar.

DETAILED DESCRIPTION OF IMPACT: Beginning at 03:39 PDT and ending at 03:50, Compute Engine instances in the asia-east1 region were unable to resolve DNS queries. From 03:52 until 04:30, connectivity to all Compute Engine instances was disrupted for users entering Google’s network in Taiwan. Between 04:13 and 04:30, Compute Engine instances in the asia-east1 region were unable to contact other Google services.

All network connectivity between Compute Engine instances continued to operate normally throughout the incident.

ROOT CAUSE: While performing a network upgrade, Google engineers encountered a process bug in the upgrade procedure, resulting in uneven traffic distribution and a traffic overload on several links serving two datacenters in Taiwan. Google engineers were quickly alerted to the traffic overload and attempted to reroute traffic away from the affected datacenters. The initial attempt uncovered a latent bug in our network control plane software that prevented a load balancer from withdrawing all routes. Traffic continued to be routed to this device, which became overwhelmed.

REMEDIATION AND PREVENTION: Upon receiving alerts of network overload in the affected datacenters, Google network engineers responded within 5 minutes to withdraw routes to these datacenters. When this did not have the desired effect, the engineers redirected traffic around the misbehaving network components. At 04:28, Google engineers resolved the underlying issue by manually disabling the faulty control plane software.

The issue in the upgrade procedure has been identified and fixed. The bug in the network control plane software has been identified and Google engineers are currently developing and testing a fix. Google engineers are augmenting the Compute Engine network monitoring to quickly and accurately pinpoint network issues affecting Compute Engine instances. Finally, Google engineers are improving the way traffic is routed to and from Compute Engine instances to better segment the system’s network failure domains.

13 Oct 2014

06:39 PDT

We experienced a DNS outage and network packet loss affecting Compute Engine instances in the zones asia-east1-a, asia-east1-b and asia-east1-c and users with instances in these zones may have been unable to reach their instances or seen high packet loss. The issue started at 03:42AM and has been solved by 4:30AM (US/Pacific time). For everyone who was affected, we apologize for any inconvenience you may have experienced.

All times are US/Pacific