Google Cloud Status Dashboard

This page provides status information on the services that are part of Google Cloud Platform. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit cloud.google.com.

Google Compute Engine Incident #15045

GCE instances are not reachable

Incident began at 2015-02-18 22:59 and ended at 2015-02-19 01:01 (all times are US/Pacific).

Date Time Description
Feb 19, 2015 16:10

SUMMARY

For 40 minutes spanning Wednesday 18th and Thursday 19th February 2015, the majority of Google Compute Engine instances experienced traffic loss for outbound network connectivity, with lower levels of loss beginning at 22:40 PST on February 18th and ending at 01:20 PST on February 19th. The total length of detectable external traffic loss was 2 hours and 40 minutes. We consider GCE’s availability over the last 24 hours to be unacceptable, and we apologise if your service was affected by this outage. Today we are completely focused on addressing the incident and its root causes, so that this problem or other hypothetical similar problems cannot recur in the future.

DETAILED DESCRIPTION OF IMPACT

Starting at 18 February at 22:40 PST, outbound traffic from Google Compute Engine instances began to experience 10% loss of flows. The fraction of flows experiencing loss increased linearly to a peak of 70% loss at 23:55. That level of loss lasted 40 minutes until 00:35 PST on 19 February, at which point engineering remediation efforts rapidly reduced loss to 15% by 00:50. Traffic loss was eliminated and normal traffic levels resumed by 01:20.

The issue manifested as a loss of external connectivity to the instances, and an inability of the instances to send traffic outside their private network. The instances themselves continued to run, and became available again as their external traffic loss cleared.

ROOT CAUSE [PRELIMINARY]

The internal software system which programs GCE’s virtual network for VM egress traffic stopped issuing updated routing information. The cause of this interruption is still under active investigation. Cached route information provided a defense in depth against missing updates, but GCE VM egress traffic started to be dropped as the cached routes expired.

RESOLUTION AND PREVENTION

Google Engineers were alerted to the dropped packets caused by cached route expiration. They were able to identify a potential fix (reload the entire route information) approximately 45 minutes after being alerted, while most routing entries had not expired. They were able to force a reload to fix the networking approximately 60 minutes after the issue was identified and well before all entries had expired.

Google Engineers have already made a change to extend the expiration lifespan of routing entries from several hours to a week, which will allow ample time to take corrective action should a similar problem occur in the future. They expect to make several other more positive defense-in-depth changes to prevent recurrence in the coming days, including updates to the system which programs route information and additional monitoring and alerting. The engineering work will proceed in parallel with the completion and validation of the full post-mortem for this event.

SUMMARY

For 40 minutes spanning Wednesday 18th and Thursday 19th February 2015, the majority of Google Compute Engine instances experienced traffic loss for outbound network connectivity, with lower levels of loss beginning at 22:40 PST on February 18th and ending at 01:20 PST on February 19th. The total length of detectable external traffic loss was 2 hours and 40 minutes. We consider GCE’s availability over the last 24 hours to be unacceptable, and we apologise if your service was affected by this outage. Today we are completely focused on addressing the incident and its root causes, so that this problem or other hypothetical similar problems cannot recur in the future.

DETAILED DESCRIPTION OF IMPACT

Starting at 18 February at 22:40 PST, outbound traffic from Google Compute Engine instances began to experience 10% loss of flows. The fraction of flows experiencing loss increased linearly to a peak of 70% loss at 23:55. That level of loss lasted 40 minutes until 00:35 PST on 19 February, at which point engineering remediation efforts rapidly reduced loss to 15% by 00:50. Traffic loss was eliminated and normal traffic levels resumed by 01:20.

The issue manifested as a loss of external connectivity to the instances, and an inability of the instances to send traffic outside their private network. The instances themselves continued to run, and became available again as their external traffic loss cleared.

ROOT CAUSE [PRELIMINARY]

The internal software system which programs GCE’s virtual network for VM egress traffic stopped issuing updated routing information. The cause of this interruption is still under active investigation. Cached route information provided a defense in depth against missing updates, but GCE VM egress traffic started to be dropped as the cached routes expired.

RESOLUTION AND PREVENTION

Google Engineers were alerted to the dropped packets caused by cached route expiration. They were able to identify a potential fix (reload the entire route information) approximately 45 minutes after being alerted, while most routing entries had not expired. They were able to force a reload to fix the networking approximately 60 minutes after the issue was identified and well before all entries had expired.

Google Engineers have already made a change to extend the expiration lifespan of routing entries from several hours to a week, which will allow ample time to take corrective action should a similar problem occur in the future. They expect to make several other more positive defense-in-depth changes to prevent recurrence in the coming days, including updates to the system which programs route information and additional monitoring and alerting. The engineering work will proceed in parallel with the completion and validation of the full post-mortem for this event.

Feb 19, 2015 01:31

The problem with network connectivity in Google Compute Engine is resolved as of shortly after 01:00 US/Pacific. We are sorry for any issues this may have caused to you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems.

We will provide a detailed analysis of this incident once we have completed our internal investigation.

The problem with network connectivity in Google Compute Engine is resolved as of shortly after 01:00 US/Pacific. We are sorry for any issues this may have caused to you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems.

We will provide a detailed analysis of this incident once we have completed our internal investigation.

Feb 19, 2015 01:00

We are still investigating the network issues with Google Compute Engine. We will provide another status update by 01:30 US/Pacific time.

We are still investigating the network issues with Google Compute Engine. We will provide another status update by 01:30 US/Pacific time.

Feb 19, 2015 00:30

We are still experiencing a network issues with Google Compute Engine.

We will provide another status update by 01:00 US/Pacific time.

We are still experiencing a network issues with Google Compute Engine.

We will provide another status update by 01:00 US/Pacific time.

Feb 19, 2015 00:01

We are currently experiencing a network issue with GCE and instances in multiple zones lost connectivity to them. For everyone who is affected, we apologize for any inconvenience you may be experiencing.

We will provide an update by 0:30 AM, Feb 19 2015 with current details.

We are currently experiencing a network issue with GCE and instances in multiple zones lost connectivity to them. For everyone who is affected, we apologize for any inconvenience you may be experiencing.

We will provide an update by 0:30 AM, Feb 19 2015 with current details.

Feb 18, 2015 23:46

We're investigating an issue with Google Compute Engine beginning at 22:59 Feb 18 2015 . We will provide more information shortly

We're investigating an issue with Google Compute Engine beginning at 22:59 Feb 18 2015 . We will provide more information shortly