Google Cloud Status

This page provides status information on the services that are part of the Google Cloud Platform. Check back here to view the current status of the services listed below. For additional information on these services, please visit cloud.google.com.

Google Compute Engine Incident #15046

GCE connection issue

Incident began at 2015-03-07 09:56 and ended at 2015-03-07 10:35 (all times are US/Pacific).

Date Time Description
Mar 08, 2015 14:37

SUMMARY: On Saturday March 7 2015, Google Compute Engine VMs experienced intermittent packet loss on egress network traffic between 09:55 PST to 10:38 PST. VM execution and VM-to-VM network traffic was unaffected during this interval.

DETAILED DESCRIPTION OF IMPACT: Beginning March 7 2015 at 09:55 PST, Google Compute Engine traffic bound both for the Internet and other Google services experienced intermittent packet loss. The intermittent packet loss persisted for 43 minutes until 10:38 PST, at which time packet loss returned to baseline levels. The user impact of this intermittent packet loss depended on VM, zone, and user netblock, and ranged from no visible impact, to unusually slow responses, to timeouts attempting to contact the VM.

ROOT CAUSE: The root cause of the packet loss was a configuration change introduced to the network stack designed to provide greater isolation between VMs and projects by capping the traffic volume allowed by an individual VM. The configuration change had been tested prior to deployment to production without incident. However as it was introduced into the production environment it affected some VMs in an unexpected manner.

REMEDIATION AND PREVENTION: Automated network monitoring systems alerted Google engineers of the issue at 10:13 PST, 18 minutes after detectable packet loss first appeared. The Google engineering team identified the root cause and rolled back the configuration change starting at 10:35 PST, which immediately decreased the incidences rate of packet loss, with full recovery complete at 10:38 PST.

Google engineers are investigating why the prior testing of the change did not accurately predict the performance of the isolation mechanism in production. Future changes will not be applied to production until the test suite has been improved to demonstrate parity with behavior observed in production during this incident. Additionally, Google engineers are immediately amending the rollout protocol for network configuration changes so that future production changes will be applied to a small fraction of VMs at a time, reducing the exposure in the event of undetected behavior.

SUMMARY: On Saturday March 7 2015, Google Compute Engine VMs experienced intermittent packet loss on egress network traffic between 09:55 PST to 10:38 PST. VM execution and VM-to-VM network traffic was unaffected during this interval.

DETAILED DESCRIPTION OF IMPACT: Beginning March 7 2015 at 09:55 PST, Google Compute Engine traffic bound both for the Internet and other Google services experienced intermittent packet loss. The intermittent packet loss persisted for 43 minutes until 10:38 PST, at which time packet loss returned to baseline levels. The user impact of this intermittent packet loss depended on VM, zone, and user netblock, and ranged from no visible impact, to unusually slow responses, to timeouts attempting to contact the VM.

ROOT CAUSE: The root cause of the packet loss was a configuration change introduced to the network stack designed to provide greater isolation between VMs and projects by capping the traffic volume allowed by an individual VM. The configuration change had been tested prior to deployment to production without incident. However as it was introduced into the production environment it affected some VMs in an unexpected manner.

REMEDIATION AND PREVENTION: Automated network monitoring systems alerted Google engineers of the issue at 10:13 PST, 18 minutes after detectable packet loss first appeared. The Google engineering team identified the root cause and rolled back the configuration change starting at 10:35 PST, which immediately decreased the incidences rate of packet loss, with full recovery complete at 10:38 PST.

Google engineers are investigating why the prior testing of the change did not accurately predict the performance of the isolation mechanism in production. Future changes will not be applied to production until the test suite has been improved to demonstrate parity with behavior observed in production during this incident. Additionally, Google engineers are immediately amending the rollout protocol for network configuration changes so that future production changes will be applied to a small fraction of VMs at a time, reducing the exposure in the event of undetected behavior.

Mar 07, 2015 11:49

The problem with Google Compute Engine connection should be resolved as of 10:35 AM 2015-03-07 UTC-8. We apologize for any issues this may have caused to you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems.

We will provide a more detailed analysis of this incident once we have completed our internal investigation.

The problem with Google Compute Engine connection should be resolved as of 10:35 AM 2015-03-07 UTC-8. We apologize for any issues this may have caused to you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems.

We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Mar 07, 2015 10:40

We apologize for any issues this may have caused to you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems.

We apologize for any issues this may have caused to you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems.