Google Cloud Service Health

Google Cloud Service Health
Incidents
GCE connection issue

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Available
Service information
Service disruption
Service outage

Incident affecting Google Compute Engine

GCE connection issue

Incident began at 2015-03-07 09:56 and ended at 2015-03-07 10:35 (all times are US/Pacific).

Date	Time	Description
8 Mar 2015	14:37 PDT	SUMMARY: On Saturday March 7 2015, Google Compute Engine VMs experienced intermittent packet loss on egress network traffic between 09:55 PST to 10:38 PST. VM execution and VM-to-VM network traffic was unaffected during this interval. DETAILED DESCRIPTION OF IMPACT: Beginning March 7 2015 at 09:55 PST, Google Compute Engine traffic bound both for the Internet and other Google services experienced intermittent packet loss. The intermittent packet loss persisted for 43 minutes until 10:38 PST, at which time packet loss returned to baseline levels. The user impact of this intermittent packet loss depended on VM, zone, and user netblock, and ranged from no visible impact, to unusually slow responses, to timeouts attempting to contact the VM. ROOT CAUSE: The root cause of the packet loss was a configuration change introduced to the network stack designed to provide greater isolation between VMs and projects by capping the traffic volume allowed by an individual VM. The configuration change had been tested prior to deployment to production without incident. However as it was introduced into the production environment it affected some VMs in an unexpected manner. REMEDIATION AND PREVENTION: Automated network monitoring systems alerted Google engineers of the issue at 10:13 PST, 18 minutes after detectable packet loss first appeared. The Google engineering team identified the root cause and rolled back the configuration change starting at 10:35 PST, which immediately decreased the incidences rate of packet loss, with full recovery complete at 10:38 PST. Google engineers are investigating why the prior testing of the change did not accurately predict the performance of the isolation mechanism in production. Future changes will not be applied to production until the test suite has been improved to demonstrate parity with behavior observed in production during this incident. Additionally, Google engineers are immediately amending the rollout protocol for network configuration changes so that future production changes will be applied to a small fraction of VMs at a time, reducing the exposure in the event of undetected behavior.
7 Mar 2015	11:49 PST	The problem with Google Compute Engine connection should be resolved as of 10:35 AM 2015-03-07 UTC-8. We apologize for any issues this may have caused to you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems. We will provide a more detailed analysis of this incident once we have completed our internal investigation.
7 Mar 2015	11:45 PST	We apologize for any issues this may have caused to you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems.

Date

Time

Description

8 Mar 2015

14:37 PDT

SUMMARY: On Saturday March 7 2015, Google Compute Engine VMs experienced intermittent packet loss on egress network traffic between 09:55 PST to 10:38 PST. VM execution and VM-to-VM network traffic was unaffected during this interval.

DETAILED DESCRIPTION OF IMPACT: Beginning March 7 2015 at 09:55 PST, Google Compute Engine traffic bound both for the Internet and other Google services experienced intermittent packet loss. The intermittent packet loss persisted for 43 minutes until 10:38 PST, at which time packet loss returned to baseline levels. The user impact of this intermittent packet loss depended on VM, zone, and user netblock, and ranged from no visible impact, to unusually slow responses, to timeouts attempting to contact the VM.

ROOT CAUSE: The root cause of the packet loss was a configuration change introduced to the network stack designed to provide greater isolation between VMs and projects by capping the traffic volume allowed by an individual VM. The configuration change had been tested prior to deployment to production without incident. However as it was introduced into the production environment it affected some VMs in an unexpected manner.

REMEDIATION AND PREVENTION: Automated network monitoring systems alerted Google engineers of the issue at 10:13 PST, 18 minutes after detectable packet loss first appeared. The Google engineering team identified the root cause and rolled back the configuration change starting at 10:35 PST, which immediately decreased the incidences rate of packet loss, with full recovery complete at 10:38 PST.

Google engineers are investigating why the prior testing of the change did not accurately predict the performance of the isolation mechanism in production. Future changes will not be applied to production until the test suite has been improved to demonstrate parity with behavior observed in production during this incident. Additionally, Google engineers are immediately amending the rollout protocol for network configuration changes so that future production changes will be applied to a small fraction of VMs at a time, reducing the exposure in the event of undetected behavior.

7 Mar 2015

11:49 PST

The problem with Google Compute Engine connection should be resolved as of 10:35 AM 2015-03-07 UTC-8. We apologize for any issues this may have caused to you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems.

We will provide a more detailed analysis of this incident once we have completed our internal investigation.

7 Mar 2015

11:45 PST

We apologize for any issues this may have caused to you or your users and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are constantly working to improve the reliability of our systems.

All times are US/Pacific