Google Cloud Status Dashboard

This page provides status information on the services that are part of Google Cloud Platform. Check back here to view the current status of the services listed below. For additional information on these services, please visit cloud.google.com.

Google Compute Engine Incident #15064

Network Connectivity Issues in europe-west1

Incident began at 2015-11-23 12:14 and ended at 2015-11-23 13:04 (all times are US/Pacific).

Date Time Description
Nov 27, 2015 07:21

SUMMARY:

On Monday 23 November 2015, for a duration of 70 minutes, a subset of Internet destinations was unreachable from the Google Compute Engine europe-west1 region. If your service or application was affected, we apologize — this is not the level of quality and reliability we strive to offer you, and we have taken and are taking immediate steps to improve the platform’s performance and availability.

DETAILED DESCRIPTION OF IMPACT:

On Monday 23 November 2015 from 11:55 to 13:05 PST, a number of Internet regions (Autonomous Systems) became unreachable from Google Compute Engine's europe1-west region. The region's traffic volume decreased by 13% during the incident. The majority of affected destination addresses were located in eastern Europe and the Middle East.

Traffic to other external destinations was not affected. There was no impact on Google Compute Engine instances in any other region, nor on traffic to any destination within Google.

ROOT CAUSE:

At 11:51 on Monday 23 November, Google networking engineers activated an additional link in Europe to a network carrier with whom Google already shares many peering links globally. On this link, the peer's network signalled that it could route traffic to many more destinations than Google engineers had anticipated, and more than the link had capacity for. Google's network responded accordingly by routing a large volume of traffic to the link. At 11:55, the link saturated and began dropping the majority of its traffic.

In normal operation, peering links are activated by automation whose safety checks would have detected and rectified this condition. In this case, the automation was not operational due to an unrelated failure, and the link was brought online manually, so the automation's safety checks did not occur.

The automated checks were expected to protect the network for approximately one hour after link activation, and normal congestion monitoring began at the end of that period. As the post-activation checks were missing, this allowed a 61-minute delay before the normal monitoring started, detected the congestion, and alerted Google network engineers.

REMEDIATION AND PREVENTION:

Automated alerts fired at 12:56. At 13:02, Google network engineers directed traffic away from the new link and traffic flows returned to normal by 13:05.

To prevent recurrence of this issue, Google network engineers are changing procedure to disallow manual link activation. Links may only be brought up using automated mechanisms, including extensive safety checks both before and after link activation. Additionally, monitoring now begins immediately after link activation, providing redundant error detection.

SUMMARY:

On Monday 23 November 2015, for a duration of 70 minutes, a subset of Internet destinations was unreachable from the Google Compute Engine europe-west1 region. If your service or application was affected, we apologize — this is not the level of quality and reliability we strive to offer you, and we have taken and are taking immediate steps to improve the platform’s performance and availability.

DETAILED DESCRIPTION OF IMPACT:

On Monday 23 November 2015 from 11:55 to 13:05 PST, a number of Internet regions (Autonomous Systems) became unreachable from Google Compute Engine's europe1-west region. The region's traffic volume decreased by 13% during the incident. The majority of affected destination addresses were located in eastern Europe and the Middle East.

Traffic to other external destinations was not affected. There was no impact on Google Compute Engine instances in any other region, nor on traffic to any destination within Google.

ROOT CAUSE:

At 11:51 on Monday 23 November, Google networking engineers activated an additional link in Europe to a network carrier with whom Google already shares many peering links globally. On this link, the peer's network signalled that it could route traffic to many more destinations than Google engineers had anticipated, and more than the link had capacity for. Google's network responded accordingly by routing a large volume of traffic to the link. At 11:55, the link saturated and began dropping the majority of its traffic.

In normal operation, peering links are activated by automation whose safety checks would have detected and rectified this condition. In this case, the automation was not operational due to an unrelated failure, and the link was brought online manually, so the automation's safety checks did not occur.

The automated checks were expected to protect the network for approximately one hour after link activation, and normal congestion monitoring began at the end of that period. As the post-activation checks were missing, this allowed a 61-minute delay before the normal monitoring started, detected the congestion, and alerted Google network engineers.

REMEDIATION AND PREVENTION:

Automated alerts fired at 12:56. At 13:02, Google network engineers directed traffic away from the new link and traffic flows returned to normal by 13:05.

To prevent recurrence of this issue, Google network engineers are changing procedure to disallow manual link activation. Links may only be brought up using automated mechanisms, including extensive safety checks both before and after link activation. Additionally, monitoring now begins immediately after link activation, providing redundant error detection.

Nov 23, 2015 14:32

The issue with network connectivity issues in europe-west1 should have been resolved for all affected users as of 13:04 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

The issue with network connectivity issues in europe-west1 should have been resolved for all affected users as of 13:04 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Nov 23, 2015 13:26

We are investigating reports of an issue with network connectivity in europe-west1. We will provide more information by 14:22 US/Pacific.

We are investigating reports of an issue with network connectivity in europe-west1. We will provide more information by 14:22 US/Pacific.