Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Google Cloud Networking, Cloud Monitoring, Google Cloud Console, Cloud Memorystore, Hybrid Connectivity

We are investigating an issue with Google Cloud Networking in the zone europe-west1-b. We will provide more information by Wednesday, 2018-12-19 07:00 US/Pacific.

Incident began at 2018-12-19 06:00 and ended at 2018-12-19 06:39 (all times are US/Pacific).

Date Time Description
21 Dec 2018 14:49 PST

ISSUE SUMMARY

On Wednesday 19 December 2018 multiple GCP services in europe-west1-b experienced a disruption for a duration of 34 minutes. Several GCP services were impacted: GCE, Monitoring, Cloud Console, GAE Admin API, Task Queues, Cloud Spanner, Cloud SQL, GKE, Cloud Bigtable, and Redis. GCP services in all other zones remained unaffected.

This service disruption was caused by an erroneous trigger leading to a switch re-installation during upgrades to two control plane network (CPN) switches impacting a portion of europe-west1-b. Most impacted GCP services in the zone recovered within a few minutes after the issue was mitigated.

We understand that these services are critical to our customers and sincerely apologize for the disruption caused by this incident. To prevent the issue from recurring we are fixing our repair workflows to catch such errors before serving traffic.

DETAILED DESCRIPTION OF IMPACT

On Wednesday 19 December 2018 from 05:53 to 06:27 US/Pacific, multiple GCP services in europe-west1-b experienced disruption due to a network outage in one of Google’s data centers.

The following Google Cloud Services in europe-west1-b were impacted: GCE instance creation, GCE networking, Cloud VPN, Cloud Interconnect, Stackdriver Monitoring API, Cloud Console, App Engine Admin API, App Engine Task Queues, Cloud Spanner, Cloud SQL, GKE, Cloud Bigtable, and Cloud Memorystore for Redis. Most of these services suffered a brief disruption during the duration of the incident and recovered when the issue was mitigated.

  • Stackdriver: Around 1% of customers accessing Stackdriver Monitoring API directly received 5xx errors.
  • Cloud Console: Affected customers may not have been able to view graphs and API usage statistics. Impacted dashboards include: /apis/dashboard, /home/dashboard, /google/maps-api/api list.
  • Redis: After the network outage ended, ~50 standard Redis instances in europe-west1 remained unavailable until 07:55 US/Pacific due to a failover bug triggered by the outage.

ROOT CAUSE

As part of a program to upgrade network switches in control plane networks across Google’s data center, two control plane network (CPN) switches supporting a single CPN were scheduled to undergo upgrades. On December 17, the first switch was upgraded and was back online the same day. The issue triggered on December 19 when the second switch was due to be upgraded. During the upgrade of the second switch, a reinstallation was erroneously triggered on the first switch, causing it to go offline for a short period of time. Having both switches down partitioned the network supporting a portion of europe-west1-b. Due to this isolation, the zone was left partially functional.

REMEDIATION AND PREVENTION

The issue was mitigated at 06:27 US/Pacific when reinstallation of the first switch in the CPN completed.

To prevent the issue from recurring we are changing the switch upgrade workflow to prevent erroneous triggers. The trigger inadvertently caused the switch to re-install before any CPN switch is deemed healthy to serve traffic. We are also adding additional checks to make sure upgraded devices are in full functional state before they are deemed healthy to start serving. We will also be improving our automation to catch offline peer devices sooner and help prevent related issues.

19 Dec 2018 06:39 PST

The Google Cloud Networking issue in zone europe-west1-b has been resolved. No further updates will be provided here.

19 Dec 2018 06:29 PST

We are investigating an issue with Google Cloud Networking in the zone europe-west1-b. We will provide more information by Wednesday, 2018-12-19 07:00 US/Pacific.

19 Dec 2018 06:29 PST

We are investigating an issue with Google Cloud Networking in the zone europe-west1-b. We will provide more information by Wednesday, 2018-12-19 07:00 US/Pacific.