Google Cloud Status Dashboard

This page provides status information on the services that are part of Google Cloud Platform. Check back here to view the current status of the services listed below. For additional information on these services, please visit cloud.google.com.

Google Cloud Datastore Incident #17002

Cloud Datastore Internal errors in the European region

Incident began at 2017-02-14 00:15 and ended at 2017-02-14 14:40 (all times are US/Pacific).

Date Time Description
Feb 21, 2017 14:00

ISSUE SUMMARY

On Tuesday 14 February 2017, some applications using Google Cloud Datastore in Western Europe or the App Engine Search API in Western Europe experienced 2%-4% error rates and elevated latency for three periods with an aggregate duration of three hours and 36 minutes. We apologize for the disruption this caused to your service. We have already taken several measures to prevent incidents of this type from recurring and to improve the reliability of these services.

DETAILED DESCRIPTION OF IMPACT

On Tuesday 14 February 2017 between 00:15 and 01:18 PST, 54% of applications using Google Cloud Datastore in Western Europe or the App Engine Search API in Western Europe experienced elevated error rates and latency. The average error rate for affected applications was 4%.

Between 08:35 and 08:48 PST, 50% of applications using Google Cloud Datastore in Western Europe or the App Engine Search API in Western Europe experienced elevated error rates. The average error rate for affected applications was 4%.

Between 12:20 and 14:40 PST, 32% of applications using Google Cloud Datastore in Western Europe or the App Engine Search API in Western Europe experienced elevated error rates and latency. The average error rate for affected applications was 2%.

Errors received by affected applications for all three incidents were either "internal error" or "timeout".

ROOT CAUSE

The incident was caused by a latent bug in a service used by both Cloud Datastore and the App Engine Search API that was triggered by high load on the service.

Starting at 00:15 PST, several applications changed their usage patterns in one zone in Western Europe and began running more complex queries, which caused higher load on the service.

REMEDIATION AND PREVENTION

Google's monitoring systems paged our engineers at 00:35 PST to alert us to elevated errors in a single zone. Our engineers followed normal practice, by redirecting traffic to other zones to reduce the impact on customers while debugging the underlying issue. At 01:15, we redirected traffic to other zones in Western Europe, which resolved the incident three minutes later.

At 08:35 we redirected traffic back to the zone that previously had errors. We found that the error rate in that zone was still high and so redirected traffic back to other zones at 08:48.

At 12:45, our monitoring systems detected elevated errors in other zones in Western Europe. At 14:06 Google engineers added capacity to the service with elevated errors in the affected zones. This removed the trigger for the incident.

We have now identified and fixed the latent bug that caused errors when the system was at high load. We expect to roll out this fix over the next few days.

Our capacity planning team have generated forecasts for peak load generated by the Cloud Datastore and App Engine Search API and determined that we now have sufficient capacity currently provisioned to handle peak loads.

We will be making several changes to our monitoring systems to improve our ability to quickly detect and diagnose errors of this type.

Once again, we apologize for the impact of this incident on your application.

ISSUE SUMMARY

On Tuesday 14 February 2017, some applications using Google Cloud Datastore in Western Europe or the App Engine Search API in Western Europe experienced 2%-4% error rates and elevated latency for three periods with an aggregate duration of three hours and 36 minutes. We apologize for the disruption this caused to your service. We have already taken several measures to prevent incidents of this type from recurring and to improve the reliability of these services.

DETAILED DESCRIPTION OF IMPACT

On Tuesday 14 February 2017 between 00:15 and 01:18 PST, 54% of applications using Google Cloud Datastore in Western Europe or the App Engine Search API in Western Europe experienced elevated error rates and latency. The average error rate for affected applications was 4%.

Between 08:35 and 08:48 PST, 50% of applications using Google Cloud Datastore in Western Europe or the App Engine Search API in Western Europe experienced elevated error rates. The average error rate for affected applications was 4%.

Between 12:20 and 14:40 PST, 32% of applications using Google Cloud Datastore in Western Europe or the App Engine Search API in Western Europe experienced elevated error rates and latency. The average error rate for affected applications was 2%.

Errors received by affected applications for all three incidents were either "internal error" or "timeout".

ROOT CAUSE

The incident was caused by a latent bug in a service used by both Cloud Datastore and the App Engine Search API that was triggered by high load on the service.

Starting at 00:15 PST, several applications changed their usage patterns in one zone in Western Europe and began running more complex queries, which caused higher load on the service.

REMEDIATION AND PREVENTION

Google's monitoring systems paged our engineers at 00:35 PST to alert us to elevated errors in a single zone. Our engineers followed normal practice, by redirecting traffic to other zones to reduce the impact on customers while debugging the underlying issue. At 01:15, we redirected traffic to other zones in Western Europe, which resolved the incident three minutes later.

At 08:35 we redirected traffic back to the zone that previously had errors. We found that the error rate in that zone was still high and so redirected traffic back to other zones at 08:48.

At 12:45, our monitoring systems detected elevated errors in other zones in Western Europe. At 14:06 Google engineers added capacity to the service with elevated errors in the affected zones. This removed the trigger for the incident.

We have now identified and fixed the latent bug that caused errors when the system was at high load. We expect to roll out this fix over the next few days.

Our capacity planning team have generated forecasts for peak load generated by the Cloud Datastore and App Engine Search API and determined that we now have sufficient capacity currently provisioned to handle peak loads.

We will be making several changes to our monitoring systems to improve our ability to quickly detect and diagnose errors of this type.

Once again, we apologize for the impact of this incident on your application.

Feb 14, 2017 14:53

The issue with Cloud Datastore serving elevated internal errors in the European region should have been resolved for all affected projects as of 14:34 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

The issue with Cloud Datastore serving elevated internal errors in the European region should have been resolved for all affected projects as of 14:34 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.

Feb 14, 2017 14:04

We are investigating an issue with Cloud Datastore in the European region. We will provide more information by 15:00 US/Pacific.

We are investigating an issue with Cloud Datastore in the European region. We will provide more information by 15:00 US/Pacific.