Google Cloud Status Dashboard
Incident affecting Operations
The Stackdriver logging service is experiencing a 30-minute delay.
Incident began at 2018-05-20 18:40 and ended at 2018-05-20 23:05 (all times are US/Pacific).
|24 May 2018||13:03 PDT|| |
On Sunday, 20 May 2018 for 4 hours and 25 minutes, approximately 6% of Stackdriver Logging logs experienced a median ingest latency of 90 minutes. To our Stackdriver Logging customers whose operations monitoring was impacted during this outage, we apologize. We have conducted an internal investigation and are taking steps to ensure this doesn’t happen again.
DETAILED DESCRIPTION OF IMPACT
On Wednesday, 20 May 2018 from 18:40 to 23:05 Pacific Time, 6% of logs ingested by Stackdriver Logging experienced log event ingest latency of up to 2 hours 30 minutes, with a median latency of 90 minutes. Customers requesting log events within the latency window would receive empty responses. Logging export sinks were not affected.
Stackdriver Logging uses a pool of workers to persist ingested log events. On Wednesday, 20 May 2018 at 17:40, a load spike in the Stackdriver Logging storage subsystem caused 0.05% of persist calls made by the workers to time out. The workers would then retry persisting to the same address until reaching a retry timeout. While the workers were retrying, they were not persisting other log events. This resulted in multiple workers removed from the pool of available workers.
By 18:40, enough workers had been removed from the pool to reduce throughput below the level of incoming traffic, creating delays for 6% of logs.
REMEDIATION AND PREVENTION
After Google Engineering was paged, engineers isolated the issue to these timing out workers. At 20:35, engineers configured the workers to return timed out log events to queue and move on to a different log event after timeout. This allowed workers to catch up with ingest rate. At 23:02, the last delayed message was delivered.
We are taking the following steps to prevent the issue from happening again: we are modifying the workers to retry persists using alternate addresses to reduce the impact of persist timeouts; we are increasing the persist capacity of the storage subsystem to manage load spikes; we are modifying Stackdriver Logging workers to reduce their unavailability when the storage subsystem experiences higher latency.
|20 May 2018||22:53 PDT|| |
The issue with StackDriver logging delay has been resolved for all affected projects as of Sunday, 2018-05-20 22:45 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.
|20 May 2018||22:04 PDT|| |
Mitigation work is currently underway by our Engineering Team. We will provide another status update by Sunday, 2018-05-20 23:00 US/Pacific with current details.
|20 May 2018||20:44 PDT|| |
The Stackdriver logging service is experiencing a 30-minute delay. We will provide another status update by Sunday, 2018-05-20 22:00 US/Pacific with current details.
|20 May 2018||20:19 PDT|| |
We are investigating an issue with Google Stackdriver. We will provide more information by Sunday, 2018-05-20 20:30 US/Pacific.
- Service information
- Service disruption
- Service outage