Service Health

This page provides status information on the services that are part of Google Cloud. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://cloud.google.com/.

Incident affecting Dialogflow CX, Dialogflow ES

Elevated latency and memory usage in Dialogflow CX and Dialogflow ES

Incident began at 2023-07-28 10:17 and ended at 2023-07-28 15:52 (all times are US/Pacific).

Previously affected location(s)

Tokyo (asia-northeast1)Mumbai (asia-south1)Singapore (asia-southeast1)Sydney (australia-southeast1)Belgium (europe-west1)London (europe-west2)Frankfurt (europe-west3)GlobalMontréal (northamerica-northeast1)Iowa (us-central1)South Carolina (us-east1)Oregon (us-west1)

Date Time Description
3 Aug 2023 15:20 PDT

Incident Report

Summary

On Friday, 28 July 2023 from 10:17 to 15:52 US/Pacific, Dialogflow Customer Experience (Dialogflow CX) and Dialogflow Essentials (Dialogflow ES) experienced elevated DEADLINE_EXCEEDED, UNAVAILABLE errors and latency for certain functions for a duration of 5 hours, 35 minutes. High error rates were seen for CreateConversation, CreateParticipant, and (Streaming)AnalyzeContent functionalities. Any customer applications or services dependent on Dialogflow “global-dialogflow.googleapis.com” or “dialogflow.googleapis.com” would have experienced service interruptions. Only “Global” location was affected by the issue and no other locations were affected.

To our Dialogflow customers whose applications were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s availability.

Root Cause

The root cause of the issue is exhaustion of thread and memory resources in the component that handles incoming and outgoing API requests. The exhaustion was triggered by recent code changes to improve the efficiency of interactions between Dialogflow API, Pub/Sub messaging, and internal authentication services, and by the unexpected behavior of some incoming queries.

One of our largest customers began trying out a new, private feature in Dialogflow. Their client behaved in an unexpected way, exposing a potential resource exhaustion involving Cloud Pub/Sub subscriptions, which we use to keep conversation state synchronized between multiple platforms. To create conversations using Dialogflow, topic subscriptions are initialized. Once conversations are completed, these subscriptions are cleared. As a result of the recent code change, customer projects that initiated conversations without completing them started to accumulate topic subscriptions, leading to a backlog.

The resource exhaustion only became apparent after we received a large volume of traffic that exhibited this unusual behavior on 28 July 2023. This manifested in our servers as spiking memory usage and thread exhaustion. When the pool of available threads began to dwindle, the incident started impacting other customers as well.

Remediation and Prevention

Google engineers were alerted to the issue by an internal monitoring system on 28 July 2023 at 10:36 US/Pacific and immediately started an investigation. From the initial analysis, Google engineers found that the majority of errors were from one data center. At 11:25 US/Pacific, a mitigation was attempted by directing the traffic away from that data center. This attempt did not resolve the errors, instead the issue moved to another data center. After further investigation, the engineering team identified that many threads were waiting for specific event completion. Google engineers temporarily disabled those events at 13:10 US/Pacific. This showed signs of improvement in error rate and latency. However, around 30 minutes later, the issue re-appeared.

Google engineers continued the investigation and identified a rollout from 17 July 2023 to be contributing to the issue and initiated a rollback at 14:20 US/Pacific. At 15:25 US/Pacific, the engineering team started noticing an improvement in error rate, and the customer impact was completely mitigated at 15:52 US/Pacific.

Google is committed preventing a repeat of this issue in the future and is completing the following actions:

  • Implement a fix to thread resource exhaustion triggered by queries that do not complete the conversation properly.
  • Implement safeguards to gradually roll out changes to limit the scope of impact.
  • Implement monitoring for queries that do not exit properly upon completion of conversations.
  • Improve traffic isolation by implementing separate resource pools for different types of API requests.
  • Improve traffic handling in the component that handles the API requests by implementing a safety mechanism that allows for graceful handling of overload conditions.

Detailed Description of Impact

On Friday, 28 July 2023 from 10:17 to 15:52 US/Pacific, requests to Dialogflow CX and Dialogflow ES experienced elevated DEADLINE_EXCEEDED errors and latency issues.

  • Around 9% of overall traffic to “global-dialogflow.googleapis.com” or “dialogflow.googleapis.com” was experiencing elevated error rates and latency.
  • Based on analysis from the previous day requests, we estimate that around 25% of expected traffic to “Global” location was served UNAVAILABLE errors.

Depending on the workload and query distribution, some customers may have noticed higher error rates than the rates provided above.


31 Jul 2023 09:39 PDT

Mini Incident Report

We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support.

(All Times US/Pacific)

Incident Start: 28 July 2023 at 10:17

Incident End: 28 July 2023 at 15:52

Duration: 5 hours, 35 minutes

Affected Services and Features:

  • Dialogflow Customer Experience (Dialogflow CX)
  • Dialogflow Essentials (Dialogflow ES)

Regions/Zones: Global

Description:

Dialogflow CX and Dialogflow ES experienced elevated DEADLINE_EXCEEDED errors and latency for a duration of 5 hours, 35 minutes. From preliminary analysis, the root cause of the issue is a recent change to how Dialogflow interacts with messaging and internal authentication services. The issue was mitigated by rolling back this change.

Google will complete a full Incident Report in the following days that will provide a detailed root cause.

Customer Impact:

  • Affected customers experienced elevated DEADLINE_EXCEEDED errors and latency.
  • Around 45% of total requests to “dialogflow.googleapis.com” endpoint experienced elevated error rates and latency.
  • Fewer than 10% of total requests to “[region]-dialogflow.googleapis.com” endpoints were affected.
28 Jul 2023 16:57 PDT

The issue with Dialogflow CX, Dialogflow ES has been resolved for all affected users as of Friday, 2023-07-28 16:56 US/Pacific.

We will publish an analysis of this incident once we have completed our internal investigation.

We thank you for your patience while we worked on resolving the issue.

28 Jul 2023 16:17 PDT

Summary: Elevated latency and memory usage in Dialogflow CX and Dialogflow ES

Description: Our Engineering team has completed the rollback across all regions and are currently validating for full recovery.

We will provide more information by Friday, 2023-07-28 17:20 US/Pacific.

Diagnosis: Customers are experiencing elevated latency and may observe deadline exceeded errors for Dialogflow products

Workaround: None at this time.

28 Jul 2023 15:41 PDT

Summary: Elevated latency and memory usage in Dialogflow CX and Dialogflow ES

Description: Our Engineering team rolled out a fix and is seeing improvements with error rate and latency

We will provide more information by Friday, 2023-07-28 16:20 US/Pacific.

Diagnosis: Customers are experiencing elevated latency and may observe deadline exceeded errors for Dialogflow products

Workaround: None at this time.

28 Jul 2023 15:18 PDT

Summary: Elevated latency and memory usage in Dialogflow CX and Dialogflow ES

Description: Our Engineering team rolled out a potential fix and is closely monitoring to ensure error rates subside.

We will provide more information by Friday, 2023-07-28 15:50 US/Pacific.

Diagnosis: Customers are experiencing elevated latency and may observe deadline exceeded errors for Dialogflow products

Workaround: None at this time.

28 Jul 2023 15:02 PDT

Summary: Elevated latency and memory usage in Dialogflow CX and Dialogflow ES

Description: Mitigation work is currently underway by our engineering team.

We do not have an ETA for mitigation at this point.

We will provide more information by Friday, 2023-07-28 16:05 US/Pacific.

Diagnosis: Customers are experiencing elevated latency and may observe deadline exceeded errors for Dialogflow products

Workaround: None at this time.

28 Jul 2023 14:53 PDT

Summary: Elevated latency and memory usage in Dialogflow CX and Dialogflow ES

Description: We are experiencing an issue with Dialogflow CX, Dialogflow ES.

Our engineering team continues to investigate the issue.

We will provide an update by Friday, 2023-07-28 15:40 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Customers are experiencing elevated latency and may observe deadline exceeded errors for Dialogflow products

Workaround: None at this time.

28 Jul 2023 14:37 PDT

Summary: Elevated latency and memory usage in Dialogflow CX and Dialogflow ES

Description: We are experiencing an issue with Dialogflow CX, Dialogflow ES.

Our engineering team continues to investigate the issue.

We will provide an update by Friday, 2023-07-28 15:30 US/Pacific with current details.

We apologize to all who are affected by the disruption.

Diagnosis: Customers observing elevated latency and may observe deadline exceeded errors for Dialogflow products

Workaround: None at this time.