Service Health
Incident affecting Dialogflow CX, Dialogflow ES
Elevated latency and memory usage in Dialogflow CX and Dialogflow ES
Incident began at 2023-07-28 10:17 and ended at 2023-07-28 15:52 (all times are US/Pacific).
Previously affected location(s)
Tokyo (asia-northeast1)Mumbai (asia-south1)Singapore (asia-southeast1)Sydney (australia-southeast1)Belgium (europe-west1)London (europe-west2)Frankfurt (europe-west3)GlobalMontréal (northamerica-northeast1)Iowa (us-central1)South Carolina (us-east1)Oregon (us-west1)
Date | Time | Description | |
---|---|---|---|
| 3 Aug 2023 | 15:20 PDT | Incident ReportSummaryOn Friday, 28 July 2023 from 10:17 to 15:52 US/Pacific, Dialogflow Customer Experience (Dialogflow CX) and Dialogflow Essentials (Dialogflow ES) experienced elevated DEADLINE_EXCEEDED, UNAVAILABLE errors and latency for certain functions for a duration of 5 hours, 35 minutes. High error rates were seen for CreateConversation, CreateParticipant, and (Streaming)AnalyzeContent functionalities. Any customer applications or services dependent on Dialogflow “global-dialogflow.googleapis.com” or “dialogflow.googleapis.com” would have experienced service interruptions. Only “Global” location was affected by the issue and no other locations were affected. To our Dialogflow customers whose applications were impacted during this outage, we sincerely apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s availability. Root CauseThe root cause of the issue is exhaustion of thread and memory resources in the component that handles incoming and outgoing API requests. The exhaustion was triggered by recent code changes to improve the efficiency of interactions between Dialogflow API, Pub/Sub messaging, and internal authentication services, and by the unexpected behavior of some incoming queries. One of our largest customers began trying out a new, private feature in Dialogflow. Their client behaved in an unexpected way, exposing a potential resource exhaustion involving Cloud Pub/Sub subscriptions, which we use to keep conversation state synchronized between multiple platforms. To create conversations using Dialogflow, topic subscriptions are initialized. Once conversations are completed, these subscriptions are cleared. As a result of the recent code change, customer projects that initiated conversations without completing them started to accumulate topic subscriptions, leading to a backlog. The resource exhaustion only became apparent after we received a large volume of traffic that exhibited this unusual behavior on 28 July 2023. This manifested in our servers as spiking memory usage and thread exhaustion. When the pool of available threads began to dwindle, the incident started impacting other customers as well. Remediation and PreventionGoogle engineers were alerted to the issue by an internal monitoring system on 28 July 2023 at 10:36 US/Pacific and immediately started an investigation. From the initial analysis, Google engineers found that the majority of errors were from one data center. At 11:25 US/Pacific, a mitigation was attempted by directing the traffic away from that data center. This attempt did not resolve the errors, instead the issue moved to another data center. After further investigation, the engineering team identified that many threads were waiting for specific event completion. Google engineers temporarily disabled those events at 13:10 US/Pacific. This showed signs of improvement in error rate and latency. However, around 30 minutes later, the issue re-appeared. Google engineers continued the investigation and identified a rollout from 17 July 2023 to be contributing to the issue and initiated a rollback at 14:20 US/Pacific. At 15:25 US/Pacific, the engineering team started noticing an improvement in error rate, and the customer impact was completely mitigated at 15:52 US/Pacific. Google is committed preventing a repeat of this issue in the future and is completing the following actions:
Detailed Description of ImpactOn Friday, 28 July 2023 from 10:17 to 15:52 US/Pacific, requests to Dialogflow CX and Dialogflow ES experienced elevated DEADLINE_EXCEEDED errors and latency issues.
Depending on the workload and query distribution, some customers may have noticed higher error rates than the rates provided above. |
| 31 Jul 2023 | 09:39 PDT | Mini Incident ReportWe apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support. (All Times US/Pacific) Incident Start: 28 July 2023 at 10:17 Incident End: 28 July 2023 at 15:52 Duration: 5 hours, 35 minutes Affected Services and Features:
Regions/Zones: Global Description: Dialogflow CX and Dialogflow ES experienced elevated DEADLINE_EXCEEDED errors and latency for a duration of 5 hours, 35 minutes. From preliminary analysis, the root cause of the issue is a recent change to how Dialogflow interacts with messaging and internal authentication services. The issue was mitigated by rolling back this change. Google will complete a full Incident Report in the following days that will provide a detailed root cause. Customer Impact:
|
| 28 Jul 2023 | 16:57 PDT | The issue with Dialogflow CX, Dialogflow ES has been resolved for all affected users as of Friday, 2023-07-28 16:56 US/Pacific. We will publish an analysis of this incident once we have completed our internal investigation. We thank you for your patience while we worked on resolving the issue. |
| 28 Jul 2023 | 16:17 PDT | Summary: Elevated latency and memory usage in Dialogflow CX and Dialogflow ES Description: Our Engineering team has completed the rollback across all regions and are currently validating for full recovery. We will provide more information by Friday, 2023-07-28 17:20 US/Pacific. Diagnosis: Customers are experiencing elevated latency and may observe deadline exceeded errors for Dialogflow products Workaround: None at this time. |
| 28 Jul 2023 | 15:41 PDT | Summary: Elevated latency and memory usage in Dialogflow CX and Dialogflow ES Description: Our Engineering team rolled out a fix and is seeing improvements with error rate and latency We will provide more information by Friday, 2023-07-28 16:20 US/Pacific. Diagnosis: Customers are experiencing elevated latency and may observe deadline exceeded errors for Dialogflow products Workaround: None at this time. |
| 28 Jul 2023 | 15:18 PDT | Summary: Elevated latency and memory usage in Dialogflow CX and Dialogflow ES Description: Our Engineering team rolled out a potential fix and is closely monitoring to ensure error rates subside. We will provide more information by Friday, 2023-07-28 15:50 US/Pacific. Diagnosis: Customers are experiencing elevated latency and may observe deadline exceeded errors for Dialogflow products Workaround: None at this time. |
| 28 Jul 2023 | 15:02 PDT | Summary: Elevated latency and memory usage in Dialogflow CX and Dialogflow ES Description: Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point. We will provide more information by Friday, 2023-07-28 16:05 US/Pacific. Diagnosis: Customers are experiencing elevated latency and may observe deadline exceeded errors for Dialogflow products Workaround: None at this time. |
| 28 Jul 2023 | 14:53 PDT | Summary: Elevated latency and memory usage in Dialogflow CX and Dialogflow ES Description: We are experiencing an issue with Dialogflow CX, Dialogflow ES. Our engineering team continues to investigate the issue. We will provide an update by Friday, 2023-07-28 15:40 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Customers are experiencing elevated latency and may observe deadline exceeded errors for Dialogflow products Workaround: None at this time. |
| 28 Jul 2023 | 14:37 PDT | Summary: Elevated latency and memory usage in Dialogflow CX and Dialogflow ES Description: We are experiencing an issue with Dialogflow CX, Dialogflow ES. Our engineering team continues to investigate the issue. We will provide an update by Friday, 2023-07-28 15:30 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Customers observing elevated latency and may observe deadline exceeded errors for Dialogflow products Workaround: None at this time. |
- All times are US/Pacific