Service Health
Incident affecting Google Kubernetes Engine
Customers may see unexpected additional messages in GKE cluster logs.
Incident began at 2023-07-06 07:13 and ended at 2023-07-18 07:27 (all times are US/Pacific).
Previously affected location(s)
Taiwan (asia-east1)Hong Kong (asia-east2)Tokyo (asia-northeast1)Osaka (asia-northeast2)Seoul (asia-northeast3)Mumbai (asia-south1)Delhi (asia-south2)Singapore (asia-southeast1)Jakarta (asia-southeast2)Sydney (australia-southeast1)Melbourne (australia-southeast2)Warsaw (europe-central2)Finland (europe-north1)Madrid (europe-southwest1)Belgium (europe-west1)Turin (europe-west12)London (europe-west2)Frankfurt (europe-west3)Netherlands (europe-west4)Zurich (europe-west6)Milan (europe-west8)Paris (europe-west9)Doha (me-central1)Tel Aviv (me-west1)Montréal (northamerica-northeast1)Toronto (northamerica-northeast2)São Paulo (southamerica-east1)Santiago (southamerica-west1)Iowa (us-central1)South Carolina (us-east1)Northern Virginia (us-east4)Columbus (us-east5)Dallas (us-south1)Oregon (us-west1)Los Angeles (us-west2)Salt Lake City (us-west3)Las Vegas (us-west4)
Date | Time | Description | |
---|---|---|---|
| 18 Jul 2023 | 10:49 PDT | Mini Incident ReportWe apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Cloud Support using https://cloud.google.com/support Incident Start: 06 July 2023 at 07:13 Incident End: 18 July 2023 at 07:27 Duration: 12 days, 14 minutes Affected Services and Features: Google Kubernetes Engine - Logging Regions/Zones: Global Description: Google Kubernetes Engine (GKE) experienced an issue with unexpected additional log messages in GKE cluster logs. The extra log messages consumed additional logging API write quota, and a subset of customers had quota exhaustion issues. The issue affected both standard and autopilot clusters at GKE versions 1.24.x and above. From preliminary analysis, the root cause of the issue is a roll out of a new version of the component responsible for processing logs. During the outage, customers were provided with a workaround to request a quota increase for Logging API and/or to apply log exclusions to avoid storing the additional messages. The issue was mitigated by a gradual fix roll out to the affected clusters. Customer Impact: Affected Customers would have seen unexpected additional log messages in GKE cluster logs. Following are the Error/Warning log entries:
|
| 18 Jul 2023 | 07:58 PDT | The issue with Google Kubernetes Engine has been resolved for all affected users as of Tuesday, 2023-07-18 07:27 US/Pacific. We thank you for your patience while we worked on resolving the issue. |
| 10 Jul 2023 | 16:25 PDT | Summary: Customers may see unexpected additional messages in GKE cluster logs. Description: Mitigation work is currently underway by our engineering team. We have identified a fix and tested it. The rollout of the fix started on Monday, 2023-07-10 and is expected to be complete by 2023-07-18. We will continue to provide updates on any status changes. Customers may see unexpected additional messages in GKE cluster logs, for GKE clusters on version 1.24 and later. These additional messages should not have been added to the logs by GKE and can be safely ignored. Concerns for large workloads approaching quota have already been mitigated. The workloads processed by GKE clusters are not impacted. We will provide more information by Tuesday, 2023-07-18 13:00 US/Pacific. Diagnosis: Customers may see unexpected additional log messages in GKE cluster logs.
Workaround: To avoid hitting Logging API quota customers can request quota Logging API increase. To avoid storing these additional messages (that provides impact on logging billing) customers can apply Log exclusions by using the following query: resource.labels.container_name = "fluentbit-gke" OR LOG_ID("fluentbit") (jsonPayload.message =~ "^Failed to get record: decoder: failed to decode payload: EOF" OR jsonPayload.message =~ "^cannot parse .* after %L$" OR (jsonPayload.message =~ "^invalid time format" AND jsonPayload.plugin = "parser:glog")) |
| 7 Jul 2023 | 16:12 PDT | Summary: Customers may see unexpected additional messages in GKE cluster logs. Description: Mitigation work is currently underway by our engineering team. We have identified a fix, and are still working to test and roll it out. We do not have an ETA for rollout completion at this time, however we will continue to provide updates on any status changes. Customers may see unexpected additional messages in GKE cluster logs, for GKE clusters on version 1.24 and later. These additional messages should not have been added to the logs by GKE and can be safely ignored. For very large workloads (i.e., using thousands of nodes per project) these logs may exceed the Logging API quota and thus cause other log messages to be dropped. The workloads processed by GKE clusters are not impacted. We will provide more information by Monday, 2023-07-10 17:00 US/Pacific. Diagnosis: Customers may see unexpected additional log messages in GKE cluster logs.
Workaround: To avoid hitting Logging API quota customers can request quota Logging API increase. To avoid storing these additional messages (that provides impact on logging billing) customers can apply Log exclusions by using the following query: resource.labels.container_name = "fluentbit-gke" OR LOG_ID("fluentbit") (jsonPayload.message =~ "^Failed to get record: decoder: failed to decode payload: EOF" OR jsonPayload.message =~ "^cannot parse .* after %L$" OR (jsonPayload.message =~ "^invalid time format" AND jsonPayload.plugin = "parser:glog")) |
| 7 Jul 2023 | 12:45 PDT | Summary: Customers may see unexpected additional messages in GKE cluster logs. Description: Mitigation work is currently underway by our engineering team. We have identified a fix, and are working to roll it out. We do not have an ETA for rollout completion at this time, however we will continue to provide updates on any status changes. Customers may see unexpected additional messages in GKE cluster logs, for GKE clusters on version 1.24 and later. These additional messages should not have been added to the logs by GKE and can be safely ignored. For very large workloads (i.e., using thousands of nodes per project) these logs may exceed the Logging API quota and thus cause other log messages to be dropped. The workloads processed by GKE clusters are not impacted. We will provide more information by Friday, 2023-07-07 17:00 US/Pacific. Diagnosis: Customers may see unexpected additional log messages in GKE cluster logs.
Workaround: To avoid hitting Logging API quota customers can request quota Logging API increase. To avoid storing these additional messages (that provides impact on logging billing) customers can apply Log exclusions by using the following query: resource.labels.container_name = "fluentbit-gke" OR LOG_ID("fluentbit") (jsonPayload.message =~ "^Failed to get record: decoder: failed to decode payload: EOF" OR jsonPayload.message =~ "^cannot parse .* after %L$" OR (jsonPayload.message =~ "^invalid time format" AND jsonPayload.plugin = "parser:glog")) |
| 6 Jul 2023 | 16:32 PDT | Summary: Customers may see unexpected additional messages in GKE cluster logs. Description: Mitigation work is currently underway by our engineering team. We have identified a fix, and are working to roll it out. We do not have an ETA for rollout completion at this time, however we will continue to provide updates on any status changes. Customers may see unexpected additional messages in GKE cluster logs, for GKE clusters on version 1.24 and later. These additional messages should not have been added to the logs by GKE and can be safely ignored. For very large workloads (i.e., using thousands of nodes per project) these logs may exceed the Logging API quota and thus cause other log messages to be dropped. The workloads processed by GKE clusters are not impacted. We will provide more information by Friday, 2023-07-07 13:00 US/Pacific. Diagnosis: Customers may see unexpected additional log messages in GKE cluster logs.
Workaround: To avoid hitting Logging API quota customers can request quota Logging API increase. To avoid storing these additional messages (that provides impact on logging billing) customers can apply Log exclusions by using the following query: resource.labels.container_name = "fluentbit-gke" OR LOG_ID("fluentbit") (jsonPayload.message =~ "^Failed to get record: decoder: failed to decode payload: EOF" OR jsonPayload.message =~ "^cannot parse .* after %L$" OR (jsonPayload.message =~ "^invalid time format" AND jsonPayload.plugin = "parser:glog")) |
| 6 Jul 2023 | 11:24 PDT | Summary: Customers may see unexpected additional messages in GKE cluster logs. Description: We are experiencing an issue with Google Kubernetes Engine. Our engineering team continues to investigate the issue. Customers may see unexpected additional messages in GKE cluster logs. These additional messages should not be added to the logs by GKE and can be ignored. The workloads processed by GKE clusters are not impacted. For very large workloads (using thousands of nodes per project) this may cause hitting Logging API quota and as the result losing some other log messages. This issue may impact GKE clusters version 1.24 and later. We will provide an update by Thursday, 2023-07-06 17:00 US/Pacific with current details. Diagnosis: Customers may see unexpected additional log messages in GKE cluster logs.
Workaround: To avoid hitting Logging API quota customers can request quota Logging API increase. To avoid storing these additional messages (that provides impact on logging billing) customers can apply Log exclusions by using the following query: resource.labels.container_name = "fluentbit-gke" OR LOG_ID("fluentbit") (jsonPayload.message =~ "^Failed to get record: decoder: failed to decode payload: EOF" OR jsonPayload.message =~ "^cannot parse .* after %L$" OR (jsonPayload.message =~ "^invalid time format" AND jsonPayload.plugin = "parser:glog")) |
| 6 Jul 2023 | 09:13 PDT | Summary: Google is investigating issues with GKE. Customers may see unexpected additional messages in GKE cluster logs. Description: We are experiencing an issue with Google Kubernetes Engine. Our engineering team continues to investigate the issue. Customers may see unexpected additional messages in GKE cluster logs. These additional messages should not be added to the logs and can be ignored. The workloads processed by GKE clusters are not impacted. For very large workloads (using thousands of nodes per project) this may cause hitting Logging API quota and as the result losing some other log messages. This issue may impact GKE clusters version 1.24 and later. We will provide an update by Thursday, 2023-07-06 12:00 US/Pacific with current details. Diagnosis: Customers may see unexpected additional log entries GKE cluster logs:.
Workaround: To avoid hitting Logging API quota customers can request quota Logging API increase. To avoid storing these additional messages (that provided impact on billing) customers can apply Log exclusions to prevent storage of excessive logs by using the following query: resource.labels.container_name = "fluentbit-gke" OR LOG_ID("fluentbit") (jsonPayload.message =~ "^Failed to get record: decoder: failed to decode payload: EOF" OR jsonPayload.message =~ "^cannot parse .* after %L$" OR (jsonPayload.message =~ "^invalid time format" AND jsonPayload.plugin = "parser:glog")) |
| 6 Jul 2023 | 08:30 PDT | Summary: Google is investigating issues with GKE - Customers may see increased logging in their GKE cluster. Description: We are experiencing an issue with Google Kubernetes Engine. Our engineering team continues to investigate the issue. We will provide an update by Thursday, 2023-07-06 09:00 US/Pacific with current details. Diagnosis: Customer may see additional log entries to their GCP project containing GKE clusters.
Workaround: Customer can apply Log exclusions to prevent storage of excessive logs by using the following query:- resource.labels.container_name = "fluentbit-gke" OR LOG_ID("fluentbit") (jsonPayload.message =~ "^Failed to get record: decoder: failed to decode payload: EOF" OR jsonPayload.message =~ "^cannot parse .* after %L$" OR (jsonPayload.message =~ "^invalid time format" AND jsonPayload.plugin = "parser:glog")) |
| 6 Jul 2023 | 07:43 PDT | Summary: Google is investigating issues with GKE - Customers may see increased logging in their GKE cluster. Description: We are experiencing an issue with Google Kubernetes Engine. Our engineering team continues to investigate the issue. We will provide an update by Thursday, 2023-07-06 09:05 US/Pacific with current details. Diagnosis: Customer may see additional logs written to their GCP project containing GKE clusters Workaround: Customer can apply Log exclusions to prevent storage of excessive logs by using the following query. LOG_ID("fluentbit") (jsonPayload.message =~ "^cannot parse .* after %L$" OR (jsonPayload.message =~ "^invalid time format" AND jsonPayload.plugin = "parser:glog")) |
- All times are US/Pacific