-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Description
What happened?
A combination of software releases in Beam dependency chain has surfaced a failure mode, that might cause unexplained pipeline stuckness. The issue affects Apache Beam 2.55.0 and 2.55.1, but may potentially affect other SDKs when the pipeline runtime environment has
google-api-core version 2.17.0 or above, AND grpcio version in the following range 1.59.0<=grpcio<=1.62.1.
Symptoms
Beam pipelines might get stuck. Dataflow jobs might have errors like: Unable to retrieve status info from SDK harness
There are 10 consecutive failures obtaining SDK worker status info, SDK worker appears to be permanently unresponsive. Aborting the SDK.
Mitigation
Upgrade to Apache Beam 2.56.0 or above once available, until then: install any of the following dependency combinations in the Beam pipeline runtime environment
- upgrade grpcio to version 1.62.2 or above OR
- downgrade grpcio and grpcio-status to 1.58.0 or below. OR
- downgrade google-api-core to version 2.16.2 or below
You can define dependencies in the pipeline runtime environment using a --requirements_file pipeline option or other options outlined in https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/.
Users of Apache Beam 2.55.0 might be able to avoid the issue by downgrading to apache-beam==2.54.0, since the default containers for the runtime environment has the set of dependencies that does not trigger the bug.
Rootcause
The issue was caused by a regression in grpcio==1.59.0 grpc/grpc#36265, which has been now fixed in grpcio==1.62.2 and above. The regression triggered the failure mode when used with google-api-core==2.17.0 and above.
Description updated: 2023-04-23.
Original description:
Update of the Python grpcio dependency to version 1.62.1 caused Dataflow job stalling, with excessive waits for responses in GRCP Multi-threaded rendezvous probably somewhere in SDK worker. Upstream issue exists here: grpc/grpc#36256
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner