KEMBAR78
[Bug]: [Python SDK] Memory leak in 2.47.0 - 2.51.0 SDKs. · Issue #28246 · apache/beam · GitHub
Skip to content

[Bug]: [Python SDK] Memory leak in 2.47.0 - 2.51.0 SDKs.  #28246

@tvalentyn

Description

@tvalentyn

What happened?

We have identified a memory leak that affects Beam Python SDK versions 2.47.0 and above. The leak was triggered by an upgrade to protobuf==4.x.x. We rootcaused this leak to protocolbuffers/protobuf#14571 and it has been remediated in Beam 2.52.0.

[update: 2023-12-19]: Due to another issue related to protobuf upgrade, Python streaming users should continue to apply the mitigation steps below with Beam 2.52.0 or switch to Beam 2.53.0 once available.

Mitigation

Until Beam 2.52.0 is released, consider any of the following workarounds:

  • Use apache-beam==2.46.0 or below.

  • Install protobuf 3.x in the submission and runtime environment. For example, you can use a --requirements_file pipeline option with a file that includes:

    protobuf==3.20.3
    grpcio-status==1.48.2
    

    For more information, see: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

  • Use a python implementation of protobuf by setting a PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python environment variable in the runtime environment. This might degrade the performance since python implementation is less efficient. For example, you could create a custom Beam SDK container from a Dockerfile that looks like the following:

    FROM apache/beam_python3.10_sdk:2.47.0
    ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
    

    For more information, see: https://beam.apache.org/documentation/runtime/environments/

  • Install protobuf==4.25.0 or newer in the submission and runtime environment.

Users of Beam 2.50.0 SDK should additionally follow mitigation options for #28318.

Additional details

The leak can be reproduced by a pipeline:

  with beam.Pipeline(options=pipeline_options) as p:
    # duplicate reads to increase throughput
    inputs = []
    for i in range(32):
      inputs.append(
          p | f"Read pubsub{i}" >> ReadFromPubSub(topic='projects/pubsub-public-data/topics/taxirides-realtime', with_attributes=True)
      )

    inputs | beam.Flatten()

Dataflow pipeline options for the above pipeline: --max_num_workers=1 --autoscaling_algorithm=NONE --worker_machine_type=n2-standard-32

The leak was triggered by Beam switching default protobuf package version from 3.19.x to 4.22.x in #24599. The new versions of protobuf also switched the default protobuf implemetation to a upb implementation. The upb implementation had two known leaks that have since been mitigated by protobuf team in: protocolbuffers/protobuf#10088, https://github.com/protocolbuffers/upb/issues/1243 . The latest available protobuf==4.24.4 does not yet have the fix, but we have confirmed that using a patched version built in https://github.com/protocolbuffers/upb/actions/runs/6028136812 fixes the leak.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2bugdone & doneIssue has been reviewed after it was closed for verification, followups, etc.python

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions