KEMBAR78
Windows heap corruption caused by upgrading protobuf · Issue #53234 · tensorflow/tensorflow · GitHub
Skip to content

Windows heap corruption caused by upgrading protobuf #53234

@meteorcloudy

Description

@meteorcloudy

System information

Describe the problem

This is a summary of the Windows heap corruption bug discovered when upgrading protobuf in #52853 (comment).

In the PR, we tried to upgrade protobuf from 3.9.2 to 3.19.0, but some tf optimizer related tests are failing on Windows with

Windows fatal exception: code 0xc0000374

The error code indicates a heap corruption. After some investigation, we discovered the problem is caused by:

  • How protobuf is linked in TensorFlow on Windows
    On Windows, most of TF C++ code is linked into _pywrap_tensorflow_internal.pyd and some extension code is linked into individual pyd files, like _pywrap_tf_optimizer.pyd. The later is also linked to the former. However, the protobuf library is statically linked into both dynamic libraries.
  • How protobuf works
    Basically, protobuf deletes some global string in a proto desctructor after comparing the address of some global default string. When there are two protobuf runtimes, there are two default strings with different addresses, which caused some memory to be accidentally deleted when mixed together.

@acozzette also explained why protobuf doesn't work well when linked in multiple places at #52853 (comment)

Provide the exact sequence of commands / steps that you executed before running into the problem

To reproduce the original error in a full build:

git clone https://github.com/meteorcloudy/tensorflow.git
cd tensorflow
git fetch origin upgrade_protobuf_grpc
configure
bazel test --announce_rc --config=opt tensorflow/python/grappler:memory_optimizer_test --test_arg=MemoryOptimizerSwapTest.testNoSwapping

However, the full TF build isn't debuggable, I have constructed some smaller targets to imitate the situation.
To build the minimal reproduce case in a debug build:

git clone https://github.com/meteorcloudy/tensorflow.git
cd tensorflow
git fetch origin reproduce_win_heap_corruption_2
configure
bazel run --announce_rc --config=dbg tensorflow/python/grappler:tf_optimizer_wrapper_bin

When debugging in Visual Studio, you should be able to see the following:
image

Possible Solution

To properly solve this problem, we probably need to migrate TF to cc_shared_library, which makes dynamic linking more controllable.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions