-
Notifications
You must be signed in to change notification settings - Fork 74.9k
Description
System information
-
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Windows -
TensorFlow version:
At Upgrade grpc to 1.41.1 and protobuf to 3.19.0 #52853 -
Bazel version (if compiling from source):
3.7.2 -
GCC/Compiler version (if compiling from source):
MSVC
Describe the problem
This is a summary of the Windows heap corruption bug discovered when upgrading protobuf in #52853 (comment).
In the PR, we tried to upgrade protobuf from 3.9.2 to 3.19.0, but some tf optimizer related tests are failing on Windows with
Windows fatal exception: code 0xc0000374
The error code indicates a heap corruption. After some investigation, we discovered the problem is caused by:
- How protobuf is linked in TensorFlow on Windows
On Windows, most of TF C++ code is linked into_pywrap_tensorflow_internal.pydand some extension code is linked into individual pyd files, like_pywrap_tf_optimizer.pyd. The later is also linked to the former. However, the protobuf library is statically linked into both dynamic libraries. - How protobuf works
Basically, protobuf deletes some global string in a proto desctructor after comparing the address of some global default string. When there are two protobuf runtimes, there are two default strings with different addresses, which caused some memory to be accidentally deleted when mixed together.
@acozzette also explained why protobuf doesn't work well when linked in multiple places at #52853 (comment)
Provide the exact sequence of commands / steps that you executed before running into the problem
To reproduce the original error in a full build:
git clone https://github.com/meteorcloudy/tensorflow.git
cd tensorflow
git fetch origin upgrade_protobuf_grpc
configure
bazel test --announce_rc --config=opt tensorflow/python/grappler:memory_optimizer_test --test_arg=MemoryOptimizerSwapTest.testNoSwapping
However, the full TF build isn't debuggable, I have constructed some smaller targets to imitate the situation.
To build the minimal reproduce case in a debug build:
git clone https://github.com/meteorcloudy/tensorflow.git
cd tensorflow
git fetch origin reproduce_win_heap_corruption_2
configure
bazel run --announce_rc --config=dbg tensorflow/python/grappler:tf_optimizer_wrapper_bin
When debugging in Visual Studio, you should be able to see the following:

Possible Solution
To properly solve this problem, we probably need to migrate TF to cc_shared_library, which makes dynamic linking more controllable.