KEMBAR78
faster shut down by snnn · Pull Request #24891 · microsoft/onnxruntime · GitHub
Skip to content

Conversation

@snnn
Copy link
Member

@snnn snnn commented May 28, 2025

This PR is to reduce the chance of having crashes when a process is shutting down. The main idea is: if we know the process is shutting down(or if we know the ORT DLL won't be reloaded), we do not need to run C++ object destructors. The approach is recommended by Windows SDK's official document and many Windows developers. For example, 18 years ago Raymond Chen wrote a blog The old-fashioned theory on how processes exit. This is ORT's current behavior. Raymond Chen also wrote a blog what a better approach is

In our case, when onnxruntime is built as a python package, the DLL(onnxruntime_pybind11_state.pyd ) will never be manually unloaded. Same on Linux. Python does not unload the DLLs on exit. Therefore, we do not need to worry about the potential memory leaks caused by any global variables. Therefore we do not need to call OrtEnv's destructors, and we do not need to unload any EP dlls.

In most cases, people do not unload DLLs on Windows. And, on Linux it is even more complicated because GCC needs to maintain a unique table to avoid odr violations, and this feature makes most C++ shared library cannot be unloaded.

So, this change detects if the os is Windows and if the process is shutdown when calling destructors. If yes, the destructor will do nothing.

After this change on Windows in most cases OrtEnv will not be destroyed. The only exception is: if someone manually load the DLL and manually unload the DLL, and also do not have a global threadpool. Then I think the user is an advanced user and they should know that they need to destroy all inference session objects and the OrtEnv singleton object before unloading the DLL. Besides, if they have enabled global thread pool, the DLL won't be unloaded if they haven't shutdown the thread pool and delete the OrtEnv object. And, even if the user has manually loaded/unloaded the DLL, there would still be memory leaks(that are not related to this change). It's hard to get 100% clean.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

@snnn snnn marked this pull request as ready for review May 31, 2025 03:17
@snnn snnn requested review from skottmckay and tianleiwu June 2, 2025 17:19
@jywu-msft jywu-msft requested a review from adrianlizarraga June 2, 2025 17:53
tianleiwu
tianleiwu previously approved these changes Jun 2, 2025
Copy link
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

adrianlizarraga
adrianlizarraga previously approved these changes Jun 2, 2025
Copy link
Contributor

@adrianlizarraga adrianlizarraga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The building is being demolished. Don't bother sweeping the floor" 👍

@snnn snnn dismissed stale reviews from adrianlizarraga and tianleiwu via 100d864 June 3, 2025 18:52
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

@snnn snnn requested a review from skottmckay June 3, 2025 22:45
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

snnn and others added 3 commits June 4, 2025 17:51
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@snnn snnn force-pushed the snnn/shutdown branch from 8d65926 to 5f24f38 Compare June 5, 2025 01:42
@snnn snnn marked this pull request as draft June 5, 2025 04:39
@snnn snnn marked this pull request as ready for review June 5, 2025 23:00
@snnn snnn merged commit 494d356 into main Jun 6, 2025
88 checks passed
@snnn snnn deleted the snnn/shutdown branch June 6, 2025 02:41
khoover added a commit to khoover/onnxruntime that referenced this pull request Jun 9, 2025
javier-intel pushed a commit to intel/onnxruntime that referenced this pull request Jun 15, 2025
This PR is to reduce the chance of having crashes when a process is
shutting down. The main idea is: if we know the process is shutting
down(or if we know the ORT DLL won't be reloaded), we do not need to run
C++ object destructors. The approach is recommended by Windows SDK's
official document and many Windows developers. For example, 18 years ago
Raymond Chen wrote a blog [The old-fashioned theory on how processes
exit](https://devblogs.microsoft.com/oldnewthing/20070502-00/?p=27023).
This is ORT's current behavior. Raymond Chen also wrote a blog [what a
better approach
is](https://devblogs.microsoft.com/oldnewthing/20120105-00/?p=8683)

In our case, when onnxruntime is built as a python package, the
DLL(onnxruntime_pybind11_state.pyd ) will never be manually unloaded.
Same on Linux. Python does not unload the DLLs on exit. Therefore, we do
not need to worry about the potential memory leaks caused by any global
variables. Therefore, we do not need to call OrtEnv's destructors, and we
do not need to unload any EP dlls.

In most cases, people do not unload DLLs on Windows. And, on Linux it is
even more complicated because GCC needs to maintain a unique table to
avoid odr violations, and this feature makes most C++ shared library
cannot be unloaded.

So, this change detects if the os is Windows and if the process is
shutdown when calling destructors. If yes, the destructor will do
nothing.

After this change on Windows in most cases OrtEnv will not be destroyed.
The only exception is: if someone manually load the DLL and manually
unload the DLL, and also do not have a global threadpool. Then I think
the user is an advanced user and they should know that they need to
destroy all inference session objects and the OrtEnv singleton object
before unloading the DLL. Besides, if they have enabled global thread
pool, the DLL won't be unloaded if they haven't shutdown the thread pool
and delete the OrtEnv object. And, even if the user has manually
loaded/unloaded the DLL, there would still be some memory leaks(that are not
related to this change). It's hard to get 100% clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants