-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Support NUMA Binding for Callable Entrypoints, Take 2 #161183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161183
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 989679d with merge base 419a2db ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D80724395. |
5e0dc8c to
ec92647
Compare
|
@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D80724395. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors NUMA binding support for both str and Callable entrypoints by replacing the previous numactl CLI approach with a simpler CPU affinity inheritance mechanism. Instead of creating temporary executable files with numactl commands, the parent process temporarily sets its own CPU affinity before spawning subprocesses, allowing child processes to inherit the correct NUMA bindings.
Key changes:
- Replaced
numactlCLI wrapping withos.sched_setaffinity()based CPU affinity management - Introduced a context manager approach for temporarily applying NUMA bindings to the current process
- Unified the NUMA binding logic to work with both
strandCallableentrypoints through inheritance
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| torch/numa/binding.py | Core refactoring - replaced numactl CLI functions with CPU affinity context manager and logical CPU binding functions |
| torch/multiprocessing/spawn.py | Updated to use new context manager approach for NUMA binding inheritance during process spawning |
| torch/distributed/launcher/api.py | Updated condition to check for parallel start compatibility with NUMA bindings |
| torch/distributed/elastic/multiprocessing/subprocess_handler/subprocess_handler.py | Applied new NUMA binding context manager to subprocess creation |
| test/test_numa_binding.py | Comprehensive test updates to verify CPU affinity binding instead of numactl command generation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| "Continuing executing without applying NUMA binding, despite exception %s", | ||
| traceback.format_exc(), | ||
| ) | ||
| return None |
Copilot
AI
Aug 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function should not return None in the except block. This is a context manager that yields, so it should either re-raise the exception or yield without returning a value. Returning None will cause a TypeError when used as a context manager.
| return None | |
| return |
Copilot uses AI. Check for mistakes.
| traceback.format_exc(), | ||
| ) | ||
| return None | ||
| raise |
Copilot
AI
Aug 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raise statement is unreachable because the previous line returns None. The function flow will never reach this point due to the return statement on line 126.
| raise | |
| else: | |
| raise |
Copilot uses AI. Check for mistakes.
| if numa_options.should_fall_back_if_binding_fails: | ||
| logger.warning("Falling back to original command without NUMA bindings.") | ||
| logger.warning( | ||
| "Continuing executing without applying NUMA binding, despite exception %s", |
Copilot
AI
Aug 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The word 'executing' should be 'execution' to be grammatically correct. The correct phrase should be 'Continuing execution without applying NUMA binding'.
| "Continuing executing without applying NUMA binding, despite exception %s", | |
| "Continuing execution without applying NUMA binding, despite exception %s", |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ec92647 to
93ce3a8
Compare
|
@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D80724395. |
93ce3a8 to
0a0810d
Compare
|
@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D80724395. |
0a0810d to
989679d
Compare
|
@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D80724395. |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
# Context In #161183, we added NUMA-binding support for `Callable` entrypoints to `elastic_launch`. However, we would raise an exception if the subprocesses would be spawned in parallel via `ThreadPoolExecutor`, which is an option configurable via the `TORCH_MP_PARALLEL_START` environment variable (see diff). The logic here was that `os.sched_setaffinity`, which we used to set CPU affinities, is [per process](https://docs.python.org/3/library/os.html#os.sched_setaffinity), so there could be a race condition during a parallel start: > Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted. But on further reading, the Linux docs say [`sched_setaffinity` is per *thread*.](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) As it turns out, the Python doc is a misnomer. I [verified that `sched_setaffinity` only affects the calling thread, not the entire calling process.](https://gist.github.com/pdesupinski/7e2de3cbe5bb48d489f257b83ccddf07) The upshot is that we actually *can* safely use the inheritance trick from #161183 even with parallel start, since the setting will be inherited from the calling thread, and `os.sched_setaffinity` only affects the calling thread. # This PR Remove restrictions against parallel start for NUMA binding. Pull Request resolved: #161576 Approved by: https://github.com/d4l3k
# Context In pytorch#160163, we added support for NUMA binding for `Callable` entrypoints to `elastic_launch`. This requires special consideration, because they go through a different path to spawn subprocesses compared to `str` entrypoints, a path which does not provide a straightforward way to utilize `numactl` CLI. See pytorch#160006 for a full description of the challenges. Although pytorch#160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call `numactl`. This appeared to be due to interactions with how the `LD_PRELOAD` environment variable was being set. # This PR On further thought, the most straightforward, foolproof solution here is to use [the trick that @d4l3k suggested.](pytorch#160006 (comment)) Specifically, for each local rank `i`: 1. The parent process sets its own CPU affinity to what local rank `i`'s should be. 2. Then, the parent spawns the subprocess for local rank `i`. 3. Finally, the parent resets its own CPU affinity to what it was originally. There were other solutions that would work just for `Callable` entrypoints, but I believe this is the simplest one that can work for *both* `str` and `Callable`, and it's pretty simple. This required a bit of refactoring: 1. Turn all the `_get_.*_numactl_options` into functions which return a set of logical CPUs to bind to, rather than options like `--cpunodebind=0`. 2. Instead of wrapping commands with `numactl`, use `os.sched_setaffinity` to bind to the CPUs from (1.). 3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process. 4. Use the context manager for both `str` and `Callable` paths # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual See [doc.](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.0) Meta only, but TLDR tried out every combination of `str`, `Callable`, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled. Pull Request resolved: pytorch#161183 Approved by: https://github.com/d4l3k
# Context In pytorch#161183, we added NUMA-binding support for `Callable` entrypoints to `elastic_launch`. However, we would raise an exception if the subprocesses would be spawned in parallel via `ThreadPoolExecutor`, which is an option configurable via the `TORCH_MP_PARALLEL_START` environment variable (see diff). The logic here was that `os.sched_setaffinity`, which we used to set CPU affinities, is [per process](https://docs.python.org/3/library/os.html#os.sched_setaffinity), so there could be a race condition during a parallel start: > Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted. But on further reading, the Linux docs say [`sched_setaffinity` is per *thread*.](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) As it turns out, the Python doc is a misnomer. I [verified that `sched_setaffinity` only affects the calling thread, not the entire calling process.](https://gist.github.com/pdesupinski/7e2de3cbe5bb48d489f257b83ccddf07) The upshot is that we actually *can* safely use the inheritance trick from pytorch#161183 even with parallel start, since the setting will be inherited from the calling thread, and `os.sched_setaffinity` only affects the calling thread. # This PR Remove restrictions against parallel start for NUMA binding. Pull Request resolved: pytorch#161576 Approved by: https://github.com/d4l3k
Context
In #160163, we added support for NUMA binding for
Callableentrypoints toelastic_launch. This requires special consideration, because they go through a different path to spawn subprocesses compared tostrentrypoints, a path which does not provide a straightforward way to utilizenumactlCLI. See #160006 for a full description of the challenges.Although #160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call
numactl. This appeared to be due to interactions with how theLD_PRELOADenvironment variable was being set.This PR
On further thought, the most straightforward, foolproof solution here is to use the trick that @d4l3k suggested.
Specifically, for each local rank
i:i's should be.i.There were other solutions that would work just for
Callableentrypoints, but I believe this is the simplest one that can work for bothstrandCallable, and it's pretty simple.This required a bit of refactoring:
_get_.*_numactl_optionsinto functions which return a set of logical CPUs to bind to, rather than options like--cpunodebind=0.numactl, useos.sched_setaffinityto bind to the CPUs from (1.).strandCallablepathsTest Plan
Automated
$ pytest test/test_numa_binding.pyManual
See doc. Meta only, but TLDR tried out every combination of
str,Callable, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled.cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta