Support NUMA Binding for Callable Entrypoints, Take 2 #161183

pdesupinski · 2025-08-21T18:22:22Z

Context

In #160163, we added support for NUMA binding for Callable entrypoints to elastic_launch. This requires special consideration, because they go through a different path to spawn subprocesses compared to str entrypoints, a path which does not provide a straightforward way to utilize numactl CLI. See #160006 for a full description of the challenges.

Although #160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call numactl. This appeared to be due to interactions with how the LD_PRELOAD environment variable was being set.

This PR

On further thought, the most straightforward, foolproof solution here is to use the trick that @d4l3k suggested.

Specifically, for each local rank i:

The parent process sets its own CPU affinity to what local rank i's should be.
Then, the parent spawns the subprocess for local rank i.
Finally, the parent resets its own CPU affinity to what it was originally.

There were other solutions that would work just for Callable entrypoints, but I believe this is the simplest one that can work for both str and Callable, and it's pretty simple.

This required a bit of refactoring:

Turn all the _get_.*_numactl_options into functions which return a set of logical CPUs to bind to, rather than options like --cpunodebind=0.
Instead of wrapping commands with numactl, use os.sched_setaffinity to bind to the CPUs from (1.).
Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process.
Use the context manager for both str and Callable paths

Test Plan

Automated

$ pytest test/test_numa_binding.py

Manual

See doc. Meta only, but TLDR tried out every combination of str, Callable, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

pytorch-bot · 2025-08-21T18:22:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161183

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 989679d with merge base 419a2db ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-08-21T18:22:59Z

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D80724395.

facebook-github-bot · 2025-08-22T02:47:20Z

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D80724395.

Copilot

Pull Request Overview

This PR refactors NUMA binding support for both str and Callable entrypoints by replacing the previous numactl CLI approach with a simpler CPU affinity inheritance mechanism. Instead of creating temporary executable files with numactl commands, the parent process temporarily sets its own CPU affinity before spawning subprocesses, allowing child processes to inherit the correct NUMA bindings.

Key changes:

Replaced numactl CLI wrapping with os.sched_setaffinity() based CPU affinity management
Introduced a context manager approach for temporarily applying NUMA bindings to the current process
Unified the NUMA binding logic to work with both str and Callable entrypoints through inheritance

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
torch/numa/binding.py	Core refactoring - replaced numactl CLI functions with CPU affinity context manager and logical CPU binding functions
torch/multiprocessing/spawn.py	Updated to use new context manager approach for NUMA binding inheritance during process spawning
torch/distributed/launcher/api.py	Updated condition to check for parallel start compatibility with NUMA bindings
torch/distributed/elastic/multiprocessing/subprocess_handler/subprocess_handler.py	Applied new NUMA binding context manager to subprocess creation
test/test_numa_binding.py	Comprehensive test updates to verify CPU affinity binding instead of numactl command generation

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-22T03:00:43Z

torch/numa/binding.py

+                "Continuing executing without applying NUMA binding, despite exception %s",
+                traceback.format_exc(),
+            )
            return None


The function should not return None in the except block. This is a context manager that yields, so it should either re-raise the exception or yield without returning a value. Returning None will cause a TypeError when used as a context manager.

Suggested change

return None

return

Copilot · 2025-08-22T03:00:44Z

torch/numa/binding.py

+                traceback.format_exc(),
+            )
            return None
        raise


This raise statement is unreachable because the previous line returns None. The function flow will never reach this point due to the return statement on line 126.

Suggested change

raise

else:

raise

Copilot · 2025-08-22T03:00:44Z

torch/numa/binding.py

        if numa_options.should_fall_back_if_binding_fails:
-            logger.warning("Falling back to original command without NUMA bindings.")
+            logger.warning(
+                "Continuing executing without applying NUMA binding, despite exception %s",


The word 'executing' should be 'execution' to be grammatically correct. The correct phrase should be 'Continuing execution without applying NUMA binding'.

Suggested change

"Continuing executing without applying NUMA binding, despite exception %s",

"Continuing execution without applying NUMA binding, despite exception %s",

d4l3k

LGTM

facebook-github-bot · 2025-08-22T17:49:40Z

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D80724395.

facebook-github-bot · 2025-08-22T22:37:41Z

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D80724395.

facebook-github-bot · 2025-08-23T00:14:50Z

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D80724395.

facebook-github-bot · 2025-08-23T07:15:58Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-08-23T07:17:47Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

# Context In #161183, we added NUMA-binding support for `Callable` entrypoints to `elastic_launch`. However, we would raise an exception if the subprocesses would be spawned in parallel via `ThreadPoolExecutor`, which is an option configurable via the `TORCH_MP_PARALLEL_START` environment variable (see diff). The logic here was that `os.sched_setaffinity`, which we used to set CPU affinities, is [per process](https://docs.python.org/3/library/os.html#os.sched_setaffinity), so there could be a race condition during a parallel start: > Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted. But on further reading, the Linux docs say [`sched_setaffinity` is per *thread*.](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) As it turns out, the Python doc is a misnomer. I [verified that `sched_setaffinity` only affects the calling thread, not the entire calling process.](https://gist.github.com/pdesupinski/7e2de3cbe5bb48d489f257b83ccddf07) The upshot is that we actually *can* safely use the inheritance trick from #161183 even with parallel start, since the setting will be inherited from the calling thread, and `os.sched_setaffinity` only affects the calling thread. # This PR Remove restrictions against parallel start for NUMA binding. Pull Request resolved: #161576 Approved by: https://github.com/d4l3k

@d4l3k

# Context In pytorch#160163, we added support for NUMA binding for `Callable` entrypoints to `elastic_launch`. This requires special consideration, because they go through a different path to spawn subprocesses compared to `str` entrypoints, a path which does not provide a straightforward way to utilize `numactl` CLI. See pytorch#160006 for a full description of the challenges. Although pytorch#160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call `numactl`. This appeared to be due to interactions with how the `LD_PRELOAD` environment variable was being set. # This PR On further thought, the most straightforward, foolproof solution here is to use [the trick that @d4l3k suggested.](pytorch#160006 (comment)) Specifically, for each local rank `i`: 1. The parent process sets its own CPU affinity to what local rank `i`'s should be. 2. Then, the parent spawns the subprocess for local rank `i`. 3. Finally, the parent resets its own CPU affinity to what it was originally. There were other solutions that would work just for `Callable` entrypoints, but I believe this is the simplest one that can work for *both* `str` and `Callable`, and it's pretty simple. This required a bit of refactoring: 1. Turn all the `_get_.*_numactl_options` into functions which return a set of logical CPUs to bind to, rather than options like `--cpunodebind=0`. 2. Instead of wrapping commands with `numactl`, use `os.sched_setaffinity` to bind to the CPUs from (1.). 3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process. 4. Use the context manager for both `str` and `Callable` paths # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual See [doc.](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.0) Meta only, but TLDR tried out every combination of `str`, `Callable`, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled. Pull Request resolved: pytorch#161183 Approved by: https://github.com/d4l3k

# Context In pytorch#161183, we added NUMA-binding support for `Callable` entrypoints to `elastic_launch`. However, we would raise an exception if the subprocesses would be spawned in parallel via `ThreadPoolExecutor`, which is an option configurable via the `TORCH_MP_PARALLEL_START` environment variable (see diff). The logic here was that `os.sched_setaffinity`, which we used to set CPU affinities, is [per process](https://docs.python.org/3/library/os.html#os.sched_setaffinity), so there could be a race condition during a parallel start: > Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted. But on further reading, the Linux docs say [`sched_setaffinity` is per *thread*.](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) As it turns out, the Python doc is a misnomer. I [verified that `sched_setaffinity` only affects the calling thread, not the entire calling process.](https://gist.github.com/pdesupinski/7e2de3cbe5bb48d489f257b83ccddf07) The upshot is that we actually *can* safely use the inheritance trick from pytorch#161183 even with parallel start, since the setting will be inherited from the calling thread, and `os.sched_setaffinity` only affects the calling thread. # This PR Remove restrictions against parallel start for NUMA binding. Pull Request resolved: pytorch#161576 Approved by: https://github.com/d4l3k

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) labels Aug 21, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 21, 2025

pdesupinski force-pushed the feature/function-numa-binding-take2 branch from 5e0dc8c to ec92647 Compare August 22, 2025 02:10

pdesupinski changed the title ~~os.sched_setaffinity instead of numactl~~ Support NUMA Binding for Callable Entrypoints, Take 2 Aug 22, 2025

pdesupinski requested review from d4l3k and kiukchung August 22, 2025 02:46

pdesupinski marked this pull request as ready for review August 22, 2025 02:47

pdesupinski added the suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter) label Aug 22, 2025

pdesupinski mentioned this pull request Aug 22, 2025

Support NUMA Binding for Callable Entrypoints #160163

Closed

d4l3k requested a review from Copilot August 22, 2025 03:00

Copilot AI reviewed Aug 22, 2025

View reviewed changes

d4l3k approved these changes Aug 22, 2025

View reviewed changes

pdesupinski force-pushed the feature/function-numa-binding-take2 branch from ec92647 to 93ce3a8 Compare August 22, 2025 17:48

pdesupinski force-pushed the feature/function-numa-binding-take2 branch from 93ce3a8 to 0a0810d Compare August 22, 2025 22:37

os.sched_setaffinity instead of numactl

989679d

pdesupinski force-pushed the feature/function-numa-binding-take2 branch from 0a0810d to 989679d Compare August 23, 2025 00:13

pdesupinski requested a review from albanD as a code owner August 23, 2025 00:13

pytorchmergebot added the merging label Aug 23, 2025

pytorchmergebot added the Merged label Aug 23, 2025

pytorchmergebot closed this in 33346b5 Aug 23, 2025

pytorchmergebot removed the merging label Aug 23, 2025

pdesupinski mentioned this pull request Aug 27, 2025

Allow parallel start NUMA binding #161576

Closed

github-actions bot deleted the feature/function-numa-binding-take2 branch September 23, 2025 02:08

	"Continuing executing without applying NUMA binding, despite exception %s",
	"Continuing execution without applying NUMA binding, despite exception %s",

Support NUMA Binding for Callable Entrypoints, Take 2 #161183

Support NUMA Binding for Callable Entrypoints, Take 2 #161183

Uh oh!

Conversation

pdesupinski commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

This PR

Test Plan

Automated

Manual

Uh oh!

pytorch-bot bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161183

✅ No Failures

Uh oh!

facebook-github-bot commented Aug 21, 2025

Uh oh!

facebook-github-bot commented Aug 22, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 22, 2025

Uh oh!

facebook-github-bot commented Aug 22, 2025

Uh oh!

facebook-github-bot commented Aug 23, 2025

Uh oh!

facebook-github-bot commented Aug 23, 2025

Uh oh!

pytorchmergebot commented Aug 23, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pdesupinski commented Aug 21, 2025 •

edited

Loading

pytorch-bot bot commented Aug 21, 2025 •

edited

Loading