Support NUMA Binding for Callable Entrypoints #160163

pdesupinski · 2025-08-08T04:23:46Z

NOTE: This implementation has been replaced with #161183 to fix some crashes.

Context

This is an extension of #149334.

This PR

Add support for NUMA bindings with Callable entrypoints, such as do_train instead of /usr/local/bin/python.

Most notably, we utilize a hack in order to force Process.start() to use custom NUMA bindings for each subprocess. Please search for HACK: in the code to see a description of the implementation we chose, and #160006 for discussion of alternatives and why this is necessary.

Other changes:

Remove unnecessary --preferred option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See MPOL_LOCAL.)
Refactor so that the main API is maybe_wrap_command_with_numa_bindings, which computes bindings for a single rank at a time, rather than maybe_wrap_with_numa_bindings which computed bindings for all ranks at once. This allowed for more code sharing between Callable and str entrypoints.

Test Plan

Automated

$ pytest test/test_numa_binding.py

Manual

Using this benchmark,, ran

$ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 | tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 | tee none_callable.txt

and observed

6.6% remote memory accesses with 'node' bindings
11.6% remote without bindings

I also ran similar with str entrypoints as before just to be sure it's still working.

NOTE: --run-path triggers the code to be run inside a Callable.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @raghavhrishi

pytorch-bot · 2025-08-08T04:23:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160163

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 45984e7 with merge base 94b91a8 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

trunk / win-vs2022-cuda12.6-py3 / build (gh) (trunk failure)
ninja: build stopped: subcommand failed

This comment was automatically generated by Dr. CI and updates every 15 minutes.

d4l3k

This code looks pretty reasonable

d4l3k · 2025-08-11T18:42:12Z

docs/source/elastic/numa.rst

should the docs move to src/numa.rst instead of being under elastic?

For context, I moved torch/distributed/numa --> torch/numa so that torch/distributed/__init__.py won't have to run when someone imports torch/multiprocessing/spawn.py. Might have been overly paranoid, but thought it might be preferable to keep torch/multiprocessing's dependencies light!

I think ultimately, we want numa documentation in the torchrun and elastic agent webpages. So maybe makes sense to keep it here still? I don't really understand these .rst files yet though so will defer to you.

I was planning to go through and do another PR to spruce up the public docs once we were more confident the impl was stable, is that fair?

d4l3k · 2025-08-11T18:45:49Z

torch/distributed/launcher/api.py

+        if (
+            self.numa_options is None
+            and self.start_method == "spawn"
+            and torch.cuda.is_available()


should this be torch.accelerator.is_available()?

Uhhhhh. We are using torch.cuda APIs so far here and elsewhere. I'm not as familiar with the topology / practices with other accelerators.

Fair to call that a potential followup since it's more than just this one callsite?

d4l3k · 2025-08-11T18:48:49Z

torch/distributed/launcher/api.py

-        if self.numa_options is None and torch.cuda.is_available():
+        if (
+            self.numa_options is None
+            and self.start_method == "spawn"


This is sort of odd since for CLI invocations this isn't even used but we're gating on it anyways

Yeah, it does default to "spawn" though even for torchrun, so it should be fine to keep it simple like this. If you feel strongly though, we can make this `and (self.start_method == "spawn" or isinstance(entrypoint, str)).

I see -- that's good to know and I think it's fine then. Maybe leave a comment about it

d4l3k · 2025-08-11T18:51:50Z

torch/numa/binding.py

+    if full_numactl_command is None:
+        return None

+    return _get_temporary_executable_for_command(command_args=full_numactl_command)


Why do we need this? We can't execute numactl commands via just the args?

So multiprocessing provides a set_executable API, but it just takes a path to an executable that will start up a python process. So we need a single command that will do both numactl with our options as well as invoke python. The only way I could see to do that was to put them into a .sh file.

Ahh I see, yeah that makes sense

d4l3k

LGTM

d4l3k · 2025-08-11T21:16:24Z

torch/numa/binding.py

+    # We do rm first to guarantee the file deletes itself. The rest of the file
+    # will still run as intended.
+    contents = f"""#!/bin/bash
+rm -- "$0"


maybe add a comment if people see these files laying around what this is for/what created it?

pdesupinski · 2025-08-11T23:58:15Z

@pytorchbot merge

pytorchmergebot · 2025-08-12T00:00:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-08-12T03:46:25Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-jammy-py3.9-clang12 / test (default, 3, 5, lf.linux.4xlarge)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

Support function entrypoints in elastic agent

pdesupinski · 2025-08-12T17:20:05Z

@pytorchbot merge

pytorchmergebot · 2025-08-12T17:24:49Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

# Context This is an extension of #149334. # This PR Add support for NUMA bindings with Callable entrypoints, such as `do_train` instead of `/usr/local/bin/python`. Most notably, we utilize a hack in order to force `Process.start()` to use custom NUMA bindings for each subprocess. Please search for `HACK:` in the code to see a description of the implementation we chose, and #160006 for discussion of alternatives and why this is necessary. Other changes: * Remove unnecessary `--preferred` option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See [MPOL_LOCAL](https://man7.org/linux/man-pages/man2/set_mempolicy.2.html).) * Refactor so that the main API is `maybe_wrap_command_with_numa_bindings`, which computes bindings for a single rank at a time, rather than `maybe_wrap_with_numa_bindings` which computed bindings for all ranks at once. This allowed for more code sharing between `Callable` and `str` entrypoints. # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual Using [this benchmark,](https://gist.github.com/pdesupinski/bbe01ade455d86e989794f2c612e2d91), ran ``` $ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 | tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 | tee none_callable.txt ``` and observed * 6.6% remote memory accesses with 'node' bindings * 11.6% remote without bindings I also ran similar with `str` entrypoints as before just to be sure it's still working. NOTE: [--run-path triggers the code to be run inside a `Callable`.](https://github.com/pytorch/pytorch/blob/017259f9c65b6fad55fb9597d7077e2543eaae46/torch/distributed/run.py#L870) Pull Request resolved: #160163 Approved by: https://github.com/d4l3k

# Context Broader context in #160163. In order for the _utils_internal version of signpost_event to do proper logging, its parameters argument needs to be json serializable. # This PR Convert `NumaOptions` to serializable form before inputting to `signpost_event`. # Test Plan ## Automated Added tests `$ pytest test/test_numa_binding.py`. ## Manual See [D80317206](https://www.internalfb.com/diff/D80317206). Pull Request resolved: #160710 Approved by: https://github.com/kiukchung

# Context This is an extension of #149334. # This PR Add support for NUMA bindings with Callable entrypoints, such as `do_train` instead of `/usr/local/bin/python`. Most notably, we utilize a hack in order to force `Process.start()` to use custom NUMA bindings for each subprocess. Please search for `HACK:` in the code to see a description of the implementation we chose, and #160006 for discussion of alternatives and why this is necessary. Other changes: * Remove unnecessary `--preferred` option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See [MPOL_LOCAL](https://man7.org/linux/man-pages/man2/set_mempolicy.2.html).) * Refactor so that the main API is `maybe_wrap_command_with_numa_bindings`, which computes bindings for a single rank at a time, rather than `maybe_wrap_with_numa_bindings` which computed bindings for all ranks at once. This allowed for more code sharing between `Callable` and `str` entrypoints. # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual Using [this benchmark,](https://gist.github.com/pdesupinski/bbe01ade455d86e989794f2c612e2d91), ran ``` $ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 | tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 | tee none_callable.txt ``` and observed * 6.6% remote memory accesses with 'node' bindings * 11.6% remote without bindings I also ran similar with `str` entrypoints as before just to be sure it's still working. NOTE: [--run-path triggers the code to be run inside a `Callable`.](https://github.com/pytorch/pytorch/blob/017259f9c65b6fad55fb9597d7077e2543eaae46/torch/distributed/run.py#L870) Pull Request resolved: #160163 Approved by: https://github.com/d4l3k

# Context This is an extension of pytorch#149334. # This PR Add support for NUMA bindings with Callable entrypoints, such as `do_train` instead of `/usr/local/bin/python`. Most notably, we utilize a hack in order to force `Process.start()` to use custom NUMA bindings for each subprocess. Please search for `HACK:` in the code to see a description of the implementation we chose, and pytorch#160006 for discussion of alternatives and why this is necessary. Other changes: * Remove unnecessary `--preferred` option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See [MPOL_LOCAL](https://man7.org/linux/man-pages/man2/set_mempolicy.2.html).) * Refactor so that the main API is `maybe_wrap_command_with_numa_bindings`, which computes bindings for a single rank at a time, rather than `maybe_wrap_with_numa_bindings` which computed bindings for all ranks at once. This allowed for more code sharing between `Callable` and `str` entrypoints. # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual Using [this benchmark,](https://gist.github.com/pdesupinski/bbe01ade455d86e989794f2c612e2d91), ran ``` $ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 | tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 | tee none_callable.txt ``` and observed * 6.6% remote memory accesses with 'node' bindings * 11.6% remote without bindings I also ran similar with `str` entrypoints as before just to be sure it's still working. NOTE: [--run-path triggers the code to be run inside a `Callable`.](https://github.com/pytorch/pytorch/blob/017259f9c65b6fad55fb9597d7077e2543eaae46/torch/distributed/run.py#L870) Pull Request resolved: pytorch#160163 Approved by: https://github.com/d4l3k

# Context Broader context in pytorch#160163. In order for the _utils_internal version of signpost_event to do proper logging, its parameters argument needs to be json serializable. # This PR Convert `NumaOptions` to serializable form before inputting to `signpost_event`. # Test Plan ## Automated Added tests `$ pytest test/test_numa_binding.py`. ## Manual See [D80317206](https://www.internalfb.com/diff/D80317206). Pull Request resolved: pytorch#160710 Approved by: https://github.com/kiukchung

@d4l3k

# Context In #160163, we added support for NUMA binding for `Callable` entrypoints to `elastic_launch`. This requires special consideration, because they go through a different path to spawn subprocesses compared to `str` entrypoints, a path which does not provide a straightforward way to utilize `numactl` CLI. See #160006 for a full description of the challenges. Although #160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call `numactl`. This appeared to be due to interactions with how the `LD_PRELOAD` environment variable was being set. # This PR On further thought, the most straightforward, foolproof solution here is to use [the trick that @d4l3k suggested.](#160006 (comment)) Specifically, for each local rank `i`: 1. The parent process sets its own CPU affinity to what local rank `i`'s should be. 2. Then, the parent spawns the subprocess for local rank `i`. 3. Finally, the parent resets its own CPU affinity to what it was originally. There were other solutions that would work just for `Callable` entrypoints, but I believe this is the simplest one that can work for *both* `str` and `Callable`, and it's pretty simple. This required a bit of refactoring: 1. Turn all the `_get_.*_numactl_options` into functions which return a set of logical CPUs to bind to, rather than options like `--cpunodebind=0`. 2. Instead of wrapping commands with `numactl`, use `os.sched_setaffinity` to bind to the CPUs from (1.). 3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process. 4. Use the context manager for both `str` and `Callable` paths # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual See [doc.](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.0) Meta only, but TLDR tried out every combination of `str`, `Callable`, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled. Pull Request resolved: #161183 Approved by: https://github.com/d4l3k

# Context This is an extension of pytorch#149334. # This PR Add support for NUMA bindings with Callable entrypoints, such as `do_train` instead of `/usr/local/bin/python`. Most notably, we utilize a hack in order to force `Process.start()` to use custom NUMA bindings for each subprocess. Please search for `HACK:` in the code to see a description of the implementation we chose, and pytorch#160006 for discussion of alternatives and why this is necessary. Other changes: * Remove unnecessary `--preferred` option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See [MPOL_LOCAL](https://man7.org/linux/man-pages/man2/set_mempolicy.2.html).) * Refactor so that the main API is `maybe_wrap_command_with_numa_bindings`, which computes bindings for a single rank at a time, rather than `maybe_wrap_with_numa_bindings` which computed bindings for all ranks at once. This allowed for more code sharing between `Callable` and `str` entrypoints. # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual Using [this benchmark,](https://gist.github.com/pdesupinski/bbe01ade455d86e989794f2c612e2d91), ran ``` $ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 | tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 | tee none_callable.txt ``` and observed * 6.6% remote memory accesses with 'node' bindings * 11.6% remote without bindings I also ran similar with `str` entrypoints as before just to be sure it's still working. NOTE: [--run-path triggers the code to be run inside a `Callable`.](https://github.com/pytorch/pytorch/blob/017259f9c65b6fad55fb9597d7077e2543eaae46/torch/distributed/run.py#L870) Pull Request resolved: pytorch#160163 Approved by: https://github.com/d4l3k

# Context Broader context in pytorch#160163. In order for the _utils_internal version of signpost_event to do proper logging, its parameters argument needs to be json serializable. # This PR Convert `NumaOptions` to serializable form before inputting to `signpost_event`. # Test Plan ## Automated Added tests `$ pytest test/test_numa_binding.py`. ## Manual See [D80317206](https://www.internalfb.com/diff/D80317206). Pull Request resolved: pytorch#160710 Approved by: https://github.com/kiukchung

@d4l3k

# Context In pytorch#160163, we added support for NUMA binding for `Callable` entrypoints to `elastic_launch`. This requires special consideration, because they go through a different path to spawn subprocesses compared to `str` entrypoints, a path which does not provide a straightforward way to utilize `numactl` CLI. See pytorch#160006 for a full description of the challenges. Although pytorch#160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call `numactl`. This appeared to be due to interactions with how the `LD_PRELOAD` environment variable was being set. # This PR On further thought, the most straightforward, foolproof solution here is to use [the trick that @d4l3k suggested.](pytorch#160006 (comment)) Specifically, for each local rank `i`: 1. The parent process sets its own CPU affinity to what local rank `i`'s should be. 2. Then, the parent spawns the subprocess for local rank `i`. 3. Finally, the parent resets its own CPU affinity to what it was originally. There were other solutions that would work just for `Callable` entrypoints, but I believe this is the simplest one that can work for *both* `str` and `Callable`, and it's pretty simple. This required a bit of refactoring: 1. Turn all the `_get_.*_numactl_options` into functions which return a set of logical CPUs to bind to, rather than options like `--cpunodebind=0`. 2. Instead of wrapping commands with `numactl`, use `os.sched_setaffinity` to bind to the CPUs from (1.). 3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process. 4. Use the context manager for both `str` and `Callable` paths # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual See [doc.](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.0) Meta only, but TLDR tried out every combination of `str`, `Callable`, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled. Pull Request resolved: pytorch#161183 Approved by: https://github.com/d4l3k

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) labels Aug 8, 2025

pdesupinski force-pushed the feature/function-numa-binding branch from 322306f to 6c4f40f Compare August 8, 2025 04:31

pdesupinski requested a review from d4l3k August 8, 2025 17:36

d4l3k requested changes Aug 11, 2025

View reviewed changes

d4l3k approved these changes Aug 11, 2025

View reviewed changes

pdesupinski added the suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter) label Aug 11, 2025

pdesupinski force-pushed the feature/function-numa-binding branch 2 times, most recently from 992317a to f84a9dd Compare August 11, 2025 23:48

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 11, 2025

pytorchmergebot added the merging label Aug 11, 2025

pytorchmergebot removed the merging label Aug 12, 2025

Support function entrypoints in elastic agent

45984e7

Support function entrypoints in elastic agent

pdesupinski force-pushed the feature/function-numa-binding branch from f84a9dd to 45984e7 Compare August 12, 2025 17:18

pytorchmergebot added the merging label Aug 12, 2025

pytorchmergebot added the Merged label Aug 12, 2025

pytorchmergebot closed this in 7e91394 Aug 12, 2025

pytorchmergebot removed the merging label Aug 12, 2025

pdesupinski mentioned this pull request Aug 15, 2025

[ez] Make NUMA signpost parameters JSON serializable #160710

Closed

pdesupinski mentioned this pull request Aug 22, 2025

Support NUMA Binding for Callable Entrypoints, Take 2 #161183

Closed

github-actions bot deleted the feature/function-numa-binding branch September 22, 2025 02:15

Support NUMA Binding for Callable Entrypoints #160163

Support NUMA Binding for Callable Entrypoints #160163

Uh oh!

Conversation

pdesupinski commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

This PR

Test Plan

Automated

Manual

Uh oh!

pytorch-bot bot commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160163

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pdesupinski Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pdesupinski Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pdesupinski commented Aug 11, 2025

Uh oh!

pytorchmergebot commented Aug 12, 2025

Merge started

Uh oh!

pytorchmergebot commented Aug 12, 2025

Merge failed

Uh oh!

pdesupinski commented Aug 12, 2025

Uh oh!

pytorchmergebot commented Aug 12, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pdesupinski commented Aug 8, 2025 •

edited

Loading

pytorch-bot bot commented Aug 8, 2025 •

edited

Loading

pdesupinski Aug 11, 2025 •

edited

Loading

pdesupinski Aug 11, 2025 •

edited

Loading