KEMBAR78
Support NUMA Binding for Callable Entrypoints, Take 2 by pdesupinski · Pull Request #161183 · pytorch/pytorch · GitHub
Skip to content

Conversation

@pdesupinski
Copy link
Contributor

@pdesupinski pdesupinski commented Aug 21, 2025

Context

In #160163, we added support for NUMA binding for Callable entrypoints to elastic_launch. This requires special consideration, because they go through a different path to spawn subprocesses compared to str entrypoints, a path which does not provide a straightforward way to utilize numactl CLI. See #160006 for a full description of the challenges.

Although #160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call numactl. This appeared to be due to interactions with how the LD_PRELOAD environment variable was being set.

This PR

On further thought, the most straightforward, foolproof solution here is to use the trick that @d4l3k suggested.

Specifically, for each local rank i:

  1. The parent process sets its own CPU affinity to what local rank i's should be.
  2. Then, the parent spawns the subprocess for local rank i.
  3. Finally, the parent resets its own CPU affinity to what it was originally.

There were other solutions that would work just for Callable entrypoints, but I believe this is the simplest one that can work for both str and Callable, and it's pretty simple.

This required a bit of refactoring:

  1. Turn all the _get_.*_numactl_options into functions which return a set of logical CPUs to bind to, rather than options like --cpunodebind=0.
  2. Instead of wrapping commands with numactl, use os.sched_setaffinity to bind to the CPUs from (1.).
  3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process.
  4. Use the context manager for both str and Callable paths

Test Plan

Automated

$ pytest test/test_numa_binding.py

Manual

See doc. Meta only, but TLDR tried out every combination of str, Callable, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) labels Aug 21, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 21, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161183

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 989679d with merge base 419a2db (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D80724395.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 21, 2025
@pdesupinski pdesupinski force-pushed the feature/function-numa-binding-take2 branch from 5e0dc8c to ec92647 Compare August 22, 2025 02:10
@pdesupinski pdesupinski changed the title os.sched_setaffinity instead of numactl Support NUMA Binding for Callable Entrypoints, Take 2 Aug 22, 2025
@pdesupinski pdesupinski requested review from d4l3k and kiukchung August 22, 2025 02:46
@pdesupinski pdesupinski marked this pull request as ready for review August 22, 2025 02:47
@facebook-github-bot
Copy link
Contributor

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D80724395.

@pdesupinski pdesupinski added the suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter) label Aug 22, 2025
@d4l3k d4l3k requested a review from Copilot August 22, 2025 03:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors NUMA binding support for both str and Callable entrypoints by replacing the previous numactl CLI approach with a simpler CPU affinity inheritance mechanism. Instead of creating temporary executable files with numactl commands, the parent process temporarily sets its own CPU affinity before spawning subprocesses, allowing child processes to inherit the correct NUMA bindings.

Key changes:

  • Replaced numactl CLI wrapping with os.sched_setaffinity() based CPU affinity management
  • Introduced a context manager approach for temporarily applying NUMA bindings to the current process
  • Unified the NUMA binding logic to work with both str and Callable entrypoints through inheritance

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
torch/numa/binding.py Core refactoring - replaced numactl CLI functions with CPU affinity context manager and logical CPU binding functions
torch/multiprocessing/spawn.py Updated to use new context manager approach for NUMA binding inheritance during process spawning
torch/distributed/launcher/api.py Updated condition to check for parallel start compatibility with NUMA bindings
torch/distributed/elastic/multiprocessing/subprocess_handler/subprocess_handler.py Applied new NUMA binding context manager to subprocess creation
test/test_numa_binding.py Comprehensive test updates to verify CPU affinity binding instead of numactl command generation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

"Continuing executing without applying NUMA binding, despite exception %s",
traceback.format_exc(),
)
return None
Copy link

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function should not return None in the except block. This is a context manager that yields, so it should either re-raise the exception or yield without returning a value. Returning None will cause a TypeError when used as a context manager.

Suggested change
return None
return

Copilot uses AI. Check for mistakes.

traceback.format_exc(),
)
return None
raise
Copy link

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raise statement is unreachable because the previous line returns None. The function flow will never reach this point due to the return statement on line 126.

Suggested change
raise
else:
raise

Copilot uses AI. Check for mistakes.

if numa_options.should_fall_back_if_binding_fails:
logger.warning("Falling back to original command without NUMA bindings.")
logger.warning(
"Continuing executing without applying NUMA binding, despite exception %s",
Copy link

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The word 'executing' should be 'execution' to be grammatically correct. The correct phrase should be 'Continuing execution without applying NUMA binding'.

Suggested change
"Continuing executing without applying NUMA binding, despite exception %s",
"Continuing execution without applying NUMA binding, despite exception %s",

Copilot uses AI. Check for mistakes.

Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pdesupinski pdesupinski force-pushed the feature/function-numa-binding-take2 branch from ec92647 to 93ce3a8 Compare August 22, 2025 17:48
@facebook-github-bot
Copy link
Contributor

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D80724395.

@pdesupinski pdesupinski force-pushed the feature/function-numa-binding-take2 branch from 93ce3a8 to 0a0810d Compare August 22, 2025 22:37
@facebook-github-bot
Copy link
Contributor

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D80724395.

@pdesupinski pdesupinski force-pushed the feature/function-numa-binding-take2 branch from 0a0810d to 989679d Compare August 23, 2025 00:13
@pdesupinski pdesupinski requested a review from albanD as a code owner August 23, 2025 00:13
@facebook-github-bot
Copy link
Contributor

@pdesupinski has imported this pull request. If you are a Meta employee, you can view this in D80724395.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Aug 28, 2025
# Context
In #161183, we added NUMA-binding support for `Callable` entrypoints to `elastic_launch`.

However, we would raise an exception if the subprocesses would be spawned in parallel via `ThreadPoolExecutor`, which is an option configurable via the `TORCH_MP_PARALLEL_START` environment variable (see diff).

The logic here was that `os.sched_setaffinity`, which we used to set CPU affinities, is [per process](https://docs.python.org/3/library/os.html#os.sched_setaffinity), so there could be a race condition during a parallel start:

> Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted.

But on further reading, the Linux docs say [`sched_setaffinity` is per *thread*.](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) As it turns out, the Python doc is a misnomer.

I [verified that `sched_setaffinity` only affects the calling thread, not the entire calling process.](https://gist.github.com/pdesupinski/7e2de3cbe5bb48d489f257b83ccddf07)

The upshot is that we actually *can* safely use the inheritance trick from #161183 even with parallel start, since the setting will be inherited from the calling thread, and `os.sched_setaffinity` only affects the calling thread.

# This PR
Remove restrictions against parallel start for NUMA binding.

Pull Request resolved: #161576
Approved by: https://github.com/d4l3k
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
# Context
In pytorch#160163, we added support for NUMA binding for `Callable` entrypoints to `elastic_launch`. This requires special consideration, because they go through a different path to spawn subprocesses compared to `str` entrypoints, a path which does not provide a straightforward way to utilize `numactl` CLI. See pytorch#160006 for a full description of the challenges.

Although pytorch#160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call `numactl`. This appeared to be due to interactions with how the `LD_PRELOAD` environment variable was being set.

# This PR
On further thought, the most straightforward, foolproof solution here is to use [the trick that @d4l3k suggested.](pytorch#160006 (comment))

Specifically, for each local rank `i`:
1. The parent process sets its own CPU affinity to what local rank `i`'s should be.
2. Then, the parent spawns the subprocess for local rank `i`.
3. Finally, the parent resets its own CPU affinity to what it was originally.

There were other solutions that would work just for `Callable` entrypoints, but I believe this is the simplest one that can work for *both* `str` and `Callable`, and it's pretty simple.

This required a bit of refactoring:
1. Turn all the `_get_.*_numactl_options` into functions which return a set of logical CPUs to bind to, rather than options like `--cpunodebind=0`.
2. Instead of wrapping commands with `numactl`, use `os.sched_setaffinity` to bind to the CPUs from (1.).
3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process.
4. Use the context manager for both `str` and `Callable` paths

# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`

## Manual
See [doc.](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.0) Meta only, but TLDR tried out every combination of `str`, `Callable`, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled.

Pull Request resolved: pytorch#161183
Approved by: https://github.com/d4l3k
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
# Context
In pytorch#161183, we added NUMA-binding support for `Callable` entrypoints to `elastic_launch`.

However, we would raise an exception if the subprocesses would be spawned in parallel via `ThreadPoolExecutor`, which is an option configurable via the `TORCH_MP_PARALLEL_START` environment variable (see diff).

The logic here was that `os.sched_setaffinity`, which we used to set CPU affinities, is [per process](https://docs.python.org/3/library/os.html#os.sched_setaffinity), so there could be a race condition during a parallel start:

> Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted.

But on further reading, the Linux docs say [`sched_setaffinity` is per *thread*.](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) As it turns out, the Python doc is a misnomer.

I [verified that `sched_setaffinity` only affects the calling thread, not the entire calling process.](https://gist.github.com/pdesupinski/7e2de3cbe5bb48d489f257b83ccddf07)

The upshot is that we actually *can* safely use the inheritance trick from pytorch#161183 even with parallel start, since the setting will be inherited from the calling thread, and `os.sched_setaffinity` only affects the calling thread.

# This PR
Remove restrictions against parallel start for NUMA binding.

Pull Request resolved: pytorch#161576
Approved by: https://github.com/d4l3k
@github-actions github-actions bot deleted the feature/function-numa-binding-take2 branch September 23, 2025 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) suppress-bc-linter Suppresses the failures of API backward-compatibility linter (Lint/bc_linter)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants