Support NUMA Binding for Callable entrypoints to `elastic_launch`

# Context
As of #149334, we now support automatic NUMA binding when we pass a `str` entrypoint (e.g. `"train.py"`) to `elastic_launch`, but it is also possible to pass a callable entrypoint such as `do_train`. The `Callable` path is actually fairly common, so we need to support it too.

Under the hood, there are [two divergent paths](https://github.com/pytorch/pytorch/blob/512b4730e3c7b931360ae7f78953d943bb483d9a/torch/distributed/elastic/multiprocessing/__init__.py#L211-L230) for launching the subprocesses depending on whether the entrypoint is a `str` or a `Callable`.

For `str`, we end up [calling `subprocess.Popen` directly,](https://github.com/pytorch/pytorch/blob/512b4730e3c7b931360ae7f78953d943bb483d9a/torch/distributed/elastic/multiprocessing/subprocess_handler/subprocess_handler.py#L57) so it is fairly straightforward to [prepend some `numactl` CLI arguments to what the args would otherwise have been.](https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/api.py#L817)

# `Callable` Implementation Options

For the `Callable` path, there are two main possible approaches.
## 1. Wrap the `Callable` so that it affinitizes itself to the correct CPUs.
So far, I've prototyped this method. However, top-level code outside the function like `import torch` will naturally execute in the subprocess before the `Callable` and therefore our bindings. In a [benchmark,](https://gist.github.com/pdesupinski/51d5e4dec7579383b7173bd5bd82da8f) I noticed this method still significantly improved memory locality compared to no bindings, but was also clearly had worse locality than the `str` equivalent of the benchmark.

When I forced the bindings to occur earlier by manually adding the right `os.sched_setaffinity` to the top of `torch/__init__.py`, the memory locality improved again but was still not as good as the `str` equivalent. Also, non-torch code could theoretically run before the bindings occur, and I'm not sure there's a particularly clean way to productionize this method anyway.

## 2. Force [`Process.start`](https://github.com/pytorch/pytorch/blob/a5725965ea21f684a314defab0bba5b9b5407705/torch/multiprocessing/spawn.py#L275) to use `numactl` CLI

Under the hood, we use [`Process.start`](https://github.com/pytorch/pytorch/blob/a5725965ea21f684a314defab0bba5b9b5407705/torch/multiprocessing/spawn.py#L275) to kick off the subprocesses in the `Callable` case. This allows us [to input various args to the `Callable` via pickling and piping.](https://github.com/pytorch/pytorch/blob/a5725965ea21f684a314defab0bba5b9b5407705/torch/multiprocessing/spawn.py#L271-L272)

Ultimately, this just invokes [a `list[str]` command line,](https://github.com/python/cpython/blob/main/Lib/multiprocessing/popen_spawn_posix.py#L55) so there's no inherent reason that we couldn't prepend it with `numactl` args. But, there is no API for doing this. The closest API is [`set_executable`.](https://github.com/python/cpython/blob/ee72c95aa947e5a87308e3657b6b3983805a086e/Lib/multiprocessing/spawn.py#L36)

Unfortunately, `set_executable` doesn't straightforwardly work for us, because
1. It only accepts a `str`, whereas we would need to prepend multiple `str` since `numactl` needs arguments.
2. The arguments to `numactl` need to be different for each local rank to do the correct bindings.

This led to a couple more radical ideas:
### 1. Create a different temp `.sh` file to execute for each local rank containing its proper `numactl` arguments, and then call `set_executable` in between each `Process.start()` call.
Problem: I'm a little antsy because there is [this parallel path for starting the processes.](https://github.com/pytorch/pytorch/blob/a5725965ea21f684a314defab0bba5b9b5407705/torch/multiprocessing/spawn.py#L283)
* Or we could use environment variables and a single script, but the race condition concern still seems to apply.
* This only applies to `"fork"` and not `"spawn"` though which is the path we actually care about.

### 2. Create a single `.sh` file to execute which automatically applies the correct bindings based on the local rank
Problem: I don't see any way to access local rank at this point. The env var would not yet be set until the entrypoint runs.

### 3. Ask for a new API in `multiprocessing`. Maybe add ability to override `set_executable` per `Process` object.
Problem: Don't want to wait for this.

# Discussion
Any ideas or preferences? Feels like I've thought through every conceivable method by now, but they each seem problematic.

At this point, I would just go with the multiple temporary `.sh` files method and just raise an exception if the `start_method != "spawn"`.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @raghavhrishi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support NUMA Binding for Callable entrypoints to `elastic_launch` #160006

Context

`Callable` Implementation Options

1. Wrap the `Callable` so that it affinitizes itself to the correct CPUs.

2. Force `Process.start` to use `numactl` CLI

1. Create a different temp `.sh` file to execute for each local rank containing its proper `numactl` arguments, and then call `set_executable` in between each `Process.start()` call.

2. Create a single `.sh` file to execute which automatically applies the correct bindings based on the local rank

3. Ask for a new API in `multiprocessing`. Maybe add ability to override `set_executable` per `Process` object.

Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support NUMA Binding for Callable entrypoints to elastic_launch #160006

Description

Context

Callable Implementation Options

1. Wrap the Callable so that it affinitizes itself to the correct CPUs.

2. Force Process.start to use numactl CLI

1. Create a different temp .sh file to execute for each local rank containing its proper numactl arguments, and then call set_executable in between each Process.start() call.

2. Create a single .sh file to execute which automatically applies the correct bindings based on the local rank

3. Ask for a new API in multiprocessing. Maybe add ability to override set_executable per Process object.

Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Support NUMA Binding for Callable entrypoints to `elastic_launch` #160006

`Callable` Implementation Options

1. Wrap the `Callable` so that it affinitizes itself to the correct CPUs.

2. Force `Process.start` to use `numactl` CLI

1. Create a different temp `.sh` file to execute for each local rank containing its proper `numactl` arguments, and then call `set_executable` in between each `Process.start()` call.

2. Create a single `.sh` file to execute which automatically applies the correct bindings based on the local rank

3. Ask for a new API in `multiprocessing`. Maybe add ability to override `set_executable` per `Process` object.