KEMBAR78
Support NUMA Binding for Callable entrypoints to `elastic_launch` · Issue #160006 · pytorch/pytorch · GitHub
Skip to content

Support NUMA Binding for Callable entrypoints to elastic_launch #160006

@pdesupinski

Description

@pdesupinski

Context

As of #149334, we now support automatic NUMA binding when we pass a str entrypoint (e.g. "train.py") to elastic_launch, but it is also possible to pass a callable entrypoint such as do_train. The Callable path is actually fairly common, so we need to support it too.

Under the hood, there are two divergent paths for launching the subprocesses depending on whether the entrypoint is a str or a Callable.

For str, we end up calling subprocess.Popen directly, so it is fairly straightforward to prepend some numactl CLI arguments to what the args would otherwise have been.

Callable Implementation Options

For the Callable path, there are two main possible approaches.

1. Wrap the Callable so that it affinitizes itself to the correct CPUs.

So far, I've prototyped this method. However, top-level code outside the function like import torch will naturally execute in the subprocess before the Callable and therefore our bindings. In a benchmark, I noticed this method still significantly improved memory locality compared to no bindings, but was also clearly had worse locality than the str equivalent of the benchmark.

When I forced the bindings to occur earlier by manually adding the right os.sched_setaffinity to the top of torch/__init__.py, the memory locality improved again but was still not as good as the str equivalent. Also, non-torch code could theoretically run before the bindings occur, and I'm not sure there's a particularly clean way to productionize this method anyway.

2. Force Process.start to use numactl CLI

Under the hood, we use Process.start to kick off the subprocesses in the Callable case. This allows us to input various args to the Callable via pickling and piping.

Ultimately, this just invokes a list[str] command line, so there's no inherent reason that we couldn't prepend it with numactl args. But, there is no API for doing this. The closest API is set_executable.

Unfortunately, set_executable doesn't straightforwardly work for us, because

  1. It only accepts a str, whereas we would need to prepend multiple str since numactl needs arguments.
  2. The arguments to numactl need to be different for each local rank to do the correct bindings.

This led to a couple more radical ideas:

1. Create a different temp .sh file to execute for each local rank containing its proper numactl arguments, and then call set_executable in between each Process.start() call.

Problem: I'm a little antsy because there is this parallel path for starting the processes.

  • Or we could use environment variables and a single script, but the race condition concern still seems to apply.
  • This only applies to "fork" and not "spawn" though which is the path we actually care about.

2. Create a single .sh file to execute which automatically applies the correct bindings based on the local rank

Problem: I don't see any way to access local rank at this point. The env var would not yet be set until the entrypoint runs.

3. Ask for a new API in multiprocessing. Maybe add ability to override set_executable per Process object.

Problem: Don't want to wait for this.

Discussion

Any ideas or preferences? Feels like I've thought through every conceivable method by now, but they each seem problematic.

At this point, I would just go with the multiple temporary .sh files method and just raise an exception if the start_method != "spawn".

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @raghavhrishi

Metadata

Metadata

Assignees

No one assigned

    Labels

    oncall: distributedAdd this issue/PR to distributed oncall triage queue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions