-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Description
Context
As of #149334, we now support automatic NUMA binding when we pass a str entrypoint (e.g. "train.py") to elastic_launch, but it is also possible to pass a callable entrypoint such as do_train. The Callable path is actually fairly common, so we need to support it too.
Under the hood, there are two divergent paths for launching the subprocesses depending on whether the entrypoint is a str or a Callable.
For str, we end up calling subprocess.Popen directly, so it is fairly straightforward to prepend some numactl CLI arguments to what the args would otherwise have been.
Callable Implementation Options
For the Callable path, there are two main possible approaches.
1. Wrap the Callable so that it affinitizes itself to the correct CPUs.
So far, I've prototyped this method. However, top-level code outside the function like import torch will naturally execute in the subprocess before the Callable and therefore our bindings. In a benchmark, I noticed this method still significantly improved memory locality compared to no bindings, but was also clearly had worse locality than the str equivalent of the benchmark.
When I forced the bindings to occur earlier by manually adding the right os.sched_setaffinity to the top of torch/__init__.py, the memory locality improved again but was still not as good as the str equivalent. Also, non-torch code could theoretically run before the bindings occur, and I'm not sure there's a particularly clean way to productionize this method anyway.
2. Force Process.start to use numactl CLI
Under the hood, we use Process.start to kick off the subprocesses in the Callable case. This allows us to input various args to the Callable via pickling and piping.
Ultimately, this just invokes a list[str] command line, so there's no inherent reason that we couldn't prepend it with numactl args. But, there is no API for doing this. The closest API is set_executable.
Unfortunately, set_executable doesn't straightforwardly work for us, because
- It only accepts a
str, whereas we would need to prepend multiplestrsincenumactlneeds arguments. - The arguments to
numactlneed to be different for each local rank to do the correct bindings.
This led to a couple more radical ideas:
1. Create a different temp .sh file to execute for each local rank containing its proper numactl arguments, and then call set_executable in between each Process.start() call.
Problem: I'm a little antsy because there is this parallel path for starting the processes.
- Or we could use environment variables and a single script, but the race condition concern still seems to apply.
- This only applies to
"fork"and not"spawn"though which is the path we actually care about.
2. Create a single .sh file to execute which automatically applies the correct bindings based on the local rank
Problem: I don't see any way to access local rank at this point. The env var would not yet be set until the entrypoint runs.
3. Ask for a new API in multiprocessing. Maybe add ability to override set_executable per Process object.
Problem: Don't want to wait for this.
Discussion
Any ideas or preferences? Feels like I've thought through every conceivable method by now, but they each seem problematic.
At this point, I would just go with the multiple temporary .sh files method and just raise an exception if the start_method != "spawn".
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @raghavhrishi