[RFC] Upstream TorchElastic to PyTorch

-- with @pritamdamania, @mrshenli , @H-Huang, @tierex 

This document describes the proposal to upstream TorchElastic to PyTorch. For brevity this doc only discusses the topics at a high level. Topics that require an in-depth technical RFC will be updated with the link.

## Goal

Training paradigm agnostic fault-tolerant/elastic distributed launcher (global restart on failure or scaling event)

Since TorchElastic adds fault tolerance and elasticity support to distributed PyTorch jobs (with assumptions mentioned [below](##TorchElastic_Overview) ). The first step towards a training paradigm agnostic distributed launcher is to upstream TorchElastic to PyTorch proper and over time blend it into the distributed PyTorch-core.

## TorchElastic Overview

TorchElastic (https://pytorch.org/elastic), is a library to launch and manage distributed PyTorch worker processes handling complexities that naturally arise in distributed applications such as fault-tolerance, membership coordination, rank and role assignment, and distributed error reporting and propagation. Since inception (December 2019), TorchElastic has become the default way to run distributed PyTorch jobs at Facebook

At a high-level the TorchElastic agent (henceforth simply referred as “agent”) is started on each host and is responsible for creating and managing the worker processes. TorchElastic is compatible with any existing PyTorch script with the following assumptions:

1. Rank and role assignment is done by the agent before workers are created
2. All workers belonging to a process group is restarted on any worker failures

The assumptions above imply that:

1. Work between checkpoints is lost on failures
2. Membership is static up to failure/scaling event (e.g. number of workers for each rendezvous version is static)
3. User need not manually assign `RANK`, `MASTER_ADDRESS`, `MASTER_PORT`
4. A no-argument `dist.init_process_group()` can and should be used

Diagram below shows how TorchElastic works:

![torchelastic_diagram](https://pytorch.org/elastic/0.2.1/_images/agent_diagram.jpg)

## Topics
*sections below assume familiarity with TorchElastic, please refer to [TorchElastic Overview](##TorchElastic_Overview) for a high level overview*

The high level discussion/design topics for upstreaming TorchElastic to PyTorch are:
1. [(MVP) Module Name](##1._Module_Name)
2. [(MVP) Rendezvous](##2._Rendezvous)
3. [(MVP) Launcher CLI](##3._Launcher_CLI)
4. [Tutorials](##4._Tutorials)
5. [Release](##5._Release)
6. [Open Questions](##6._Open_Questions)

Below we discuss each topic in detail. Each topic is organized into **background** and **proposal** sub-sections introducing the current state and context and the proposal for upstream.

## 1. Module Name 
### Background
TorchElastic’s GitHub repository (https://github.com/pytorch/elastic) contains both the core platform code as well as Kubernetes, AWS, Azure integration code. For the purpose of this doc we focus on upstreaming the core platform code which resides under the `pytorch/elastic/torchelastic` subdirectory as:
```
torchelastic
  |- agent
  |- distributed
  |- events
  |- metrics
  |- multiprocessing
  |- rendezvous
  |- timer
  |- utils
# module path is: torchelastic.<module> (e.g. torchelastic.agent)
```
### Proposal
We propose the upstream to be done in two phases:

1. Phase I (MVP): `torchelastic/**` directory is moved to `torch/distributed/elastic/**` and `torch.distributed.elastic` be released as a Beta feature.
2. Phase II (over time): blend enhanced modules in `torchelastic` to existing top level modules in torch. Some of these modules are (not an exhaustive list):
     1. `torchelastic.multiprocessing` → `torch.multiprocessing`
     2. `torchelastic.logging|events|metrics` → (new) `torch.logging|events|metrics`
     3. `torchelastic.distributed` → (fold into) `torch.distributed`

## 2. Rendezvous
### Background
While the concept of rendezvous in both TorchElastic and PyTorch are similar there are differences:

  | TorchElastic | PyTorch
-- | -- | --
Defines RendezvousHandler Interface | Yes | No
Registration Mechanism | explicit (as registry) | implicit (via import statements)
Implementations | etcd, zeus, mast | zeus
Required to run job | Yes | No
Part of | Agent process | Worker process

Strictly speaking rendezvous is NOT needed to run distributed PyTorch as long as rank and role (master) assignment is done by some entity (e.g. for each worker the following env vars are set: `RANK`, `MASTER_ADDRESS`, `MASTER_PORT`). For PyTorch jobs invoked through TorchElastic users need not worry about rendezvous, rank or role assignment.

In OSS TorchElastic ships with an etcd based non-trivial rendezvous implementation which implies that it has a runtime dependency on etcd server and a compile time dependency on etcd client.

### Proposal 
We propose that:

1. `EtcdRendezvous` is upstreamed to torch as-is.
2. A dependency to [`etc-client`](https://pypi.org/project/python-etcd/) is added to `requirements.txt` in torch
3. A `StandaloneRendezvous` is implemented (on top of `c10d::TCPStore`) that:
    1. At a minimum supports fixed sized num nodes + fault tolerance (e.g. elasticity not MVP)
    2. Takes `MASTER_ADDRESS` and `MASTER_PORT` as input from the user
    3. A single point of failure on the elected `MASTER_ADDRESS`
    4. No external library/service dependency (hence the name “Standalone”)
4. `ManualRendezvous` is implemented that:
    1. Takes `RANK`, `WORLD_SIZE`, `MASTER_ADDRESS` and `MASTER_PORT` from the user (e.g. existing behavior in torch)
    2. This makes it possible to keep the launcher CLI backwards compatible

The proposal above will ensure that:

1. Users DO NOT have to use etcd if they do not want to
2. Users who want highly available rendezvous CAN use etcd (but have to setup the etcd server on their own)

### Followups
1. RFC on `StandaloneRendezvous`
2. Check with release team that adding `etcd-client` dependency is OK (esp. for windows builds)

## 3. Launcher CLI
### Background
While one can use TorchElastic programmatically, TorchElastic includes an out-of-the-box launcher CLI that is a drop in replacement for torch.distributed.launch. The code-snippet below show launching `YOUR_TRAINING_SCRIPT.py` with the pt.launcher (left) versus the pet.launcher (right):

<img width="890" alt="image" src="https://user-images.githubusercontent.com/43595115/104784697-b10d6a00-573d-11eb-9c54-4e389ff2e92e.png">

### Proposal 
Replace `torch.distributed.launch` with `torchelastic.distributed.launch`. The new launch code-snippet would look like:

```
>>> python -m torch.distributed.launch
    --nproc_per_node $NUM_TRAINERS
    --nnodes $NUM_HOSTS
    --rdzv_backend standalone # (could be etcd if torchelastic is installed)
    --rdzv_id $JOB_ID
    YOUR_TRAINING_SCRIPT.py [--arg1... train script args...]
```
-- OR  (arguments backwards compatible with existing launcher)
```
>>> python -m torch.distributed.launch
    --nproc_per_node $NUM_TRAINERS
    --nnodes $NUM_HOSTS
    # uses ManualRendezvous
    --node_rank 0
    --master_addr "192.168.1.1"
    --master_port 8080
    YOUR_TRAINING_SCRIPT.py [--arg1... train script args...]
```
The behavior and arguments of the launcher is preserved and 100% backwards compatible. 

## 4. Tutorials

### Background
**Tutorial**
Currently there is a set of “parallel and-Distributed-Training” tutorials on pytorch.org (https://pytorch.org/tutorials/) which covers:

1. [Distributed-data parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)
2. [Single server model parallel + pipelining](https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html)
3. [Distributed RPC](https://pytorch.org/tutorials/intermediate/rpc_tutorial.html) 

What is lacking is a “syllabus” (similar to an academic course) that the user can follow to learn how to work with distributed training in general with PyTorch. That is, while the existing tutorials guide the user through several paradigms it could be enhanced to teach the user basic primitives to build their own distributed applications with well-defined custom behaviors, topologies, failure handling.

**Documentation**
TorchElastic uses gh pages + sphinx, which is the same mechanism that PyTorch uses to document APIs. As long as the top level rst files are added to the main PyTorch docs gh branch everything else should work seamlessly.

### Proposal
1. Tutorial: 
    1. Edit the existing tutorials to match the upstreamed usage pattern (this may be enough)
    2. Revamp the existing tutorials to more clearly guide the user across different “levels”
2. Documentation:
    1. Merge rst files from TorchElastic gh-pages branch (https://github.com/pytorch/elastic/tree/gh-pages) to the correct places on the PyTorch gh-pages branch

## 5. Release
### Open Questions
1. How should we time the release? PyTorch 1.8? 1.9?
2. Do we upstream in phases? (e.g. target some MVP for 1.8 and then fast follow with a 1.8.1 or complete in 1.9)

### Followups
1. Concrete timeline (with dates that matches 1.8, 1.9 release dates) with phase deliniations

## 6. Open Questions
1. Should TSM also be open sourced with TorchElastic?
2. Coordinating Kubeflow integration efforts (issue 117 (https://github.com/pytorch/elastic/issues/117)) with this upstream since once upstreamed, module paths for torchelastic changes and pypi packages will be deprecated → breaks pre-upstream integration
3. Designate PoCs from:
    1. Documentation team - to revamp existing tutorials and ensure that docs are generated properly
    2. Release team - to coordinate release dates and versions
    3. Distributed PT - code reviews during upstreaming, review StandaloneRendezvous + other RFCs



cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Upstream TorchElastic to PyTorch #50621

Goal

TorchElastic Overview

Topics

1. Module Name

Background

Proposal

2. Rendezvous

Background

Proposal

Followups

3. Launcher CLI

Background

Proposal

4. Tutorials

Background

Proposal

5. Release

Open Questions

Followups

6. Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	TorchElastic	PyTorch
Defines RendezvousHandler Interface	Yes	No
Registration Mechanism	explicit (as registry)	implicit (via import statements)
Implementations	etcd, zeus, mast	zeus
Required to run job	Yes	No
Part of	Agent process	Worker process

[RFC] Upstream TorchElastic to PyTorch #50621

Description

Goal

TorchElastic Overview

Topics

1. Module Name

Background

Proposal

2. Rendezvous

Background

Proposal

Followups

3. Launcher CLI

Background

Proposal

4. Tutorials

Background

Proposal

5. Release

Open Questions

Followups

6. Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions