Seperate grad norm computation from `torch.nn.utils.clip_grad_norm_`

### 🚀 The feature, motivation and pitch

Gradient norm clipping requires computing the total gradient norm across the entire model first. The current design of `torch.nn.utils.clip_grad_norm_` is insufficient for cases like pipeline parallelism (PP).

When using pipeline parallelism (PP), the grad norm need to be computed on each PP stage and then reduced across all PP stages. Separating the grad norm computation from `torch.nn.utils.clip_grad_norm_` would allow developers to properly reduce the grad norm before clipping.

### Alternatives

_No response_

### Additional context

See https://github.com/pytorch/torchtitan/issues/596 and https://github.com/pytorch/torchtitan/pull/649 for more context.

cc @albanD @mruberry @jbschlosser @walterddr @mikaylagawarecki

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Seperate grad norm computation from `torch.nn.utils.clip_grad_norm_` #139467

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Seperate grad norm computation from torch.nn.utils.clip_grad_norm_ #139467

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Seperate grad norm computation from `torch.nn.utils.clip_grad_norm_` #139467