-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Closed
Labels
module: nnRelated to torch.nnRelated to torch.nntriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🚀 The feature, motivation and pitch
Gradient norm clipping requires computing the total gradient norm across the entire model first. The current design of torch.nn.utils.clip_grad_norm_ is insufficient for cases like pipeline parallelism (PP).
When using pipeline parallelism (PP), the grad norm need to be computed on each PP stage and then reduced across all PP stages. Separating the grad norm computation from torch.nn.utils.clip_grad_norm_ would allow developers to properly reduce the grad norm before clipping.
Alternatives
No response
Additional context
See pytorch/torchtitan#596 and pytorch/torchtitan#649 for more context.
cc @albanD @mruberry @jbschlosser @walterddr @mikaylagawarecki
Metadata
Metadata
Assignees
Labels
module: nnRelated to torch.nnRelated to torch.nntriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module