-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Description
📚 The doc issue
Hi, I have one question about the implementation of Adafactor.
According to the PyTorch document and the Adafactor paper, a sum of the gradient square should be used.
However, the existing implementations use a mean of the gradient square.
The mean of the gradient square is used in the implementations.
https://github.com/pytorch/pytorch/blob/v2.7.0/torch/optim/_adafactor.py#L538
https://github.com/pytorch/pytorch/blob/v2.7.0/torch/optim/_adafactor.py#L547
https://github.com/facebookresearch/fairseq/blob/main/fairseq/optim/adafactor.py#L236
https://github.com/google-research/t5x/blob/main/t5x/adafactor.py#L524
The sum of the gradient square is used in the PyTorch document and Adafactor paper.
https://docs.pytorch.org/docs/stable/generated/torch.optim.Adafactor.html
https://arxiv.org/pdf/1804.04235
Why is this discrepancy not documented or discussed?
Reference:
#109581
Suggest a potential alternative/fix
No response
cc @svekars @sekyondaMeta @AlannaBurke @vincentqb @jbschlosser @albanD @janeyx99 @crcrpar