documentation of Adafactor does not match the implementation

### 📚 The doc issue

Hi, I have one question about the implementation of Adafactor. 

According to the PyTorch document and the [Adafactor paper](https://arxiv.org/pdf/1804.04235), a `sum` of the gradient square should be used. 
However, the existing implementations use a `mean` of the gradient square. 

The `mean` of the gradient square is used in the implementations.
https://github.com/pytorch/pytorch/blob/v2.7.0/torch/optim/_adafactor.py#L538
https://github.com/pytorch/pytorch/blob/v2.7.0/torch/optim/_adafactor.py#L547
https://github.com/facebookresearch/fairseq/blob/main/fairseq/optim/adafactor.py#L236
https://github.com/google-research/t5x/blob/main/t5x/adafactor.py#L524

The `sum` of the gradient square is used in the PyTorch document and Adafactor paper.
https://docs.pytorch.org/docs/stable/generated/torch.optim.Adafactor.html
https://arxiv.org/pdf/1804.04235


Why is this discrepancy not documented or discussed?


Reference:
https://github.com/pytorch/pytorch/issues/109581

@janeyx99 

### Suggest a potential alternative/fix

_No response_

cc @svekars @sekyondaMeta @AlannaBurke @vincentqb @jbschlosser @albanD @janeyx99 @crcrpar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

documentation of Adafactor does not match the implementation #154862

📚 The doc issue

Suggest a potential alternative/fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

documentation of Adafactor does not match the implementation #154862

Description

📚 The doc issue

Suggest a potential alternative/fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions