KEMBAR78
documentation of Adafactor does not match the implementation · Issue #154862 · pytorch/pytorch · GitHub
Skip to content

documentation of Adafactor does not match the implementation #154862

@yorkerlin

Description

@yorkerlin

📚 The doc issue

Hi, I have one question about the implementation of Adafactor.

According to the PyTorch document and the Adafactor paper, a sum of the gradient square should be used.
However, the existing implementations use a mean of the gradient square.

The mean of the gradient square is used in the implementations.
https://github.com/pytorch/pytorch/blob/v2.7.0/torch/optim/_adafactor.py#L538
https://github.com/pytorch/pytorch/blob/v2.7.0/torch/optim/_adafactor.py#L547
https://github.com/facebookresearch/fairseq/blob/main/fairseq/optim/adafactor.py#L236
https://github.com/google-research/t5x/blob/main/t5x/adafactor.py#L524

The sum of the gradient square is used in the PyTorch document and Adafactor paper.
https://docs.pytorch.org/docs/stable/generated/torch.optim.Adafactor.html
https://arxiv.org/pdf/1804.04235

Why is this discrepancy not documented or discussed?

Reference:
#109581

@janeyx99

Suggest a potential alternative/fix

No response

cc @svekars @sekyondaMeta @AlannaBurke @vincentqb @jbschlosser @albanD @janeyx99 @crcrpar

Metadata

Metadata

Assignees

No one assigned

    Labels

    actionablemodule: docsRelated to our documentation, both in docs/ and docblocksmodule: optimizerRelated to torch.optimtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions