[small][muon] Use addmm for Newton–Schulz orthogonalization #161379

chuanhaozhuge · 2025-08-24T22:42:46Z

A performance optimization. Using torch.addmm, which fuses matrix multiply + scale + add into one op.

Benchmark
In a QWEN-like 0.5B model training we observed average optimizer.step() latency speedup: matmul ~44.5 ms -> addmm ~27.4 ms: a 1.62× speedup.

matmul

addmm

Testing
End-to-end training:
We used a training script that pre-trains a QWEN-like model on openwebtext-100k dataset. We trained for one epoch and the resulting loss curves show consistency between normal matmul and addmm.

Unit test:

    # dummy model and data
    model0 = Linear(10, 10, bias=False)
    model1 = copy.deepcopy(model0)
    inputs = torch.randn(8, 10)
    targets = torch.randn(8, 10)
    loss = MSELoss()

    lr = 1e-3
    wd = 0.1
    momentum = 0.95

    opt_ref_muon = Muon(
        params=model0.parameters(),
        lr=lr,
        weight_decay=wd,
        momentum=momentum,
        nesterov=nesterov,
        adjust_lr_fn="original",
    )

    opt_exp_muon = Muon(
        params=model1.parameters(),
        lr=lr,
        weight_decay=wd,
        momentum=momentum,
        nesterov=nesterov,
        adjust_lr_fn="original",
        use_addmm=True,
    )

    out_ref = model0(inputs)
    loss_ref = loss(out_ref, targets)
    opt_ref_muon.zero_grad()
    loss_ref.backward()
    opt_ref_muon.step()

    out_exp = model1(inputs)
    loss_exp = loss(out_exp, targets)
    opt_exp_muon.zero_grad()
    loss_exp.backward()
    opt_exp_muon.step()

    for p_ref, p_exp in zip(model0.parameters(), model1.parameters()):
        torch.testing.assert_close(p_ref, p_exp)

shows numeric difference, but this is expected on bf16 precision:

Mismatched elements: 96 / 100 (96.0%)
Greatest absolute difference: 8.985400199890137e-05 at index (1, 9) (up to 1e-06 allowed)
Greatest relative difference: 0.007370449136942625 at index (0, 6) (up to 1e-05 allowed)

~~Introduced a flag that allows users to opt in, as there are numerical differences relative to the original implementation.~~
Update: since addmm fuses the math ops, there are fewer intermediate roundings and is therefore more numerically accurate compared to the original form. Based on this, we opt to make addmm the default and only option.

pytorch-bot · 2025-08-24T22:42:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161379

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4758522 with merge base 74280d0 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

janeyx99

I think it's ok to make it the default implementation with addmm as it's strictly mathematically more accurate! (The differences are due to fusion, I would imagine)

toothacher17 · 2025-08-25T20:26:46Z

great!

chuanhaozhuge · 2025-08-25T20:33:02Z

briefly discussed offline with @janeyx99. since addmm fuses the math ops, there are fewer intermediate roundings and is therefore more numerically accurate compared to the original form. based on this, we opt to make addmm the default and only option.

janeyx99

lgtm, the change to line 33 isn't needed anymore btw

chuanhaozhuge · 2025-08-25T22:32:43Z

@pytorchbot merge

pytorchmergebot · 2025-08-25T22:34:48Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-08-25T23:06:53Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / verify-cachebench-cpu-test / test (verify_cachebench, 1, 1, linux.2xlarge)

Details for Dev Infra team

Raised by workflow job

chuanhaozhuge · 2025-08-26T06:41:58Z

@pytorchbot merge

pytorchmergebot · 2025-08-26T06:43:53Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…161379) A performance optimization. Using `torch.addmm`, which fuses `matrix multiply + scale + add` into one op. **Benchmark** In a QWEN-like 0.5B model training we observed average `optimizer.step()` latency speedup: matmul ~44.5 ms -> addmm ~27.4 ms: a **1.62×** speedup. matmul <img width="1403" height="600" alt="Screenshot 2025-08-24 at 3 15 37 PM" src="https://github.com/user-attachments/assets/a77a68d4-da3c-473a-97f0-e6ef0a3b46d9" /> addmm <img width="1426" height="602" alt="Screenshot 2025-08-24 at 3 13 42 PM" src="https://github.com/user-attachments/assets/e493af36-44d3-4026-9f7c-fd0f9cdbc7e5" /> **Testing** End-to-end training: We used a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curves show consistency between normal matmul and addmm. <img width="1035" height="434" alt="Screenshot 2025-08-24 at 2 56 21 PM" src="https://github.com/user-attachments/assets/b96b13e3-0a01-4908-853c-d917b41f3d75" /> Unit test: ```python # dummy model and data model0 = Linear(10, 10, bias=False) model1 = copy.deepcopy(model0) inputs = torch.randn(8, 10) targets = torch.randn(8, 10) loss = MSELoss() lr = 1e-3 wd = 0.1 momentum = 0.95 opt_ref_muon = Muon( params=model0.parameters(), lr=lr, weight_decay=wd, momentum=momentum, nesterov=nesterov, adjust_lr_fn="original", ) opt_exp_muon = Muon( params=model1.parameters(), lr=lr, weight_decay=wd, momentum=momentum, nesterov=nesterov, adjust_lr_fn="original", use_addmm=True, ) out_ref = model0(inputs) loss_ref = loss(out_ref, targets) opt_ref_muon.zero_grad() loss_ref.backward() opt_ref_muon.step() out_exp = model1(inputs) loss_exp = loss(out_exp, targets) opt_exp_muon.zero_grad() loss_exp.backward() opt_exp_muon.step() for p_ref, p_exp in zip(model0.parameters(), model1.parameters()): torch.testing.assert_close(p_ref, p_exp) ``` shows numeric difference, but this is expected on bf16 precision: ``` Mismatched elements: 96 / 100 (96.0%) Greatest absolute difference: 8.985400199890137e-05 at index (1, 9) (up to 1e-06 allowed) Greatest relative difference: 0.007370449136942625 at index (0, 6) (up to 1e-05 allowed) ``` ~~Introduced a flag that allows users to opt in, as there are numerical differences relative to the original implementation.~~ Update: since `addmm` fuses the math ops, there are fewer intermediate roundings and is therefore more numerically accurate compared to the original form. Based on this, we opt to make `addmm` the default and only option. Pull Request resolved: pytorch#161379 Approved by: https://github.com/janeyx99

use addmm for NS

9545f2d

pytorch-bot bot added the release notes: optim label Aug 24, 2025

lint fix

1894d42

janeyx99 reviewed Aug 25, 2025

View reviewed changes

janeyx99 added the topic: performance topic category label Aug 25, 2025

only use addmm

18939f1

chuanhaozhuge marked this pull request as ready for review August 25, 2025 20:48

chuanhaozhuge requested a review from albanD as a code owner August 25, 2025 20:48

janeyx99 approved these changes Aug 25, 2025

View reviewed changes

remove code change

a0fbd26

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 25, 2025

pytorchmergebot added the merging label Aug 25, 2025

pytorchmergebot removed the merging label Aug 25, 2025

expect numeric diff between compiler/eager

4758522

pytorchmergebot added the merging label Aug 26, 2025

pytorchmergebot added the Merged label Aug 26, 2025

pytorchmergebot closed this in e9d42b3 Aug 26, 2025

pytorchmergebot removed the merging label Aug 26, 2025

arledesma mentioned this pull request Sep 3, 2025

Muon Optimizer kohya-ss/musubi-tuner#386

Closed

github-actions bot deleted the muon_dev_1 branch September 26, 2025 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[small][muon] Use addmm for Newton–Schulz orthogonalization #161379

[small][muon] Use addmm for Newton–Schulz orthogonalization #161379

Uh oh!

chuanhaozhuge commented Aug 24, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 24, 2025 •

edited

Loading

Uh oh!

janeyx99 left a comment

Uh oh!

toothacher17 commented Aug 25, 2025

Uh oh!

chuanhaozhuge commented Aug 25, 2025

Uh oh!

janeyx99 left a comment

Uh oh!

chuanhaozhuge commented Aug 25, 2025

Uh oh!

pytorchmergebot commented Aug 25, 2025

Uh oh!

pytorchmergebot commented Aug 25, 2025

Uh oh!

chuanhaozhuge commented Aug 26, 2025

Uh oh!

pytorchmergebot commented Aug 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[small][muon] Use addmm for Newton–Schulz orthogonalization #161379

[small][muon] Use addmm for Newton–Schulz orthogonalization #161379

Uh oh!

Conversation

chuanhaozhuge commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161379

✅ No Failures

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

toothacher17 commented Aug 25, 2025

Uh oh!

chuanhaozhuge commented Aug 25, 2025

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

chuanhaozhuge commented Aug 25, 2025

Uh oh!

pytorchmergebot commented Aug 25, 2025

Merge started

Uh oh!

pytorchmergebot commented Aug 25, 2025

Merge failed

Uh oh!

chuanhaozhuge commented Aug 26, 2025

Uh oh!

pytorchmergebot commented Aug 26, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chuanhaozhuge commented Aug 24, 2025 •

edited

Loading

pytorch-bot bot commented Aug 24, 2025 •

edited

Loading