perf(inductor): use for loop with shortcut in `Optimizer`s to speedup against list comprehensions (e.g. complex conversion) #110613

jon-chuang · 2023-10-05T16:18:10Z

Fully fixes: #110506

Depends: #110607
Potential merge conflicts:

feat(optimizer): Adagrad will use device when capturable - True always when compiling with dynamo #110339
feat(inductor): Improve Adamax to be better fused by Inductor and enable it #110345
fix(optim): adagrad sparse multitensor incorrect early exit #110454

[optim]: NAdam, RAdam and _multi_tensor_adadelta do not support complex types #110606 (we can apply the improvements here orthogonally to the complex support)

Results

Benchmark: 100 params.

Breakdowns (float32, dynamo):

Adagrad: this PR: 4.4s, main: 8.8s
Adam: this PR: 2.1s, main: 9.8s
AdamW: this PR: 2.5s, main: 8.2s
ASGD: this PR: 3.1s, main: 8.5s
RMSProp: this PR: 1.3s, main: 4.2s
RProp: this PR: 6.7s, main: 14.9s

Notes:

Adagrad is still slow due to _get_value list comprehension. Can be fixed in https://github.com/pytorch/pytorch/pull/110339/files by utilizing capturable path
Adamax is not actually compiled (it is currently disabled).
Inductor compile time is quite variable. We calculate dynamo by subtracting call_user_compiler from compile_inner timing.

This PR:

Adagrad (torch.float32): 28.47496461868286s
Adagrad (torch.complex64): 29.379547357559204s
Adam (torch.float32): 17.334211587905884s
Adam (torch.complex64): 29.637500524520874s
Adamax (torch.float32): 2.4749321937561035s
Adamax (torch.complex64): 3.1997995376586914s
AdamW (torch.float32): 18.06532859802246s
AdamW (torch.complex64): 28.25661015510559s
ASGD (torch.float32): 23.70255398750305s
ASGD (torch.complex64): 25.33756995201111s
RMSprop (torch.float32): 7.964028596878052s
RMSprop (torch.complex64): 12.909599781036377s
Rprop (torch.float32): 30.512362003326416s
Rprop (torch.complex64): 44.74405765533447s

Main

Adagrad (torch.float32): 26.919506072998047s
Adagrad (torch.complex64): 35.190622091293335s
Adam (torch.float32): 25.715000867843628s
Adam (torch.complex64): 24.17716670036316s
Adamax (torch.float32): 2.4404726028442383s
Adamax (torch.complex64): 3.3538928031921387s
AdamW (torch.float32): 25.2022807598114s
AdamW (torch.complex64): 28.915700912475586s
ASGD (torch.float32): 24.108731985092163s
ASGD (torch.complex64): 26.589075088500977s
RMSprop (torch.float32): 10.781344175338745s
RMSprop (torch.complex64): 15.136352777481079s
Rprop (torch.float32): 42.46482181549072s
Rprop (torch.complex64): 48.28277635574341s

Seems that it doesn't help the complex case by much (but that's not the majority case). torch.float32 is generally positive, when it does not show drastic improvement / regresses, it is due to inductor variance (by manually inspecting the logs).

Benchmark Script

import torch
import time
from torch.optim import Adagrad, Adam, Adamax, AdamW, ASGD, RMSprop, Rprop

OPTIMS = [Adagrad, Adam, Adamax, AdamW, ASGD, RMSprop, Rprop]
DTYPES = [torch.float, torch.cfloat]

NUM_PARAMS = 100
kwargs = { "lr": 0.01, "foreach": True }
summary = []

for optim_cls in OPTIMS:
    for dtype in DTYPES:
        torch._dynamo.reset()
        # torch._inductor.metrics.reset()
        input = torch.ones([10, 10], dtype=dtype, device="cuda:0")
        model = torch.nn.Sequential(
            *[torch.nn.Linear(10, 10, dtype=dtype, device="cuda:0") for _ in range(NUM_PARAMS)]
        )

        model(input).sum().abs().backward()
        opt_compiled = optim_cls(model.parameters(), **kwargs)
        compiled_step = torch.compile(opt_compiled.step)

        with torch.set_grad_enabled(False):
            start_time = time.time()
            compiled_step()
            summary.append(f"{optim_cls.__name__} ({dtype}): {time.time() - start_time}s")

        print(optim_cls, kwargs, dtype, torch._dynamo.utils.compile_times())

for s in summary:
    print(s)

CC: @janeyx99 @mlazos

pytorch-bot · 2023-10-05T16:18:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110613

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit b123833 with merge base cf1b494 ():

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/optim/adagrad.py

jon-chuang · 2023-10-05T16:59:18Z

Actually, I know how to speed this up even further. We can add a has_complex flag, similar to has_sparse_grad.

See the use of has_sparse_grad to shortcut the any iterator in adagrad

This is currently being canary tested in #110607

janeyx99

Approval contingent on green CI :)

torch/optim/adagrad.py

janeyx99 · 2023-10-05T17:47:45Z

torch/optim/adagrad.py

-        ]
-        device_params = [torch.view_as_real(x) if torch.is_complex(x) else x for x in device_params]
+        for i in range(len(device_params)):
+            if torch.is_complex(device_params[i]):


janeyx99 · 2023-10-05T17:49:24Z

torch/optim/adagrad.py

    for ((device_params, device_grads, device_state_sums, device_state_steps), _) in grouped_tensorlists.values():
-
-        device_has_sparse_grad = any(grad.is_sparse for grad in device_grads)
+        device_has_sparse_grad = has_sparse_grad and any(grad.is_sparse for grad in device_grads)


oh very good heuristical catch

Yes, we can also do this for has_complex. Being tested in: #110607

Perhaps we should try to cover a smaller surface area with just the Adam changes (#110607) and then proceed with the large change here if things go smoothly?

Anw, has_complex changes are orthogonal, so I'm also down to just merge this one first and rebase #110607 and further has_complex improvements on this PR.

janeyx99 · 2023-10-05T18:01:56Z

Also, please land the Adagrad sparse fix separately

janeyx99

Hey! Don't land before addressing my new comment

torch/optim/adam.py

jon-chuang · 2023-10-05T19:30:32Z

@janeyx99 to be more explicit, this is how single_tensor case always shortcuts:

pytorch/torch/optim/adam.py

Line 375 in 0296632

if torch.is_complex(param):

pytorch/torch/optim/adamax.py

Line 262 in 0296632

if torch.is_complex(param):

pytorch/torch/optim/adamw.py

Line 397 in 0296632

if torch.is_complex(param):

pytorch/torch/optim/asgd.py

Line 251 in 0296632

if torch.is_complex(param):

pytorch/torch/optim/rmsprop.py

Line 275 in 0296632

if is_complex_param:

pytorch/torch/optim/rprop.py

Line 238 in 0296632

if torch.is_complex(param):

torch/optim/adam.py

torch/optim/adamw.py

jon-chuang · 2023-10-05T20:36:52Z

@pytorchbot merge

pytorchmergebot · 2023-10-05T20:38:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@janeyx99

…tcut (#110635) Partial fix: #110606 More on `has_complex` shortcut: #110613 (comment) CC: @janeyx99 @mlazos @lezcano Pull Request resolved: #110635 Approved by: https://github.com/lezcano

@janeyx99

…cut (#110634) Partial fix: #110606 More on `has_complex` shortcut: #110613 (comment) CC: @janeyx99 @mlazos @lezcano Pull Request resolved: #110634 Approved by: https://github.com/lezcano

@janeyx99

…_complex` shortcut (#110631) Partial fix: #110606 More on `has_complex` shortcut: #110613 (comment) CC: @janeyx99, @mlazos, @lezcano Pull Request resolved: #110631 Approved by: https://github.com/lezcano

jon-chuang added 3 commits October 5, 2023 11:05

done

16e9902

fix for all

0aec85d

improve any

878a7f5

jon-chuang requested review from albanD and janeyx99 as code owners October 5, 2023 16:18

pytorch-bot bot added the release notes: optim label Oct 5, 2023

jon-chuang changed the title ~~Jon chuang/fast multi tensor optim~~ perf(inductor): use for loop with shortcut in Optimizers to speedup against list comprehensions (e.g. complex conversion) Oct 5, 2023

pytorchbot added the open source label Oct 5, 2023

jon-chuang commented Oct 5, 2023

View reviewed changes

torch/optim/adagrad.py Show resolved Hide resolved

jon-chuang added 2 commits October 5, 2023 12:38

faster shortcut

03fc782

use shortcut

7423c0c

shortcut has_sparse_grad

f596822

janeyx99 approved these changes Oct 5, 2023

View reviewed changes

undo

0296632

jon-chuang mentioned this pull request Oct 5, 2023

feat(optim): Add adadelta multi_tensor support for complex, with has_complex shortcut #110631

Closed

janeyx99 reviewed Oct 5, 2023

View reviewed changes

torch/optim/adam.py Show resolved Hide resolved

jon-chuang mentioned this pull request Oct 5, 2023

feat(optim): Add NAdamsupport for complex, with has_complex shortcut #110634

Closed

albanD removed their request for review October 5, 2023 18:56

This was referenced Oct 5, 2023

feat(optim): Add RAdam support for complex, with has_complex shortcut #110635

Closed

perf(inductor): improve Adam compile times by shortcutting for loops (via has_complex) #110607

Closed

jon-chuang commented Oct 5, 2023

View reviewed changes

torch/optim/adam.py Outdated Show resolved Hide resolved

jon-chuang commented Oct 5, 2023

View reviewed changes

torch/optim/adamw.py Outdated Show resolved Hide resolved

use amsgrad

b123833

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 5, 2023

pytorchmergebot added the merging label Oct 5, 2023

pytorchmergebot added Merged and removed merging labels Oct 5, 2023

pytorchmergebot closed this in df7d01a Oct 5, 2023

jon-chuang mentioned this pull request Oct 7, 2023

[dynamo] Slow compile times for optimizers due to for loops #110506

Closed

perf(inductor): use for loop with shortcut in Optimizers to speedup against list comprehensions (e.g. complex conversion) #110613

perf(inductor): use for loop with shortcut in Optimizers to speedup against list comprehensions (e.g. complex conversion) #110613

Uh oh!

Conversation

jon-chuang commented Oct 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

Benchmark Script

Uh oh!

pytorch-bot bot commented Oct 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110613

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

Uh oh!

jon-chuang commented Oct 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

janeyx99 Oct 5, 2023

Choose a reason for hiding this comment

Uh oh!

janeyx99 Oct 5, 2023

Choose a reason for hiding this comment

Uh oh!

jon-chuang Oct 5, 2023

Choose a reason for hiding this comment

Uh oh!

jon-chuang Oct 5, 2023

Choose a reason for hiding this comment

Uh oh!

janeyx99 commented Oct 5, 2023

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jon-chuang commented Oct 5, 2023

Uh oh!

Uh oh!

Uh oh!

jon-chuang commented Oct 5, 2023

Uh oh!

pytorchmergebot commented Oct 5, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

perf(inductor): use for loop with shortcut in `Optimizer`s to speedup against list comprehensions (e.g. complex conversion) #110613

perf(inductor): use for loop with shortcut in `Optimizer`s to speedup against list comprehensions (e.g. complex conversion) #110613

jon-chuang commented Oct 5, 2023 •

edited

Loading

pytorch-bot bot commented Oct 5, 2023 •

edited

Loading

jon-chuang commented Oct 5, 2023 •

edited

Loading