fix(optim): adagrad sparse multitensor incorrect early exit #110454

jon-chuang · 2023-10-03T17:44:49Z

This PR:
Passes

Main:

test/optim/test_optim.py::TestOptim::test_adagrad_sparse FAILED [0.0058s]

==================================================================================================================================== FAILURES =====================================================================================================================================
__________________________________________________________________________________________________________________________ TestOptim.test_adagrad_sparse __________________________________________________________________________________________________________________________
Traceback (most recent call last):
  File "/home/jonch/Desktop/Programming/mlsys/pytorch/test/optim/test_optim.py", line 1448, in test_adagrad_sparse
    self._test_rosenbrock_sparse(
  File "/home/jonch/Desktop/Programming/mlsys/pytorch/test/optim/test_optim.py", line 128, in _test_rosenbrock_sparse
    self.assertEqual(params, params_c, atol=1e-6, rtol=1e-6)
  File "/home/jonch/Desktop/Programming/mlsys/pytorch/torch/testing/_internal/common_utils.py", line 3309, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 2 (50.0%)
Greatest absolute difference: 0.09999999999993325 at index (1,) (up to 1e-06 allowed)
Greatest relative difference: 0.06249999999996089 at index (1,) (up to 1e-06 allowed)

CC: @janeyx99

pytorch-bot · 2023-10-03T17:44:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110454

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 86aecab with merge base 4069d1d ():

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

test/optim/test_optim.py

jon-chuang · 2023-10-03T17:50:23Z

test/optim/test_optim.py

+                if w:
+                    grad_out = torch.tensor([grad[0], 0], dtype=param.dtype)
                else:
-                    param.grad = x.to_dense()


Note to reviewer:

Removing to_dense speeds up the test time from 200s to 8s for adagrad. sgd_sparse is still very slow due to optimizer being slow.

Instead, we instantiate the cyclic grads directly from the grad for dense case.

test/optim/test_optim.py

janeyx99

Looks good overall! Thanks for the catch and the quick fix.

I would still like moving the SGD testing to its own PR just for modularity even if it is only 1 line.

test/optim/test_optim.py

jon-chuang · 2023-10-05T16:48:08Z

Hello @janeyx99 let's merge this to unblock #110562?

(anw apologies about not using ghstack, I'll try to get the hang of it shortly...)

janeyx99

awesome!

janeyx99 · 2023-10-05T18:02:55Z

@pytorchbot merge

pytorchmergebot · 2023-10-05T18:04:51Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@janeyx99

… against list comprehensions (e.g. complex conversion) (#110613) Fully fixes: #110506 Depends: #110607 Potential merge conflicts: - #110339 - #110345 - #110454 Related: - #110606 (we can apply the improvements here orthogonally to the complex support) ### Results Benchmark: 100 params. Breakdowns (float32, dynamo): ``` Adagrad: this PR: 4.4s, main: 8.8s Adam: this PR: 2.1s, main: 9.8s AdamW: this PR: 2.5s, main: 8.2s ASGD: this PR: 3.1s, main: 8.5s RMSProp: this PR: 1.3s, main: 4.2s RProp: this PR: 6.7s, main: 14.9s ``` Notes: 1. Adagrad is still slow due to `_get_value` list comprehension. Can be fixed in https://github.com/pytorch/pytorch/pull/110339/files by utilizing capturable path 2. Adamax is not actually compiled (it is currently disabled). 3. Inductor compile time is quite variable. We calculate dynamo by subtracting `call_user_compiler` from `compile_inner` timing. <details> This PR: ``` Adagrad (torch.float32): 28.47496461868286s Adagrad (torch.complex64): 29.379547357559204s Adam (torch.float32): 17.334211587905884s Adam (torch.complex64): 29.637500524520874s Adamax (torch.float32): 2.4749321937561035s Adamax (torch.complex64): 3.1997995376586914s AdamW (torch.float32): 18.06532859802246s AdamW (torch.complex64): 28.25661015510559s ASGD (torch.float32): 23.70255398750305s ASGD (torch.complex64): 25.33756995201111s RMSprop (torch.float32): 7.964028596878052s RMSprop (torch.complex64): 12.909599781036377s Rprop (torch.float32): 30.512362003326416s Rprop (torch.complex64): 44.74405765533447s ``` Main ``` Adagrad (torch.float32): 26.919506072998047s Adagrad (torch.complex64): 35.190622091293335s Adam (torch.float32): 25.715000867843628s Adam (torch.complex64): 24.17716670036316s Adamax (torch.float32): 2.4404726028442383s Adamax (torch.complex64): 3.3538928031921387s AdamW (torch.float32): 25.2022807598114s AdamW (torch.complex64): 28.915700912475586s ASGD (torch.float32): 24.108731985092163s ASGD (torch.complex64): 26.589075088500977s RMSprop (torch.float32): 10.781344175338745s RMSprop (torch.complex64): 15.136352777481079s Rprop (torch.float32): 42.46482181549072s Rprop (torch.complex64): 48.28277635574341s ``` Seems that it doesn't help the complex case by much (but that's not the majority case). torch.float32 is generally positive, when it does not show drastic improvement / regresses, it is due to inductor variance (by manually inspecting the logs). </details> ### Benchmark Script ```python import torch import time from torch.optim import Adagrad, Adam, Adamax, AdamW, ASGD, RMSprop, Rprop OPTIMS = [Adagrad, Adam, Adamax, AdamW, ASGD, RMSprop, Rprop] DTYPES = [torch.float, torch.cfloat] NUM_PARAMS = 100 kwargs = { "lr": 0.01, "foreach": True } summary = [] for optim_cls in OPTIMS: for dtype in DTYPES: torch._dynamo.reset() # torch._inductor.metrics.reset() input = torch.ones([10, 10], dtype=dtype, device="cuda:0") model = torch.nn.Sequential( *[torch.nn.Linear(10, 10, dtype=dtype, device="cuda:0") for _ in range(NUM_PARAMS)] ) model(input).sum().abs().backward() opt_compiled = optim_cls(model.parameters(), **kwargs) compiled_step = torch.compile(opt_compiled.step) with torch.set_grad_enabled(False): start_time = time.time() compiled_step() summary.append(f"{optim_cls.__name__} ({dtype}): {time.time() - start_time}s") print(optim_cls, kwargs, dtype, torch._dynamo.utils.compile_times()) for s in summary: print(s) ``` CC: @janeyx99 @mlazos Pull Request resolved: #110613 Approved by: https://github.com/janeyx99

Follow up to: #110454, which defines the infra for sparse multi tensor optimizer testing Pull Request resolved: #110562 Approved by: https://github.com/janeyx99

jon-chuang added 3 commits October 3, 2023 11:32

fix early exit

5ebfbda

add multi_tensor test to rosenbrock sparse

bffca00

speedup rosenbrock by avoiding to_dense

361a433

jon-chuang requested review from albanD and janeyx99 as code owners October 3, 2023 17:44

pytorch-bot bot added the release notes: optim label Oct 3, 2023

jon-chuang changed the title ~~fix(optim): adagrad sparse multitensor early exit~~ fix(optim): adagrad sparse multitensor incorrect early exit Oct 3, 2023

jon-chuang commented Oct 3, 2023

View reviewed changes

test/optim/test_optim.py Show resolved Hide resolved

pytorchbot added the open source label Oct 3, 2023

jon-chuang commented Oct 3, 2023

View reviewed changes

test/optim/test_optim.py Show resolved Hide resolved

jon-chuang commented Oct 3, 2023

View reviewed changes

remove stray

b2741fc

janeyx99 reviewed Oct 3, 2023

View reviewed changes

test/optim/test_optim.py Outdated Show resolved Hide resolved

remove torch stack, add sgd to test

964c571

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 3, 2023

janeyx99 reviewed Oct 4, 2023

View reviewed changes

test/optim/test_optim.py Show resolved Hide resolved

jon-chuang added 2 commits October 4, 2023 17:51

add comment, remove sgd

dd22efa

lint

86aecab

This was referenced Oct 4, 2023

feat(optim): add SGD sparse multitensor to testing path #110562

Closed

perf(inductor): use for loop with shortcut in Optimizers to speedup against list comprehensions (e.g. complex conversion) #110613

Closed

janeyx99 approved these changes Oct 5, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 5, 2023

pytorchmergebot added the merging label Oct 5, 2023

pytorchmergebot added Merged and removed merging labels Oct 5, 2023

pytorchmergebot closed this in c99de9f Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(optim): adagrad sparse multitensor incorrect early exit #110454

fix(optim): adagrad sparse multitensor incorrect early exit #110454

Uh oh!

jon-chuang commented Oct 3, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 3, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jon-chuang Oct 3, 2023 •

edited

Loading

Uh oh!

Uh oh!

janeyx99 left a comment

Uh oh!

Uh oh!

jon-chuang commented Oct 5, 2023

Uh oh!

janeyx99 left a comment

Uh oh!

janeyx99 commented Oct 5, 2023

Uh oh!

pytorchmergebot commented Oct 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fix(optim): adagrad sparse multitensor incorrect early exit #110454

fix(optim): adagrad sparse multitensor incorrect early exit #110454

Uh oh!

Conversation

jon-chuang commented Oct 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110454

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

Uh oh!

Uh oh!

jon-chuang Oct 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jon-chuang commented Oct 5, 2023

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

janeyx99 commented Oct 5, 2023

Uh oh!

pytorchmergebot commented Oct 5, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jon-chuang commented Oct 3, 2023 •

edited

Loading

pytorch-bot bot commented Oct 3, 2023 •

edited

Loading

jon-chuang Oct 3, 2023 •

edited

Loading