[MPS] Expand fused forloop to bfloat16 #141104

malfet · 2024-11-20T03:39:57Z

Stack from ghstack (oldest at bottom):

-> [MPS] Expand fused forloop to bfloat16 #141104

For MacOS14+

Running following script (adapted from one mentioned in #127242 )

import torch
from torch.optim import adam, adamw
import torch.utils.benchmark as benchmark
import itertools


def profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused):
    fn(
        params,
        grads,
        exp_avgs,
        exp_avg_sqs,
        max_exp_avg_sqs,
        state_steps,
        foreach=False,
        capturable=False,
        fused=fused,
        amsgrad=amsgrad,
        beta1=0.9,
        beta2=0.99,
        lr=1e-3,
        weight_decay=.0,
        eps=1e-5,
        maximize=False,
        grad_scale=None,
        found_inf=None,
    )
    torch.mps.synchronize()

device, dtype = "mps", torch.bfloat16

results = []

for num_tensors, numel, adamWflag, amsgrad in itertools.product([10, 50, 100], [1024, 65536, 1048576], [True, False], [True, False]):
    print(f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}")
    params, grads, exp_avgs, exp_avg_sqs = [[torch.arange(numel, dtype=dtype, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(4)]
    max_exp_avg_sqs = [torch.arange(numel, dtype=dtype, device=device) for _ in range(num_tensors)] if amsgrad else []
    state_steps = [torch.tensor([5], dtype=dtype, device=device) for _ in range(num_tensors)]
    fn = adamw.adamw if adamWflag else adam.adam

    for fused in [True, False]:

        t = benchmark.Timer(
                stmt='profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused)',
                label=f'Fused Adam on {device} using {dtype}',
                sub_label=f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}",
                globals=locals(),
                description= f"Fused: {fused}",
            ).blocked_autorange(min_run_time=5)
        results.append(t)

compare = benchmark.Compare(results)
compare.trim_significant_figures()
compare.colorize(rowwise=True)
compare.print()

Produces following results on M4Pro running MacOS 15

[-------------------------------- Fused Adam on mps using torch.bfloat16 -------------------------------]
                                                                          |  Fused: True  |  Fused: False
1 threads: ----------------------------------------------------------------------------------------------
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 10        |       283     |      2810    
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 10       |       277     |      2430    
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 10       |       285     |      2400    
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 10      |       278     |      2250    
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 10       |       504     |      2700    
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 10      |       478     |      2600    
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 10      |       506     |      2500    
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 10     |       482     |      2300    
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 10     |      2089     |      4190    
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 10    |      1940     |      3800    
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 10    |      2100     |      3770    
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 10   |      1950     |      3600    
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 50        |       842     |     14000    
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 50       |       835     |     11800    
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 50       |       845     |     11700    
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 50      |       855     |     11000    
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 50       |      1410     |     14000    
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 50      |      1350     |     12000    
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 50      |      1400     |     12000    
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 50     |      1340     |     11000    
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 50     |      9767     |     20400    
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 50    |      8991     |     18600    
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 50    |      9803     |     18300    
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 50   |      9070     |     17600    
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100       |      1600     |     27000    
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100      |      1600     |     24100    
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100      |      1600     |     23500    
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100     |      1600     |     21800    
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100      |      2740     |     26000    
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100     |      2580     |     24000    
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100     |      2730     |     25000    
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100    |      2600     |     23000    
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100    |     19350     |     39000    
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100   |     17780     |     37300    
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100   |     19400     |     37000    
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100  |     17900     |     35500    
Times are in microseconds (us).

[ghstack-poisoned]

pytorch-bot · 2024-11-20T03:40:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141104

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

❌ 2 New Failures, 1 Unrelated Failure

As of commit 3503e19 with merge base 0443398 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for test/test_nestedtensor.py:
linux-binary-manywheel / manywheel-py3_9-cuda12_6-test / test (gh)
RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (9, 5, 1) but found runtime version (9, 1, 0). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / win-vs2019-cpu-py3 / test (default, 1, 3, lf.windows.4xlarge.nonephemeral) (gh) (similar failure)
'Test'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

On MacOS14+ ghstack-source-id: 31feeee Pull Request resolved: #141104

qqaatw

Thanks for adding bf16 support. Would you mind pasting some perf numbers comparing cpu and mps on bf16?

aten/src/ATen/native/mps/kernels/FusedOptimizerOps.metal

[ghstack-poisoned]

On MacOS14+ ghstack-source-id: 6a13cd8 Pull Request resolved: #141104

malfet · 2024-11-21T23:22:59Z

@pytorchbot merge -i

pytorchmergebot · 2024-11-21T23:24:54Z

Merge started

Your change will be merged while ignoring the following 2 checks: Lint / lintrunner-noclang / linux-job, Mac MPS / macos-py3-arm64-mps / test (test_mps, 1, 1, macos-m2-15)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2024-11-22T01:02:07Z

@pytorchbot revert -m "Want to add test script to the commit message" -c weird

pytorchmergebot · 2024-11-22T01:03:34Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2024-11-22T01:03:45Z

@malfet your PR has been successfully reverted.

This reverts commit 9a72939. Reverted #141104 on behalf of https://github.com/malfet due to Want to add test script to the commit message ([comment](#141104 (comment)))

malfet · 2024-11-22T01:04:55Z

@pytorchbot merge -f "Just reverting to amend commit message"

pytorchmergebot · 2024-11-22T01:06:36Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

For MacOS14+ Running following script ```python ``` Produces following results on M4Pro running MacOS 15 ``` [-------------------------------- Fused Adam on mps using torch.bfloat16 -------------------------------] | Fused: True | Fused: False 1 threads: ---------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 10 | 283 | 2810 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 10 | 277 | 2430 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 10 | 285 | 2400 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 10 | 278 | 2250 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 10 | 504 | 2700 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 10 | 478 | 2600 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 10 | 506 | 2500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 10 | 482 | 2300 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 10 | 2089 | 4190 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 10 | 1940 | 3800 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 10 | 2100 | 3770 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 10 | 1950 | 3600 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 50 | 842 | 14000 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 50 | 835 | 11800 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 50 | 845 | 11700 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 50 | 855 | 11000 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 50 | 1410 | 14000 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 50 | 1350 | 12000 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 50 | 1400 | 12000 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 50 | 1340 | 11000 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 50 | 9767 | 20400 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 50 | 8991 | 18600 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 50 | 9803 | 18300 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 50 | 9070 | 17600 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 | 1600 | 27000 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 | 1600 | 24100 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 | 1600 | 23500 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 | 1600 | 21800 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 | 2740 | 26000 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 | 2580 | 24000 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 | 2730 | 25000 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 | 2600 | 23000 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 | 19350 | 39000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 | 17780 | 37300 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 | 19400 | 37000 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 | 17900 | 35500 Times are in microseconds (us). ``` Pull Request resolved: pytorch#141104 Approved by: https://github.com/qqaatw, https://github.com/kulinseth, https://github.com/Skylion007 ghstack dependencies: pytorch#141089, pytorch#141090, pytorch#141092, pytorch#141103

This reverts commit 9a72939. Reverted pytorch#141104 on behalf of https://github.com/malfet due to Want to add test script to the commit message ([comment](pytorch#141104 (comment)))

For MacOS14+ Running following script (adapted from one mentioned in pytorch#127242 ) ```python import torch from torch.optim import adam, adamw import torch.utils.benchmark as benchmark import itertools def profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused): fn( params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, foreach=False, capturable=False, fused=fused, amsgrad=amsgrad, beta1=0.9, beta2=0.99, lr=1e-3, weight_decay=.0, eps=1e-5, maximize=False, grad_scale=None, found_inf=None, ) torch.mps.synchronize() device, dtype = "mps", torch.bfloat16 results = [] for num_tensors, numel, adamWflag, amsgrad in itertools.product([10, 50, 100], [1024, 65536, 1048576], [True, False], [True, False]): print(f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}") params, grads, exp_avgs, exp_avg_sqs = [[torch.arange(numel, dtype=dtype, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(4)] max_exp_avg_sqs = [torch.arange(numel, dtype=dtype, device=device) for _ in range(num_tensors)] if amsgrad else [] state_steps = [torch.tensor([5], dtype=dtype, device=device) for _ in range(num_tensors)] fn = adamw.adamw if adamWflag else adam.adam for fused in [True, False]: t = benchmark.Timer( stmt='profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused)', label=f'Fused Adam on {device} using {dtype}', sub_label=f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}", globals=locals(), description= f"Fused: {fused}", ).blocked_autorange(min_run_time=5) results.append(t) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.colorize(rowwise=True) compare.print() ``` Produces following results on M4Pro running MacOS 15 ``` [-------------------------------- Fused Adam on mps using torch.bfloat16 -------------------------------] | Fused: True | Fused: False 1 threads: ---------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 10 | 283 | 2810 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 10 | 277 | 2430 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 10 | 285 | 2400 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 10 | 278 | 2250 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 10 | 504 | 2700 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 10 | 478 | 2600 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 10 | 506 | 2500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 10 | 482 | 2300 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 10 | 2089 | 4190 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 10 | 1940 | 3800 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 10 | 2100 | 3770 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 10 | 1950 | 3600 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 50 | 842 | 14000 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 50 | 835 | 11800 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 50 | 845 | 11700 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 50 | 855 | 11000 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 50 | 1410 | 14000 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 50 | 1350 | 12000 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 50 | 1400 | 12000 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 50 | 1340 | 11000 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 50 | 9767 | 20400 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 50 | 8991 | 18600 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 50 | 9803 | 18300 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 50 | 9070 | 17600 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 | 1600 | 27000 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 | 1600 | 24100 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 | 1600 | 23500 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 | 1600 | 21800 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 | 2740 | 26000 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 | 2580 | 24000 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 | 2730 | 25000 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 | 2600 | 23000 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 | 19350 | 39000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 | 17780 | 37300 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 | 19400 | 37000 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 | 17900 | 35500 Times are in microseconds (us). ``` Pull Request resolved: pytorch#141104 Approved by: https://github.com/qqaatw, https://github.com/kulinseth, https://github.com/Skylion007 ghstack dependencies: pytorch#141089, pytorch#141090, pytorch#141092, pytorch#141103

For MacOS14+ Running following script ```python ``` Produces following results on M4Pro running MacOS 15 ``` [-------------------------------- Fused Adam on mps using torch.bfloat16 -------------------------------] | Fused: True | Fused: False 1 threads: ---------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 10 | 283 | 2810 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 10 | 277 | 2430 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 10 | 285 | 2400 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 10 | 278 | 2250 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 10 | 504 | 2700 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 10 | 478 | 2600 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 10 | 506 | 2500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 10 | 482 | 2300 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 10 | 2089 | 4190 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 10 | 1940 | 3800 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 10 | 2100 | 3770 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 10 | 1950 | 3600 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 50 | 842 | 14000 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 50 | 835 | 11800 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 50 | 845 | 11700 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 50 | 855 | 11000 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 50 | 1410 | 14000 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 50 | 1350 | 12000 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 50 | 1400 | 12000 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 50 | 1340 | 11000 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 50 | 9767 | 20400 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 50 | 8991 | 18600 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 50 | 9803 | 18300 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 50 | 9070 | 17600 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 | 1600 | 27000 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 | 1600 | 24100 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 | 1600 | 23500 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 | 1600 | 21800 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 | 2740 | 26000 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 | 2580 | 24000 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 | 2730 | 25000 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 | 2600 | 23000 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 | 19350 | 39000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 | 17780 | 37300 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 | 19400 | 37000 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 | 17900 | 35500 Times are in microseconds (us). ``` Pull Request resolved: pytorch#141104 Approved by: https://github.com/qqaatw, https://github.com/kulinseth, https://github.com/Skylion007 ghstack dependencies: pytorch#141089, pytorch#141090, pytorch#141092, pytorch#141103

This reverts commit 9a72939. Reverted pytorch#141104 on behalf of https://github.com/malfet due to Want to add test script to the commit message ([comment](pytorch#141104 (comment)))

For MacOS14+ Running following script (adapted from one mentioned in pytorch#127242 ) ```python import torch from torch.optim import adam, adamw import torch.utils.benchmark as benchmark import itertools def profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused): fn( params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, foreach=False, capturable=False, fused=fused, amsgrad=amsgrad, beta1=0.9, beta2=0.99, lr=1e-3, weight_decay=.0, eps=1e-5, maximize=False, grad_scale=None, found_inf=None, ) torch.mps.synchronize() device, dtype = "mps", torch.bfloat16 results = [] for num_tensors, numel, adamWflag, amsgrad in itertools.product([10, 50, 100], [1024, 65536, 1048576], [True, False], [True, False]): print(f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}") params, grads, exp_avgs, exp_avg_sqs = [[torch.arange(numel, dtype=dtype, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(4)] max_exp_avg_sqs = [torch.arange(numel, dtype=dtype, device=device) for _ in range(num_tensors)] if amsgrad else [] state_steps = [torch.tensor([5], dtype=dtype, device=device) for _ in range(num_tensors)] fn = adamw.adamw if adamWflag else adam.adam for fused in [True, False]: t = benchmark.Timer( stmt='profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused)', label=f'Fused Adam on {device} using {dtype}', sub_label=f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}", globals=locals(), description= f"Fused: {fused}", ).blocked_autorange(min_run_time=5) results.append(t) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.colorize(rowwise=True) compare.print() ``` Produces following results on M4Pro running MacOS 15 ``` [-------------------------------- Fused Adam on mps using torch.bfloat16 -------------------------------] | Fused: True | Fused: False 1 threads: ---------------------------------------------------------------------------------------------- amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 10 | 283 | 2810 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 10 | 277 | 2430 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 10 | 285 | 2400 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 10 | 278 | 2250 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 10 | 504 | 2700 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 10 | 478 | 2600 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 10 | 506 | 2500 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 10 | 482 | 2300 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 10 | 2089 | 4190 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 10 | 1940 | 3800 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 10 | 2100 | 3770 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 10 | 1950 | 3600 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 50 | 842 | 14000 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 50 | 835 | 11800 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 50 | 845 | 11700 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 50 | 855 | 11000 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 50 | 1410 | 14000 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 50 | 1350 | 12000 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 50 | 1400 | 12000 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 50 | 1340 | 11000 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 50 | 9767 | 20400 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 50 | 8991 | 18600 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 50 | 9803 | 18300 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 50 | 9070 | 17600 amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100 | 1600 | 27000 amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100 | 1600 | 24100 amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100 | 1600 | 23500 amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100 | 1600 | 21800 amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100 | 2740 | 26000 amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100 | 2580 | 24000 amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100 | 2730 | 25000 amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100 | 2600 | 23000 amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100 | 19350 | 39000 amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100 | 17780 | 37300 amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100 | 19400 | 37000 amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100 | 17900 | 35500 Times are in microseconds (us). ``` Pull Request resolved: pytorch#141104 Approved by: https://github.com/qqaatw, https://github.com/kulinseth, https://github.com/Skylion007 ghstack dependencies: pytorch#141089, pytorch#141090, pytorch#141092, pytorch#141103

Update

64daa80

[ghstack-poisoned]

malfet requested a review from kulinseth as a code owner November 20, 2024 03:39

malfet mentioned this pull request Nov 20, 2024

[MPS][BE] Move FusedOptimizerOps to its own shader #141092

Closed

malfet mentioned this pull request Nov 20, 2024

[MPS][BE] Let preprocessor do preprocessing #141103

Closed

pytorch-bot bot added ciflow/mps Run MPS tests (subset of trunk) release notes: mps Release notes category labels Nov 20, 2024

malfet added a commit that referenced this pull request Nov 20, 2024

[MPS] Expand fused forloop to bfloat16

2a4f2db

On MacOS14+ ghstack-source-id: 31feeee Pull Request resolved: #141104

malfet requested review from Skylion007 and qqaatw November 20, 2024 03:46

qqaatw approved these changes Nov 20, 2024

View reviewed changes

aten/src/ATen/native/mps/kernels/FusedOptimizerOps.metal Show resolved Hide resolved

kulinseth approved these changes Nov 20, 2024

View reviewed changes

malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 20, 2024

Update

3503e19

[ghstack-poisoned]

malfet added a commit that referenced this pull request Nov 20, 2024

[MPS] Expand fused forloop to bfloat16

0e5cff0

On MacOS14+ ghstack-source-id: 6a13cd8 Pull Request resolved: #141104

Skylion007 approved these changes Nov 20, 2024

View reviewed changes

pytorchmergebot added the merging label Nov 21, 2024

pytorchmergebot added the Merged label Nov 21, 2024

pytorchmergebot closed this in 9a72939 Nov 21, 2024

pytorchmergebot removed the merging label Nov 21, 2024

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Nov 22, 2024

pytorchmergebot reopened this Nov 22, 2024

pytorchmergebot added the merging label Nov 22, 2024

pytorchmergebot closed this in d8b4406 Nov 22, 2024

pytorchmergebot removed the merging label Nov 22, 2024

malfet deleted the gh/malfet/63/head branch December 12, 2024 22:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MPS] Expand fused forloop to bfloat16 #141104

[MPS] Expand fused forloop to bfloat16 #141104

Uh oh!

malfet commented Nov 20, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 20, 2024 •

edited

Loading

Uh oh!

qqaatw left a comment

Uh oh!

Uh oh!

malfet commented Nov 21, 2024

Uh oh!

pytorchmergebot commented Nov 21, 2024

Uh oh!

malfet commented Nov 22, 2024

Uh oh!

pytorchmergebot commented Nov 22, 2024

Uh oh!

pytorchmergebot commented Nov 22, 2024

Uh oh!

malfet commented Nov 22, 2024

Uh oh!

pytorchmergebot commented Nov 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[MPS] Expand fused forloop to bfloat16 #141104

[MPS] Expand fused forloop to bfloat16 #141104

Uh oh!

Conversation

malfet commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141104

❗ 1 Active SEVs

❌ 2 New Failures, 1 Unrelated Failure

Uh oh!

qqaatw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

malfet commented Nov 21, 2024

Uh oh!

pytorchmergebot commented Nov 21, 2024

Merge started

Uh oh!

malfet commented Nov 22, 2024

Uh oh!

pytorchmergebot commented Nov 22, 2024

Uh oh!

pytorchmergebot commented Nov 22, 2024

Uh oh!

malfet commented Nov 22, 2024

Uh oh!

pytorchmergebot commented Nov 22, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

malfet commented Nov 20, 2024 •

edited

Loading

pytorch-bot bot commented Nov 20, 2024 •

edited

Loading