-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[ROCm] Improve reduction sum performance #160466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Use input vectorization for reduction_on_fastest_striding_dimension when dim >= 0
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160466
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit bff4a8a with merge base 4d419a7 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 0 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us cherry-pick of pytorch#160466
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 0 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us cherry-pick of pytorch#160466 Fixes SWDEV-546136
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us cherry-pick of pytorch#160466 Fixes SWDEV-546136
* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 0 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us cherry-pick of pytorch#160466 Fixes SWDEV-546136
* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us Pull Request resolved: #160466 Approved by: https://github.com/petrex, https://github.com/jeffdaily
* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us cherry-pick of pytorch#160466 Fixes SWDEV-546136
* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us Pull Request resolved: #160466 Approved by: https://github.com/petrex, https://github.com/jeffdaily
* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us Pull Request resolved: pytorch#160466 Approved by: https://github.com/petrex, https://github.com/jeffdaily
* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us cherry-pick of pytorch#160466 Fixes SWDEV-546136
* Use input vectorization for reduction_on_fastest_striding_dimension when dim0 >= 128 **Reproducer:** ``` import time import torch shapes = [ (5079670, 128) ] dims = [ (1) ] for i, shape in enumerate(shapes): x = torch.randn(shape, device='cuda', dtype=torch.float) for _ in range(10): w = torch.sum(x, dims[i]) torch.cuda.synchronize() print(w.size()) start_time = time.time() for _ in range(50): _ = torch.sum(x, dims[i]) torch.cuda.synchronize() end_time = time.time() mean_time = (end_time - start_time)/50 print(f"Avg time for shape {shape}: {mean_time * 1e6:.2f} us") ``` **Before (MI300X):** Avg time for shape (5079670, 128): 1629.99 us **After (MI300X)** Avg time for shape (5079670, 128): 1008.59 us Pull Request resolved: pytorch#160466 Approved by: https://github.com/petrex, https://github.com/jeffdaily
Reproducer:
Before (MI300X):
Avg time for shape (5079670, 128): 1629.99 us
After (MI300X)
Avg time for shape (5079670, 128): 1008.59 us
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd