-
Notifications
You must be signed in to change notification settings - Fork 25.7k
cublaslt/hipblaslt persistent workspace #156495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo. - fixes hipblaslt issue where memory use increased during graph capture - preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE - moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs - size_t getCUDABlasLtWorkspaceSize() - void* getCUDABlasLtWorkspace()
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156495
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 35928f5 with merge base 3644b41 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@eqy this PR touches nvidia-specific code, and the unified cublas/cublaslt workspace option that you recently worked on. I would appreciate your review. This change was needed to fix memory behavior on ROCm, but it seemed like the correct thing to do was make this change for CUDA, as well. |
|
Sure I'm OOTO until next Friday, can I take a look then or is this blocking something on your end? |
|
@eqy thanks for taking a look. Not blocking anything. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / cuda12.8-py3.10-gcc9-sm80 / build Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@jeffdaily since this PR merged, a new error is happening in rocom trunk tests, not sure if this is to blame, https://github.com/pytorch/pytorch/actions/runs/15948855901/job/44987650811 Should we revert to evaluate? |
|
@jeanschmidt the test is passing in latest CI runs. Using latest tip of main I could not repro using the command from the CI link
|
Cherry-pick of pytorch#156495. --------- Co-authored-by: Eddie Yan <eddiey@nvidia.com>
Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo. - fixes hipblaslt issue where memory use increased during graph capture - preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE - moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs - size_t getCUDABlasLtWorkspaceSize() - void* getCUDABlasLtWorkspace() Fixes #2286. Pull Request resolved: pytorch#156495 Approved by: https://github.com/eqy (cherry picked from commit 996206e)
Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo. - fixes hipblaslt issue where memory use increased during graph capture - preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE - moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs - size_t getCUDABlasLtWorkspaceSize() - void* getCUDABlasLtWorkspace() Fixes ROCm#2286. Pull Request resolved: pytorch#156495 Approved by: https://github.com/eqy (cherry picked from commit 996206e)
Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo. - fixes hipblaslt issue where memory use increased during graph capture - preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE - moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs - size_t getCUDABlasLtWorkspaceSize() - void* getCUDABlasLtWorkspace() Fixes #2286. Pull Request resolved: pytorch#156495 Approved by: https://github.com/eqy (cherry picked from commit 996206e)
Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo. - fixes hipblaslt issue where memory use increased during graph capture - preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE - moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs - size_t getCUDABlasLtWorkspaceSize() - void* getCUDABlasLtWorkspace() Fixes #2286. Pull Request resolved: pytorch#156495 Approved by: https://github.com/eqy (cherry picked from commit 996206e)
Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo. - fixes hipblaslt issue where memory use increased during graph capture - preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE - moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs - size_t getCUDABlasLtWorkspaceSize() - void* getCUDABlasLtWorkspace() Fixes #2286. Pull Request resolved: pytorch#156495 Approved by: https://github.com/eqy (cherry picked from commit 996206e)
Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo.
Fixes ROCm#2286.