KEMBAR78
cublaslt/hipblaslt persistent workspace by jeffdaily · Pull Request #156495 · pytorch/pytorch · GitHub
Skip to content

Conversation

@jeffdaily
Copy link
Collaborator

@jeffdaily jeffdaily commented Jun 20, 2025

Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo.

  • fixes hipblaslt issue where memory use increased during graph capture
  • preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE
  • moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs
    • size_t getCUDABlasLtWorkspaceSize()
    • void* getCUDABlasLtWorkspace()

Fixes ROCm#2286.

Similar to cublas/hipblas, LT now allocates one workspace per
handle+stream combo.

- fixes hipblaslt issue where memory use increased during graph capture
- preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE
- moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs
  - size_t getCUDABlasLtWorkspaceSize()
  - void* getCUDABlasLtWorkspace()
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 20, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156495

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 35928f5 with merge base 3644b41 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jeffdaily jeffdaily marked this pull request as ready for review June 20, 2025 15:32
@jeffdaily jeffdaily requested review from eqy and syed-ahmed as code owners June 20, 2025 15:32
@jeffdaily jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Jun 20, 2025
@jeffdaily jeffdaily added release notes: rocm mandatorylabel release notes: cuda release notes category labels Jun 20, 2025
@jeffdaily
Copy link
Collaborator Author

@eqy this PR touches nvidia-specific code, and the unified cublas/cublaslt workspace option that you recently worked on. I would appreciate your review. This change was needed to fix memory behavior on ROCm, but it seemed like the correct thing to do was make this change for CUDA, as well.

@eqy
Copy link
Collaborator

eqy commented Jun 21, 2025

Sure I'm OOTO until next Friday, can I take a look then or is this blocking something on your end?

@jeffdaily
Copy link
Collaborator Author

@eqy thanks for taking a look. Not blocking anything.

@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 26, 2025
@jeffdaily
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 27, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / cuda12.8-py3.10-gcc9-sm80 / build

Details for Dev Infra team Raised by workflow job

@jeffdaily
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@jeanschmidt
Copy link
Contributor

@jeffdaily since this PR merged, a new error is happening in rocom trunk tests, not sure if this is to blame,

https://github.com/pytorch/pytorch/actions/runs/15948855901/job/44987650811

Should we revert to evaluate?

@jeffdaily
Copy link
Collaborator Author

@jeanschmidt the test is passing in latest CI runs. Using latest tip of main I could not repro using the command from the CI link

PYTORCH_TEST_WITH_ROCM=1 python test/test_cuda.py TestMemPool.test_mempool_limited_memory_with_allocator

pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Jul 15, 2025
Cherry-pick of pytorch#156495.

---------

Co-authored-by: Eddie Yan <eddiey@nvidia.com>
pragupta pushed a commit to ROCm/pytorch that referenced this pull request Jul 21, 2025
Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo.

- fixes hipblaslt issue where memory use increased during graph capture
- preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE
- moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs
  - size_t getCUDABlasLtWorkspaceSize()
  - void* getCUDABlasLtWorkspace()

Fixes #2286.

Pull Request resolved: pytorch#156495
Approved by: https://github.com/eqy

(cherry picked from commit 996206e)
pragupta pushed a commit to pragupta/pytorch that referenced this pull request Jul 21, 2025
Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo.

- fixes hipblaslt issue where memory use increased during graph capture
- preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE
- moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs
  - size_t getCUDABlasLtWorkspaceSize()
  - void* getCUDABlasLtWorkspace()

Fixes ROCm#2286.

Pull Request resolved: pytorch#156495
Approved by: https://github.com/eqy

(cherry picked from commit 996206e)
pragupta pushed a commit to ROCm/pytorch that referenced this pull request Jul 22, 2025
Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo.

- fixes hipblaslt issue where memory use increased during graph capture
- preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE
- moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs
  - size_t getCUDABlasLtWorkspaceSize()
  - void* getCUDABlasLtWorkspace()

Fixes #2286.

Pull Request resolved: pytorch#156495
Approved by: https://github.com/eqy

(cherry picked from commit 996206e)
jithunnair-amd pushed a commit to ROCm/pytorch that referenced this pull request Jul 22, 2025
Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo.

- fixes hipblaslt issue where memory use increased during graph capture
- preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE
- moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs
  - size_t getCUDABlasLtWorkspaceSize()
  - void* getCUDABlasLtWorkspace()

Fixes #2286.

Pull Request resolved: pytorch#156495
Approved by: https://github.com/eqy

(cherry picked from commit 996206e)
pragupta pushed a commit to ROCm/pytorch that referenced this pull request Jul 29, 2025
Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo.

- fixes hipblaslt issue where memory use increased during graph capture
- preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE
- moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs
  - size_t getCUDABlasLtWorkspaceSize()
  - void* getCUDABlasLtWorkspace()

Fixes #2286.

Pull Request resolved: pytorch#156495
Approved by: https://github.com/eqy

(cherry picked from commit 996206e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: cuda release notes category release notes: rocm mandatorylabel triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory issues in nn.Linear due to creation of an empty tensor

6 participants