-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[BE]: Reduce binary size 40% using aggressive fatbin compression. #157791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BE]: Reduce binary size 40% using aggressive fatbin compression. #157791
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157791
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 89b2585 with merge base 179dcc1 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
9979f4f to
95f63c6
Compare
|
Official docker image builds failures appear unrelated. |
|
@atalman The better compression finally makes libtorch small enough to survive building! |
|
Hi @Skylion007 can we try to do it only for linux for now ? I see some windows builds are failing. Introducing this in Windows can be a followup PR. If Linux builds are working this will already be a big win. This is big improvement in binary size:
|
| # AWS specific CUDA build guidance | ||
| ENV TORCH_CUDA_ARCH_LIST Maxwell | ||
| ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all" | ||
| ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all -compress-mode=size" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't whl a zip file already? If that's the case, shoudl we try to package it with -9 or something and achieve the same result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had -9 flag for zipping wheels before uploading to pypi for a while. This have not produced noticeable difference:
https://github.com/pytorch/test-infra/blob/b4870dd25914dbc61267c945726717f74207c0ef/release/pypi/prep_binary_for_pypi.sh#L77
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This users a more aggressive compression BEFORE it gets compressed by the wheel. so it compresses better then trying to run aggressive compression on top of mild compression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious if there are any public docs on what compression algorithms are used by nvcc/cudafe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably it's zlib, but it's not clear. They used to use fatbin I think, but switched to nvfatbin recently, not sure it's open source. https://docs.nvidia.com/cuda/nvfatbin/index.html
.ci/manywheel/build_cuda.sh
Outdated
| ;; | ||
| 12.9) | ||
| TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" | ||
| TORCH_CUDA_ARCH_LIST="6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0+PTX" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding 6.0 and 7.0 is not an option according to: #157517 and #157517 (comment). Please remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't until we reduced the size with the compression flags. Now everything seems like it might work (we at least aren't hitting the libtorch limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this PR helped more than I anticipated and reduced binary size by nearly 40%, so we can include these archs after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These architectures are still deprecated and more importantly untested. User using older GPUs should use well-tested builds with CUDA 12.6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I don't see how "deprecated == untested". As long as they're supported (even if deprecated), they should be getting tested on the nvidia side just as well (also, 12.9 evolved from 12.6, so I find it improbable that the support for older architectures just fell off a cliff).
.ci/manywheel/build_cuda.sh
Outdated
| # WAR to resolve the ld error in libtorch build with CUDA 12.9 | ||
| if [[ "$PACKAGE_TYPE" == "libtorch" ]]; then | ||
| TORCH_CUDA_ARCH_LIST="7.5;8.0;9.0;10.0;12.0+PTX" | ||
| TORCH_CUDA_ARCH_LIST="6.0;7.0;7.5;8.0;9.0;10.0;12.0+PTX" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
|
What's the performance impact? |
|
Regarding the failure related to NVSHMEM in CUDA 12.6 build: do you think it would be possible to drop this compression feature for 12.6 build? CUDA 12.8 and 12.9 builds do not have this issue because they have already dropped those SM archs. |
|
I believe we can only implement compression for cu 12.8+ builds cuda 12.6 builds are legacy builds and currently we do not plan to upload these to pypi. So they can stay as they are cc @kwen2501 @Skylion007 |
6609b38 to
a1381b0
Compare
|
@albanD Another alternative is we just also release a CUDA 12.6 wheel, and require newer drivers for CUDA 12.8+. |
I may be wrong, but from what understand the |
Any usage of |
|
@pytorchbot revert -m "Reverting to avoid regressing on the driver supported" -c nosignal |
|
@pytorchbot successfully started a revert job. Check the current status here. |
…ion. (#157791)" This reverts commit 9bdf87e. Reverted #157791 on behalf of https://github.com/albanD due to Reverting to avoid regressing on the driver supported ([comment](#157791 (comment)))
|
@Skylion007 your PR has been successfully reverted. |
|
Cross-posting from other questions outside of this PR: |
|
So I saw this in action in vLLM which merged a similar change that we had to revert (vllm-project/vllm#20847). The CUDA driver version I was running is 12.2 (reported by nvidia-smi), which is less than 12.4. @Skylion007 are there other libraries that adopted this change that we should tell about this constraint? |
|
FYI @tridao with Flash Attention. |
…161316) #159779 CUDA 13 added the support for --compress-mode flag for nvcc across all drivers of CUDA 13.X toolkits, enabling the possibility to use --compress-mode=size for significant size reduction (~71% less for CUDA Math APIs for example). https://developer.nvidia.com/blog/whats-new-and-important-in-cuda-toolkit-13-0/ Why we have to add for CUDA 13 only, quote from @ptrblck : Any usage of --compress-mode=size/balance will drop the support of older CUDA drivers and will bump the min. driver requirement to CUDA 12.4. #157791 (comment) Default for CUDA 13 will be --compress-mode=balance which gives smaller binaries than LZ4 speed mode used in previous CUDA versions. Related - #157791 Pull Request resolved: #161316 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
|
Can there be a variant of pytorch where this flag can be applied? |
…ytorch#161316) pytorch#159779 CUDA 13 added the support for --compress-mode flag for nvcc across all drivers of CUDA 13.X toolkits, enabling the possibility to use --compress-mode=size for significant size reduction (~71% less for CUDA Math APIs for example). https://developer.nvidia.com/blog/whats-new-and-important-in-cuda-toolkit-13-0/ Why we have to add for CUDA 13 only, quote from @ptrblck : Any usage of --compress-mode=size/balance will drop the support of older CUDA drivers and will bump the min. driver requirement to CUDA 12.4. pytorch#157791 (comment) Default for CUDA 13 will be --compress-mode=balance which gives smaller binaries than LZ4 speed mode used in previous CUDA versions. Related - pytorch#157791 Pull Request resolved: pytorch#161316 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
|
This fix has been merged into PyTorch 13 CUDA binaries since those do not have the driver compatibility issue. See #161316 for more info |

NVCC apparently has a compression-mode flag to tell it how you want to compress the fatbinary since 12.4. This mode defaults to speed (pick a low compression mode that loads the file quickly). Since we are running into PyPi size issues, this will allow us to upload smaller wheel files.
From: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#compress-mode-default-size-speed-balance-none-compress-mode
Up to 37.2% reduction in binary size with virtually no drawback (except potentially a little slower loading of the .so at PyTorch startup).
694 MB for CUDA 12.9 builds with 6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0+PTX
vs
1.08GB for CUDA 12.9 builds with 7.5;8.0;8.6;9.0;10.0;12.0+PTX
CUDA 12.9 694MB vs 1.08GB
CUDA 12.8 604MB vs 845MB
This ends up saving PyPi.org approximately 19.6 PiB of bandwidth per month for the CUDA 12.9 case.
This will also allow us to add back CUDA 12.8 12.0+PTX which will make the package forward compatible on newer GPUs. Undoing the need for PR #157516 and #157634
More details can be found in Nvidia's technical blog for CUDA 12.4: https://developer.nvidia.com/blog/runtime-fatbin-creation-using-the-nvidia-cuda-toolkit-12-4-compiler/
Note: This PR has been merged as #161316
cc @ptrblck @msaroufim @eqy @jerryzh168