[BE]: Reduce binary size 40% using aggressive fatbin compression. #157791

Skylion007 · 2025-07-08T13:49:23Z

NVCC apparently has a compression-mode flag to tell it how you want to compress the fatbinary since 12.4. This mode defaults to speed (pick a low compression mode that loads the file quickly). Since we are running into PyPi size issues, this will allow us to upload smaller wheel files.

From: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#compress-mode-default-size-speed-balance-none-compress-mode

size
Uses a compression mode more focused on reduced binary size, at the cost of compression and decompression time.

Up to 37.2% reduction in binary size with virtually no drawback (except potentially a little slower loading of the .so at PyTorch startup).

694 MB for CUDA 12.9 builds with 6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0+PTX
vs
1.08GB for CUDA 12.9 builds with 7.5;8.0;8.6;9.0;10.0;12.0+PTX

CUDA 12.9 694MB vs 1.08GB

CUDA 12.8 604MB vs 845MB

This ends up saving PyPi.org approximately 19.6 PiB of bandwidth per month for the CUDA 12.9 case.

This will also allow us to add back CUDA 12.8 12.0+PTX which will make the package forward compatible on newer GPUs. Undoing the need for PR #157516 and #157634

More details can be found in Nvidia's technical blog for CUDA 12.4: https://developer.nvidia.com/blog/runtime-fatbin-creation-using-the-nvidia-cuda-toolkit-12-4-compiler/

Note: This PR has been merged as #161316

cc @ptrblck @msaroufim @eqy @jerryzh168

pytorch-bot · 2025-07-08T13:49:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157791

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 89b2585 with merge base 179dcc1 ():

NEW FAILURE - The following job has failed:

docker-builds / docker-build (linux.12xlarge, pytorch-linux-noble-rocm-n-py3) (gh)
Final attempt failed. Child_process exited with error code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Skylion007 · 2025-07-08T14:13:12Z

Official docker image builds failures appear unrelated.

Skylion007 · 2025-07-08T20:06:55Z

@atalman The better compression finally makes libtorch small enough to survive building!

atalman · 2025-07-08T21:34:14Z

Hi @Skylion007 can we try to do it only for linux for now ? I see some windows builds are failing. Introducing this in Windows can be a followup PR. If Linux builds are working this will already be a big win. This is big improvement in binary size:

malfet · 2025-07-08T21:46:46Z

.ci/docker/ubuntu/Dockerfile

 # AWS specific CUDA build guidance
 ENV TORCH_CUDA_ARCH_LIST Maxwell
-ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all"
+ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all -compress-mode=size"


Aren't whl a zip file already? If that's the case, shoudl we try to package it with -9 or something and achieve the same result?

We had -9 flag for zipping wheels before uploading to pypi for a while. This have not produced noticeable difference:
https://github.com/pytorch/test-infra/blob/b4870dd25914dbc61267c945726717f74207c0ef/release/pypi/prep_binary_for_pypi.sh#L77

This users a more aggressive compression BEFORE it gets compressed by the wheel. so it compresses better then trying to run aggressive compression on top of mild compression.

Just curious if there are any public docs on what compression algorithms are used by nvcc/cudafe

Presumably it's zlib, but it's not clear. They used to use fatbin I think, but switched to nvfatbin recently, not sure it's open source. https://docs.nvidia.com/cuda/nvfatbin/index.html

ptrblck · 2025-07-08T21:49:21Z

.ci/manywheel/build_cuda.sh

        ;;
    12.9)
-        TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX"
+        TORCH_CUDA_ARCH_LIST="6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0+PTX"


Adding 6.0 and 7.0 is not an option according to: #157517 and #157517 (comment). Please remove it

It wasn't until we reduced the size with the compression flags. Now everything seems like it might work (we at least aren't hitting the libtorch limit

Actually, this PR helped more than I anticipated and reduced binary size by nearly 40%, so we can include these archs after all.

These architectures are still deprecated and more importantly untested. User using older GPUs should use well-tested builds with CUDA 12.6.

FWIW, I don't see how "deprecated == untested". As long as they're supported (even if deprecated), they should be getting tested on the nvidia side just as well (also, 12.9 evolved from 12.6, so I find it improbable that the support for older architectures just fell off a cliff).

ptrblck · 2025-07-08T21:49:29Z

.ci/manywheel/build_cuda.sh

        # WAR to resolve the ld error in libtorch build with CUDA 12.9
        if [[ "$PACKAGE_TYPE" == "libtorch" ]]; then
-            TORCH_CUDA_ARCH_LIST="7.5;8.0;9.0;10.0;12.0+PTX"
+            TORCH_CUDA_ARCH_LIST="6.0;7.0;7.5;8.0;9.0;10.0;12.0+PTX"


Same as above

ptrblck · 2025-07-08T21:50:09Z

What's the performance impact?

speed
    Uses a compression mode more focused on reduced decompression time, at the cost of less reduction in final binary size.

kwen2501 · 2025-07-08T22:33:12Z

Regarding the failure related to NVSHMEM in CUDA 12.6 build:

nvlink error   : Uncompress failed (target: sm_50)
nvlink fatal   : elfLink fatbinary error (target: sm_50)
nvlink error   : Uncompress failed (target: sm_60)
nvlink fatal   : elfLink fatbinary error (target: sm_60)

do you think it would be possible to drop this compression feature for 12.6 build?
Soon in 2.9, we are going to deprecate those SM archs:
#157517

CUDA 12.8 and 12.9 builds do not have this issue because they have already dropped those SM archs.
cc @atalman

atalman · 2025-07-08T22:35:42Z

I believe we can only implement compression for cu 12.8+ builds cuda 12.6 builds are legacy builds and currently we do not plan to upload these to pypi. So they can stay as they are cc @kwen2501 @Skylion007

Skylion007 · 2025-07-10T15:15:23Z

@albanD Another alternative is we just also release a CUDA 12.6 wheel, and require newer drivers for CUDA 12.8+.

traversaro · 2025-07-10T15:16:02Z

It is not. We are using the same tooling we were using before, we just are passing slightly different settings into the fatbin utility we were already using before. fatbin is a general compiler tool and not specific to CUDA version.

I may be wrong, but from what understand the --compress-mode flag is passed directly to nvcc, it is not passed to fatbin, as only the -compress-all is prefixed by -Xfatbin. I could not find any docs on how nvcc propagates that flag to fatbin.

ptrblck · 2025-07-10T15:52:29Z

@ptrblck can you confirm what is the state here? Can the binaries generated with the "compress-mode=" flag be used with older drivers? In particular, is it any more restrictive than ">=525.60.13" currently advertized for all 12.X toolkits?

Any usage of --compress-mode=size/balance will drop the support of older CUDA drivers and will bump the min. driver requirement to CUDA 12.4.
We would thus decrease our support matrix and I expect to see a lot of issues for the next release.
My suggestion would be to wait with this PR to the next CUDA 13 major release, which would require a driver update.

albanD · 2025-07-10T16:12:27Z

@pytorchbot revert -m "Reverting to avoid regressing on the driver supported" -c nosignal

pytorchmergebot · 2025-07-10T16:13:59Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…ion. (#157791)" This reverts commit 9bdf87e. Reverted #157791 on behalf of https://github.com/albanD due to Reverting to avoid regressing on the driver supported ([comment](#157791 (comment)))

pytorchmergebot · 2025-07-10T16:14:09Z

@Skylion007 your PR has been successfully reverted.

ptrblck · 2025-07-10T16:50:56Z

Cross-posting from other questions outside of this PR:
CUDA 12.4 was released in March 2024 and contained driver 550.54.14 for Linux

zou3519 · 2025-07-12T02:37:52Z

So I saw this in action in vLLM which merged a similar change that we had to revert (vllm-project/vllm#20847). The CUDA driver version I was running is 12.2 (reported by nvidia-smi), which is less than 12.4.

@Skylion007 are there other libraries that adopted this change that we should tell about this constraint?

Skylion007 · 2025-07-12T15:05:58Z

FYI @tridao with Flash Attention.

@ptrblck

…161316) #159779 CUDA 13 added the support for --compress-mode flag for nvcc across all drivers of CUDA 13.X toolkits, enabling the possibility to use --compress-mode=size for significant size reduction (~71% less for CUDA Math APIs for example). https://developer.nvidia.com/blog/whats-new-and-important-in-cuda-toolkit-13-0/ Why we have to add for CUDA 13 only, quote from @ptrblck : Any usage of --compress-mode=size/balance will drop the support of older CUDA drivers and will bump the min. driver requirement to CUDA 12.4. #157791 (comment) Default for CUDA 13 will be --compress-mode=balance which gives smaller binaries than LZ4 speed mode used in previous CUDA versions. Related - #157791 Pull Request resolved: #161316 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007

github-actions · 2025-09-12T22:34:40Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

brunoais · 2025-09-13T06:37:28Z

Can there be a variant of pytorch where this flag can be applied?
Maybe with a suffix in the version number?

@ptrblck

…ytorch#161316) pytorch#159779 CUDA 13 added the support for --compress-mode flag for nvcc across all drivers of CUDA 13.X toolkits, enabling the possibility to use --compress-mode=size for significant size reduction (~71% less for CUDA Math APIs for example). https://developer.nvidia.com/blog/whats-new-and-important-in-cuda-toolkit-13-0/ Why we have to add for CUDA 13 only, quote from @ptrblck : Any usage of --compress-mode=size/balance will drop the support of older CUDA drivers and will bump the min. driver requirement to CUDA 12.4. pytorch#157791 (comment) Default for CUDA 13 will be --compress-mode=balance which gives smaller binaries than LZ4 speed mode used in previous CUDA versions. Related - pytorch#157791 Pull Request resolved: pytorch#161316 Approved by: https://github.com/nWEIdia, https://github.com/Skylion007

Skylion007 · 2025-09-27T19:54:07Z

This fix has been merged into PyTorch 13 CUDA binaries since those do not have the driver compatibility issue. See #161316 for more info

Skylion007 requested review from atalman and malfet July 8, 2025 13:49

pytorch-bot bot added the release notes: releng release notes category label Jul 8, 2025

Skylion007 changed the title ~~[BE]: NVCC use fatbin compression mode size~~ [BE]: NVCC use fatbin compression mode size. Try to add SM60 back Jul 8, 2025

Skylion007 force-pushed the skylion007/nvcc-size-compress-mode-2025-07-08 branch from 9979f4f to 95f63c6 Compare July 8, 2025 13:59

pytorchbot added the open source label Jul 8, 2025

This was referenced Jul 8, 2025

[release 2.8-2.9] Delete support for Maxwell, Pascal, and Volta architectures for CUDA 12.8 and 12.9 builds #157517

Open

Add sm_70 arch for linux cuda 12.8 and 12.9 builds #157558

Closed

Skylion007 added ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR ciflow/binaries_libtorch Trigger binary build and upload jobs for libtorch on the PR labels Jul 8, 2025

malfet reviewed Jul 8, 2025

View reviewed changes

malfet approved these changes Jul 8, 2025

View reviewed changes

ptrblck previously requested changes Jul 8, 2025

View reviewed changes

Skylion007 requested a review from ptrblck July 8, 2025 22:48

Skylion007 force-pushed the skylion007/nvcc-size-compress-mode-2025-07-08 branch 5 times, most recently from 6609b38 to a1381b0 Compare July 8, 2025 22:57

Skylion007 added 3 commits July 8, 2025 19:06

[BE]: NVCC use fatbin compression mode size

4f28bfb

Address reviewer comments and undo Windows changes

d2d72d8

Fix merge conflict

f3e3edb

Skylion007 changed the title ~~[BE]: NVCC use fatbin compression mode size. Try to add SM60 back~~ [BE]: NVCC use fatbin compression mode size. Jul 8, 2025

pytorchmergebot added the Merged label Jul 10, 2025

pytorchmergebot closed this in 9bdf87e Jul 10, 2025

pytorchmergebot removed the merging label Jul 10, 2025

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Jul 10, 2025

pytorchmergebot reopened this Jul 10, 2025

matthewdouglas mentioned this pull request Jul 10, 2025

[Packaging] Shrink wheel ~35 % via nvcc --compress-mode=size bitsandbytes-foundation/bitsandbytes#1704

Closed

zou3519 mentioned this pull request Jul 12, 2025

[Bug]: CUDA kernel image error when serving Llama4 Maverick since #20694 vllm-project/vllm#20847

Closed

1 task

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 14, 2025

zcbenz mentioned this pull request Jul 23, 2025

[CUDA] --compress-mode requires CUDA 12.8 ml-explore/mlx#2407

Merged

tinglvv mentioned this pull request Aug 22, 2025

Use -compress-mode=size for CUDA 13 build for binary size reduction #161316

Closed

github-actions bot added the Stale label Sep 12, 2025

Skylion007 closed this Sep 27, 2025

h-vetinari mentioned this pull request Sep 29, 2025

use fatbin compression for CUDA 13.0+ conda-forge/pytorch-cpu-feedstock#419

Open

[BE]: Reduce binary size 40% using aggressive fatbin compression. #157791

[BE]: Reduce binary size 40% using aggressive fatbin compression. #157791

Uh oh!

Conversation

Skylion007 commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157791

❌ 1 New Failure

Uh oh!

Skylion007 commented Jul 8, 2025

Uh oh!

Skylion007 commented Jul 8, 2025

Uh oh!

atalman commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Skylion007 Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Skylion007 Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ptrblck commented Jul 8, 2025

Uh oh!

kwen2501 commented Jul 8, 2025

Uh oh!

atalman commented Jul 8, 2025

Uh oh!

Skylion007 commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

traversaro commented Jul 10, 2025

Uh oh!

ptrblck commented Jul 10, 2025

Uh oh!

albanD commented Jul 10, 2025

Uh oh!

pytorchmergebot commented Jul 10, 2025

Uh oh!

pytorchmergebot commented Jul 10, 2025

Uh oh!

ptrblck commented Jul 10, 2025

Uh oh!

zou3519 commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skylion007 commented Jul 12, 2025

Uh oh!

github-actions bot commented Sep 12, 2025

Uh oh!

brunoais commented Sep 13, 2025

Uh oh!

Skylion007 commented Sep 27, 2025

Uh oh!

Reviewers

Assignees

Skylion007 commented Jul 8, 2025 •

edited

Loading

pytorch-bot bot commented Jul 8, 2025 •

edited

Loading

atalman commented Jul 8, 2025 •

edited

Loading

Skylion007 Jul 8, 2025 •

edited

Loading

Skylion007 Jul 9, 2025 •

edited

Loading

Skylion007 commented Jul 10, 2025 •

edited

Loading

zou3519 commented Jul 12, 2025 •

edited

Loading