KEMBAR78
[BE]: Reduce binary size 40% using aggressive fatbin compression. by Skylion007 · Pull Request #157791 · pytorch/pytorch · GitHub
Skip to content

Conversation

@Skylion007
Copy link
Collaborator

@Skylion007 Skylion007 commented Jul 8, 2025

NVCC apparently has a compression-mode flag to tell it how you want to compress the fatbinary since 12.4. This mode defaults to speed (pick a low compression mode that loads the file quickly). Since we are running into PyPi size issues, this will allow us to upload smaller wheel files.

From: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#compress-mode-default-size-speed-balance-none-compress-mode

size
Uses a compression mode more focused on reduced binary size, at the cost of compression and decompression time.

Up to 37.2% reduction in binary size with virtually no drawback (except potentially a little slower loading of the .so at PyTorch startup).

694 MB for CUDA 12.9 builds with 6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0+PTX
vs
1.08GB for CUDA 12.9 builds with 7.5;8.0;8.6;9.0;10.0;12.0+PTX

CUDA 12.9 694MB vs 1.08GB

CUDA 12.8 604MB vs 845MB

This ends up saving PyPi.org approximately 19.6 PiB of bandwidth per month for the CUDA 12.9 case.

This will also allow us to add back CUDA 12.8 12.0+PTX which will make the package forward compatible on newer GPUs. Undoing the need for PR #157516 and #157634

Screenshot 2025-07-08 at 5 36 44 PM

More details can be found in Nvidia's technical blog for CUDA 12.4: https://developer.nvidia.com/blog/runtime-fatbin-creation-using-the-nvidia-cuda-toolkit-12-4-compiler/

Note: This PR has been merged as #161316

cc @ptrblck @msaroufim @eqy @jerryzh168

@Skylion007 Skylion007 requested review from atalman and malfet July 8, 2025 13:49
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 8, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157791

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 89b2585 with merge base 179dcc1 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: releng release notes category label Jul 8, 2025
@Skylion007 Skylion007 changed the title [BE]: NVCC use fatbin compression mode size [BE]: NVCC use fatbin compression mode size. Try to add SM60 back Jul 8, 2025
@Skylion007 Skylion007 force-pushed the skylion007/nvcc-size-compress-mode-2025-07-08 branch from 9979f4f to 95f63c6 Compare July 8, 2025 13:59
@Skylion007
Copy link
Collaborator Author

Official docker image builds failures appear unrelated.

@Skylion007 Skylion007 added ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR ciflow/binaries_libtorch Trigger binary build and upload jobs for libtorch on the PR labels Jul 8, 2025
@Skylion007
Copy link
Collaborator Author

@atalman The better compression finally makes libtorch small enough to survive building!

@atalman
Copy link
Contributor

atalman commented Jul 8, 2025

Hi @Skylion007 can we try to do it only for linux for now ? I see some windows builds are failing. Introducing this in Windows can be a followup PR. If Linux builds are working this will already be a big win. This is big improvement in binary size:

Screenshot 2025-07-08 at 5 36 44 PM

# AWS specific CUDA build guidance
ENV TORCH_CUDA_ARCH_LIST Maxwell
ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all"
ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all -compress-mode=size"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't whl a zip file already? If that's the case, shoudl we try to package it with -9 or something and achieve the same result?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had -9 flag for zipping wheels before uploading to pypi for a while. This have not produced noticeable difference:
https://github.com/pytorch/test-infra/blob/b4870dd25914dbc61267c945726717f74207c0ef/release/pypi/prep_binary_for_pypi.sh#L77

Copy link
Collaborator Author

@Skylion007 Skylion007 Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This users a more aggressive compression BEFORE it gets compressed by the wheel. so it compresses better then trying to run aggressive compression on top of mild compression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious if there are any public docs on what compression algorithms are used by nvcc/cudafe

Copy link
Collaborator Author

@Skylion007 Skylion007 Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably it's zlib, but it's not clear. They used to use fatbin I think, but switched to nvfatbin recently, not sure it's open source. https://docs.nvidia.com/cuda/nvfatbin/index.html

ptrblck
ptrblck previously requested changes Jul 8, 2025
;;
12.9)
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX"
TORCH_CUDA_ARCH_LIST="6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0+PTX"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding 6.0 and 7.0 is not an option according to: #157517 and #157517 (comment). Please remove it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't until we reduced the size with the compression flags. Now everything seems like it might work (we at least aren't hitting the libtorch limit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this PR helped more than I anticipated and reduced binary size by nearly 40%, so we can include these archs after all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These architectures are still deprecated and more importantly untested. User using older GPUs should use well-tested builds with CUDA 12.6.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I don't see how "deprecated == untested". As long as they're supported (even if deprecated), they should be getting tested on the nvidia side just as well (also, 12.9 evolved from 12.6, so I find it improbable that the support for older architectures just fell off a cliff).

# WAR to resolve the ld error in libtorch build with CUDA 12.9
if [[ "$PACKAGE_TYPE" == "libtorch" ]]; then
TORCH_CUDA_ARCH_LIST="7.5;8.0;9.0;10.0;12.0+PTX"
TORCH_CUDA_ARCH_LIST="6.0;7.0;7.5;8.0;9.0;10.0;12.0+PTX"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@ptrblck
Copy link
Collaborator

ptrblck commented Jul 8, 2025

What's the performance impact?

speed
    Uses a compression mode more focused on reduced decompression time, at the cost of less reduction in final binary size.

@kwen2501
Copy link
Contributor

kwen2501 commented Jul 8, 2025

Regarding the failure related to NVSHMEM in CUDA 12.6 build:

nvlink error   : Uncompress failed (target: sm_50)
nvlink fatal   : elfLink fatbinary error (target: sm_50)
nvlink error   : Uncompress failed (target: sm_60)
nvlink fatal   : elfLink fatbinary error (target: sm_60)

do you think it would be possible to drop this compression feature for 12.6 build?
Soon in 2.9, we are going to deprecate those SM archs:
#157517

CUDA 12.8 and 12.9 builds do not have this issue because they have already dropped those SM archs.
cc @atalman

@atalman
Copy link
Contributor

atalman commented Jul 8, 2025

I believe we can only implement compression for cu 12.8+ builds cuda 12.6 builds are legacy builds and currently we do not plan to upload these to pypi. So they can stay as they are cc @kwen2501 @Skylion007

@Skylion007 Skylion007 requested a review from ptrblck July 8, 2025 22:48
@Skylion007 Skylion007 force-pushed the skylion007/nvcc-size-compress-mode-2025-07-08 branch 5 times, most recently from 6609b38 to a1381b0 Compare July 8, 2025 22:57
@Skylion007 Skylion007 changed the title [BE]: NVCC use fatbin compression mode size. Try to add SM60 back [BE]: NVCC use fatbin compression mode size. Jul 8, 2025
@Skylion007
Copy link
Collaborator Author

Skylion007 commented Jul 10, 2025

@albanD Another alternative is we just also release a CUDA 12.6 wheel, and require newer drivers for CUDA 12.8+.

@traversaro
Copy link

It is not. We are using the same tooling we were using before, we just are passing slightly different settings into the fatbin utility we were already using before. fatbin is a general compiler tool and not specific to CUDA version.

I may be wrong, but from what understand the --compress-mode flag is passed directly to nvcc, it is not passed to fatbin, as only the -compress-all is prefixed by -Xfatbin. I could not find any docs on how nvcc propagates that flag to fatbin.

@ptrblck
Copy link
Collaborator

ptrblck commented Jul 10, 2025

@ptrblck can you confirm what is the state here? Can the binaries generated with the "compress-mode=" flag be used with older drivers? In particular, is it any more restrictive than ">=525.60.13" currently advertized for all 12.X toolkits?

Any usage of --compress-mode=size/balance will drop the support of older CUDA drivers and will bump the min. driver requirement to CUDA 12.4.
We would thus decrease our support matrix and I expect to see a lot of issues for the next release.
My suggestion would be to wait with this PR to the next CUDA 13 major release, which would require a driver update.

@albanD
Copy link
Collaborator

albanD commented Jul 10, 2025

@pytorchbot revert -m "Reverting to avoid regressing on the driver supported" -c nosignal

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Jul 10, 2025
…ion. (#157791)"

This reverts commit 9bdf87e.

Reverted #157791 on behalf of https://github.com/albanD due to Reverting to avoid regressing on the driver supported ([comment](#157791 (comment)))
@pytorchmergebot
Copy link
Collaborator

@Skylion007 your PR has been successfully reverted.

@ptrblck
Copy link
Collaborator

ptrblck commented Jul 10, 2025

Cross-posting from other questions outside of this PR:
CUDA 12.4 was released in March 2024 and contained driver 550.54.14 for Linux

@zou3519
Copy link
Contributor

zou3519 commented Jul 12, 2025

So I saw this in action in vLLM which merged a similar change that we had to revert (vllm-project/vllm#20847). The CUDA driver version I was running is 12.2 (reported by nvidia-smi), which is less than 12.4.

@Skylion007 are there other libraries that adopted this change that we should tell about this constraint?

@Skylion007
Copy link
Collaborator Author

FYI @tridao with Flash Attention.

@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 14, 2025
pytorchmergebot pushed a commit that referenced this pull request Aug 24, 2025
…161316)

#159779

CUDA 13 added the support for --compress-mode flag for nvcc across all drivers of CUDA 13.X toolkits, enabling the possibility to use --compress-mode=size for significant size reduction (~71% less for CUDA Math APIs for example). https://developer.nvidia.com/blog/whats-new-and-important-in-cuda-toolkit-13-0/

Why we have to add for CUDA 13 only, quote from @ptrblck : Any usage of --compress-mode=size/balance will drop the support of older CUDA drivers and will bump the min. driver requirement to CUDA 12.4. #157791 (comment)

Default for CUDA 13 will be --compress-mode=balance which gives smaller binaries than LZ4 speed mode used in previous CUDA versions.

Related - #157791

Pull Request resolved: #161316
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Sep 12, 2025
@brunoais
Copy link

Can there be a variant of pytorch where this flag can be applied?
Maybe with a suffix in the version number?

markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
…ytorch#161316)

pytorch#159779

CUDA 13 added the support for --compress-mode flag for nvcc across all drivers of CUDA 13.X toolkits, enabling the possibility to use --compress-mode=size for significant size reduction (~71% less for CUDA Math APIs for example). https://developer.nvidia.com/blog/whats-new-and-important-in-cuda-toolkit-13-0/

Why we have to add for CUDA 13 only, quote from @ptrblck : Any usage of --compress-mode=size/balance will drop the support of older CUDA drivers and will bump the min. driver requirement to CUDA 12.4. pytorch#157791 (comment)

Default for CUDA 13 will be --compress-mode=balance which gives smaller binaries than LZ4 speed mode used in previous CUDA versions.

Related - pytorch#157791

Pull Request resolved: pytorch#161316
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
@Skylion007
Copy link
Collaborator Author

This fix has been merged into PyTorch 13 CUDA binaries since those do not have the driver compatibility issue. See #161316 for more info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

better-engineering Relatively self-contained tasks for better engineering contributors ci-no-td Do not run TD on this PR ciflow/binaries_libtorch Trigger binary build and upload jobs for libtorch on the PR ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged module: cuda Related to torch.cuda, and CUDA support in general open source release notes: build release notes category release notes: cuda release notes category release notes: releng release notes category Reverted Stale triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.