-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Nccl update to 2.25.1 for cuda 12.4-12.8 #146073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146073
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New Failures, 15 Pending, 5 Unrelated FailuresAs of commit aa9dd24 with merge base 858bc0c ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
d14775d
to
8a30e7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, that should work for PyPI, but might be a bit challenging for Poetry, woudn't it? Why not just build nccl for 118 from source and publish the wheel to download.pytorch.org?
Hi @atalman, could you help add 12.8 to this PR as well. I'm planning to close https://github.com/pytorch/pytorch/pull/145776/files as it needed the NCCL submodule version update, which will be fixed by this PR. |
Ah nvm, i see Nvidia actually stopped publishing for CU11 and yanked the unsupported version (hence us building from scratch) |
a87430b
to
5dd816d
Compare
Ok, so the link problems you are seeing are because newer libnccl has two collectives.o, and looks like one of them is lost during the slimming, checking if undoing it will help |
This reverts commit eecee58. Reverted #146073 on behalf of https://github.com/atalman due to breaks Locally building benchmarks ([comment](#146073 (comment)))
@pytorchmergebot merge -f "Reverted because of false alarm" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This wastes some time checking out NCCL even for ROCM builds which will be using system RCCL instead. |
Should resolve: #144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: #146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
This reverts commit 06f4a5c. Reverted #146073 on behalf of https://github.com/atalman due to breaks macos builds: ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ([comment](#146073 (comment)))
Should resolve: #144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: #146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
This reverts commit eecee58. Reverted #146073 on behalf of https://github.com/atalman due to breaks Locally building benchmarks ([comment](#146073 (comment)))
Should resolve: #144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: #146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
I installed the PyTorch cuda12.8 with Belta version, which come from “pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128 It successfully installed a torch shown below: It successfully installed a NCCL (2.25.1+cuda12.8) shown below: However, when I run a code, it shows two errors. |
Should resolve: pytorch#144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: pytorch#146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
This reverts commit 06f4a5c. Reverted pytorch#146073 on behalf of https://github.com/atalman due to breaks macos builds: ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ([comment](pytorch#146073 (comment)))
Should resolve: #144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: #146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
This reverts commit eecee58. Reverted #146073 on behalf of https://github.com/atalman due to breaks Locally building benchmarks ([comment](#146073 (comment)))
Should resolve: #144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: #146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
Note that there is a regression on 2.25 for RoCEv2 that isn't Spectrum-X |
Generate source tarball with PEP 517 conform build tools instead of the custom routine in place right now. Closes #150461. The current procedure for generating the source tarball consists in creation of a source tree by manual copying and pruning of source files. This PR replaces that with a call to the standard [build tool](https://build.pypa.io/en/stable/), which works with the build backend to produce an sdist. For that to work correctly, the build backend also needs to be configured. In the case of Pytorch, the backend currently is (the legacy version of) the setuptools backend, the source dist part of which is mostly configured via the `MANIFEST.in` file. The resulting source distribution can be used to install directly from source with `pip install ./torch-{version}.tar.gz` or to build wheels directly from source with `pip wheel ./torch-{version}.tar.gz`; both should be considered experimental for now. ## Issues ### sdist name According to PEP 517, the name of the source distribution file must coincide with the project name, or [more precisely](https://peps.python.org/pep-0517/#source-distributions), the source distribution of a project that generates `{NAME}-{...}.whl` wheels are required to be named `{NAME}-{...}.tar.gz`. Currently, the source tarball is called `pytorch-{...}.tar.gz`, but the generated wheels and python package are called `torch-{...}`. ### Symbolic Links The source tree at the moment contains a small number of symbolic links. This [has been seen as problematic](pypa/pip#5919) largely because of lack of support on Windows, but also because of [a problem in setuptools](pypa/setuptools#4937). Particularly unfortunate is a circular symlink in the third party `ittapi` module, which can not be resolved by replacing it with a copy. PEP 721 (now integrated in the [Source Distribution Format Specification](https://packaging.python.org/en/latest/specifications/source-distribution-format/#source-distribution-archive-features)) allows for symbolic links, but only if they don't point outside the destination directory and if they don't contain `../` in their target. The list of symbolic links currently is as follows: <details> |source|target|problem|solution| |-|-|-|-| | `.dockerignore` | `.gitignore` | ✅ ok (individual file) || | `docs/requirements.txt` | `../.ci/docker/requirements-docs.txt` |❗`..` in target|swap source and target[^1]| | `functorch/docs/source/notebooks` | `../../notebooks/` |❗`..` in target|swap source and target[^1]| | `.github/ci_commit_pins/triton.txt` | `../../.ci/docker/ci_commit_pins/triton.txt` | ✅ ok (omitted from sdist)|| | `third_party/flatbuffers/docs/source/CONTRIBUTING.md` | `../../CONTRIBUTING.md` |❗`..` in target|omit from sdist[^2]| | `third_party/flatbuffers/java/src/test/java/DictionaryLookup` | `../../../../tests/DictionaryLookup` |❗`..` in target|omit from sdist[^3]| | `third_party/flatbuffers/java/src/test/java/MyGame` | `../../../../tests/MyGame` |❗`..` in target|omit from sdist[^3]| | `third_party/flatbuffers/java/src/test/java/NamespaceA` | `../../../../tests/namespace_test/NamespaceA` |❗`..` in target|omit from sdist[^3]| | `third_party/flatbuffers/java/src/test/java/NamespaceC` | `../../../../tests/namespace_test/NamespaceC` |❗`..` in target|omit from sdist[^3]| | `third_party/flatbuffers/java/src/test/java/optional_scalars` | `../../../../tests/optional_scalars` |❗`..` in target|omit from sdist[^3]| | `third_party/flatbuffers/java/src/test/java/union_vector` | `../../../../tests/union_vector` |❗`..` in target|omit from sdist[^3]| | `third_party/flatbuffers/kotlin/benchmark/src/jvmMain/java` | `../../../../java/src/main/java` |❗`..` in target|omit from sdist[^3]| | `third_party/ittapi/rust/ittapi-sys/c-library` | `../../` |❗`..` in target|omit from sdist[^4]| | `third_party/ittapi/rust/ittapi-sys/LICENSES` | `../../LICENSES` |❗`..` in target|omit from sdist[^4]| | `third_party/opentelemetry-cpp/buildscripts/pre-merge-commit` | `./pre-commit` |✅ ok (individual file)|| | `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_client.cc` | `../../push/tests/integration/sample_client.cc` |❗`..` in target|omit from sdist[^5]| | `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_server.cc` | `../../pull/tests/integration/sample_server.cc` |❗`..` in target|omit from sdist[^5]| | `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_client.cc` | `../../push/tests/integration/sample_client.cc` |❗`..` in target|omit from sdist[^5]| | `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_server.cc` | `../../pull/tests/integration/sample_server.cc` |❗`..` in target|omit from sdist[^5]| | `third_party/XNNPACK/tools/xngen` | `xngen.py` | ✅ ok (individual file)|| </details> The introduction of symbolic links inside the `.ci/docker` folder creates a new problem, however, because Docker's `COPY` command does not allow symlinks in this way. We work around that by using `tar ch` to dereference the symlinks before handing them over to `docker build`. [^1]: These resources can be naturally considered to be part of the docs, so moving the actual files into the place of the current symlinks and replacing them with (unproblematic) symlinks can be said to improve semantics as well. [^2]: The flatbuffers docs already actually use the original file, not the symlink and in the most recent releases, starting from flatbuffers-25.1.21 the symlink is replaced by the actual file thanks to a documentation overhaul. [^3]: These resources are flatbuffers tests for java and kotlin and can be omitted from our sdist. [^4]: We don't need to ship the rust bindings for ittapi. [^5]: These are demonstration examples for how to link to prometheus-cpp using cmake and can be omitted. ### Nccl Nccl used to be included as a submodule. However, with #146073 (first released in v2.7.0-rc1), the submodule was removed and replaced with a build time checkout procedure in `tools/build_pytorch_libs.py`, which checks out the required version of nccl from the upstream repository based on a commit pin recorded in `.ci/docker/ci_commit_pins/nccl-cu{11,12}.txt`. This means that a crucial third party dependency is missing from the source distribution and as the `.ci` folder is omitted from the source distribution, it is not possible to use the build time download. However, it *is* possible to use a system provided Nccl using the `USE_SYSTEM_NCCL` environment variable, which now also is the default for the official Pytorch wheels. Pull Request resolved: #152098 Approved by: https://github.com/atalman
Should resolve: #144768
We use one common nccl version for cuda builds 12.4-12.8 :
NCCL_VERSION=v2.25.1-1
For CUDA 11.8 we use legacy
NCCL_VERSION=v2.21.1-1
We use pinned version of NCCL rather then submodule.
Move nccl location from
third_party/nccl/nccl
tothird_party/nccl
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @StrongerXi