Detect NVSHMEM location #153010

kwen2501 · 2025-05-07T00:53:17Z

Stack from ghstack (oldest at bottom):

-> Detect NVSHMEM location #153010

Changes

Detect NVSHMEM install location via sysconfig.get_path("purelib"), which typically resolves to <conda_env>/lib/python/site-packages, and NVSHMEM include and lib live under nvidia/nvshmem
Added link dir via target_link_directories
Removed direct dependency on mlx5
Added preload rule (following other other NVIDIA libs)

Plan of Record

End user experience: link against NVSHMEM dynamically (NVSHMEM lib size is 100M, similar to NCCL, thus we'd like users to pip install nvshmem than torch carrying the bits)
Developer experience: at compile time, prefers wheel dependency than using Git submodule
General rule: submodule for small lib that torch can statically link with
If user pip install a lib, our CI build process should do the same, rather than building from Git submodule (just for its header, for example)
Keep USE_NVSHMEM to gate non-Linux platforms, like Windows, Mac
At configuration time, we should be able to detect whether nvshmem is available, if not, we don't build NVSHMEMSymmetricMemory at all.

For now, we have symbol dependency on two particular libs from NVSHMEM:

libnvshmem_host.so: contains host side APIs;
libnvshmem_device.a: contains device-side global variables AND device function impls.

Add link dir Remove direct dependency on mlx5 Add preload rule [ghstack-poisoned]

pytorch-bot · 2025-05-07T00:53:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153010

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6a1e35e with merge base e9e1aac ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

### Changes - Detect NVSHMEM install location via `sysconfig.get_path("purelib")`, which typically resolves to `<conda_env>/lib/python/site-packages`, and NVSHMEM include and lib live under `nvidia/nvshmem` - Added link dir via `target_link_directories` - Removed direct dependency on mlx5 - Added preload rule (following other other NVIDIA libs) ### Plan of Record 1. End user experience: link against NVSHMEM dynamically (NVSHMEM lib size is 100M, similar to NCCL, thus we'd like users to `pip install nvshmem` than torch carrying the bits) 2. Developer experience: at compile time, prefers wheel dependency than using Git submodule General rule: submodule for small lib that torch can statically link with If user pip install a lib, our CI build process should do the same, rather than building from Git submodule (just for its header, for example) 3. Keep `USE_NVSHMEM` to gate non-Linux platforms, like Windows, Mac 4. At configuration time, we should be able to detect whether nvshmem is available, if not, we don't build `NVSHMEMSymmetricMemory` at all. For now, we have symbol dependency on two particular libs from NVSHMEM: - libnvshmem_host.so: contains host side APIs; - libnvshmem_device.a: contains device-side global variables AND device function impls. [ghstack-poisoned]

Add link dir Remove direct dependency on mlx5 Add preload rule ghstack-source-id: 360a115 Pull Request resolved: #153010

kwen2501 · 2025-05-07T01:20:13Z

@malfet can you please let me know if the following lines would satisfy the rpath requirement automatically? Thanks!

pytorch/cmake/Dependencies.cmake

Lines 14 to 16 in 81b6920

    
           # Automatically add all linked folders that are NOT in the build directory to 
        
           # the rpath (per library?) 
        
           set(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE)

Skylion007 · 2025-05-07T13:43:43Z

torch/__init__.py

            "cusolver": "libcusolver.so.*[0-9]",
            "nccl": "libnccl.so.*[0-9]",
            "nvtx": "libnvToolsExt.so.*[0-9]",
+            "nvshmem": "libnvshmem_host.so.*[0-9]",


We may want to consider dynamically linking here for the wheels and adding it as a dependency to help reduce binary size: https://pypi.org/project/nvidia-nvshmem-cu12/

Thanks @Skylion007. It is indeed the plan to dynamically link against nvshmem wheel. (Please see "Plan of Record" in PR description.)
The code here does a pre-load, copying notes from [Global dependencies]:

# ... we try to be good citizens and avoid polluting the symbol # namespaces, so libtorch is loaded with all its dependencies in a local scope. # That usually leads to missing symbol errors at run-time, so to avoid a situation like # this we have to preload those libs in a global namespace.

seth-howell · 2025-05-07T16:53:21Z

tools/setup_helpers/cmake.py

            }
        )

+        # Detect build dependencies from python lib path (in order to set *_HOME variables)


Since you are looking at alternative installations of NVSHMEM, are you also interested in covering the case where users may have installed the NVSHMEM packages from a deb or rpm file?
It's not really our recommended workflow, but in that case, setting NVSHMEM_HOME won't work directly. However, we do support the cmake find_package utility in those distributions.

Thanks, using find_package seems like a more robust solution! Which cmake version is required for this to work?

Thanks. IIUC, find_package requires us writing a FindNVSHMEM.cmake containing heuristic for searching NVSHMEM from system install path. I think I can implement that in a follow-up PR.
I think, though, wheel install should take precedence over system install if both exist?

Looking back at this, I don't think a FindNVSHMEM.cmake file is necessary. NVSHMEM already ships with NVSHMEMConfig.cmake files in the installations. These should just work with the find_package command without additional work.

kwen2501 · 2025-05-07T23:27:45Z

@pytorchbot merge

pytorchmergebot · 2025-05-07T23:29:32Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2025-05-08T23:41:51Z

You'll need to add an entry to this particular list (actually there are 3 such lists in that file):

pytorch/.ci/manywheel/build_cuda.sh

Line 184 in cbcb57d

CUDA_RPATHS=(

Add NVSHMEM_HOME detection

7a92626

Add link dir Remove direct dependency on mlx5 Add preload rule [ghstack-poisoned]

kwen2501 changed the title ~~Add NVSHMEM_HOME detection~~ Detect NVSHMEM location May 7, 2025

kwen2501 requested review from atalman, fduwjj, fegin, malfet and ngimel May 7, 2025 01:11

kwen2501 added a commit that referenced this pull request May 7, 2025

Add NVSHMEM_HOME detection

4c39c31

Add link dir Remove direct dependency on mlx5 Add preload rule ghstack-source-id: 360a115 Pull Request resolved: #153010

kwen2501 added the release notes: distributed (c10d) release notes category label May 7, 2025

ngimel approved these changes May 7, 2025

View reviewed changes

fduwjj approved these changes May 7, 2025

View reviewed changes

Skylion007 reviewed May 7, 2025

View reviewed changes

Skylion007 approved these changes May 7, 2025

View reviewed changes

kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label May 7, 2025

seth-howell reviewed May 7, 2025

View reviewed changes

pytorchmergebot added the merging label May 7, 2025

pytorchmergebot closed this in 5bf0c35 May 7, 2025

pytorchmergebot added Merged and removed merging labels May 7, 2025

github-actions bot deleted the gh/kwen2501/147/head branch July 6, 2025 02:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect NVSHMEM location #153010

Detect NVSHMEM location #153010

Uh oh!

kwen2501 commented May 7, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 7, 2025 •

edited

Loading

Uh oh!

kwen2501 commented May 7, 2025 •

edited

Loading

Uh oh!

Skylion007 May 7, 2025

Uh oh!

kwen2501 May 7, 2025

Uh oh!

seth-howell May 7, 2025

Uh oh!

ngimel May 7, 2025

Uh oh!

kwen2501 May 7, 2025

Uh oh!

seth-howell Jun 5, 2025

Uh oh!

kwen2501 commented May 7, 2025

Uh oh!

pytorchmergebot commented May 7, 2025

Uh oh!

malfet commented May 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Detect NVSHMEM location #153010

Detect NVSHMEM location #153010

Uh oh!

Conversation

kwen2501 commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Plan of Record

Uh oh!

pytorch-bot bot commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153010

✅ No Failures

Uh oh!

kwen2501 commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skylion007 May 7, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 May 7, 2025

Choose a reason for hiding this comment

Uh oh!

seth-howell May 7, 2025

Choose a reason for hiding this comment

Uh oh!

ngimel May 7, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 May 7, 2025

Choose a reason for hiding this comment

Uh oh!

seth-howell Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented May 7, 2025

Uh oh!

pytorchmergebot commented May 7, 2025

Merge started

Uh oh!

malfet commented May 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kwen2501 commented May 7, 2025 •

edited

Loading

pytorch-bot bot commented May 7, 2025 •

edited

Loading

kwen2501 commented May 7, 2025 •

edited

Loading