KEMBAR78
Detect NVSHMEM location by kwen2501 · Pull Request #153010 · pytorch/pytorch · GitHub
Skip to content

Conversation

@kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented May 7, 2025

Stack from ghstack (oldest at bottom):

Changes

  • Detect NVSHMEM install location via sysconfig.get_path("purelib"), which typically resolves to <conda_env>/lib/python/site-packages, and NVSHMEM include and lib live under nvidia/nvshmem
  • Added link dir via target_link_directories
  • Removed direct dependency on mlx5
  • Added preload rule (following other other NVIDIA libs)

Plan of Record

  1. End user experience: link against NVSHMEM dynamically (NVSHMEM lib size is 100M, similar to NCCL, thus we'd like users to pip install nvshmem than torch carrying the bits)
  2. Developer experience: at compile time, prefers wheel dependency than using Git submodule
    General rule: submodule for small lib that torch can statically link with
    If user pip install a lib, our CI build process should do the same, rather than building from Git submodule (just for its header, for example)
  3. Keep USE_NVSHMEM to gate non-Linux platforms, like Windows, Mac
  4. At configuration time, we should be able to detect whether nvshmem is available, if not, we don't build NVSHMEMSymmetricMemory at all.

For now, we have symbol dependency on two particular libs from NVSHMEM:

  • libnvshmem_host.so: contains host side APIs;
  • libnvshmem_device.a: contains device-side global variables AND device function impls.

Add link dir

Remove direct dependency on mlx5

Add preload rule

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented May 7, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153010

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6a1e35e with merge base e9e1aac (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@kwen2501 kwen2501 changed the title Add NVSHMEM_HOME detection Detect NVSHMEM location May 7, 2025
@kwen2501 kwen2501 requested review from atalman, fduwjj, fegin, malfet and ngimel May 7, 2025 01:11
### Changes
- Detect NVSHMEM install location via `sysconfig.get_path("purelib")`, which typically resolves to `<conda_env>/lib/python/site-packages`, and NVSHMEM include and lib live under `nvidia/nvshmem`
- Added link dir via `target_link_directories`
- Removed direct dependency on mlx5
- Added preload rule (following other other NVIDIA libs)

### Plan of Record
1. End user experience: link against NVSHMEM dynamically (NVSHMEM lib size is 100M, similar to NCCL, thus we'd like users to `pip install nvshmem` than torch carrying the bits)
2. Developer experience: at compile time, prefers wheel dependency than using Git submodule
General rule: submodule for small lib that torch can statically link with
If user pip install a lib, our CI build process should do the same, rather than building from Git submodule (just for its header, for example)
3. Keep `USE_NVSHMEM` to gate non-Linux platforms, like Windows, Mac
4. At configuration time, we should be able to detect whether nvshmem is available, if not, we don't build `NVSHMEMSymmetricMemory` at all.

For now, we have symbol dependency on two particular libs from NVSHMEM:
- libnvshmem_host.so: contains host side APIs;
- libnvshmem_device.a: contains device-side global variables AND device function impls.




[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request May 7, 2025
Add link dir

Remove direct dependency on mlx5

Add preload rule

ghstack-source-id: 360a115
Pull Request resolved: #153010
@kwen2501 kwen2501 added the release notes: distributed (c10d) release notes category label May 7, 2025
@kwen2501
Copy link
Contributor Author

kwen2501 commented May 7, 2025

@malfet can you please let me know if the following lines would satisfy the rpath requirement automatically? Thanks!

# Automatically add all linked folders that are NOT in the build directory to
# the rpath (per library?)
set(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE)

"cusolver": "libcusolver.so.*[0-9]",
"nccl": "libnccl.so.*[0-9]",
"nvtx": "libnvToolsExt.so.*[0-9]",
"nvshmem": "libnvshmem_host.so.*[0-9]",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to consider dynamically linking here for the wheels and adding it as a dependency to help reduce binary size: https://pypi.org/project/nvidia-nvshmem-cu12/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Skylion007. It is indeed the plan to dynamically link against nvshmem wheel. (Please see "Plan of Record" in PR description.)
The code here does a pre-load, copying notes from [Global dependencies]:

# ... we try to be good citizens and avoid polluting the symbol
# namespaces, so libtorch is loaded with all its dependencies in a local scope.
# That usually leads to missing symbol errors at run-time, so to avoid a situation like
# this we have to preload those libs in a global namespace.

@kwen2501 kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label May 7, 2025
}
)

# Detect build dependencies from python lib path (in order to set *_HOME variables)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are looking at alternative installations of NVSHMEM, are you also interested in covering the case where users may have installed the NVSHMEM packages from a deb or rpm file?
It's not really our recommended workflow, but in that case, setting NVSHMEM_HOME won't work directly. However, we do support the cmake find_package utility in those distributions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, using find_package seems like a more robust solution! Which cmake version is required for this to work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. IIUC, find_package requires us writing a FindNVSHMEM.cmake containing heuristic for searching NVSHMEM from system install path. I think I can implement that in a follow-up PR.
I think, though, wheel install should take precedence over system install if both exist?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back at this, I don't think a FindNVSHMEM.cmake file is necessary. NVSHMEM already ships with NVSHMEMConfig.cmake files in the installations. These should just work with the find_package command without additional work.

@kwen2501
Copy link
Contributor Author

kwen2501 commented May 7, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@malfet
Copy link
Contributor

malfet commented May 8, 2025

You'll need to add an entry to this particular list (actually there are 3 such lists in that file):

CUDA_RPATHS=(

@github-actions github-actions bot deleted the gh/kwen2501/147/head branch July 6, 2025 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants