-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Detect NVSHMEM location #153010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect NVSHMEM location #153010
Conversation
Add link dir Remove direct dependency on mlx5 Add preload rule [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153010
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 6a1e35e with merge base e9e1aac ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
### Changes
- Detect NVSHMEM install location via `sysconfig.get_path("purelib")`, which typically resolves to `<conda_env>/lib/python/site-packages`, and NVSHMEM include and lib live under `nvidia/nvshmem`
- Added link dir via `target_link_directories`
- Removed direct dependency on mlx5
- Added preload rule (following other other NVIDIA libs)
### Plan of Record
1. End user experience: link against NVSHMEM dynamically (NVSHMEM lib size is 100M, similar to NCCL, thus we'd like users to `pip install nvshmem` than torch carrying the bits)
2. Developer experience: at compile time, prefers wheel dependency than using Git submodule
General rule: submodule for small lib that torch can statically link with
If user pip install a lib, our CI build process should do the same, rather than building from Git submodule (just for its header, for example)
3. Keep `USE_NVSHMEM` to gate non-Linux platforms, like Windows, Mac
4. At configuration time, we should be able to detect whether nvshmem is available, if not, we don't build `NVSHMEMSymmetricMemory` at all.
For now, we have symbol dependency on two particular libs from NVSHMEM:
- libnvshmem_host.so: contains host side APIs;
- libnvshmem_device.a: contains device-side global variables AND device function impls.
[ghstack-poisoned]
|
@malfet can you please let me know if the following lines would satisfy the pytorch/cmake/Dependencies.cmake Lines 14 to 16 in 81b6920
|
| "cusolver": "libcusolver.so.*[0-9]", | ||
| "nccl": "libnccl.so.*[0-9]", | ||
| "nvtx": "libnvToolsExt.so.*[0-9]", | ||
| "nvshmem": "libnvshmem_host.so.*[0-9]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to consider dynamically linking here for the wheels and adding it as a dependency to help reduce binary size: https://pypi.org/project/nvidia-nvshmem-cu12/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Skylion007. It is indeed the plan to dynamically link against nvshmem wheel. (Please see "Plan of Record" in PR description.)
The code here does a pre-load, copying notes from [Global dependencies]:
# ... we try to be good citizens and avoid polluting the symbol
# namespaces, so libtorch is loaded with all its dependencies in a local scope.
# That usually leads to missing symbol errors at run-time, so to avoid a situation like
# this we have to preload those libs in a global namespace.
| } | ||
| ) | ||
|
|
||
| # Detect build dependencies from python lib path (in order to set *_HOME variables) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you are looking at alternative installations of NVSHMEM, are you also interested in covering the case where users may have installed the NVSHMEM packages from a deb or rpm file?
It's not really our recommended workflow, but in that case, setting NVSHMEM_HOME won't work directly. However, we do support the cmake find_package utility in those distributions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, using find_package seems like a more robust solution! Which cmake version is required for this to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. IIUC, find_package requires us writing a FindNVSHMEM.cmake containing heuristic for searching NVSHMEM from system install path. I think I can implement that in a follow-up PR.
I think, though, wheel install should take precedence over system install if both exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking back at this, I don't think a FindNVSHMEM.cmake file is necessary. NVSHMEM already ships with NVSHMEMConfig.cmake files in the installations. These should just work with the find_package command without additional work.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
You'll need to add an entry to this particular list (actually there are 3 such lists in that file): pytorch/.ci/manywheel/build_cuda.sh Line 184 in cbcb57d
|
Stack from ghstack (oldest at bottom):
Changes
sysconfig.get_path("purelib"), which typically resolves to<conda_env>/lib/python/site-packages, and NVSHMEM include and lib live undernvidia/nvshmemtarget_link_directoriesPlan of Record
pip install nvshmemthan torch carrying the bits)General rule: submodule for small lib that torch can statically link with
If user pip install a lib, our CI build process should do the same, rather than building from Git submodule (just for its header, for example)
USE_NVSHMEMto gate non-Linux platforms, like Windows, MacNVSHMEMSymmetricMemoryat all.For now, we have symbol dependency on two particular libs from NVSHMEM: