-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][feat] Add NCCL Symmetric Integration for All Reduce #4500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
da346ee to
5455c72
Compare
📝 WalkthroughWalkthroughThis change introduces support for a new allreduce strategy, Changes
Sequence Diagram(s)sequenceDiagram
participant PythonApp
participant ModelConfig
participant PyTorchModelEngine
participant UserBuffersManager
participant UserBufferAllocator
participant NCCLUserBufferAllocator
participant AllreduceOp
PythonApp->>ModelConfig: Set allreduce_strategy = "NCCL_SYMMETRIC"
ModelConfig-->>PyTorchModelEngine: Pass config
PyTorchModelEngine->>UserBuffersManager: initialize(..., use_nccl_symmetric=True)
UserBuffersManager->>UserBufferAllocator: Set use_nccl_symmetric = True
UserBufferAllocator->>NCCLUserBufferAllocator: Return NCCLUserBufferAllocator singleton
AllreduceOp->>AllreduceOp: run(..., strategy=NCCL_SYMMETRIC)
AllreduceOp->>NCCLUserBufferAllocator: registerUBBuffer(...)
AllreduceOp->>AllreduceOp: runNCCLAllReduceSymmetric(...)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (16)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (13)
🧰 Additional context used🧬 Code Graph Analysis (1)cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (15)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (4)
tensorrt_llm/_torch/distributed/ops.py (1)
1-1: Logger imported but never used
loggingis imported and a module‐levelloggeris created, yet no subsequent call site references it.
If no log messages are planned, drop the import/variable; otherwise add real logging to avoid dead code.Also applies to: 18-18
cpp/tensorrt_llm/pybind/runtime/bindings.cpp (1)
451-452: Enum exposure looks good, consider docstring / ordering touch-upsThe two new
.value()entries correctly exposeTWOSHOTandNCCL_SYMMETRICto Python, keeping the binding in sync with the C++ enum.
Minor nits you may want to address while you’re here:
- The list is no longer roughly alphabetical nor grouped by concept (e.g., all NCCL* strategies together). Re-ordering would improve skim-ability.
- Public-facing additions normally get a brief note in the surrounding file-level comment or the project’s Python docs so users discover the new options. Consider adding one.
No functional issues spotted.
tensorrt_llm/_torch/compilation/patterns/ub_allreduce.py (1)
187-241: Well-implemented pattern but clarify the commented-out registration.The
register_no_quant_patternfunction is well-implemented and follows the established pattern structure. However, the registration is commented out on line 241, making this pattern inactive.Technical Implementation:
- Pattern matching logic correctly handles allreduce without quantization
- Target function properly passes
Nonefor scale-related arguments (lines 213-214)- Dtype validation follows the same constraints as other patterns (FP16/BF16 only)
- Function structure is consistent with existing quantization patterns
Question: Is there a specific reason why
register_no_quant_pattern(custom_pass)is commented out? If this pattern is intended for future use, consider adding a TODO comment explaining when it should be enabled. If it's ready for use, consider uncommenting it or documenting any prerequisites.- # register_no_quant_pattern(custom_pass) + # TODO: Enable when no-quantization UB allreduce support is ready + # register_no_quant_pattern(custom_pass)cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (1)
24-36: Consider caching the environment variable check and using descriptive variable names.The singleton pattern is correct, but repeatedly calling
getenvcould impact performance. Also, the underscore variable names are not descriptive.UserBufferAllocator& UserBufferAllocator::Instance() { - // if environment variable TLLM_USE_NCCL_UB is set to 1, use NCCLUserBufferAllocator - char* useNCCLUB = std::getenv("TLLM_USE_NCCL_UB"); - if (useNCCLUB != nullptr) + static bool useNCCL = std::getenv("TLLM_USE_NCCL_UB") != nullptr; + if (useNCCL) { - static NCCLUserBufferAllocator _; - return _; + static NCCLUserBufferAllocator ncclAllocator; + return ncclAllocator; } else { - static UserBufferAllocator _; - return _; + static UserBufferAllocator defaultAllocator; + return defaultAllocator; } }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
cpp/tensorrt_llm/kernels/customAllReduceKernels.h(1 hunks)cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp(1 hunks)cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h(2 hunks)cpp/tensorrt_llm/kernels/userbuffers/ub_interface.cpp(1 hunks)cpp/tensorrt_llm/pybind/runtime/bindings.cpp(1 hunks)cpp/tensorrt_llm/thop/allreduceOp.cpp(6 hunks)tensorrt_llm/_torch/compilation/backend.py(1 hunks)tensorrt_llm/_torch/compilation/patterns/ub_allreduce.py(1 hunks)tensorrt_llm/_torch/distributed/ops.py(2 hunks)tensorrt_llm/_torch/model_config.py(1 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(1 hunks)tensorrt_llm/functional.py(1 hunks)tensorrt_llm/llmapi/llm_args.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (5)
cpp/tensorrt_llm/kernels/userbuffers/ub_interface.cpp (1)
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (2)
Instance(22-36)Instance(22-22)
tensorrt_llm/_torch/model_config.py (1)
tensorrt_llm/functional.py (1)
AllReduceStrategy(3768-3777)
tensorrt_llm/_torch/compilation/patterns/ub_allreduce.py (5)
cpp/tensorrt_llm/thop/allreduceOp.cpp (18)
allreduce(1009-1031)allreduce(1009-1013)strategy(691-722)strategy(691-691)input(159-191)input(159-162)input(207-285)input(207-209)input(287-306)input(287-289)input(308-339)input(308-310)input(341-406)input(341-343)input(408-590)input(408-412)input(592-658)input(592-595)tensorrt_llm/functional.py (5)
allreduce(3938-4029)Tensor(105-550)AllReduceStrategy(3768-3777)dtype(248-251)dtype(254-258)tensorrt_llm/mapping.py (1)
tp_group(337-338)tensorrt_llm/_torch/distributed/ops.py (1)
userbuffers_allreduce_finalize(104-109)tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py (1)
extra_check(72-87)
cpp/tensorrt_llm/thop/allreduceOp.cpp (2)
tests/unittest/_torch/multi_gpu/test_user_buffers.py (1)
create_userbuffers_tensor(41-46)cpp/tensorrt_llm/common/opUtils.cpp (2)
getDtypeMap(33-47)getDtypeMap(33-33)
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h (1)
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (18)
initialize(38-48)initialize(38-38)initialize(87-100)initialize(87-87)isInitialized(50-53)isInitialized(50-50)allocate(64-71)allocate(64-64)deallocate(73-73)deallocate(73-73)get(75-79)get(75-75)comm(81-85)comm(81-81)registerUBBuffer(55-62)registerUBBuffer(55-55)registerUBBuffer(102-111)registerUBBuffer(102-102)
🔇 Additional comments (17)
tensorrt_llm/functional.py (1)
3884-3886: Confirm all downstream mappings handle the newNCCL_SYMMETRIC(value 8).Adding the enum member is straightforward, but any
switch/if-elseladders, array-indexed look-ups, or serialization schemas that assume a contiguous[0..7]range will now be off by one. Please double-check:
- Kernel dispatch tables in C++ (
customAllReduceKernels.h,allreduceOp.cpp).- Python <-> C++ marshalling layers (
bindings.cpp) for exhaustiveif/elifchains.- Config validators that still call
max(AllReduceStrategy)or hard-code7.A quick grep or unit test asserting round-trip (enum → int → enum) will catch any omissions.
cpp/tensorrt_llm/kernels/customAllReduceKernels.h (1)
59-61: Confirm Python/C++ enum parity after addingMNNVL&NCCL_SYMMETRIC
AllReduceStrategyTypeis documented as must stay in sync withtensorrt_llm.functional.AllReduceStrategy.
Double-checkfunctional.py, pybind bindings and any switch/serialization tables for the two new values to avoid UB at runtime.cpp/tensorrt_llm/kernels/userbuffers/ub_interface.cpp (1)
39-40: LGTM – API rename propagated correctly
isInitialized()matches the allocator refactor; no functional issues spotted here.tensorrt_llm/llmapi/llm_args.py (1)
1879-1883: NCCL_SYMMETRIC strategy fully integrated and verifiedAll references to the new
NCCL_SYMMETRICoption are present across Python and C++ layers, confirming the implementation is complete:
- tensorrt_llm/functional.py: Added
NCCL_SYMMETRIC = 8inAllReduceStrategyenum- tensorrt_llm/llmapi/llm_args.py: Included
'NCCL_SYMMETRIC'in theLiteralforallreduce_strategy- tensorrt_llm/_torch/model_config.py: Maps
"NCCL_SYMMETRIC"toAllReduceStrategy.NCCL_SYMMETRIC- tensorrt_llm/_torch/pyexecutor/model_engine.py: Handles
"NCCL_SYMMETRIC"branch in executor- cpp/tensorrt_llm/kernels/customAllReduceKernels.h: Enum
AllReduceStrategyType::NCCL_SYMMETRIC = 8- cpp/tensorrt_llm/thop/allreduceOp.cpp: Implements
runNCCLAllReduceSymmetriccase- cpp/tensorrt_llm/pybind/runtime/bindings.cpp: Exposes
"NCCL_SYMMETRIC"in Python bindingsNo further changes required—ready to merge.
cpp/tensorrt_llm/thop/allreduceOp.cpp (5)
166-167: Good addition for debugging message sizes.The debug log provides useful information about the allreduce message size, which can help diagnose performance issues.
180-181: Correct dispatch for the new NCCL_SYMMETRIC strategy.The case statement properly routes to the new symmetric allreduce implementation.
671-674: Correct handling of NCCL_SYMMETRIC strategy selection.The runtime strategy getter properly returns the NCCL_SYMMETRIC strategy when configured.
700-704: Good addition of debug logging for NCCL_SYMMETRIC strategy.The logging is consistent with other strategies and helps with debugging.
720-720: Improved debug logging for unknown strategies.Including the strategy value helps identify unexpected strategy types during debugging.
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h (3)
31-37: Good extension of UBBuffer to support NCCL windows.The addition of the window member and updated constructor maintains backward compatibility while enabling NCCL window support.
54-69: Good refactoring for naming consistency and extensibility.The changes improve code consistency with camelCase naming and the mPrefix convention for member variables. Making
registerUBBuffervirtual and moving members to protected scope enables proper subclass implementation.
71-79: Well-designed NCCLUserBufferAllocator class.The new allocator properly extends the base class and uses shared_ptr for safe NCCL communicator management.
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (5)
17-18: Necessary includes for NCCLUserBufferAllocator implementation.The includes are required for
getCommfunction andstd::setusage.
38-52: Good refactoring for naming consistency.The member variable renaming and camelCase parameter names improve code consistency.
55-62: Consistent method naming refactoring.The camelCase method name and updated member variables maintain consistency.
66-71: Consistent refactoring in allocate method.The method properly uses the updated naming conventions.
76-85: Consistent member variable usage.The methods properly use the updated member variable names.
34a1010 to
07d53a9
Compare
|
/bot run |
07d53a9 to
9661360
Compare
|
PR_Github #12354 [ run ] triggered by Bot |
|
PR_Github #12354 [ run ] completed with state |
3d447b9 to
d1e3b46
Compare
|
/bot run |
|
PR_Github #12606 [ run ] triggered by Bot |
|
PR_Github #12606 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from the llmapi side.
d1e3b46 to
1323445
Compare
993994d to
4340e8e
Compare
|
/bot run |
|
PR_Github #14231 Bot args parsing error: usage: /bot [-h] |
|
/bot run --multi-gpu-test-only |
|
PR_Github #14232 Bot args parsing error: usage: /bot [-h] |
|
/bot run --only-multi-gpu-test |
|
PR_Github #14234 [ run ] triggered by Bot |
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Add log message Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Enable NCCL symmetric for non-torch compile path Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Fix build Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Address review comments Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
d623585 to
4838b90
Compare
|
/bot run --only-multi-gpu-test |
|
PR_Github #14234 [ run ] completed with state |
|
PR_Github #14254 [ run ] triggered by Bot |
|
PR_Github #14254 [ run ] completed with state |
|
/bot reuse-pipeline |
|
PR_Github #14319 [ reuse-pipeline ] triggered by Bot |
|
PR_Github #14319 [ reuse-pipeline ] completed with state |
|
Whats the plan for the "UserBuffer" kernels? Are NCCL symmetric designed to replace that? |
|
@litaotju I think once we have the fused version of NCCL symmetric APIs we can start replacing removing the user buffer kernels (assuming the performance is similar). |
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Add NCCL Symmetric Integration for All Reduce
Add NCCL symmetric integration for all reduce. Performance result on H100*8.

Summary by CodeRabbit
New Features
Enhancements
Tests