[None][feat] Add NCCL Symmetric Integration for All Reduce #4500

Tabrizian · 2025-05-20T20:48:32Z

Add NCCL Symmetric Integration for All Reduce

Add NCCL symmetric integration for all reduce. Performance result on H100*8.

Summary by CodeRabbit

New Features
- Added support for a new allreduce strategy: NCCL_SYMMETRIC, available in backend and Python interfaces.
- Extended configuration options to select the NCCL_SYMMETRIC strategy via model arguments and configuration files.
- Introduced a new user buffer allocator variant to enable NCCL symmetric communication.
Enhancements
- Improved user buffer management and initialization to support NCCL_SYMMETRIC communication.
- Enhanced runtime strategy handling, logging, and dispatch to include NCCL_SYMMETRIC.
- Updated PyTorch model engine to conditionally enable user buffers for NCCL_SYMMETRIC.
Tests
- Extended multi-GPU AllReduce tests to cover both MNNVL and NCCL_SYMMETRIC strategies.
- Adjusted user buffer manager initialization in tests to support the new strategy.

coderabbitai · 2025-07-18T22:33:26Z

📝 Walkthrough

Walkthrough

This change introduces support for a new allreduce strategy, NCCL_SYMMETRIC, across both C++ and Python codebases. It updates enums, Python bindings, and configuration options, and implements the corresponding logic in core allocator and operator classes. Method and member renaming, new class additions, and API signature updates are included for user buffer management.

Changes

Cohort / File(s)	Change Summary
AllReduce Strategy Enum Extension `cpp/tensorrt_llm/kernels/customAllReduceKernels.h`, `tensorrt_llm/functional.py`, `cpp/tensorrt_llm/pybind/runtime/bindings.cpp`	Added `NCCL_SYMMETRIC` (and `MNNVL` in C++) to the `AllReduceStrategyType` enum in C++ and Python, and exposed the new value to Python bindings.
User Buffer Allocator Refactor & NCCL Symmetric Support `cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp`, `cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h`	Refactored member/method names to camelCase, added `NCCLUserBufferAllocator` subclass, introduced logic for `NCCL_SYMMETRIC` allocation, and added a static flag to select allocator type.
User Buffer Manager API Update `cpp/tensorrt_llm/kernels/userbuffers/userbuffersManager.cpp`, `cpp/tensorrt_llm/kernels/userbuffers/userbuffersManager.h`, `tests/unittest/_torch/multi_gpu/test_user_buffers.py`	Updated `initialize` and related functions to accept an extra `use_nccl_symmetric` boolean parameter; updated test to pass the new argument.
User Buffer Allocator Interface Update `cpp/tensorrt_llm/kernels/userbuffers/ub_interface.cpp`	Updated to use the renamed `isInitialized()` method.
Allreduce Operator NCCL Symmetric Implementation `cpp/tensorrt_llm/thop/allreduceOp.cpp`	Added support for `NCCL_SYMMETRIC` strategy in the allreduce operator, including a new method for NCCL symmetric allreduce logic and updated runtime strategy dispatch.
Python Model and Config API Update `tensorrt_llm/llmapi/llm_args.py`, `tensorrt_llm/_torch/model_config.py`, `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Added `"NCCL_SYMMETRIC"` as an allowed value in model arguments, updated strategy mapping, and adjusted model engine logic to handle the new strategy and pass the flag to userbuffer initialization.
Python Logging Addition `tensorrt_llm/_torch/distributed/ops.py`	Introduced logging module and logger instance at the module level.
Multi-GPU AllReduce Test Extension `tests/unittest/_torch/multi_gpu/test_mnnvl_allreduce.py`	Parameterized tests to run with both `MNNVL` and `NCCL_SYMMETRIC` strategies, including conditional user buffer initialization for `NCCL_SYMMETRIC`.

Sequence Diagram(s)

sequenceDiagram
    participant PythonApp
    participant ModelConfig
    participant PyTorchModelEngine
    participant UserBuffersManager
    participant UserBufferAllocator
    participant NCCLUserBufferAllocator
    participant AllreduceOp

    PythonApp->>ModelConfig: Set allreduce_strategy = "NCCL_SYMMETRIC"
    ModelConfig-->>PyTorchModelEngine: Pass config
    PyTorchModelEngine->>UserBuffersManager: initialize(..., use_nccl_symmetric=True)
    UserBuffersManager->>UserBufferAllocator: Set use_nccl_symmetric = True
    UserBufferAllocator->>NCCLUserBufferAllocator: Return NCCLUserBufferAllocator singleton
    AllreduceOp->>AllreduceOp: run(..., strategy=NCCL_SYMMETRIC)
    AllreduceOp->>NCCLUserBufferAllocator: registerUBBuffer(...)
    AllreduceOp->>AllreduceOp: runNCCLAllReduceSymmetric(...)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Community want to contribute, Documentation

Suggested reviewers

nv-guomingz
pcastonguay
juney-nvidia
shaharmor98

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d623585 and 4838b90.

📒 Files selected for processing (16)

cpp/tensorrt_llm/kernels/customAllReduceKernels.h (1 hunks)
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (1 hunks)
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h (3 hunks)
cpp/tensorrt_llm/kernels/userbuffers/ub_interface.cpp (1 hunks)
cpp/tensorrt_llm/kernels/userbuffers/userbuffersManager.cpp (2 hunks)
cpp/tensorrt_llm/kernels/userbuffers/userbuffersManager.h (2 hunks)
cpp/tensorrt_llm/pybind/runtime/bindings.cpp (1 hunks)
cpp/tensorrt_llm/thop/allreduceOp.cpp (6 hunks)
tensorrt_llm/_torch/distributed/ops.py (2 hunks)
tensorrt_llm/_torch/model_config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2 hunks)
tensorrt_llm/functional.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (1 hunks)
tests/unittest/_torch/multi_gpu/test_mnnvl_allreduce.py (8 hunks)
tests/unittest/_torch/multi_gpu/test_user_buffers.py (1 hunks)
tests/unittest/api_stability/references/llm.yaml (1 hunks)

✅ Files skipped from review due to trivial changes (1)

tensorrt_llm/_torch/distributed/ops.py

🚧 Files skipped from review as they are similar to previous changes (13)

cpp/tensorrt_llm/kernels/customAllReduceKernels.h
tensorrt_llm/_torch/model_config.py
tests/unittest/api_stability/references/llm.yaml
tests/unittest/_torch/multi_gpu/test_user_buffers.py
cpp/tensorrt_llm/pybind/runtime/bindings.cpp
cpp/tensorrt_llm/kernels/userbuffers/ub_interface.cpp
tensorrt_llm/functional.py
cpp/tensorrt_llm/kernels/userbuffers/userbuffersManager.h
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
cpp/tensorrt_llm/kernels/userbuffers/userbuffersManager.cpp
tests/unittest/_torch/multi_gpu/test_mnnvl_allreduce.py
cpp/tensorrt_llm/thop/allreduceOp.cpp

🧰 Additional context used

🧬 Code Graph Analysis (1)

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h (2)

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (34)

initialize (37-47)

initialize (37-37)

initialize (86-99)

initialize (86-86)

isInitialized (49-52)

isInitialized (49-49)

allocate (63-70)

allocate (63-63)

deallocate (72-72)

deallocate (72-72)

get (74-78)

get (74-74)

comm (80-84)

comm (80-80)

registerUBBuffer (54-61)

registerUBBuffer (54-54)

registerUBBuffer (101-120)

registerUBBuffer (101-101)

NCCLHelper (135-142)

NCCLHelper (144-155)

getNCCLCommWindowRegister (232-235)

getNCCLCommWindowRegister (232-232)

getNCCLMemAlloc (237-240)

getNCCLMemAlloc (237-237)

isLoaded (242-245)

isLoaded (242-242)

loadNCCLLibrary (157-207)

loadNCCLLibrary (157-157)

loadLibraryHandle (209-216)

loadLibraryHandle (209-209)

getSymbolAddress (218-230)

getSymbolAddress (218-218)

getNCCLHelper (125-132)

getNCCLHelper (125-125)

cpp/tensorrt_llm/kernels/userbuffers/userbuffersManager.cpp (5)

initialize (32-43)

initialize (32-33)

comm (96-99)

comm (96-96)

world_config (36-36)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (15)

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h (7)

17-17: Include nccl.h header added for NCCL integration.

The NCCL header inclusion is appropriate for the new symmetric buffer functionality.

22-26: Platform-specific includes for dynamic library loading.

The conditional compilation for Windows/Unix dynamic library headers is correctly implemented.

38-44: UBBuffer struct extended with ncclWindow_t member.

The addition of the window member and updated constructor are consistent with NCCL windowed communication requirements. The default parameter value nullptr for the window is appropriate.

61-67: UserBufferAllocator made polymorphic with virtual methods.

Making initialize and the new registerUBBuffer methods virtual enables proper polymorphism for the NCCLUserBufferAllocator subclass. The design follows good OOP practices.

69-69: Static flag for NCCL symmetric mode control.

The use_nccl_symmetric static boolean provides a clean way to toggle between allocator implementations at runtime.

80-113: NCCLHelper class for dynamic library management.

The NCCLHelper class properly encapsulates dynamic loading of NCCL library symbols. The design includes:

Function pointer typedefs for required NCCL functions

Platform-specific library handle types

Proper resource management in destructor

Clean interface for symbol access

115-127: NCCLUserBufferAllocator subclass design.

The subclass correctly overrides virtual methods and maintains proper inheritance hierarchy. The static NCCLHelper instance ensures shared library loading across instances.

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (8)

17-19: New includes for functionality additions.

The added includes support the new NCCL integration and error handling requirements.

25-34: Conditional allocator instantiation in Instance() method.

The singleton pattern correctly returns either the base UserBufferAllocator or NCCLUserBufferAllocator based on the static flag. This allows runtime switching between allocation strategies.

37-46: UserBufferAllocator::initialize implementation with camelCase members.

The method correctly initializes the base allocator using the existing user buffer infrastructure. The camelCase naming is consistent with the class refactoring.

101-119: NCCLUserBufferAllocator::registerUBBuffer implementation.

The method properly:

Checks initialization status

Loads required NCCL symbols dynamically

Allocates memory using ncclMemAlloc

Registers the memory window for symmetric collective operations

Returns a properly constructed UBBuffer

The error handling for library loading and symbol resolution is appropriate.

135-155: NCCLHelper constructor and destructor.

The RAII pattern is correctly implemented with proper resource cleanup in the destructor. Platform-specific library handle cleanup is handled appropriately.

157-207: NCCLHelper::loadNCCLLibrary implementation.

The dynamic library loading implementation is robust:

Tries appropriate library names for each platform

Proper error handling with try-catch

Loads required symbols with appropriate error messages

Sets loaded state only when all required symbols are available

Graceful degradation with warnings when symbols are missing

209-230: Platform-specific library loading helper methods.

The helper methods properly abstract platform differences for dynamic library operations. The implementations are standard and appropriate for each platform.

86-98: Verify NCCL communicator group usage across all ranks

It looks like NCCLUserBufferAllocator::initialize always builds a full-world communicator by inserting ranks 0…worldConfig.getSize()-1 and calling getComm(group). No other code calls getComm with a subset of ranks. Before proceeding:

• Confirm that you really want a single NCCL communicator spanning every rank in worldConfig (across nodes and GPUs).
• If your deployment uses separate parallelism groups (data/tensor/pipeline or per-node only), you’ll need to pass the appropriate subset of ranks into getComm rather than the full range.

Files to review:

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp → NCCLUserBufferAllocator::initialize

cpp/tensorrt_llm/common/opUtils.cpp/h → implementation of getComm(std::set<int> const&)

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (4)

tensorrt_llm/_torch/distributed/ops.py (1)

1-1: Logger imported but never used

logging is imported and a module‐level logger is created, yet no subsequent call site references it.
If no log messages are planned, drop the import/variable; otherwise add real logging to avoid dead code.

Also applies to: 18-18

cpp/tensorrt_llm/pybind/runtime/bindings.cpp (1)

451-452: Enum exposure looks good, consider docstring / ordering touch-ups

The two new .value() entries correctly expose TWOSHOT and NCCL_SYMMETRIC to Python, keeping the binding in sync with the C++ enum.
Minor nits you may want to address while you’re here:

The list is no longer roughly alphabetical nor grouped by concept (e.g., all NCCL* strategies together). Re-ordering would improve skim-ability.

Public-facing additions normally get a brief note in the surrounding file-level comment or the project’s Python docs so users discover the new options. Consider adding one.

No functional issues spotted.
tensorrt_llm/_torch/compilation/patterns/ub_allreduce.py (1)
187-241: Well-implemented pattern but clarify the commented-out registration.

The register_no_quant_pattern function is well-implemented and follows the established pattern structure. However, the registration is commented out on line 241, making this pattern inactive.

Technical Implementation:

Pattern matching logic correctly handles allreduce without quantization

Target function properly passes None for scale-related arguments (lines 213-214)

Dtype validation follows the same constraints as other patterns (FP16/BF16 only)

Function structure is consistent with existing quantization patterns

Question: Is there a specific reason why register_no_quant_pattern(custom_pass) is commented out? If this pattern is intended for future use, consider adding a TODO comment explaining when it should be enabled. If it's ready for use, consider uncommenting it or documenting any prerequisites.
-        # register_no_quant_pattern(custom_pass)
+        # TODO: Enable when no-quantization UB allreduce support is ready
+        # register_no_quant_pattern(custom_pass)
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (1)
24-36: Consider caching the environment variable check and using descriptive variable names.

The singleton pattern is correct, but repeatedly calling getenv could impact performance. Also, the underscore variable names are not descriptive.
 UserBufferAllocator& UserBufferAllocator::Instance()
 {
-    // if environment variable TLLM_USE_NCCL_UB is set to 1, use NCCLUserBufferAllocator
-    char* useNCCLUB = std::getenv("TLLM_USE_NCCL_UB");
-    if (useNCCLUB != nullptr)
+    static bool useNCCL = std::getenv("TLLM_USE_NCCL_UB") != nullptr;
+    if (useNCCL)
     {
-        static NCCLUserBufferAllocator _;
-        return _;
+        static NCCLUserBufferAllocator ncclAllocator;
+        return ncclAllocator;
     }
     else
     {
-        static UserBufferAllocator _;
-        return _;
+        static UserBufferAllocator defaultAllocator;
+        return defaultAllocator;
     }
 }

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc8b29c and 5455c72.

📒 Files selected for processing (13)

cpp/tensorrt_llm/kernels/customAllReduceKernels.h (1 hunks)
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (1 hunks)
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h (2 hunks)
cpp/tensorrt_llm/kernels/userbuffers/ub_interface.cpp (1 hunks)
cpp/tensorrt_llm/pybind/runtime/bindings.cpp (1 hunks)
cpp/tensorrt_llm/thop/allreduceOp.cpp (6 hunks)
tensorrt_llm/_torch/compilation/backend.py (1 hunks)
tensorrt_llm/_torch/compilation/patterns/ub_allreduce.py (1 hunks)
tensorrt_llm/_torch/distributed/ops.py (2 hunks)
tensorrt_llm/_torch/model_config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
tensorrt_llm/functional.py (1 hunks)
tensorrt_llm/llmapi/llm_args.py (1 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (5)

cpp/tensorrt_llm/kernels/userbuffers/ub_interface.cpp (1)

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (2)

Instance (22-36)

Instance (22-22)

tensorrt_llm/_torch/model_config.py (1)

tensorrt_llm/functional.py (1)

AllReduceStrategy (3768-3777)

tensorrt_llm/_torch/compilation/patterns/ub_allreduce.py (5)

cpp/tensorrt_llm/thop/allreduceOp.cpp (18)

allreduce (1009-1031)

allreduce (1009-1013)

strategy (691-722)

strategy (691-691)

input (159-191)

input (159-162)

input (207-285)

input (207-209)

input (287-306)

input (287-289)

input (308-339)

input (308-310)

input (341-406)

input (341-343)

input (408-590)

input (408-412)

input (592-658)

input (592-595)

tensorrt_llm/functional.py (5)

allreduce (3938-4029)

Tensor (105-550)

AllReduceStrategy (3768-3777)

dtype (248-251)

dtype (254-258)

tensorrt_llm/mapping.py (1)

tp_group (337-338)

tensorrt_llm/_torch/distributed/ops.py (1)

userbuffers_allreduce_finalize (104-109)

tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py (1)

extra_check (72-87)

cpp/tensorrt_llm/thop/allreduceOp.cpp (2)

tests/unittest/_torch/multi_gpu/test_user_buffers.py (1)

create_userbuffers_tensor (41-46)

cpp/tensorrt_llm/common/opUtils.cpp (2)

getDtypeMap (33-47)

getDtypeMap (33-33)

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h (1)

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (18)

initialize (38-48)

initialize (38-38)

initialize (87-100)

initialize (87-87)

isInitialized (50-53)

isInitialized (50-50)

allocate (64-71)

allocate (64-64)

deallocate (73-73)

deallocate (73-73)

get (75-79)

get (75-75)

comm (81-85)

comm (81-81)

registerUBBuffer (55-62)

registerUBBuffer (55-55)

registerUBBuffer (102-111)

registerUBBuffer (102-102)

🔇 Additional comments (17)

tensorrt_llm/functional.py (1)

3884-3886: Confirm all downstream mappings handle the new NCCL_SYMMETRIC (value 8).

Adding the enum member is straightforward, but any switch/if-else ladders, array-indexed look-ups, or serialization schemas that assume a contiguous [0..7] range will now be off by one. Please double-check:

Kernel dispatch tables in C++ (customAllReduceKernels.h, allreduceOp.cpp).

Python <-> C++ marshalling layers (bindings.cpp) for exhaustive if/elif chains.

Config validators that still call max(AllReduceStrategy) or hard-code 7.

A quick grep or unit test asserting round-trip (enum → int → enum) will catch any omissions.

cpp/tensorrt_llm/kernels/customAllReduceKernels.h (1)

59-61: Confirm Python/C++ enum parity after adding MNNVL & NCCL_SYMMETRIC

AllReduceStrategyType is documented as must stay in sync with tensorrt_llm.functional.AllReduceStrategy.
Double-check functional.py, pybind bindings and any switch/serialization tables for the two new values to avoid UB at runtime.

cpp/tensorrt_llm/kernels/userbuffers/ub_interface.cpp (1)

39-40: LGTM – API rename propagated correctly

isInitialized() matches the allocator refactor; no functional issues spotted here.

tensorrt_llm/llmapi/llm_args.py (1)

1879-1883: NCCL_SYMMETRIC strategy fully integrated and verified

All references to the new NCCL_SYMMETRIC option are present across Python and C++ layers, confirming the implementation is complete:

tensorrt_llm/functional.py: Added NCCL_SYMMETRIC = 8 in AllReduceStrategy enum

tensorrt_llm/llmapi/llm_args.py: Included 'NCCL_SYMMETRIC' in the Literal for allreduce_strategy

tensorrt_llm/_torch/model_config.py: Maps "NCCL_SYMMETRIC" to AllReduceStrategy.NCCL_SYMMETRIC

tensorrt_llm/_torch/pyexecutor/model_engine.py: Handles "NCCL_SYMMETRIC" branch in executor

cpp/tensorrt_llm/kernels/customAllReduceKernels.h: Enum AllReduceStrategyType::NCCL_SYMMETRIC = 8

cpp/tensorrt_llm/thop/allreduceOp.cpp: Implements runNCCLAllReduceSymmetric case

cpp/tensorrt_llm/pybind/runtime/bindings.cpp: Exposes "NCCL_SYMMETRIC" in Python bindings

No further changes required—ready to merge.

cpp/tensorrt_llm/thop/allreduceOp.cpp (5)

166-167: Good addition for debugging message sizes.

The debug log provides useful information about the allreduce message size, which can help diagnose performance issues.

180-181: Correct dispatch for the new NCCL_SYMMETRIC strategy.

The case statement properly routes to the new symmetric allreduce implementation.

671-674: Correct handling of NCCL_SYMMETRIC strategy selection.

The runtime strategy getter properly returns the NCCL_SYMMETRIC strategy when configured.

700-704: Good addition of debug logging for NCCL_SYMMETRIC strategy.

The logging is consistent with other strategies and helps with debugging.

720-720: Improved debug logging for unknown strategies.

Including the strategy value helps identify unexpected strategy types during debugging.

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h (3)

31-37: Good extension of UBBuffer to support NCCL windows.

The addition of the window member and updated constructor maintains backward compatibility while enabling NCCL window support.

54-69: Good refactoring for naming consistency and extensibility.

The changes improve code consistency with camelCase naming and the mPrefix convention for member variables. Making registerUBBuffer virtual and moving members to protected scope enables proper subclass implementation.

71-79: Well-designed NCCLUserBufferAllocator class.

The new allocator properly extends the base class and uses shared_ptr for safe NCCL communicator management.

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (5)

17-18: Necessary includes for NCCLUserBufferAllocator implementation.

The includes are required for getComm function and std::set usage.

38-52: Good refactoring for naming consistency.

The member variable renaming and camelCase parameter names improve code consistency.

55-62: Consistent method naming refactoring.

The camelCase method name and updated member variables maintain consistency.

66-71: Consistent refactoring in allocate method.

The method properly uses the updated naming conventions.

76-85: Consistent member variable usage.

The methods properly use the updated member variable names.

tensorrt_llm/_torch/compilation/backend.py

tensorrt_llm/_torch/model_config.py

tensorrt_llm/_torch/pyexecutor/model_engine.py

cpp/tensorrt_llm/thop/allreduceOp.cpp

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp

Tabrizian · 2025-07-18T23:47:29Z

/bot run

tensorrt-cicd · 2025-07-18T23:52:26Z

PR_Github #12354 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-19T00:11:33Z

PR_Github #12354 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #9176 completed with status: 'FAILURE'

Tabrizian · 2025-07-22T20:44:31Z

/bot run

tensorrt-cicd · 2025-07-22T20:50:43Z

PR_Github #12606 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-22T22:15:42Z

PR_Github #12606 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9381 completed with status: 'FAILURE'

tensorrt_llm/_torch/pyexecutor/model_engine.py

Superjomn

LGTM from the llmapi side.

Tabrizian · 2025-07-31T23:24:15Z

/bot run

tensorrt-cicd · 2025-08-06T04:19:25Z

PR_Github #14231 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --multi-gpu-only-test

Tabrizian · 2025-08-06T04:21:07Z

/bot run --multi-gpu-test-only

tensorrt-cicd · 2025-08-06T04:26:15Z

PR_Github #14232 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --multi-gpu-test-only

Tabrizian · 2025-08-06T04:39:24Z

/bot run --only-multi-gpu-test

tensorrt-cicd · 2025-08-06T04:45:59Z

PR_Github #14234 [ run ] triggered by Bot

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Add log message Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Enable NCCL symmetric for non-torch compile path Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Fix build Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com> Address review comments Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

Tabrizian · 2025-08-06T06:27:17Z

/bot run --only-multi-gpu-test

tensorrt-cicd · 2025-08-06T06:30:08Z

PR_Github #14234 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10749 (Partly Tested) completed with status: 'FAILURE'

tensorrt-cicd · 2025-08-06T06:32:23Z

PR_Github #14254 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-06T13:23:42Z

PR_Github #14254 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10761 (Partly Tested) completed with status: 'SUCCESS'

Tabrizian · 2025-08-06T16:22:09Z

/bot reuse-pipeline

tensorrt-cicd · 2025-08-06T16:27:28Z

PR_Github #14319 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-08-06T16:44:24Z

PR_Github #14319 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #14254 (Partly Tested) for commit 4838b90

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp

litaotju · 2025-08-07T03:54:03Z

Whats the plan for the "UserBuffer" kernels? Are NCCL symmetric designed to replace that?

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h

Tabrizian · 2025-08-08T00:26:24Z

@litaotju I think once we have the fused version of NCCL symmetric APIs we can start replacing removing the user buffer kernels (assuming the performance is similar).

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

Tabrizian force-pushed the user/imant/ncclsymmetric branch from da346ee to 5455c72 Compare July 18, 2025 22:33

Tabrizian changed the title ~~Add UB NCCL integration~~ Add NCCL Symmetric Integration for All Reduce Jul 18, 2025

coderabbitai bot reviewed Jul 18, 2025

View reviewed changes

Tabrizian changed the title ~~Add NCCL Symmetric Integration for All Reduce~~ [1/N] Add NCCL Symmetric Integration for All Reduce Jul 18, 2025

Tabrizian force-pushed the user/imant/ncclsymmetric branch 2 times, most recently from 34a1010 to 07d53a9 Compare July 18, 2025 23:41

Tabrizian marked this pull request as ready for review July 18, 2025 23:41

Tabrizian requested review from a team as code owners July 18, 2025 23:41

Tabrizian requested review from liji-nv, nv-guomingz and shaharmor98 July 18, 2025 23:41

Tabrizian force-pushed the user/imant/ncclsymmetric branch from 07d53a9 to 9661360 Compare July 18, 2025 23:47

Tabrizian force-pushed the user/imant/ncclsymmetric branch 2 times, most recently from 3d447b9 to d1e3b46 Compare July 22, 2025 20:44

liji-nv requested changes Jul 23, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/model_engine.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/model_engine.py Outdated Show resolved Hide resolved

Superjomn approved these changes Jul 25, 2025

View reviewed changes

Tabrizian force-pushed the user/imant/ncclsymmetric branch from d1e3b46 to 1323445 Compare July 31, 2025 23:14

Tabrizian requested a review from liji-nv July 31, 2025 23:18

Tabrizian force-pushed the user/imant/ncclsymmetric branch 2 times, most recently from 993994d to 4340e8e Compare July 31, 2025 23:24

Tabrizian added 4 commits August 5, 2025 23:27

Address review comment and add testing

9b54a55

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

Add NCCLHelper

bc91db1

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

Fix CI

4838b90

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

Tabrizian force-pushed the user/imant/ncclsymmetric branch from d623585 to 4838b90 Compare August 6, 2025 06:27

liji-nv approved these changes Aug 7, 2025

View reviewed changes

litaotju approved these changes Aug 7, 2025

View reviewed changes

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp Show resolved Hide resolved

hlu1 approved these changes Aug 7, 2025

View reviewed changes

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h Show resolved Hide resolved

Tabrizian merged commit 8227616 into NVIDIA:main Aug 8, 2025
4 checks passed

Shunkangz pushed a commit to hcyezhang/TensorRT-LLM that referenced this pull request Aug 8, 2025

[None][feat] Add NCCL Symmetric Integration for All Reduce (NVIDIA#4500)

b7eee49

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

coderabbitai bot mentioned this pull request Aug 13, 2025

[None] [ci] Reorganize CMake and Python integration test infrastructure for C++ tests #6754

Merged

This was referenced Aug 25, 2025

[None][chore] Create PyExecutor from TorchLlmArgs Part 1 #7105

Merged

Don't review - Testing: Support fp8 block wide ep #7209

Closed

This was referenced Sep 4, 2025

[TRTLLM-7349][feat] Adding new orchestrator type -- ray #7520

Merged

[https://nvbugs/5498967][fix] Downgrade NCCL #7556

Merged

coderabbitai bot mentioned this pull request Sep 22, 2025

[None] [feat] Add NCCL device kernels; enable NCCL_DEVICE all-reduce title #7910

Open

3 tasks

[None][feat] Add NCCL Symmetric Integration for All Reduce #4500

[None][feat] Add NCCL Symmetric Integration for All Reduce #4500

Uh oh!

Conversation

Tabrizian commented May 20, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add NCCL Symmetric Integration for All Reduce

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tabrizian commented Jul 18, 2025

Uh oh!

tensorrt-cicd commented Jul 18, 2025

Uh oh!

tensorrt-cicd commented Jul 19, 2025

Uh oh!

Tabrizian commented Jul 22, 2025

Uh oh!

tensorrt-cicd commented Jul 22, 2025

Uh oh!

tensorrt-cicd commented Jul 22, 2025

Uh oh!

Uh oh!

Uh oh!

Superjomn left a comment

Choose a reason for hiding this comment

Uh oh!

Tabrizian commented Jul 31, 2025

Uh oh!

tensorrt-cicd commented Aug 6, 2025

Uh oh!

Tabrizian commented Aug 6, 2025

Uh oh!

tensorrt-cicd commented Aug 6, 2025

Uh oh!

Tabrizian commented Aug 6, 2025

Uh oh!

tensorrt-cicd commented Aug 6, 2025

Uh oh!

Tabrizian commented Aug 6, 2025

Uh oh!

tensorrt-cicd commented Aug 6, 2025

Uh oh!

tensorrt-cicd commented Aug 6, 2025

Uh oh!

tensorrt-cicd commented Aug 6, 2025

Uh oh!

Tabrizian commented Aug 6, 2025

Uh oh!

tensorrt-cicd commented Aug 6, 2025

Uh oh!

tensorrt-cicd commented Aug 6, 2025

Uh oh!

Uh oh!

litaotju commented Aug 7, 2025

Uh oh!

Uh oh!

Tabrizian commented Aug 8, 2025

Uh oh!

Tabrizian commented May 20, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 18, 2025 •

edited

Loading