KEMBAR78
[None][feat] Add NCCL Symmetric Integration for All Reduce by Tabrizian · Pull Request #4500 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@Tabrizian
Copy link
Member

@Tabrizian Tabrizian commented May 20, 2025

Add NCCL Symmetric Integration for All Reduce

Add NCCL symmetric integration for all reduce. Performance result on H100*8.
image

Summary by CodeRabbit

  • New Features

    • Added support for a new allreduce strategy: NCCL_SYMMETRIC, available in backend and Python interfaces.
    • Extended configuration options to select the NCCL_SYMMETRIC strategy via model arguments and configuration files.
    • Introduced a new user buffer allocator variant to enable NCCL symmetric communication.
  • Enhancements

    • Improved user buffer management and initialization to support NCCL_SYMMETRIC communication.
    • Enhanced runtime strategy handling, logging, and dispatch to include NCCL_SYMMETRIC.
    • Updated PyTorch model engine to conditionally enable user buffers for NCCL_SYMMETRIC.
  • Tests

    • Extended multi-GPU AllReduce tests to cover both MNNVL and NCCL_SYMMETRIC strategies.
    • Adjusted user buffer manager initialization in tests to support the new strategy.

@Tabrizian Tabrizian force-pushed the user/imant/ncclsymmetric branch from da346ee to 5455c72 Compare July 18, 2025 22:33
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 18, 2025

📝 Walkthrough

Walkthrough

This change introduces support for a new allreduce strategy, NCCL_SYMMETRIC, across both C++ and Python codebases. It updates enums, Python bindings, and configuration options, and implements the corresponding logic in core allocator and operator classes. Method and member renaming, new class additions, and API signature updates are included for user buffer management.

Changes

Cohort / File(s) Change Summary
AllReduce Strategy Enum Extension
cpp/tensorrt_llm/kernels/customAllReduceKernels.h, tensorrt_llm/functional.py, cpp/tensorrt_llm/pybind/runtime/bindings.cpp
Added NCCL_SYMMETRIC (and MNNVL in C++) to the AllReduceStrategyType enum in C++ and Python, and exposed the new value to Python bindings.
User Buffer Allocator Refactor & NCCL Symmetric Support
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp, cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h
Refactored member/method names to camelCase, added NCCLUserBufferAllocator subclass, introduced logic for NCCL_SYMMETRIC allocation, and added a static flag to select allocator type.
User Buffer Manager API Update
cpp/tensorrt_llm/kernels/userbuffers/userbuffersManager.cpp, cpp/tensorrt_llm/kernels/userbuffers/userbuffersManager.h, tests/unittest/_torch/multi_gpu/test_user_buffers.py
Updated initialize and related functions to accept an extra use_nccl_symmetric boolean parameter; updated test to pass the new argument.
User Buffer Allocator Interface Update
cpp/tensorrt_llm/kernels/userbuffers/ub_interface.cpp
Updated to use the renamed isInitialized() method.
Allreduce Operator NCCL Symmetric Implementation
cpp/tensorrt_llm/thop/allreduceOp.cpp
Added support for NCCL_SYMMETRIC strategy in the allreduce operator, including a new method for NCCL symmetric allreduce logic and updated runtime strategy dispatch.
Python Model and Config API Update
tensorrt_llm/llmapi/llm_args.py, tensorrt_llm/_torch/model_config.py, tensorrt_llm/_torch/pyexecutor/model_engine.py
Added "NCCL_SYMMETRIC" as an allowed value in model arguments, updated strategy mapping, and adjusted model engine logic to handle the new strategy and pass the flag to userbuffer initialization.
Python Logging Addition
tensorrt_llm/_torch/distributed/ops.py
Introduced logging module and logger instance at the module level.
Multi-GPU AllReduce Test Extension
tests/unittest/_torch/multi_gpu/test_mnnvl_allreduce.py
Parameterized tests to run with both MNNVL and NCCL_SYMMETRIC strategies, including conditional user buffer initialization for NCCL_SYMMETRIC.

Sequence Diagram(s)

sequenceDiagram
    participant PythonApp
    participant ModelConfig
    participant PyTorchModelEngine
    participant UserBuffersManager
    participant UserBufferAllocator
    participant NCCLUserBufferAllocator
    participant AllreduceOp

    PythonApp->>ModelConfig: Set allreduce_strategy = "NCCL_SYMMETRIC"
    ModelConfig-->>PyTorchModelEngine: Pass config
    PyTorchModelEngine->>UserBuffersManager: initialize(..., use_nccl_symmetric=True)
    UserBuffersManager->>UserBufferAllocator: Set use_nccl_symmetric = True
    UserBufferAllocator->>NCCLUserBufferAllocator: Return NCCLUserBufferAllocator singleton
    AllreduceOp->>AllreduceOp: run(..., strategy=NCCL_SYMMETRIC)
    AllreduceOp->>NCCLUserBufferAllocator: registerUBBuffer(...)
    AllreduceOp->>AllreduceOp: runNCCLAllReduceSymmetric(...)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

Community want to contribute, Documentation

Suggested reviewers

  • nv-guomingz
  • pcastonguay
  • juney-nvidia
  • shaharmor98

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d623585 and 4838b90.

📒 Files selected for processing (16)
  • cpp/tensorrt_llm/kernels/customAllReduceKernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h (3 hunks)
  • cpp/tensorrt_llm/kernels/userbuffers/ub_interface.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/userbuffers/userbuffersManager.cpp (2 hunks)
  • cpp/tensorrt_llm/kernels/userbuffers/userbuffersManager.h (2 hunks)
  • cpp/tensorrt_llm/pybind/runtime/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/thop/allreduceOp.cpp (6 hunks)
  • tensorrt_llm/_torch/distributed/ops.py (2 hunks)
  • tensorrt_llm/_torch/model_config.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (2 hunks)
  • tensorrt_llm/functional.py (1 hunks)
  • tensorrt_llm/llmapi/llm_args.py (1 hunks)
  • tests/unittest/_torch/multi_gpu/test_mnnvl_allreduce.py (8 hunks)
  • tests/unittest/_torch/multi_gpu/test_user_buffers.py (1 hunks)
  • tests/unittest/api_stability/references/llm.yaml (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tensorrt_llm/_torch/distributed/ops.py
🚧 Files skipped from review as they are similar to previous changes (13)
  • cpp/tensorrt_llm/kernels/customAllReduceKernels.h
  • tensorrt_llm/_torch/model_config.py
  • tests/unittest/api_stability/references/llm.yaml
  • tests/unittest/_torch/multi_gpu/test_user_buffers.py
  • cpp/tensorrt_llm/pybind/runtime/bindings.cpp
  • cpp/tensorrt_llm/kernels/userbuffers/ub_interface.cpp
  • tensorrt_llm/functional.py
  • cpp/tensorrt_llm/kernels/userbuffers/userbuffersManager.h
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • cpp/tensorrt_llm/kernels/userbuffers/userbuffersManager.cpp
  • tests/unittest/_torch/multi_gpu/test_mnnvl_allreduce.py
  • cpp/tensorrt_llm/thop/allreduceOp.cpp
🧰 Additional context used
🧬 Code Graph Analysis (1)
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h (2)
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (34)
  • initialize (37-47)
  • initialize (37-37)
  • initialize (86-99)
  • initialize (86-86)
  • isInitialized (49-52)
  • isInitialized (49-49)
  • allocate (63-70)
  • allocate (63-63)
  • deallocate (72-72)
  • deallocate (72-72)
  • get (74-78)
  • get (74-74)
  • comm (80-84)
  • comm (80-80)
  • registerUBBuffer (54-61)
  • registerUBBuffer (54-54)
  • registerUBBuffer (101-120)
  • registerUBBuffer (101-101)
  • NCCLHelper (135-142)
  • NCCLHelper (144-155)
  • getNCCLCommWindowRegister (232-235)
  • getNCCLCommWindowRegister (232-232)
  • getNCCLMemAlloc (237-240)
  • getNCCLMemAlloc (237-237)
  • isLoaded (242-245)
  • isLoaded (242-242)
  • loadNCCLLibrary (157-207)
  • loadNCCLLibrary (157-157)
  • loadLibraryHandle (209-216)
  • loadLibraryHandle (209-209)
  • getSymbolAddress (218-230)
  • getSymbolAddress (218-218)
  • getNCCLHelper (125-132)
  • getNCCLHelper (125-125)
cpp/tensorrt_llm/kernels/userbuffers/userbuffersManager.cpp (5)
  • initialize (32-43)
  • initialize (32-33)
  • comm (96-99)
  • comm (96-96)
  • world_config (36-36)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (15)
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h (7)

17-17: Include nccl.h header added for NCCL integration.

The NCCL header inclusion is appropriate for the new symmetric buffer functionality.


22-26: Platform-specific includes for dynamic library loading.

The conditional compilation for Windows/Unix dynamic library headers is correctly implemented.


38-44: UBBuffer struct extended with ncclWindow_t member.

The addition of the window member and updated constructor are consistent with NCCL windowed communication requirements. The default parameter value nullptr for the window is appropriate.


61-67: UserBufferAllocator made polymorphic with virtual methods.

Making initialize and the new registerUBBuffer methods virtual enables proper polymorphism for the NCCLUserBufferAllocator subclass. The design follows good OOP practices.


69-69: Static flag for NCCL symmetric mode control.

The use_nccl_symmetric static boolean provides a clean way to toggle between allocator implementations at runtime.


80-113: NCCLHelper class for dynamic library management.

The NCCLHelper class properly encapsulates dynamic loading of NCCL library symbols. The design includes:

  • Function pointer typedefs for required NCCL functions
  • Platform-specific library handle types
  • Proper resource management in destructor
  • Clean interface for symbol access

115-127: NCCLUserBufferAllocator subclass design.

The subclass correctly overrides virtual methods and maintains proper inheritance hierarchy. The static NCCLHelper instance ensures shared library loading across instances.

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (8)

17-19: New includes for functionality additions.

The added includes support the new NCCL integration and error handling requirements.


25-34: Conditional allocator instantiation in Instance() method.

The singleton pattern correctly returns either the base UserBufferAllocator or NCCLUserBufferAllocator based on the static flag. This allows runtime switching between allocation strategies.


37-46: UserBufferAllocator::initialize implementation with camelCase members.

The method correctly initializes the base allocator using the existing user buffer infrastructure. The camelCase naming is consistent with the class refactoring.


101-119: NCCLUserBufferAllocator::registerUBBuffer implementation.

The method properly:

  • Checks initialization status
  • Loads required NCCL symbols dynamically
  • Allocates memory using ncclMemAlloc
  • Registers the memory window for symmetric collective operations
  • Returns a properly constructed UBBuffer

The error handling for library loading and symbol resolution is appropriate.


135-155: NCCLHelper constructor and destructor.

The RAII pattern is correctly implemented with proper resource cleanup in the destructor. Platform-specific library handle cleanup is handled appropriately.


157-207: NCCLHelper::loadNCCLLibrary implementation.

The dynamic library loading implementation is robust:

  • Tries appropriate library names for each platform
  • Proper error handling with try-catch
  • Loads required symbols with appropriate error messages
  • Sets loaded state only when all required symbols are available
  • Graceful degradation with warnings when symbols are missing

209-230: Platform-specific library loading helper methods.

The helper methods properly abstract platform differences for dynamic library operations. The implementations are standard and appropriate for each platform.


86-98: Verify NCCL communicator group usage across all ranks

It looks like NCCLUserBufferAllocator::initialize always builds a full-world communicator by inserting ranks 0…worldConfig.getSize()-1 and calling getComm(group). No other code calls getComm with a subset of ranks. Before proceeding:

• Confirm that you really want a single NCCL communicator spanning every rank in worldConfig (across nodes and GPUs).
• If your deployment uses separate parallelism groups (data/tensor/pipeline or per-node only), you’ll need to pass the appropriate subset of ranks into getComm rather than the full range.

Files to review:

  • cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp → NCCLUserBufferAllocator::initialize
  • cpp/tensorrt_llm/common/opUtils.cpp/h → implementation of getComm(std::set<int> const&)
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Tabrizian Tabrizian changed the title Add UB NCCL integration Add NCCL Symmetric Integration for All Reduce Jul 18, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (4)
tensorrt_llm/_torch/distributed/ops.py (1)

1-1: Logger imported but never used

logging is imported and a module‐level logger is created, yet no subsequent call site references it.
If no log messages are planned, drop the import/variable; otherwise add real logging to avoid dead code.

Also applies to: 18-18

cpp/tensorrt_llm/pybind/runtime/bindings.cpp (1)

451-452: Enum exposure looks good, consider docstring / ordering touch-ups

The two new .value() entries correctly expose TWOSHOT and NCCL_SYMMETRIC to Python, keeping the binding in sync with the C++ enum.
Minor nits you may want to address while you’re here:

  1. The list is no longer roughly alphabetical nor grouped by concept (e.g., all NCCL* strategies together). Re-ordering would improve skim-ability.
  2. Public-facing additions normally get a brief note in the surrounding file-level comment or the project’s Python docs so users discover the new options. Consider adding one.

No functional issues spotted.

tensorrt_llm/_torch/compilation/patterns/ub_allreduce.py (1)

187-241: Well-implemented pattern but clarify the commented-out registration.

The register_no_quant_pattern function is well-implemented and follows the established pattern structure. However, the registration is commented out on line 241, making this pattern inactive.

Technical Implementation:

  • Pattern matching logic correctly handles allreduce without quantization
  • Target function properly passes None for scale-related arguments (lines 213-214)
  • Dtype validation follows the same constraints as other patterns (FP16/BF16 only)
  • Function structure is consistent with existing quantization patterns

Question: Is there a specific reason why register_no_quant_pattern(custom_pass) is commented out? If this pattern is intended for future use, consider adding a TODO comment explaining when it should be enabled. If it's ready for use, consider uncommenting it or documenting any prerequisites.

-        # register_no_quant_pattern(custom_pass)
+        # TODO: Enable when no-quantization UB allreduce support is ready
+        # register_no_quant_pattern(custom_pass)
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (1)

24-36: Consider caching the environment variable check and using descriptive variable names.

The singleton pattern is correct, but repeatedly calling getenv could impact performance. Also, the underscore variable names are not descriptive.

 UserBufferAllocator& UserBufferAllocator::Instance()
 {
-    // if environment variable TLLM_USE_NCCL_UB is set to 1, use NCCLUserBufferAllocator
-    char* useNCCLUB = std::getenv("TLLM_USE_NCCL_UB");
-    if (useNCCLUB != nullptr)
+    static bool useNCCL = std::getenv("TLLM_USE_NCCL_UB") != nullptr;
+    if (useNCCL)
     {
-        static NCCLUserBufferAllocator _;
-        return _;
+        static NCCLUserBufferAllocator ncclAllocator;
+        return ncclAllocator;
     }
     else
     {
-        static UserBufferAllocator _;
-        return _;
+        static UserBufferAllocator defaultAllocator;
+        return defaultAllocator;
     }
 }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc8b29c and 5455c72.

📒 Files selected for processing (13)
  • cpp/tensorrt_llm/kernels/customAllReduceKernels.h (1 hunks)
  • cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h (2 hunks)
  • cpp/tensorrt_llm/kernels/userbuffers/ub_interface.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/runtime/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/thop/allreduceOp.cpp (6 hunks)
  • tensorrt_llm/_torch/compilation/backend.py (1 hunks)
  • tensorrt_llm/_torch/compilation/patterns/ub_allreduce.py (1 hunks)
  • tensorrt_llm/_torch/distributed/ops.py (2 hunks)
  • tensorrt_llm/_torch/model_config.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (1 hunks)
  • tensorrt_llm/functional.py (1 hunks)
  • tensorrt_llm/llmapi/llm_args.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (5)
cpp/tensorrt_llm/kernels/userbuffers/ub_interface.cpp (1)
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (2)
  • Instance (22-36)
  • Instance (22-22)
tensorrt_llm/_torch/model_config.py (1)
tensorrt_llm/functional.py (1)
  • AllReduceStrategy (3768-3777)
tensorrt_llm/_torch/compilation/patterns/ub_allreduce.py (5)
cpp/tensorrt_llm/thop/allreduceOp.cpp (18)
  • allreduce (1009-1031)
  • allreduce (1009-1013)
  • strategy (691-722)
  • strategy (691-691)
  • input (159-191)
  • input (159-162)
  • input (207-285)
  • input (207-209)
  • input (287-306)
  • input (287-289)
  • input (308-339)
  • input (308-310)
  • input (341-406)
  • input (341-343)
  • input (408-590)
  • input (408-412)
  • input (592-658)
  • input (592-595)
tensorrt_llm/functional.py (5)
  • allreduce (3938-4029)
  • Tensor (105-550)
  • AllReduceStrategy (3768-3777)
  • dtype (248-251)
  • dtype (254-258)
tensorrt_llm/mapping.py (1)
  • tp_group (337-338)
tensorrt_llm/_torch/distributed/ops.py (1)
  • userbuffers_allreduce_finalize (104-109)
tensorrt_llm/_torch/compilation/patterns/ar_residual_norm.py (1)
  • extra_check (72-87)
cpp/tensorrt_llm/thop/allreduceOp.cpp (2)
tests/unittest/_torch/multi_gpu/test_user_buffers.py (1)
  • create_userbuffers_tensor (41-46)
cpp/tensorrt_llm/common/opUtils.cpp (2)
  • getDtypeMap (33-47)
  • getDtypeMap (33-33)
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h (1)
cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (18)
  • initialize (38-48)
  • initialize (38-38)
  • initialize (87-100)
  • initialize (87-87)
  • isInitialized (50-53)
  • isInitialized (50-50)
  • allocate (64-71)
  • allocate (64-64)
  • deallocate (73-73)
  • deallocate (73-73)
  • get (75-79)
  • get (75-75)
  • comm (81-85)
  • comm (81-81)
  • registerUBBuffer (55-62)
  • registerUBBuffer (55-55)
  • registerUBBuffer (102-111)
  • registerUBBuffer (102-102)
🔇 Additional comments (17)
tensorrt_llm/functional.py (1)

3884-3886: Confirm all downstream mappings handle the new NCCL_SYMMETRIC (value 8).

Adding the enum member is straightforward, but any switch/if-else ladders, array-indexed look-ups, or serialization schemas that assume a contiguous [0..7] range will now be off by one. Please double-check:

  1. Kernel dispatch tables in C++ (customAllReduceKernels.h, allreduceOp.cpp).
  2. Python <-> C++ marshalling layers (bindings.cpp) for exhaustive if/elif chains.
  3. Config validators that still call max(AllReduceStrategy) or hard-code 7.

A quick grep or unit test asserting round-trip (enum → int → enum) will catch any omissions.

cpp/tensorrt_llm/kernels/customAllReduceKernels.h (1)

59-61: Confirm Python/C++ enum parity after adding MNNVL & NCCL_SYMMETRIC

AllReduceStrategyType is documented as must stay in sync with tensorrt_llm.functional.AllReduceStrategy.
Double-check functional.py, pybind bindings and any switch/serialization tables for the two new values to avoid UB at runtime.

cpp/tensorrt_llm/kernels/userbuffers/ub_interface.cpp (1)

39-40: LGTM – API rename propagated correctly

isInitialized() matches the allocator refactor; no functional issues spotted here.

tensorrt_llm/llmapi/llm_args.py (1)

1879-1883: NCCL_SYMMETRIC strategy fully integrated and verified

All references to the new NCCL_SYMMETRIC option are present across Python and C++ layers, confirming the implementation is complete:

  • tensorrt_llm/functional.py: Added NCCL_SYMMETRIC = 8 in AllReduceStrategy enum
  • tensorrt_llm/llmapi/llm_args.py: Included 'NCCL_SYMMETRIC' in the Literal for allreduce_strategy
  • tensorrt_llm/_torch/model_config.py: Maps "NCCL_SYMMETRIC" to AllReduceStrategy.NCCL_SYMMETRIC
  • tensorrt_llm/_torch/pyexecutor/model_engine.py: Handles "NCCL_SYMMETRIC" branch in executor
  • cpp/tensorrt_llm/kernels/customAllReduceKernels.h: Enum AllReduceStrategyType::NCCL_SYMMETRIC = 8
  • cpp/tensorrt_llm/thop/allreduceOp.cpp: Implements runNCCLAllReduceSymmetric case
  • cpp/tensorrt_llm/pybind/runtime/bindings.cpp: Exposes "NCCL_SYMMETRIC" in Python bindings

No further changes required—ready to merge.

cpp/tensorrt_llm/thop/allreduceOp.cpp (5)

166-167: Good addition for debugging message sizes.

The debug log provides useful information about the allreduce message size, which can help diagnose performance issues.


180-181: Correct dispatch for the new NCCL_SYMMETRIC strategy.

The case statement properly routes to the new symmetric allreduce implementation.


671-674: Correct handling of NCCL_SYMMETRIC strategy selection.

The runtime strategy getter properly returns the NCCL_SYMMETRIC strategy when configured.


700-704: Good addition of debug logging for NCCL_SYMMETRIC strategy.

The logging is consistent with other strategies and helps with debugging.


720-720: Improved debug logging for unknown strategies.

Including the strategy value helps identify unexpected strategy types during debugging.

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.h (3)

31-37: Good extension of UBBuffer to support NCCL windows.

The addition of the window member and updated constructor maintains backward compatibility while enabling NCCL window support.


54-69: Good refactoring for naming consistency and extensibility.

The changes improve code consistency with camelCase naming and the mPrefix convention for member variables. Making registerUBBuffer virtual and moving members to protected scope enables proper subclass implementation.


71-79: Well-designed NCCLUserBufferAllocator class.

The new allocator properly extends the base class and uses shared_ptr for safe NCCL communicator management.

cpp/tensorrt_llm/kernels/userbuffers/ub_allocator.cpp (5)

17-18: Necessary includes for NCCLUserBufferAllocator implementation.

The includes are required for getComm function and std::set usage.


38-52: Good refactoring for naming consistency.

The member variable renaming and camelCase parameter names improve code consistency.


55-62: Consistent method naming refactoring.

The camelCase method name and updated member variables maintain consistency.


66-71: Consistent refactoring in allocate method.

The method properly uses the updated naming conventions.


76-85: Consistent member variable usage.

The methods properly use the updated member variable names.

@Tabrizian Tabrizian changed the title Add NCCL Symmetric Integration for All Reduce [1/N] Add NCCL Symmetric Integration for All Reduce Jul 18, 2025
@Tabrizian Tabrizian force-pushed the user/imant/ncclsymmetric branch 2 times, most recently from 34a1010 to 07d53a9 Compare July 18, 2025 23:41
@Tabrizian Tabrizian marked this pull request as ready for review July 18, 2025 23:41
@Tabrizian Tabrizian requested review from a team as code owners July 18, 2025 23:41
@Tabrizian
Copy link
Member Author

/bot run

@Tabrizian Tabrizian force-pushed the user/imant/ncclsymmetric branch from 07d53a9 to 9661360 Compare July 18, 2025 23:47
@tensorrt-cicd
Copy link
Collaborator

PR_Github #12354 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12354 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #9176 completed with status: 'FAILURE'

@Tabrizian Tabrizian force-pushed the user/imant/ncclsymmetric branch 2 times, most recently from 3d447b9 to d1e3b46 Compare July 22, 2025 20:44
@Tabrizian
Copy link
Member Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12606 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12606 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9381 completed with status: 'FAILURE'

Copy link
Collaborator

@Superjomn Superjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from the llmapi side.

@Tabrizian Tabrizian force-pushed the user/imant/ncclsymmetric branch from d1e3b46 to 1323445 Compare July 31, 2025 23:14
@Tabrizian Tabrizian requested a review from liji-nv July 31, 2025 23:18
@Tabrizian Tabrizian force-pushed the user/imant/ncclsymmetric branch 2 times, most recently from 993994d to 4340e8e Compare July 31, 2025 23:24
@Tabrizian
Copy link
Member Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14231 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --multi-gpu-only-test

@Tabrizian
Copy link
Member Author

/bot run --multi-gpu-test-only

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14232 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --multi-gpu-test-only

@Tabrizian
Copy link
Member Author

/bot run --only-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14234 [ run ] triggered by Bot

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

Add log message

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

Enable NCCL symmetric for non-torch compile path

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

Fix build

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

Address review comments

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
@Tabrizian Tabrizian force-pushed the user/imant/ncclsymmetric branch from d623585 to 4838b90 Compare August 6, 2025 06:27
@Tabrizian
Copy link
Member Author

/bot run --only-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14234 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10749 (Partly Tested) completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14254 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14254 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10761 (Partly Tested) completed with status: 'SUCCESS'

@Tabrizian
Copy link
Member Author

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14319 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14319 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #14254 (Partly Tested) for commit 4838b90

@litaotju
Copy link
Collaborator

litaotju commented Aug 7, 2025

Whats the plan for the "UserBuffer" kernels? Are NCCL symmetric designed to replace that?

@Tabrizian
Copy link
Member Author

@litaotju I think once we have the fused version of NCCL symmetric APIs we can start replacing removing the user buffer kernels (assuming the performance is similar).

@Tabrizian Tabrizian merged commit 8227616 into NVIDIA:main Aug 8, 2025
4 checks passed
Shunkangz pushed a commit to hcyezhang/TensorRT-LLM that referenced this pull request Aug 8, 2025
Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants