[None][fix] Complete the last missing allreduce op in Llama3/4. #6850

hyukn · 2025-08-13T02:45:01Z

The allreduce op of the last decoder layer is missing in some circumstances for the models Llama3 and Llama4.

Summary by CodeRabbit

Bug Fixes
- Disabled fusion optimizations on the final decoder layer for both MLP and MoE to prevent end-of-network issues.
- Refined post-MLP and post-MoE fusion gating to depend on the presence of the next layer’s normalization and next-layer quantization state, improving correctness across quantized and non-quantized paths.
Chores
- Preserved public APIs while improving runtime behavior and unpacking logic for fused outputs; clarified comments and formatting.

coderabbitai · 2025-08-13T02:45:07Z

📝 Walkthrough

Walkthrough

Disable POST_MLP_FUSION and POST_MOE_FUSION for the final decoder layer; change post-fusion gating to consult the next layer's layernorm and next attention/quantization for scale and fusion_op selection; adjust all-reduce/moe_allreduce paths, unpacking, and preserve PRE_MLP_FUSION intent.

Changes

Cohort / File(s)	Summary
Fusion gating & control flow `tensorrt_llm/_torch/models/modeling_llama.py`	Disable post-MLP/MOE fusion on final decoder layer; gate POST_* behavior on presence of `next_layer_layernorm`; choose `fusion_op`, `scale`, and all-reduce variant (standard vs `moe_allreduce` / min-latency) based on next-layer context and quantization flags.
Quantization-aware unpacking `tensorrt_llm/_torch/models/modeling_llama.py`	If `next_attn` and nvfp4/FP8 active, unpack fused outputs via fp4/act_sf into a `Fp4QuantizedTensor`; otherwise unpack to `(hidden_states, residual)`.
All-reduce / min-latency MOE path `tensorrt_llm/_torch/models/modeling_llama.py`	Preserve `cutlass_min_latency_mode` MOE support: min-latency uses `moe_allreduce` with next layer norm weight/eps; otherwise use standard all_reduce with chosen `fusion_op` and `scale`.
POST_MLP_FUSION / POST_MOE_FUSION handling `tensorrt_llm/_torch/models/modeling_llama.py`	If no `next_layer_layernorm` perform pure all_reduce (`fusion_op=None`); if present, derive `scale` from `next_attn` when quantized or set `fusion_op=RESIDUAL_RMS_NORM` and `scale=None`.
Minor updates `tensorrt_llm/_torch/models/modeling_llama.py`	Formatting/comments updated; note that MIN_LATENCY_MODE is treated as False in context; PRE_MLP_FUSION behavior preserved while aligning post-fusion gating.

Sequence Diagram(s)

sequenceDiagram
  participant Input as hidden_states
  participant Layer as LlamaDecoderLayer
  participant NextLN as next_layer_layernorm
  participant NextAttn as next_attn
  participant Quant as Quant (nvfp4/FP8)
  participant Fusion as AllReduce/MOE_AllReduce

  Input->>Layer: forward(hidden_states)
  Layer->>Layer: determine fusion path (PRE/POST/POST_MOE)
  alt POST path and no NextLN (final layer)
    Layer->>Fusion: all_reduce(fusion_op=None)
  else POST path with NextLN
    Layer->>NextAttn: check presence
    alt NextAttn present and Quant active
      NextAttn->>Layer: provide scale (qkv_proj.input_scale)
      Layer->>Fusion: all_reduce(fusion_op=..., scale=provided)
    else
      Layer->>Fusion: all_reduce(fusion_op=RESIDUAL_RMS_NORM, scale=None)
    end
  end
  Fusion-->>Layer: unpack (Fp4QuantizedTensor or (hidden, residual))
  Layer-->>Input: output

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

[TRTLLM-6445] feat: Enable AllReduce-associated fusion patterns in Llama3/4. #6205: Modifies the same Llama decoder fusion/all-reduce gating and quantization-aware unpacking logic in modeling_llama.py.

Suggested reviewers

symphonylyh
nvpohanh
brb-nv
chenfeiz0326
nv-yilinf

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8f1c511 and d32194e.

📒 Files selected for processing (1)

tensorrt_llm/_torch/models/modeling_llama.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tensorrt_llm/_torch/models/modeling_llama.py

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

hyukn · 2025-08-13T02:47:32Z

/bot run --disable-fail-fast --add-multi-gpu-test

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

tensorrt_llm/_torch/models/modeling_llama.py (2)

710-714: Consider removing or guarding debug prints for production

The debug prints at lines 711-714 and 765-768 are useful for debugging but should be either removed or properly guarded for production use. Consider using a debug flag or the logging framework instead of direct prints.

-            if self.mapping.tp_rank == 0:
-                print(
-                    f"{self.layer_idx} pre_mlp_fusion_op: {self.pre_mlp_fusion_op}"
-                )
+            if self.mapping.tp_rank == 0 and logger.isEnabledFor(logging.DEBUG):
+                logger.debug(
+                    f"{self.layer_idx} pre_mlp_fusion_op: {self.pre_mlp_fusion_op}"
+                )

765-768: Remove or properly guard debug print statements

Similar to the earlier comment, these debug prints should be properly handled for production code.

-            if self.mapping.tp_rank == 0:
-                print(
-                    f"{self.layer_idx} post_mlp_fusion_op: {self.post_mlp_fusion_op}"
-                )
+            if self.mapping.tp_rank == 0 and logger.isEnabledFor(logging.DEBUG):
+                logger.debug(
+                    f"{self.layer_idx} post_mlp_fusion_op: {self.post_mlp_fusion_op}"
+                )

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf00003 and f9c3b18.

📒 Files selected for processing (1)

tensorrt_llm/_torch/models/modeling_llama.py (6 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/models/modeling_llama.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/models/modeling_llama.py

🔇 Additional comments (6)

tensorrt_llm/_torch/models/modeling_llama.py (6)

443-443: Consistent fusion gating for MOE layers

Good consistency - the POST_MOE_FUSION also excludes the last layer, matching the MLP fusion behavior. This ensures proper allreduce operation for MOE-based models as well.

554-562: Improved fusion decision logic using next_layer_layernorm

The change from checking next_attn to next_layer_layernorm for determining fusion context is more robust. The logic correctly handles the last decoder layer by setting the fusion operation to RESIDUAL_RMS_NORM without scale when there's no next layer.

661-662: Consistent implementation in LlamaDecoderLayer

The POST_MLP_FUSION gating is consistently applied in the regular LlamaDecoderLayer class, maintaining consistency across decoder implementations.

757-763: Consistent fusion context detection

The change to use next_layer_layernorm instead of direct next layer checks provides a cleaner and more consistent approach to determine fusion context. The fallback to RESIDUAL_RMS_NORM for the last layer is appropriate.

778-778: Improved condition for NVFP4 handling

The condition now correctly checks for next_attn existence before checking for NVFP4 quantization, preventing potential attribute errors when next_attn is None on the last layer.

427-427: Please confirm performance and behavior impact of disabling post-MLP fusion on the final decoder layer

We searched the repo and found:

No test files referencing POST_MLP_FUSION or POST_MOE_FUSION under tests/

No performance benchmarks for Llama models under benchmark/

Fusion/allreduce briefly mentioned across various README.md files but without details on expected behavior

Without existing tests or benchmarks, please:

Add or update a unit/integration test to cover the fusion toggle on the last layer

Run your Llama performance benchmarks to detect any regressions

Document the change in behavior (e.g., update README or design docs)

tensorrt-cicd · 2025-08-13T02:52:46Z

PR_Github #15054 [ run ] triggered by Bot

tensorrt_llm/_torch/models/modeling_llama.py

hyukn · 2025-08-13T02:55:59Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2025-08-13T03:01:09Z

PR_Github #15058 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-13T03:01:11Z

PR_Github #15054 [ run ] completed with state ABORTED

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tensorrt_llm/_torch/models/modeling_llama.py (1)
710-710: Remove unnecessary blank line.

There's an extra blank line that can be removed for consistency with the surrounding code style.
-
 
             all_reduce_output = self.all_reduce(

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f9c3b18 and 8103f90.

📒 Files selected for processing (1)

tensorrt_llm/_torch/models/modeling_llama.py (6 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/models/modeling_llama.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/models/modeling_llama.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (7)

tensorrt_llm/_torch/models/modeling_llama.py (7)

427-427: LGTM! Correctly disables POST_MLP_FUSION for the last decoder layer.

This change properly prevents post-MLP fusion on the final decoder layer of Llama4, which ensures the missing allreduce operation is properly executed.

443-443: LGTM! Correctly disables POST_MOE_FUSION for the last decoder layer.

This change properly prevents post-MOE fusion on the final decoder layer of Llama4 MOE models, ensuring the missing allreduce operation is properly executed.

552-562: Logical improvement: Next-layer context detection via layernorm.

The change from next_attn-based gating to next_layer_layernorm presence check is a more robust way to detect the existence of a next layer. The logic correctly handles the last layer case by setting the fusion op to RESIDUAL_RMS_NORM when no scale is needed.

591-596: Proper unpacking of allreduce output based on quantization mode.

The conditional unpacking correctly handles different quantization scenarios - FP4 quantization with next attention layer vs. other cases. This ensures the correct data types are propagated through the model.

661-662: LGTM! Correctly disables POST_MLP_FUSION for the last decoder layer in LlamaModel.

This change ensures consistency between Llama3 and Llama4 models by preventing post-MLP fusion on the final decoder layer, fixing the missing allreduce operation issue.

753-760: Logical improvement: Next-layer context detection for standard Llama models.

The change properly handles the last layer case by checking for next_layer_layernorm presence and adjusting the fusion operation and scale accordingly. This ensures consistent behavior across both Llama3 and Llama4 architectures.

770-774: Proper conditional unpacking based on quantization state.

The code correctly unpacks the allreduce output based on whether the next attention layer exists and FP4 quantization is enabled, maintaining consistency with the Llama4 implementation.

tensorrt-cicd · 2025-08-13T05:57:49Z

PR_Github #15058 [ run ] completed with state ABORTED

hyukn · 2025-08-13T07:02:37Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2025-08-13T07:09:32Z

PR_Github #15088 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-13T11:26:56Z

PR_Github #15088 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #83 completed with status: 'FAILURE'

hyukn · 2025-08-14T00:35:51Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2025-08-14T00:41:31Z

PR_Github #15195 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-14T07:42:05Z

PR_Github #15195 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #97 completed with status: 'SUCCESS'

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

hyukn · 2025-08-14T08:21:02Z

/bot run --disable-fail-fast --add-multi-gpu-test

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

tensorrt_llm/_torch/models/modeling_llama.py (4)
573-573: Consider making MIN_LATENCY_MODE configurable

The comment indicates that MIN_LATENCY_MODE is hardcoded to False. Consider making this configurable through environment variables or configuration parameters to allow flexibility in deployment scenarios.

719-719: Remove extra blank line

There's an unnecessary blank line that should be removed for consistency.

Apply this diff:
-
 
             all_reduce_output = self.all_reduce(
762-762: Remove trailing whitespace

There's trailing whitespace on this line that should be removed.

Apply this diff:
-                                                      hidden_states, residual)
+                                                      hidden_states, residual)
85-86: Fix line continuation formatting

The line continuation is broken across two lines in an awkward way. Consider reformatting for better readability.

Apply this diff:
-            rope_fusion=not self.
-            use_qk_norm,  # Llama4 uses qk_norm after RoPE, so it is not possible to fuse RoPE into the attention OP with qk_norm.
+            rope_fusion=not self.use_qk_norm,  # Llama4 uses qk_norm after RoPE, so it is not possible to fuse RoPE into the attention OP with qk_norm.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these settings in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8103f90 and 8f1c511.

📒 Files selected for processing (1)

tensorrt_llm/_torch/models/modeling_llama.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/models/modeling_llama.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/models/modeling_llama.py

🔇 Additional comments (5)

tensorrt_llm/_torch/models/modeling_llama.py (5)

552-562: Ensure correct handling of missing allreduce op on the last decoder layer for Llama4

The implementation correctly addresses the missing allreduce operation on the last decoder layer. When next_layer_layernorm is None (indicating the last layer), it performs a pure allreduce with fusion_op=None to disable fusion. This is the appropriate fix for the issue.

564-571: Good implementation of scale adjustment for the last layer

The logic correctly handles the case where next_layer_layernorm exists but could be the last decoder layer. It appropriately adjusts the scale based on the presence of next_attn and quantization settings, ensuring proper handling of edge cases.

574-606: Correct unpacking logic for different quantization scenarios

The unpacking logic correctly handles both the min-latency MOE path and the standard allreduce path, with appropriate handling for NVFP4 quantization. The implementation properly differentiates between scenarios based on the presence of next_attn and quantization modes.

763-771: Good implementation of POST_MLP_FUSION for LlamaDecoderLayer

The implementation correctly mirrors the Llama4DecoderLayer logic for handling the missing allreduce op on the last decoder layer. When next_layer_layernorm is None, it properly performs a pure allreduce with fusion_op=None.

773-795: Consistent scale adjustment logic across both decoder layer implementations

The scale adjustment and unpacking logic for LlamaDecoderLayer correctly mirrors the Llama4DecoderLayer implementation, ensuring consistent behavior across both model types when handling the last decoder layer.

tensorrt-cicd · 2025-08-14T08:26:24Z

PR_Github #15266 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-14T16:52:56Z

PR_Github #15266 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #111 completed with status: 'SUCCESS'

…IA#6850) The allreduce op of the last decoder layer is missing in some circumstances for the models Llama3 and Llama4. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

…IA#6850) The allreduce op of the last decoder layer is missing in some circumstances for the models Llama3 and Llama4. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

The allreduce op of the last decoder layer is missing in some circumstances for the models Llama3 and Llama4. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

…IA#6850) The allreduce op of the last decoder layer is missing in some circumstances for the models Llama3 and Llama4. Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

hyukn requested a review from litaotju August 13, 2025 02:45

hyukn requested review from a team as code owners August 13, 2025 02:45

hyukn requested a review from mikeiovine August 13, 2025 02:45

coderabbitai bot reviewed Aug 13, 2025

View reviewed changes

litaotju approved these changes Aug 13, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_llama.py Outdated Show resolved Hide resolved

hyukn force-pushed the fix/llama_last_allreduce branch from f9c3b18 to 8103f90 Compare August 13, 2025 02:55

coderabbitai bot reviewed Aug 13, 2025

View reviewed changes

hyukn force-pushed the fix/llama_last_allreduce branch from 8103f90 to 8f1c511 Compare August 14, 2025 08:17

[None][fix] Complete the last missing allreduce op in Llama3/4.

d32194e

Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

hyukn force-pushed the fix/llama_last_allreduce branch from 8f1c511 to d32194e Compare August 14, 2025 08:20

coderabbitai bot reviewed Aug 14, 2025

View reviewed changes

hyukn merged commit d62b9c0 into NVIDIA:release/1.0 Aug 15, 2025
5 checks passed

coderabbitai bot mentioned this pull request Aug 22, 2025

[None][chore] Mass integration of release/1.0 - 2nd #7171

Merged

coderabbitai bot mentioned this pull request Aug 25, 2025

[https://nvbugs/5473781][fix] Fix llama 4 FP8 for PP>1 #7220

Merged

coderabbitai bot mentioned this pull request Aug 29, 2025

[None][feat] Add FP16/BF16 UB allreduce support to TRT path #7384

Open

1 task

hyukn mentioned this pull request Sep 1, 2025

[None][fix] Cherry-pick 6850: Complete the last missing allreduce op in Llama3/4. #7420

Merged

[None][fix] Complete the last missing allreduce op in Llama3/4. #6850

[None][fix] Complete the last missing allreduce op in Llama3/4. #6850

Uh oh!

Conversation

hyukn commented Aug 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

hyukn commented Aug 13, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Aug 13, 2025

Uh oh!

Uh oh!

hyukn commented Aug 13, 2025

Uh oh!

tensorrt-cicd commented Aug 13, 2025

Uh oh!

tensorrt-cicd commented Aug 13, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Aug 13, 2025

Uh oh!

hyukn commented Aug 13, 2025

Uh oh!

tensorrt-cicd commented Aug 13, 2025

Uh oh!

tensorrt-cicd commented Aug 13, 2025

Uh oh!

hyukn commented Aug 14, 2025

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

hyukn commented Aug 14, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

tensorrt-cicd commented Aug 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hyukn commented Aug 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 13, 2025 •

edited

Loading