KEMBAR78
[TRTLLM-7385][feat] Optimize Qwen2/2.5-VL performance by yechank-nvidia · Pull Request #7250 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@yechank-nvidia
Copy link
Collaborator

@yechank-nvidia yechank-nvidia commented Aug 26, 2025

Summary by CodeRabbit

  • New Features

    • Added multimodal mRoPE support across models, enabling Qwen2/Qwen2.5-VL with configurable rope fusion.
    • Enabled importing HuggingFace Qwen2-VL weights through a new mapping flow.
    • Added selective device transfer for multimodal inputs via path filtering.
  • Performance

    • CUDA Graphs path now supports mRoPE with per-batch parameters and non-blocking transfers.
  • Improvements

    • Refined vision encoder integration and positional handling.
    • Quickstart example can optionally print context/generation logits and logprobs via flags.
  • Tests

    • Added comprehensive Qwen2.5-VL integration tests and a unit test for selective device transfer.

@yechank-nvidia yechank-nvidia self-assigned this Aug 26, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 26, 2025

📝 Walkthrough

Walkthrough

Adds mRoPE support and refactors Qwen2-VL weight loading and vision wiring. Introduces MRotaryEmbedding and mrope_section handling. Removes mrope_config from multiple model forward signatures. Updates multimodal data flow through executor (position_ids/multimodal_params). Adds HF weight mapper and tests for Qwen2_5-VL and selective to_device.

Changes

Cohort / File(s) Summary of changes
Example diagnostics
examples/llm-api/quickstart_multimodal.py
Conditional printing of context logits, generation logits, and logprobs based on CLI flags.
Positional embedding interface
tensorrt_llm/_torch/attention_backend/interface.py
Adds PositionalEmbeddingParams.mrope_section: Optional[List[int]].
Rotary embedding + attention wiring
tensorrt_llm/_torch/modules/rotary_embedding.py, tensorrt_llm/_torch/modules/attention.py
Adds MRotaryEmbedding with mrope_section handling and integrates selection between RotaryEmbedding and MRotaryEmbedding based on pos_embd_params.type.is_mrope() and rope_fusion. Exposes MRotaryEmbedding.
Remove mrope_config from forward paths
tensorrt_llm/_torch/models/modeling_gpt_oss.py, .../modeling_hunyuan_moe.py, .../modeling_llama.py, .../modeling_llama_min_latency.py, .../modeling_qwen3.py
Drops mrope_config from method signatures and call chains in Transformer, HunYuanAttention, Llama4Attention, Llama4MinLatencyAttention, Qwen3DecoderLayer/Qwen3Model.
Qwen attention updates
tensorrt_llm/_torch/models/modeling_qwen.py
Passes rope_fusion flag to base Attention and forwards mrope_section into PositionalEmbeddingParams when rope_scaling is present.
Qwen2-VL refactor and weight mapping
tensorrt_llm/_torch/models/modeling_qwen2vl.py, tensorrt_llm/_torch/models/checkpoints/__init__.py, .../checkpoints/hf/qwen2vl_weight_mapper.py
Refactors vision encoder construction, adds filter_weights, prepares mrope_config in forward, supports disable_fuse_rope, exposes multimodal_data_device_paths. Adds Qwen2VLHfWeightMapper and exports it.
Executor mRoPE pipeline
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py, .../model_engine.py, .../resource_manager.py
Switches to per-batch MultimodalParams for mRoPE, uses 3D position_ids buffer when enabled, propagates multimodal_params and mrope_position_ids, updates warmup to honor use_mrope, removes legacy mrope_position_deltas path in runner and request construction.
Multimodal inputs API
tensorrt_llm/inputs/multimodal.py
MultimodalParams.to_device gains keyword path filter; adds internal helper to move selected paths; strip_for_generation keeps only mrope_position_deltas under mrope_config.
Tests
tests/unittest/_torch/modeling/test_modeling_qwen2_5vl.py, tests/unittest/_torch/multimodal/test_share_multiparams.py
Adds Qwen2_5-VL integration and HF parity tests, CUDA graph scenarios, and a unit test for to_device keyword-path filtering behavior.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Model as Qwen2VLModel
  participant WM as Qwen2VLHfWeightMapper
  participant Vision as VisionEncoder
  participant LLM as LLM

  User->>Model: load_weights(weights, WM)
  Model->>Model: filter_weights(prefix, weights)
  Model->>Vision: load vision weights (if not DISAGG)
  Model->>LLM: llm.load_weights(WM)
  alt fuse_rope enabled
    Model->>Model: prepare_mrope_config(...)
  end
  Model-->>User: load complete
Loading
sequenceDiagram
  autonumber
  participant Client
  participant RM as ResourceManager
  participant Engine as PyTorchModelEngine
  participant CG as CUDAGraphRunner
  participant Model as Model.forward
  participant Attn as Attention/MRotaryEmbedding

  Client->>RM: build dummy/real request (use_mrope?)
  RM->>RM: attach py_multimodal_data.mrope_config (position_ids [+ deltas])
  RM-->>Engine: LlmRequest
  Engine->>Engine: _prepare_tp_inputs(use_mrope)
  Engine->>CG: capture(inputs: position_ids, multimodal_params)
  Client->>CG: run(inputs: position_ids, multimodal_params)
  CG->>Model: forward(attn_metadata, position_ids, multimodal_params, ...)
  alt rope_fusion disabled and is_mrope
    Model->>Attn: MRotaryEmbedding.forward(position_ids, qkv)
  else
    Model->>Attn: RotaryEmbedding or fused path
  end
  Model-->>Client: logits
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested reviewers

  • yizhang-nv
  • QiJune
  • juney-nvidia
  • Funatiq
  • rakib-hasan

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@yechank-nvidia yechank-nvidia changed the title [TRTLLM-7385][feat] Optimize & clean Qwen2/2.5-VL performance [TRTLLM-7385][feat] Optimize Qwen2/2.5-VL performance Aug 26, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (11)
examples/llm-api/quickstart_multimodal.py (1)

1-1: Add standard NVIDIA multi-line header to Python example

The file examples/llm-api/quickstart_multimodal.py is missing the required NVIDIA copyright and license header.
Based on the most common header in other Python files (e.g. triton_backend/all_models/.../fbank.py), please prepend this exact block at the very top of the file:

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.

• File: examples/llm-api/quickstart_multimodal.py (insert before line 1)

tensorrt_llm/_torch/attention_backend/interface.py (1)

1-1: Missing NVIDIA copyright header (2025) and future-annotations import

Per coding guidelines, prepend the NVIDIA copyright header to all Python sources. Also, several modules in this PR use PEP 604/585 typing elsewhere; adding from __future__ import annotations early helps keep Python 3.8 compatibility if annotations evolve here later.

Apply at file top:

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ from __future__ import annotations
 import copy
 import weakref
tensorrt_llm/_torch/models/modeling_llama.py (1)

1-1: Missing NVIDIA copyright header (2025) and future-annotations import

Header is required. This module uses PEP 604 unions (|) and PEP 585 generics (tuple[...]). Add from __future__ import annotations immediately after the header to keep Python 3.8 compatibility by deferring annotation evaluation.

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ from __future__ import annotations
 import copy
 import os
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (1)

1-1: Missing NVIDIA copyright header (2025) and future-annotations import

This module uses PEP 604/585 typing in multiple places. Add the header and from __future__ import annotations at the top to ensure Python 3.8 compatibility and satisfy repository policy.

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ from __future__ import annotations
 from collections.abc import Callable
 from functools import partial
tensorrt_llm/inputs/multimodal.py (1)

1-1: Missing NVIDIA copyright header (2025)

Please prepend the required header.

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
 """Multimodal utilities for handling images and other media types in TensorRT-LLM."""
tests/unittest/_torch/multimodal/test_share_multiparams.py (1)

1-1: Missing required NVIDIA copyright header (2025)

Per repository guidelines, prepend the standard NVIDIA copyright header (current year) to all Python files, including tests.

tensorrt_llm/_torch/models/checkpoints/__init__.py (1)

1-1: Missing required NVIDIA copyright header (2025)

Please add the standard NVIDIA header at the top of this file.

tensorrt_llm/_torch/modules/attention.py (1)

1-1: Missing required NVIDIA copyright header (2025)

Add the standard header to comply with repo policy.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

1-1: Missing required NVIDIA copyright header (2025)

Please prepend the standard NVIDIA header.

tensorrt_llm/_torch/models/modeling_qwen2vl.py (2)

244-265: Incorrect temporal padding computation (under-pads for non-multiples).

padded_num_frames = num_frames + num_frames % temporal_patch_size produces 12 for 10 frames with size 8; the next multiple is 16. Use ceil-to-multiple.

Apply:

-        padded_num_frames = num_frames + num_frames % temporal_patch_size
+        padded_num_frames = (
+            ((num_frames + temporal_patch_size - 1) // temporal_patch_size)
+            * temporal_patch_size
+        )

609-711: Avoid padding to max_position_embeddings; vectorize and fix torch.concatenate usage.

Current approach pads to max_position_embeddings per sample, which is memory-heavy and risks OOM on GPU. You can directly index self.cos_ori/self.sin_ori with variable-length IDs and then reshape/split without padding. Also, prefer torch.cat (broader compatibility) and move deltas to the model device with correct dtype.

Apply:

-        if mrope_position_ids:
-            concat_cos_sin_list = []
-            for mrope_position_id in mrope_position_ids:
-                mrope_position_id = mrope_position_id.transpose(1, 0)
-
-                # TODO: Test whether padding is needed
-                mrope_position_ids_padding = torch.zeros(
-                    mrope_position_id.shape[:-1] +
-                    (self.model_config.pretrained_config.
-                     max_position_embeddings, ),
-                    dtype=torch.int32,
-                    device=mrope_position_id.device)
-                mrope_position_ids_padding[:, :, :mrope_position_id.
-                                           shape[-1]] = mrope_position_id
-
-                mrope_position_ids_padding = mrope_position_ids_padding.to(
-                    self.cos_ori.device)
-                cos = self.cos_ori[mrope_position_ids_padding]
-                sin = self.sin_ori[mrope_position_ids_padding]
-
-                mrope_section = [16, 24, 24]
-                cos = torch.cat([
-                    m[:, i % 3]
-                    for i, m in enumerate(cos.split(mrope_section, dim=-1))
-                ],
-                                dim=-1).unsqueeze(-1)
-                sin = torch.cat([
-                    m[:, i % 3]
-                    for i, m in enumerate(sin.split(mrope_section, dim=-1))
-                ],
-                                dim=-1).unsqueeze(-1)
-                concat_cos_sin = torch.concatenate((cos, sin), axis=-1)
-                concat_cos_sin = concat_cos_sin.reshape(concat_cos_sin.shape[0],
-                                                        -1)
-                concat_cos_sin_list.append(concat_cos_sin)
-            mrope_config['mrope_rotary_cos_sin'] = torch.cat(
-                concat_cos_sin_list, dim=0).to(self.device)
+        if mrope_position_ids:
+            concat_cos_sin_list = []
+            sections = getattr(self.llm.config, "mrope_section", [16, 24, 24])
+            for mrope_position_id in mrope_position_ids:
+                pos = mrope_position_id.transpose(1, 0).to(self.cos_ori.device, dtype=torch.long)
+                cos = self.cos_ori[pos]
+                sin = self.sin_ori[pos]
+                cos = torch.cat([chunk[:, i % 3] for i, chunk in enumerate(cos.split(sections, dim=-1))], dim=-1).unsqueeze(-1)
+                sin = torch.cat([chunk[:, i % 3] for i, chunk in enumerate(sin.split(sections, dim=-1))], dim=-1).unsqueeze(-1)
+                concat = torch.cat((cos, sin), dim=-1).reshape(cos.shape[0], -1)
+                concat_cos_sin_list.append(concat)
+            mrope_config['mrope_rotary_cos_sin'] = torch.cat(concat_cos_sin_list, dim=0).to(self.device)
@@
-        if mrope_position_deltas:
-            mrope_config['mrope_position_deltas'] = torch.cat(
-                mrope_position_deltas, dim=0)
+        if mrope_position_deltas:
+            mrope_config['mrope_position_deltas'] = torch.cat(mrope_position_deltas, dim=0).to(self.device, dtype=torch.int32)

Note: consider making sections configurable instead of hardcoding [16, 24, 24].

🧹 Nitpick comments (26)
examples/llm-api/quickstart_multimodal.py (1)

221-228: Avoid dumping full logits to stdout; print shapes/summaries to prevent massive logs and slowdown

Raw context/generation logits can be extremely large (seq_len × vocab_size × beams), which can flood stdout, slow runs, and OOM terminals. Summarize instead and be resilient to missing attributes.

-        if args.return_context_logits:
-            print(f"[{i}] Context logits: {output.context_logits}")
-        if args.return_generation_logits:
-            print(
-                f"[{i}] Generation logits: {output.outputs[0].generation_logits}"
-            )
-        if args.logprobs:
-            print(f"[{i}] Logprobs: {output.outputs[0].logprobs}")
+        if getattr(args, "return_context_logits", False):
+            ctx = getattr(output, "context_logits", None)
+            if ctx is None:
+                print(f"[{i}] Context logits: None")
+            else:
+                shape = getattr(ctx, "shape", None)
+                dtype = getattr(ctx, "dtype", type(ctx))
+                print(f"[{i}] Context logits: shape={tuple(shape) if shape is not None else 'unknown'}, dtype={dtype}")
+
+        if getattr(args, "return_generation_logits", False):
+            gen = getattr(output.outputs[0], "generation_logits", None)
+            if gen is None:
+                print(f"[{i}] Generation logits: None")
+            else:
+                shape = getattr(gen, "shape", None)
+                dtype = getattr(gen, "dtype", type(gen))
+                print(f"[{i}] Generation logits: shape={tuple(shape) if shape is not None else 'unknown'}, dtype={dtype}")
+
+        if getattr(args, "logprobs", False):
+            lps = getattr(output.outputs[0], "logprobs", None)
+            if isinstance(lps, list):
+                head = ", ".join(f"{x:.4f}" for x in lps[:5])
+                more = "..." if len(lps) > 5 else ""
+                print(f"[{i}] Logprobs: n={len(lps)}, head=[{head}{more}]")
+            else:
+                print(f"[{i}] Logprobs: {lps}")
tensorrt_llm/_torch/attention_backend/interface.py (1)

506-508: mRoPE section: add light validation and docstring clarity

Good addition. To make failures obvious when a wrong shape is passed from model wiring, add validation in __post_init__ for mrope_section when present (e.g., positive ints, non-empty, optionally sum <= rope.dim) and briefly document expected semantics (Qwen2/2.5-VL commonly uses three sections). Keep it optional to avoid over-constraining new models.

Example guard (illustrative; adapt location accordingly):

if self.type.is_rope() and self.mrope_section is not None:
    assert all(isinstance(x, int) and x > 0 for x in self.mrope_section), "mrope_section must be positive ints"
    if self.rope and isinstance(self.rope.dim, int):
        assert sum(self.mrope_section) <= self.rope.dim, "sum(mrope_section) must not exceed rope.dim"
tensorrt_llm/_torch/models/modeling_llama.py (2)

182-183: Forward call now passes fewer positional args; prefer keywords to avoid drift

You removed one positional None in the forward_impl(...) call. To make this resilient to future signature changes and to ensure correctness now, pass these trailing optionals by name (e.g., kv_cache_params=None, lora_params=None, skip_attn_scaling=..., attention_sinks=None). This avoids accidental misalignment if forward_impl’s parameter order differs across backends.

Example pattern:

attn_output = self.forward_impl(
    q, k, v,
    attn_metadata,
    attention_mask=attention_mask,
    all_reduce_params=all_reduce_params,
    skip_attn_scaling=skip_attn_scaling,
    attention_sinks=None,
)

Please confirm the exact parameter names.


433-435: Min-latency NOPE path: keep call-site robust with keywords

Same concern as above: the parent _forward_nope changed and you removed the mrope_config positional. Pass the trailing parameters by name to guard against parameter drift and improve readability.

return super()._forward_nope(
    position_ids,
    hidden_states,
    attn_metadata,
    attention_mask=attention_mask,
    all_reduce_params=all_reduce_params,
    skip_attn_scaling=skip_attn_scaling,
)
tensorrt_llm/_torch/models/modeling_llama_min_latency.py (1)

433-435: Call into base _forward_nope: use keyword args for clarity

The removed mrope_config positional looks correct. To prevent subtle regressions if _forward_nope evolves, pass the last arguments by name.

-        return super()._forward_nope(position_ids, hidden_states, attn_metadata,
-                                     attention_mask, all_reduce_params,
-                                     skip_attn_scaling)
+        return super()._forward_nope(
+            position_ids,
+            hidden_states,
+            attn_metadata,
+            attention_mask=attention_mask,
+            all_reduce_params=all_reduce_params,
+            skip_attn_scaling=skip_attn_scaling,
+        )
tensorrt_llm/inputs/multimodal.py (2)

388-403: Path-filtered device move: validate dict type; behavior on empty list

Good defensive check that multimodal_data is a dict when keyword is provided. Consider treating keyword=[] as a no-op explicitly (currently it is), and optionally log when a path is not found to ease debugging (silent skip is fine for “Chill”).


405-448: Helper mutates input dict in place; document this or deep-copy

result = data and subsequent assignments update the original dict. This is efficient, but note it in the docstring to avoid surprises for callers reusing the dict elsewhere. Alternatively, shallow-copy the parent dict(s) before mutation if you want isolation, at a small cost.

Example shallow-copy at top:

-        result = data
+        result = dict(data)

And copy nested parents before mutation if needed.

tests/unittest/_torch/multimodal/test_share_multiparams.py (1)

149-167: Selective to_device path filter works; consider minor robustness tweaks

The new test validates the dot-path filtering behavior and confirms non-targeted tensors remain on CPU. Looks good. Two optional hardening ideas:

  • Add a skip guard for environments without CUDA to avoid false negatives on CPU-only CI.
  • Consider adding one more assertion for a non-existent path (e.g., "image.not_there") to confirm it’s safely ignored.

Example (optional):

@unittest.skipIf(not torch.cuda.is_available(), "CUDA required for this test")
def test_to_device_with_keyword(self):
    ...
tensorrt_llm/_torch/models/checkpoints/__init__.py (1)

18-19: Keep all tidy (optional)

Appending the new symbol is fine. Consider keeping all sorted alphabetically to reduce churn in future diffs.

tensorrt_llm/_torch/modules/attention.py (1)

267-279: Initialize MRotaryEmbedding defensively when mrope_section may be absent

The MRope path looks correct. To avoid AttributeError with older configs that don’t carry pos_embd_params.mrope_section, construct MRotaryEmbedding without the parameter when it’s missing (class has a sensible default). This preserves backward compatibility.

Apply this diff within the block:

-            if self.pos_embd_params.type.is_mrope():
-                self.rotary_emb = MRotaryEmbedding(
-                    self.pos_embd_params.rope,
-                    head_dim=self.head_dim,
-                    is_neox=self.pos_embd_params.is_neox,
-                    mrope_section=self.pos_embd_params.mrope_section,
-                )
+            if self.pos_embd_params.type.is_mrope():
+                if hasattr(self.pos_embd_params, "mrope_section"):
+                    self.rotary_emb = MRotaryEmbedding(
+                        self.pos_embd_params.rope,
+                        head_dim=self.head_dim,
+                        is_neox=self.pos_embd_params.is_neox,
+                        mrope_section=self.pos_embd_params.mrope_section,
+                    )
+                else:
+                    self.rotary_emb = MRotaryEmbedding(
+                        self.pos_embd_params.rope,
+                        head_dim=self.head_dim,
+                        is_neox=self.pos_embd_params.is_neox,
+                    )
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)

484-484: Fix typo and wrap long NOTE to satisfy Ruff E501

There’s a minor typo (“mrop_config”) and the line exceeds 120 chars. Reflow into two lines and fix the spelling.

Apply this diff:

-            # NOTE: Planning to get dummy_data from each model. Before that, we need to add dummy mrop_config to the request here.
+            # NOTE: Planning to get dummy_data from each model.
+            # Before that, add a dummy mrope_config to the request here.

493-498: Dummy mrope_position_deltas shape looks fine; consider setdefault if py_multimodal_data may exist later

For dummy requests it’s safe to overwrite py_multimodal_data. If this function ever reuses existing requests, prefer setdefault to avoid clobbering other dummy modalities.

Alternative (for future-proofing):

req.py_multimodal_data = getattr(req, "py_multimodal_data", {}) or {}
req.py_multimodal_data.setdefault("mrope_config", {})["mrope_position_ids"] = dummy_mrope_position_ids
...
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)

61-69: Consider using a factory function for MultimodalParams creation.

The multimodal_params initialization with default mrope_config could be extracted into a helper method for better readability and reusability.

Consider refactoring to improve readability:

+    def _create_mrope_multimodal_params(self, batch_size: int, device: str):
+        """Create MultimodalParams with default mrope configuration."""
+        return [
+            MultimodalParams(
+                multimodal_data={
+                    "mrope_config": {
+                        "mrope_position_deltas":
+                        torch.zeros((1, 1), device=device, dtype=torch.int32)
+                    }
+                }) for _ in range(batch_size)
+        ]
+
     def __init__(self, ...):
         ...
-        self.multimodal_params = [
-            MultimodalParams(
-                multimodal_data={
-                    "mrope_config": {
-                        "mrope_position_deltas":
-                        torch.zeros((1, 1), device=device, dtype=torch.int32)
-                    }
-                }) for _ in range(batch_size)
-        ] if self.use_mrope else []
+        self.multimodal_params = self._create_mrope_multimodal_params(
+            batch_size, device) if self.use_mrope else []
tensorrt_llm/_torch/modules/rotary_embedding.py (1)

167-170: Consider adding a warning for unexpected 2D position_ids in mRoPE mode.

Since MRotaryEmbedding is specifically for mRoPE, receiving 2D position_ids might indicate a configuration issue.

Consider adding a warning when falling back to standard RoPE:

         else:
             # Fallback to the original RoPE where position_ids is 2D for dummy requests
+            import warnings
+            warnings.warn(
+                "MRotaryEmbedding received 2D position_ids, falling back to standard RoPE. "
+                "This may indicate a configuration issue.",
+                RuntimeWarning,
+                stacklevel=2
+            )
             cos_sin = self.rotary_cos_sin[position_ids.view(-1)]
             cos, sin = cos_sin[:, 0, :], cos_sin[:, 1, :]
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

1287-1306: Fix line length issues for better readability.

The code logic is correct, but several lines exceed the 120-character limit, affecting readability.

Apply this formatting fix:

             if self.use_mrope:
                 ctx_mrope_position_ids = multimodal_params.multimodal_data[
-                    'mrope_config'][
-                        'mrope_position_ids'][:, :,
-                                              begin_compute:begin_compute +
-                                              len(prompt_tokens)]
+                    'mrope_config']['mrope_position_ids'][
+                        :, :, begin_compute:begin_compute + len(prompt_tokens)]
                 mrope_position_ids.append(ctx_mrope_position_ids)

                 # TODO: Visit later to decide the appropriate position of sending multimodal data & selectively sending multimodal data
                 multimodal_params.to_device(
                     "multimodal_data",
                     "cuda",
                     pin_memory=True,
-                    keyword=getattr(self.model, "multimodal_data_device_paths",
-                                    None))
+                    keyword=getattr(
+                        self.model, "multimodal_data_device_paths", None))

1441-1465: Consider extracting mRoPE position ID generation logic.

The mRoPE position ID generation and multimodal parameter handling could be extracted into a helper method for better maintainability.

Consider refactoring into a helper method:

+    def _prepare_generation_mrope_params(
+            self, past_seen_token_num: int,
+            multimodal_data: dict) -> Tuple[torch.Tensor, MultimodalParams]:
+        """Prepare mRoPE parameters for generation requests."""
+        mrope_position_deltas = multimodal_data[
+            'mrope_config']['mrope_position_deltas']
+        gen_mrope_position_ids = (
+            past_seen_token_num + mrope_position_deltas
+        ).expand(3, 1, 1)
+        
+        multimodal_params = MultimodalParams(multimodal_data=multimodal_data)
+        multimodal_params.strip_for_generation()
+        if multimodal_params.has_content():
+            multimodal_params.to_device(
+                "multimodal_data", "cuda", pin_memory=True,
+                keyword=["mrope_config.mrope_position_deltas"])
+        return gen_mrope_position_ids, multimodal_params
+
                 # Multimodal
-                multimodal_params = MultimodalParams(
-                    multimodal_data=request.py_multimodal_data)
-                multimodal_params.strip_for_generation()
-                if multimodal_params.has_content():
+                if request.py_multimodal_data and self.use_mrope:
+                    gen_mrope_position_ids, multimodal_params = (
+                        self._prepare_generation_mrope_params(
+                            past_seen_token_num, request.py_multimodal_data))
+                    mrope_position_ids.append(gen_mrope_position_ids)
+                    multimodal_params_list.append(multimodal_params)
tests/unittest/_torch/modeling/test_modeling_qwen2_5vl.py (2)

101-102: Fix line length in repr method.

The return statement exceeds the 120-character limit.

Split the string for better readability:

     def __repr__(self) -> str:
-        return f"modality:{self.modality.lower()}-use_cuda_graph:{self.use_cuda_graph}-disable_fuse_rope:{self.disable_fuse_rope}"
+        return (f"modality:{self.modality.lower()}-"
+                f"use_cuda_graph:{self.use_cuda_graph}-"
+                f"disable_fuse_rope:{self.disable_fuse_rope}")

134-136: Consider adding error handling for tokenizer initialization.

The tokenizer loading could fail if the model path is invalid.

Add error handling:

     def get_inputprocessor(self, model_path):
-        tokenizer = AutoTokenizer.from_pretrained(model_path)
-        return create_input_processor(model_path, tokenizer=tokenizer)
+        try:
+            tokenizer = AutoTokenizer.from_pretrained(model_path)
+            return create_input_processor(model_path, tokenizer=tokenizer)
+        except Exception as e:
+            raise RuntimeError(f"Failed to load tokenizer from {model_path}: {e}")
tensorrt_llm/_torch/models/modeling_qwen2vl.py (8)

62-89: Align docstring with return shape and ensure consistent dtype/shape.

  • Docstring says mrope_position_deltas has shape (batch_size), but the code returns (B, 1). Keep them consistent.
  • Consider using torch.int32 for deltas at the source to avoid later casting churn.

Apply:

-        Returns:
-            position_ids: A tensor of shape (3, batch_size, sequence_length)
-            mrope_position_deltas: A tensor of shape (batch_size)
+        Returns:
+            position_ids: torch.IntTensor of shape (3, batch_size, sequence_length)
+            mrope_position_deltas: torch.IntTensor of shape (batch_size, 1)

And when returning in the no-vision path:

-                mrope_position_deltas = torch.zeros(
+                mrope_position_deltas = torch.zeros(
                     [input_ids.shape[0], 1],
                     device=input_ids.device,
-                    dtype=input_ids.dtype,
+                    dtype=torch.int32,
                 )

Also applies to: 232-235, 296-313


135-146: Avoid shadowing input_ids parameter inside loop.

Reusing input_ids for the per-row tensor shadows the function parameter and risks confusion.

Apply:

-        for i, input_ids in enumerate(total_input_ids):
-            input_ids = input_ids[attention_mask[i] == 1]
+        for i, seq_input_ids in enumerate(total_input_ids):
+            seq_input_ids = seq_input_ids[attention_mask[i] == 1]
             image_nums, video_nums = 0, 0
             vision_start_indices = torch.argwhere(
-                input_ids == vision_start_token_id).squeeze(1)
-            vision_tokens = input_ids[vision_start_indices + 1]
+                seq_input_ids == vision_start_token_id).squeeze(1)
+            vision_tokens = seq_input_ids[vision_start_indices + 1]
-            input_tokens = input_ids.tolist()
+            input_tokens = seq_input_ids.tolist()

303-313: Make CPU clones explicit and unify dtypes.

  • Keeping mrope_position_ids as int64 is fine for indexing later, but spell out intent.
  • mrope_position_deltas is cast to int32; consider casting IDs to int64 explicitly to avoid implicit conversions.

Apply:

-        mrope_config['mrope_position_ids'] = mrope_position_ids.to(
-            'cpu').clone()
+        mrope_config['mrope_position_ids'] = mrope_position_ids.to('cpu', dtype=torch.long).clone()

349-358: Remove commented debug prints and long lines (Ruff E501).

Dead debug prints add noise and break line-length checks. Please delete them.

Apply:

-        # print(f"fused_input_ids: {fused_input_ids}")
-        # print(f"multimodal_data['image']['pixel_values'].shape: {multimodal_data['image']['pixel_values'].shape}")
-        # print(f"multimodal_data['image']['image_grid_thw']: {multimodal_data['image']['image_grid_thw']}")
-        # print(f"multimodal_data['mrope_config']['mrope_position_ids'].shape: {multimodal_data['mrope_config']['mrope_position_ids'].shape}")
-        # print(f"multimodal_data['mrope_config']['mrope_position_deltas'].shape: {multimodal_data['mrope_config']['mrope_position_deltas'].shape}")

521-533: Replace print statements with logger and avoid dumping giant key lists.

Use logger.debug and optionally gate behind a verbosity flag; dumping entire weights.keys() is noisy.

Apply:

-        if not DISAGG:
-            print(f"weights.keys(): {weights.keys()}")
-            vision_encoder_weights = filter_weights("visual", weights)
-            print(
-                f"vision_encoder_weights.keys(): {vision_encoder_weights.keys()}"
-            )
+        if not DISAGG:
+            vision_encoder_weights = filter_vision_weights("visual", weights)
+            logger.debug("Loaded %d vision weights", len(vision_encoder_weights))

32-34: Global naming: prefer G_ prefix for globals.

Per guidelines, globals should be UPPER_SNAKE_CASE prefixed with G_. Consider renaming DISAGG to G_DISAGG for consistency.

Sample:

-DISAGG = os.getenv('TLLM_MULTIMODAL_DISAGGREGATED', '0') == '1'
+G_DISAGG = os.getenv('TLLM_MULTIMODAL_DISAGGREGATED', '0') == '1'

Update references accordingly.


356-357: Resolve Ruff E501: wrap or reflow long lines.

A few lines exceed 120 characters. Please wrap them to satisfy linters.

You can break the long comment at Line 477 and remove the debug prints at Lines 356–357 as suggested earlier. For other occurrences, wrap strings/comments across lines.

Also applies to: 477-479, 566-569


296-313: Add minimal docstrings for public helpers.

get_mrope_config and prepare_mrope_config are externally consumed via MultimodalParams; add short Google-style docstrings for clarity.

Also applies to: 609-711

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bf377d0 and 87b279a.

📒 Files selected for processing (19)
  • examples/llm-api/quickstart_multimodal.py (1 hunks)
  • tensorrt_llm/_torch/attention_backend/interface.py (1 hunks)
  • tensorrt_llm/_torch/models/checkpoints/__init__.py (2 hunks)
  • tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_gpt_oss.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_hunyuan_moe.py (0 hunks)
  • tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_llama_min_latency.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen.py (2 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen2vl.py (12 hunks)
  • tensorrt_llm/_torch/models/modeling_qwen3.py (0 hunks)
  • tensorrt_llm/_torch/modules/attention.py (2 hunks)
  • tensorrt_llm/_torch/modules/rotary_embedding.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (4 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (11 hunks)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (1 hunks)
  • tensorrt_llm/inputs/multimodal.py (2 hunks)
  • tests/unittest/_torch/modeling/test_modeling_qwen2_5vl.py (1 hunks)
  • tests/unittest/_torch/multimodal/test_share_multiparams.py (1 hunks)
💤 Files with no reviewable changes (2)
  • tensorrt_llm/_torch/models/modeling_qwen3.py
  • tensorrt_llm/_torch/models/modeling_hunyuan_moe.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/attention_backend/interface.py
  • tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py
  • examples/llm-api/quickstart_multimodal.py
  • tensorrt_llm/_torch/models/modeling_llama_min_latency.py
  • tests/unittest/_torch/multimodal/test_share_multiparams.py
  • tensorrt_llm/_torch/models/modeling_qwen.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/modules/rotary_embedding.py
  • tensorrt_llm/_torch/models/checkpoints/__init__.py
  • tensorrt_llm/_torch/models/modeling_gpt_oss.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
  • tensorrt_llm/inputs/multimodal.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tests/unittest/_torch/modeling/test_modeling_qwen2_5vl.py
  • tensorrt_llm/_torch/models/modeling_qwen2vl.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/attention_backend/interface.py
  • tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py
  • examples/llm-api/quickstart_multimodal.py
  • tensorrt_llm/_torch/models/modeling_llama_min_latency.py
  • tests/unittest/_torch/multimodal/test_share_multiparams.py
  • tensorrt_llm/_torch/models/modeling_qwen.py
  • tensorrt_llm/_torch/models/modeling_llama.py
  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/modules/rotary_embedding.py
  • tensorrt_llm/_torch/models/checkpoints/__init__.py
  • tensorrt_llm/_torch/models/modeling_gpt_oss.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py
  • tensorrt_llm/inputs/multimodal.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tests/unittest/_torch/modeling/test_modeling_qwen2_5vl.py
  • tensorrt_llm/_torch/models/modeling_qwen2vl.py
🧠 Learnings (1)
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
🧬 Code graph analysis (12)
tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py (3)
tensorrt_llm/_torch/models/checkpoints/hf/weight_mapper.py (1)
  • HfWeightMapper (10-99)
tensorrt_llm/_torch/models/modeling_utils.py (1)
  • register_mapper (597-608)
tensorrt_llm/_torch/models/modeling_qwen2vl.py (1)
  • filter_weights (35-42)
examples/llm-api/quickstart_multimodal.py (2)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
  • generation_logits (208-217)
tensorrt_llm/scaffolding/task.py (1)
  • logprobs (99-100)
tests/unittest/_torch/multimodal/test_share_multiparams.py (1)
tensorrt_llm/inputs/multimodal.py (1)
  • to_device (359-403)
tensorrt_llm/_torch/models/modeling_qwen.py (2)
tensorrt_llm/_torch/models/modeling_utils.py (1)
  • config (500-501)
tensorrt_llm/_torch/models/modeling_bert.py (1)
  • config (266-285)
tensorrt_llm/_torch/modules/attention.py (2)
tensorrt_llm/_torch/modules/rotary_embedding.py (2)
  • MRotaryEmbedding (130-193)
  • RotaryEmbedding (10-127)
tensorrt_llm/functional.py (1)
  • is_mrope (707-708)
tensorrt_llm/_torch/modules/rotary_embedding.py (1)
tensorrt_llm/_torch/attention_backend/interface.py (1)
  • RopeParams (342-494)
tensorrt_llm/_torch/models/checkpoints/__init__.py (1)
tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py (1)
  • Qwen2VLHfWeightMapper (7-21)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
  • use_mrope (482-490)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
tensorrt_llm/inputs/multimodal.py (1)
  • MultimodalParams (152-475)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
  • use_mrope (482-490)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
  • add_dummy_requests (68-69)
  • add_dummy_requests (430-499)
tensorrt_llm/inputs/multimodal.py (4)
  • to_device (359-403)
  • MultimodalParams (152-475)
  • strip_for_generation (449-471)
  • has_content (473-475)
tests/unittest/_torch/modeling/test_modeling_qwen2_5vl.py (11)
tensorrt_llm/_torch/attention_backend/utils.py (1)
  • get_attention_backend (10-24)
tensorrt_llm/_torch/metadata.py (1)
  • KVCacheParams (9-31)
tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py (1)
  • Qwen2VLHfWeightMapper (7-21)
tensorrt_llm/_torch/models/modeling_qwen2vl.py (1)
  • Qwen2_5_VLModel (802-815)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (3)
  • DecodingCUDAGraphRunner (12-143)
  • capture (78-104)
  • run (109-143)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (6)
  • KVCacheManager (141-943)
  • add_dummy_requests (68-69)
  • add_dummy_requests (430-499)
  • shutdown (80-81)
  • shutdown (355-356)
  • shutdown (1120-1121)
tensorrt_llm/inputs/registry.py (1)
  • create_input_processor (274-303)
tensorrt_llm/inputs/utils.py (1)
  • default_multimodal_input_loader (442-602)
tensorrt_llm/inputs/data.py (1)
  • prompt_inputs (55-69)
tensorrt_llm/inputs/multimodal.py (2)
  • MultimodalParams (152-475)
  • to_device (359-403)
tensorrt_llm/_torch/model_config.py (1)
  • torch_dtype (147-152)
tensorrt_llm/_torch/models/modeling_qwen2vl.py (5)
tensorrt_llm/_torch/models/checkpoints/base_weight_mapper.py (1)
  • BaseWeightMapper (10-165)
tensorrt_llm/_torch/models/modeling_utils.py (2)
  • filter_weights (710-716)
  • config (500-501)
tensorrt_llm/runtime/multimodal_model_runner.py (1)
  • get_rope_index (1802-1968)
tensorrt_llm/_torch/model_config.py (1)
  • torch_dtype (147-152)
tensorrt_llm/inputs/multimodal.py (1)
  • MultimodalParams (152-475)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py

484-484: Line too long (130 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/model_engine.py

1296-1296: Line too long (135 > 120)

(E501)


1462-1462: Line too long (131 > 120)

(E501)


1583-1583: Line too long (145 > 120)

(E501)


1584-1584: Line too long (140 > 120)

(E501)

tests/unittest/_torch/modeling/test_modeling_qwen2_5vl.py

102-102: Line too long (130 > 120)

(E501)

tensorrt_llm/_torch/models/modeling_qwen2vl.py

356-356: Line too long (142 > 120)

(E501)


357-357: Line too long (148 > 120)

(E501)


477-477: Line too long (140 > 120)

(E501)


566-566: Line too long (126 > 120)

(E501)

🔇 Additional comments (22)
examples/llm-api/quickstart_multimodal.py (2)

223-226: Double-check the correct accessor for generation logits and expected shape

The internal API snippet shows generation_logits shaped [beam_width, seq_length, vocab_size] after transpose; here you access output.outputs[0].generation_logits. Confirm that the sample-level accessor is intended (vs a top-level output.generation_logits) and that its shape semantics match your expectations before printing/summarizing.

If needed, we can adjust the print path to whichever accessor is canonical in the updated executor.


221-228: No missing CLI flags—add_llm_args already defines them
The flags --return_context_logits, --return_generation_logits, and --logprobs are all added in examples/llm-api/quickstart_advanced.py (lines 143–149) with default=False and action='store_true', so the checks in quickstart_multimodal.py will never raise an AttributeError.

tensorrt_llm/inputs/multimodal.py (2)

360-364: API extension looks good; positional compatibility preserved

Adding keyword: Optional[List[str]] = None after pin_memory preserves existing positional callers. Docstring updates clearly explain behavior. LGTM.


452-454: Generation strip behavior change: only keep mRoPE deltas

The narrowing to retain only mrope_position_deltas under mrope_config is sensible and aligns with MRope-only needs during generation. Please confirm no downstream consumer expects mrope_rotary_cos_sin at generation time; if any path does, this will now drop it.

Also applies to: 458-471

tensorrt_llm/_torch/models/checkpoints/__init__.py (1)

9-9: Expose Qwen2VL mapper in public API

Importing Qwen2VLHfWeightMapper here is correct and enables external access. LGTM.

tensorrt_llm/_torch/modules/attention.py (1)

27-27: Publicly exporting MRotaryEmbedding is appropriate

Bringing MRotaryEmbedding into this module’s namespace matches the new MRope path usage. No issues.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

486-488: Bug: invalid use of Tensor.expand with -1; shape creation for dummy mrope_position_ids will raise at runtime

torch.Tensor.expand does not accept -1 as a size. The current code will throw. Create the (3, 1, token_num) shape by first viewing to 3D, then expanding with concrete sizes.

Apply this fix:

-                dummy_mrope_position_ids = torch.arange(
-                    0, token_num, dtype=torch.int32).expand(3, 1, -1).clone()
+                base = torch.arange(token_num, dtype=torch.int32).view(1, 1, token_num)
+                dummy_mrope_position_ids = base.expand(3, 1, token_num).clone()

Likely an incorrect or invalid review comment.

tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py (1)

6-21: Mapper correctly normalizes 'model.language_model.' → 'model.' prefixes before delegating to HfWeightMapper

The transformation is correct and scoped, then reuses the base filter logic. Clean and minimal. LGTM.

tensorrt_llm/_torch/models/modeling_qwen.py (2)

35-35: LGTM!

The addition of mrope_section parameter retrieval from the config is properly handled with a default of None.


49-49: LGTM!

The rope_fusion parameter is correctly derived from the disable_fuse_rope config option with proper negation logic.

tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)

48-59: LGTM! Clean implementation of mRoPE position IDs handling.

The conditional initialization of position IDs based on use_mrope is well-structured, using the appropriate 3D shape for mRoPE and 2D shape for standard RoPE.


129-139: LGTM! Proper handling of mRoPE data transfer.

The non-blocking copy of mrope_position_deltas ensures efficient data transfer without blocking the compute stream.

tensorrt_llm/_torch/modules/rotary_embedding.py (2)

130-142: LGTM! Well-structured MRotaryEmbedding class initialization.

The MRotaryEmbedding class properly extends RotaryEmbedding with mRoPE-specific parameters. The default mrope_section values [16, 24, 24] are appropriate for Qwen2/2.5-VL models.


153-165: LGTM! Elegant multi-section RoPE implementation.

The splitting and concatenation of cos/sin values based on mrope_section is well-implemented. The logic correctly handles the 3D position_ids for mRoPE while maintaining backward compatibility with 2D position_ids.

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

446-448: LGTM! Proper initialization of mRoPE position IDs buffer.

The 3D tensor allocation for mrope_position_ids_cuda with shape (3, 1, max_num_tokens) is correct for mRoPE.


1581-1600: LGTM! Proper handling of mRoPE vs standard position IDs.

The conditional logic correctly handles the 3D mRoPE position IDs when available and falls back to standard 2D position IDs otherwise.

tensorrt_llm/_torch/models/modeling_qwen2vl.py (6)

746-749: Only compute mRoPE config when needed; ensure it’s complete.

Good gating via disable_fuse_rope. Ensure mrope_config includes both mrope_rotary_cos_sin and mrope_position_deltas on the correct device/dtypes as in the previous refactor.


730-759: End-to-end data flow looks solid. Minor consistency checks.

  • mm_embeds computed from encoder or cached path, then fused—LGTM.
  • Passing mrope_config into llm.forward(...) is consistent with the new flow. After the above fixes, device/dtype mismatches should be resolved.

750-761: Log level/shape logging looks appropriate.

Context/generation split logging and final output shape debug are helpful and not noisy.


374-376: Ensure stability by using the public from_config API or strict version pinning

The private method _from_config isn’t part of the guaranteed public API and may change in future transformers releases. To avoid unexpected breakages:

• Replace

self.visual = model_class._from_config(config).eval()

with the public constructor:

self.visual = model_class.from_config(config).eval()

• If from_config isn’t available or doesn’t meet your needs in Transformers 4.53.x, pin your dependency explicitly (e.g. in setup.py or requirements.txt):

transformers>=4.53.0,<4.54.0

and add a comment noting why the pin is required.

Please verify that Qwen2VisionTransformerPretrainedModel.from_config exists and behaves as expected in your target version.


96-117: Filling masked positions with 1 is the established convention

A search across the codebase shows the exact same position_ids.masked_fill_(attention_mask == 0, 1) pattern in:

  • tensorrt_llm/runtime/multimodal_model_runner.py:1950
  • tensorrt_llm/runtime/generation.py:2898 and 2984

This confirms that using 1 for masked positions is intentional and consistent throughout. No change is needed here.


490-493: No unreachable early return in Qwen2VLModelBase.init

The if hasattr(self, "llm"): return guard in Qwen2VLModelBase.init can never fire under normal usage, because:

  • Subclasses (Qwen2VLModel and Qwen2_5_VLModel) do not set self.llm before calling super().__init__.
  • Within Qwen2VLModelBase.init, self.llm is only assigned after the mm_encoder setup (lines 487–491).

Therefore, when DISAGG is False, self.mm_encoder will always be initialized. You can safely ignore the original concern.

Likely an incorrect or invalid review comment.

@yechank-nvidia yechank-nvidia force-pushed the optimize_qwen2vl branch 2 times, most recently from 7095cdc to f7cd755 Compare September 6, 2025 07:28
@yechank-nvidia yechank-nvidia marked this pull request as ready for review September 6, 2025 07:31
@yechank-nvidia yechank-nvidia requested review from a team as code owners September 6, 2025 07:31
@yechank-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17861 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17861 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13373 completed with status: 'FAILURE'

@PerkzZheng
Copy link
Collaborator

The fmha_v2 changes LGTM. I will let others review the rest of the codes. Thanks!

@yechank-nvidia
Copy link
Collaborator Author

/bot run

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
@yechank-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19335 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19335 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14519 completed with status: 'FAILURE'

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
@yechank-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19357 [ run ] triggered by Bot

@yechank-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19384 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19384 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14560 completed with status: 'FAILURE'

@yechank-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19410 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19410 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14582 completed with status: 'FAILURE'

@yechank-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19442 [ run ] triggered by Bot

@yechank-nvidia
Copy link
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19492 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19442 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #14611 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19492 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit b6c29de

@yechank-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19494 [ run ] triggered by Bot

@yechank-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19534 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19494 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #14653 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #19534 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #14683 completed with status: 'SUCCESS'

@jaedeok-nvidia jaedeok-nvidia merged commit f77aca9 into NVIDIA:main Sep 22, 2025
5 checks passed
# NOTE: self.use_mrope is enough for differentiating whether to use mrope_position_ids but
# `_create_dummy_context_requests` from `kv_cache_creater` makes an exception that I can not add multimodal_data to the dummy_request
# so that we only replace position_ids with mrope_position_ids when it is not a dummy request and for models who is using mrope.
mrope_position_ids = torch.cat(mrope_position_ids,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've identified a potential device mismatch issue in the code. The tensor ctx_mrope_position_ids might be on the CUDA device, while gen_mrope_position_ids remains on the CPU device. This discrepancy will cause an error when attempting to concatenate these tensors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Nekofish-L, thx for raising this issue! Yes, as you pointed out, we found that this cause the memory location discrepancy. We are trying to address this in this PR.

The solution is that we removed mrope_position_ids from multimodal_data_device_paths which makes it to reside on CPU region.

Thx again for the catch.

JunyiXu-nv pushed a commit to JunyiXu-nv/TensorRT-LLM that referenced this pull request Sep 22, 2025
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
nv-lschneider pushed a commit to nv-lschneider/TensorRT-LLM that referenced this pull request Sep 22, 2025
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants