-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-7385][feat] Optimize Qwen2/2.5-VL performance #7250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds mRoPE support and refactors Qwen2-VL weight loading and vision wiring. Introduces MRotaryEmbedding and mrope_section handling. Removes mrope_config from multiple model forward signatures. Updates multimodal data flow through executor (position_ids/multimodal_params). Adds HF weight mapper and tests for Qwen2_5-VL and selective to_device. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant Model as Qwen2VLModel
participant WM as Qwen2VLHfWeightMapper
participant Vision as VisionEncoder
participant LLM as LLM
User->>Model: load_weights(weights, WM)
Model->>Model: filter_weights(prefix, weights)
Model->>Vision: load vision weights (if not DISAGG)
Model->>LLM: llm.load_weights(WM)
alt fuse_rope enabled
Model->>Model: prepare_mrope_config(...)
end
Model-->>User: load complete
sequenceDiagram
autonumber
participant Client
participant RM as ResourceManager
participant Engine as PyTorchModelEngine
participant CG as CUDAGraphRunner
participant Model as Model.forward
participant Attn as Attention/MRotaryEmbedding
Client->>RM: build dummy/real request (use_mrope?)
RM->>RM: attach py_multimodal_data.mrope_config (position_ids [+ deltas])
RM-->>Engine: LlmRequest
Engine->>Engine: _prepare_tp_inputs(use_mrope)
Engine->>CG: capture(inputs: position_ids, multimodal_params)
Client->>CG: run(inputs: position_ids, multimodal_params)
CG->>Model: forward(attn_metadata, position_ids, multimodal_params, ...)
alt rope_fusion disabled and is_mrope
Model->>Attn: MRotaryEmbedding.forward(position_ids, qkv)
else
Model->>Attn: RotaryEmbedding or fused path
end
Model-->>Client: logits
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (11)
examples/llm-api/quickstart_multimodal.py (1)
1-1: Add standard NVIDIA multi-line header to Python exampleThe file
examples/llm-api/quickstart_multimodal.pyis missing the required NVIDIA copyright and license header.
Based on the most common header in other Python files (e.g.triton_backend/all_models/.../fbank.py), please prepend this exact block at the very top of the file:+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License.• File: examples/llm-api/quickstart_multimodal.py (insert before line 1)
tensorrt_llm/_torch/attention_backend/interface.py (1)
1-1: Missing NVIDIA copyright header (2025) and future-annotations importPer coding guidelines, prepend the NVIDIA copyright header to all Python sources. Also, several modules in this PR use PEP 604/585 typing elsewhere; adding
from __future__ import annotationsearly helps keep Python 3.8 compatibility if annotations evolve here later.Apply at file top:
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + from __future__ import annotations import copy import weakreftensorrt_llm/_torch/models/modeling_llama.py (1)
1-1: Missing NVIDIA copyright header (2025) and future-annotations importHeader is required. This module uses PEP 604 unions (
|) and PEP 585 generics (tuple[...]). Addfrom __future__ import annotationsimmediately after the header to keep Python 3.8 compatibility by deferring annotation evaluation.+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + from __future__ import annotations import copy import ostensorrt_llm/_torch/models/modeling_llama_min_latency.py (1)
1-1: Missing NVIDIA copyright header (2025) and future-annotations importThis module uses PEP 604/585 typing in multiple places. Add the header and
from __future__ import annotationsat the top to ensure Python 3.8 compatibility and satisfy repository policy.+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + from __future__ import annotations from collections.abc import Callable from functools import partialtensorrt_llm/inputs/multimodal.py (1)
1-1: Missing NVIDIA copyright header (2025)Please prepend the required header.
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. """Multimodal utilities for handling images and other media types in TensorRT-LLM."""tests/unittest/_torch/multimodal/test_share_multiparams.py (1)
1-1: Missing required NVIDIA copyright header (2025)Per repository guidelines, prepend the standard NVIDIA copyright header (current year) to all Python files, including tests.
tensorrt_llm/_torch/models/checkpoints/__init__.py (1)
1-1: Missing required NVIDIA copyright header (2025)Please add the standard NVIDIA header at the top of this file.
tensorrt_llm/_torch/modules/attention.py (1)
1-1: Missing required NVIDIA copyright header (2025)Add the standard header to comply with repo policy.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
1-1: Missing required NVIDIA copyright header (2025)Please prepend the standard NVIDIA header.
tensorrt_llm/_torch/models/modeling_qwen2vl.py (2)
244-265: Incorrect temporal padding computation (under-pads for non-multiples).
padded_num_frames = num_frames + num_frames % temporal_patch_sizeproduces 12 for 10 frames with size 8; the next multiple is 16. Use ceil-to-multiple.Apply:
- padded_num_frames = num_frames + num_frames % temporal_patch_size + padded_num_frames = ( + ((num_frames + temporal_patch_size - 1) // temporal_patch_size) + * temporal_patch_size + )
609-711: Avoid padding to max_position_embeddings; vectorize and fixtorch.concatenateusage.Current approach pads to
max_position_embeddingsper sample, which is memory-heavy and risks OOM on GPU. You can directly indexself.cos_ori/self.sin_oriwith variable-length IDs and then reshape/split without padding. Also, prefertorch.cat(broader compatibility) and move deltas to the model device with correct dtype.Apply:
- if mrope_position_ids: - concat_cos_sin_list = [] - for mrope_position_id in mrope_position_ids: - mrope_position_id = mrope_position_id.transpose(1, 0) - - # TODO: Test whether padding is needed - mrope_position_ids_padding = torch.zeros( - mrope_position_id.shape[:-1] + - (self.model_config.pretrained_config. - max_position_embeddings, ), - dtype=torch.int32, - device=mrope_position_id.device) - mrope_position_ids_padding[:, :, :mrope_position_id. - shape[-1]] = mrope_position_id - - mrope_position_ids_padding = mrope_position_ids_padding.to( - self.cos_ori.device) - cos = self.cos_ori[mrope_position_ids_padding] - sin = self.sin_ori[mrope_position_ids_padding] - - mrope_section = [16, 24, 24] - cos = torch.cat([ - m[:, i % 3] - for i, m in enumerate(cos.split(mrope_section, dim=-1)) - ], - dim=-1).unsqueeze(-1) - sin = torch.cat([ - m[:, i % 3] - for i, m in enumerate(sin.split(mrope_section, dim=-1)) - ], - dim=-1).unsqueeze(-1) - concat_cos_sin = torch.concatenate((cos, sin), axis=-1) - concat_cos_sin = concat_cos_sin.reshape(concat_cos_sin.shape[0], - -1) - concat_cos_sin_list.append(concat_cos_sin) - mrope_config['mrope_rotary_cos_sin'] = torch.cat( - concat_cos_sin_list, dim=0).to(self.device) + if mrope_position_ids: + concat_cos_sin_list = [] + sections = getattr(self.llm.config, "mrope_section", [16, 24, 24]) + for mrope_position_id in mrope_position_ids: + pos = mrope_position_id.transpose(1, 0).to(self.cos_ori.device, dtype=torch.long) + cos = self.cos_ori[pos] + sin = self.sin_ori[pos] + cos = torch.cat([chunk[:, i % 3] for i, chunk in enumerate(cos.split(sections, dim=-1))], dim=-1).unsqueeze(-1) + sin = torch.cat([chunk[:, i % 3] for i, chunk in enumerate(sin.split(sections, dim=-1))], dim=-1).unsqueeze(-1) + concat = torch.cat((cos, sin), dim=-1).reshape(cos.shape[0], -1) + concat_cos_sin_list.append(concat) + mrope_config['mrope_rotary_cos_sin'] = torch.cat(concat_cos_sin_list, dim=0).to(self.device) @@ - if mrope_position_deltas: - mrope_config['mrope_position_deltas'] = torch.cat( - mrope_position_deltas, dim=0) + if mrope_position_deltas: + mrope_config['mrope_position_deltas'] = torch.cat(mrope_position_deltas, dim=0).to(self.device, dtype=torch.int32)Note: consider making
sectionsconfigurable instead of hardcoding[16, 24, 24].
🧹 Nitpick comments (26)
examples/llm-api/quickstart_multimodal.py (1)
221-228: Avoid dumping full logits to stdout; print shapes/summaries to prevent massive logs and slowdownRaw context/generation logits can be extremely large (seq_len × vocab_size × beams), which can flood stdout, slow runs, and OOM terminals. Summarize instead and be resilient to missing attributes.
- if args.return_context_logits: - print(f"[{i}] Context logits: {output.context_logits}") - if args.return_generation_logits: - print( - f"[{i}] Generation logits: {output.outputs[0].generation_logits}" - ) - if args.logprobs: - print(f"[{i}] Logprobs: {output.outputs[0].logprobs}") + if getattr(args, "return_context_logits", False): + ctx = getattr(output, "context_logits", None) + if ctx is None: + print(f"[{i}] Context logits: None") + else: + shape = getattr(ctx, "shape", None) + dtype = getattr(ctx, "dtype", type(ctx)) + print(f"[{i}] Context logits: shape={tuple(shape) if shape is not None else 'unknown'}, dtype={dtype}") + + if getattr(args, "return_generation_logits", False): + gen = getattr(output.outputs[0], "generation_logits", None) + if gen is None: + print(f"[{i}] Generation logits: None") + else: + shape = getattr(gen, "shape", None) + dtype = getattr(gen, "dtype", type(gen)) + print(f"[{i}] Generation logits: shape={tuple(shape) if shape is not None else 'unknown'}, dtype={dtype}") + + if getattr(args, "logprobs", False): + lps = getattr(output.outputs[0], "logprobs", None) + if isinstance(lps, list): + head = ", ".join(f"{x:.4f}" for x in lps[:5]) + more = "..." if len(lps) > 5 else "" + print(f"[{i}] Logprobs: n={len(lps)}, head=[{head}{more}]") + else: + print(f"[{i}] Logprobs: {lps}")tensorrt_llm/_torch/attention_backend/interface.py (1)
506-508: mRoPE section: add light validation and docstring clarityGood addition. To make failures obvious when a wrong shape is passed from model wiring, add validation in
__post_init__formrope_sectionwhen present (e.g., positive ints, non-empty, optionally sum <= rope.dim) and briefly document expected semantics (Qwen2/2.5-VL commonly uses three sections). Keep it optional to avoid over-constraining new models.Example guard (illustrative; adapt location accordingly):
if self.type.is_rope() and self.mrope_section is not None: assert all(isinstance(x, int) and x > 0 for x in self.mrope_section), "mrope_section must be positive ints" if self.rope and isinstance(self.rope.dim, int): assert sum(self.mrope_section) <= self.rope.dim, "sum(mrope_section) must not exceed rope.dim"tensorrt_llm/_torch/models/modeling_llama.py (2)
182-183: Forward call now passes fewer positional args; prefer keywords to avoid driftYou removed one positional
Nonein theforward_impl(...)call. To make this resilient to future signature changes and to ensure correctness now, pass these trailing optionals by name (e.g.,kv_cache_params=None,lora_params=None,skip_attn_scaling=...,attention_sinks=None). This avoids accidental misalignment ifforward_impl’s parameter order differs across backends.Example pattern:
attn_output = self.forward_impl( q, k, v, attn_metadata, attention_mask=attention_mask, all_reduce_params=all_reduce_params, skip_attn_scaling=skip_attn_scaling, attention_sinks=None, )Please confirm the exact parameter names.
433-435: Min-latency NOPE path: keep call-site robust with keywordsSame concern as above: the parent
_forward_nopechanged and you removed themrope_configpositional. Pass the trailing parameters by name to guard against parameter drift and improve readability.return super()._forward_nope( position_ids, hidden_states, attn_metadata, attention_mask=attention_mask, all_reduce_params=all_reduce_params, skip_attn_scaling=skip_attn_scaling, )tensorrt_llm/_torch/models/modeling_llama_min_latency.py (1)
433-435: Call into base_forward_nope: use keyword args for clarityThe removed
mrope_configpositional looks correct. To prevent subtle regressions if_forward_nopeevolves, pass the last arguments by name.- return super()._forward_nope(position_ids, hidden_states, attn_metadata, - attention_mask, all_reduce_params, - skip_attn_scaling) + return super()._forward_nope( + position_ids, + hidden_states, + attn_metadata, + attention_mask=attention_mask, + all_reduce_params=all_reduce_params, + skip_attn_scaling=skip_attn_scaling, + )tensorrt_llm/inputs/multimodal.py (2)
388-403: Path-filtered device move: validate dict type; behavior on empty listGood defensive check that
multimodal_datais a dict whenkeywordis provided. Consider treatingkeyword=[]as a no-op explicitly (currently it is), and optionally log when a path is not found to ease debugging (silent skip is fine for “Chill”).
405-448: Helper mutates input dict in place; document this or deep-copy
result = dataand subsequent assignments update the original dict. This is efficient, but note it in the docstring to avoid surprises for callers reusing the dict elsewhere. Alternatively, shallow-copy the parent dict(s) before mutation if you want isolation, at a small cost.Example shallow-copy at top:
- result = data + result = dict(data)And copy nested parents before mutation if needed.
tests/unittest/_torch/multimodal/test_share_multiparams.py (1)
149-167: Selective to_device path filter works; consider minor robustness tweaksThe new test validates the dot-path filtering behavior and confirms non-targeted tensors remain on CPU. Looks good. Two optional hardening ideas:
- Add a skip guard for environments without CUDA to avoid false negatives on CPU-only CI.
- Consider adding one more assertion for a non-existent path (e.g., "image.not_there") to confirm it’s safely ignored.
Example (optional):
@unittest.skipIf(not torch.cuda.is_available(), "CUDA required for this test") def test_to_device_with_keyword(self): ...tensorrt_llm/_torch/models/checkpoints/__init__.py (1)
18-19: Keep all tidy (optional)Appending the new symbol is fine. Consider keeping all sorted alphabetically to reduce churn in future diffs.
tensorrt_llm/_torch/modules/attention.py (1)
267-279: Initialize MRotaryEmbedding defensively when mrope_section may be absentThe MRope path looks correct. To avoid AttributeError with older configs that don’t carry pos_embd_params.mrope_section, construct MRotaryEmbedding without the parameter when it’s missing (class has a sensible default). This preserves backward compatibility.
Apply this diff within the block:
- if self.pos_embd_params.type.is_mrope(): - self.rotary_emb = MRotaryEmbedding( - self.pos_embd_params.rope, - head_dim=self.head_dim, - is_neox=self.pos_embd_params.is_neox, - mrope_section=self.pos_embd_params.mrope_section, - ) + if self.pos_embd_params.type.is_mrope(): + if hasattr(self.pos_embd_params, "mrope_section"): + self.rotary_emb = MRotaryEmbedding( + self.pos_embd_params.rope, + head_dim=self.head_dim, + is_neox=self.pos_embd_params.is_neox, + mrope_section=self.pos_embd_params.mrope_section, + ) + else: + self.rotary_emb = MRotaryEmbedding( + self.pos_embd_params.rope, + head_dim=self.head_dim, + is_neox=self.pos_embd_params.is_neox, + )tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
484-484: Fix typo and wrap long NOTE to satisfy Ruff E501There’s a minor typo (“mrop_config”) and the line exceeds 120 chars. Reflow into two lines and fix the spelling.
Apply this diff:
- # NOTE: Planning to get dummy_data from each model. Before that, we need to add dummy mrop_config to the request here. + # NOTE: Planning to get dummy_data from each model. + # Before that, add a dummy mrope_config to the request here.
493-498: Dummy mrope_position_deltas shape looks fine; consider setdefault if py_multimodal_data may exist laterFor dummy requests it’s safe to overwrite py_multimodal_data. If this function ever reuses existing requests, prefer setdefault to avoid clobbering other dummy modalities.
Alternative (for future-proofing):
req.py_multimodal_data = getattr(req, "py_multimodal_data", {}) or {} req.py_multimodal_data.setdefault("mrope_config", {})["mrope_position_ids"] = dummy_mrope_position_ids ...tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
61-69: Consider using a factory function for MultimodalParams creation.The multimodal_params initialization with default mrope_config could be extracted into a helper method for better readability and reusability.
Consider refactoring to improve readability:
+ def _create_mrope_multimodal_params(self, batch_size: int, device: str): + """Create MultimodalParams with default mrope configuration.""" + return [ + MultimodalParams( + multimodal_data={ + "mrope_config": { + "mrope_position_deltas": + torch.zeros((1, 1), device=device, dtype=torch.int32) + } + }) for _ in range(batch_size) + ] + def __init__(self, ...): ... - self.multimodal_params = [ - MultimodalParams( - multimodal_data={ - "mrope_config": { - "mrope_position_deltas": - torch.zeros((1, 1), device=device, dtype=torch.int32) - } - }) for _ in range(batch_size) - ] if self.use_mrope else [] + self.multimodal_params = self._create_mrope_multimodal_params( + batch_size, device) if self.use_mrope else []tensorrt_llm/_torch/modules/rotary_embedding.py (1)
167-170: Consider adding a warning for unexpected 2D position_ids in mRoPE mode.Since MRotaryEmbedding is specifically for mRoPE, receiving 2D position_ids might indicate a configuration issue.
Consider adding a warning when falling back to standard RoPE:
else: # Fallback to the original RoPE where position_ids is 2D for dummy requests + import warnings + warnings.warn( + "MRotaryEmbedding received 2D position_ids, falling back to standard RoPE. " + "This may indicate a configuration issue.", + RuntimeWarning, + stacklevel=2 + ) cos_sin = self.rotary_cos_sin[position_ids.view(-1)] cos, sin = cos_sin[:, 0, :], cos_sin[:, 1, :]tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
1287-1306: Fix line length issues for better readability.The code logic is correct, but several lines exceed the 120-character limit, affecting readability.
Apply this formatting fix:
if self.use_mrope: ctx_mrope_position_ids = multimodal_params.multimodal_data[ - 'mrope_config'][ - 'mrope_position_ids'][:, :, - begin_compute:begin_compute + - len(prompt_tokens)] + 'mrope_config']['mrope_position_ids'][ + :, :, begin_compute:begin_compute + len(prompt_tokens)] mrope_position_ids.append(ctx_mrope_position_ids) # TODO: Visit later to decide the appropriate position of sending multimodal data & selectively sending multimodal data multimodal_params.to_device( "multimodal_data", "cuda", pin_memory=True, - keyword=getattr(self.model, "multimodal_data_device_paths", - None)) + keyword=getattr( + self.model, "multimodal_data_device_paths", None))
1441-1465: Consider extracting mRoPE position ID generation logic.The mRoPE position ID generation and multimodal parameter handling could be extracted into a helper method for better maintainability.
Consider refactoring into a helper method:
+ def _prepare_generation_mrope_params( + self, past_seen_token_num: int, + multimodal_data: dict) -> Tuple[torch.Tensor, MultimodalParams]: + """Prepare mRoPE parameters for generation requests.""" + mrope_position_deltas = multimodal_data[ + 'mrope_config']['mrope_position_deltas'] + gen_mrope_position_ids = ( + past_seen_token_num + mrope_position_deltas + ).expand(3, 1, 1) + + multimodal_params = MultimodalParams(multimodal_data=multimodal_data) + multimodal_params.strip_for_generation() + if multimodal_params.has_content(): + multimodal_params.to_device( + "multimodal_data", "cuda", pin_memory=True, + keyword=["mrope_config.mrope_position_deltas"]) + return gen_mrope_position_ids, multimodal_params + # Multimodal - multimodal_params = MultimodalParams( - multimodal_data=request.py_multimodal_data) - multimodal_params.strip_for_generation() - if multimodal_params.has_content(): + if request.py_multimodal_data and self.use_mrope: + gen_mrope_position_ids, multimodal_params = ( + self._prepare_generation_mrope_params( + past_seen_token_num, request.py_multimodal_data)) + mrope_position_ids.append(gen_mrope_position_ids) + multimodal_params_list.append(multimodal_params)tests/unittest/_torch/modeling/test_modeling_qwen2_5vl.py (2)
101-102: Fix line length in repr method.The return statement exceeds the 120-character limit.
Split the string for better readability:
def __repr__(self) -> str: - return f"modality:{self.modality.lower()}-use_cuda_graph:{self.use_cuda_graph}-disable_fuse_rope:{self.disable_fuse_rope}" + return (f"modality:{self.modality.lower()}-" + f"use_cuda_graph:{self.use_cuda_graph}-" + f"disable_fuse_rope:{self.disable_fuse_rope}")
134-136: Consider adding error handling for tokenizer initialization.The tokenizer loading could fail if the model path is invalid.
Add error handling:
def get_inputprocessor(self, model_path): - tokenizer = AutoTokenizer.from_pretrained(model_path) - return create_input_processor(model_path, tokenizer=tokenizer) + try: + tokenizer = AutoTokenizer.from_pretrained(model_path) + return create_input_processor(model_path, tokenizer=tokenizer) + except Exception as e: + raise RuntimeError(f"Failed to load tokenizer from {model_path}: {e}")tensorrt_llm/_torch/models/modeling_qwen2vl.py (8)
62-89: Align docstring with return shape and ensure consistent dtype/shape.
- Docstring says
mrope_position_deltashas shape(batch_size), but the code returns(B, 1). Keep them consistent.- Consider using
torch.int32for deltas at the source to avoid later casting churn.Apply:
- Returns: - position_ids: A tensor of shape (3, batch_size, sequence_length) - mrope_position_deltas: A tensor of shape (batch_size) + Returns: + position_ids: torch.IntTensor of shape (3, batch_size, sequence_length) + mrope_position_deltas: torch.IntTensor of shape (batch_size, 1)And when returning in the no-vision path:
- mrope_position_deltas = torch.zeros( + mrope_position_deltas = torch.zeros( [input_ids.shape[0], 1], device=input_ids.device, - dtype=input_ids.dtype, + dtype=torch.int32, )Also applies to: 232-235, 296-313
135-146: Avoid shadowinginput_idsparameter inside loop.Reusing
input_idsfor the per-row tensor shadows the function parameter and risks confusion.Apply:
- for i, input_ids in enumerate(total_input_ids): - input_ids = input_ids[attention_mask[i] == 1] + for i, seq_input_ids in enumerate(total_input_ids): + seq_input_ids = seq_input_ids[attention_mask[i] == 1] image_nums, video_nums = 0, 0 vision_start_indices = torch.argwhere( - input_ids == vision_start_token_id).squeeze(1) - vision_tokens = input_ids[vision_start_indices + 1] + seq_input_ids == vision_start_token_id).squeeze(1) + vision_tokens = seq_input_ids[vision_start_indices + 1] - input_tokens = input_ids.tolist() + input_tokens = seq_input_ids.tolist()
303-313: Make CPU clones explicit and unify dtypes.
- Keeping
mrope_position_idsas int64 is fine for indexing later, but spell out intent.mrope_position_deltasis cast to int32; consider casting IDs to int64 explicitly to avoid implicit conversions.Apply:
- mrope_config['mrope_position_ids'] = mrope_position_ids.to( - 'cpu').clone() + mrope_config['mrope_position_ids'] = mrope_position_ids.to('cpu', dtype=torch.long).clone()
349-358: Remove commented debug prints and long lines (Ruff E501).Dead debug prints add noise and break line-length checks. Please delete them.
Apply:
- # print(f"fused_input_ids: {fused_input_ids}") - # print(f"multimodal_data['image']['pixel_values'].shape: {multimodal_data['image']['pixel_values'].shape}") - # print(f"multimodal_data['image']['image_grid_thw']: {multimodal_data['image']['image_grid_thw']}") - # print(f"multimodal_data['mrope_config']['mrope_position_ids'].shape: {multimodal_data['mrope_config']['mrope_position_ids'].shape}") - # print(f"multimodal_data['mrope_config']['mrope_position_deltas'].shape: {multimodal_data['mrope_config']['mrope_position_deltas'].shape}")
521-533: ReplaceUse
logger.debugand optionally gate behind a verbosity flag; dumping entireweights.keys()is noisy.Apply:
- if not DISAGG: - print(f"weights.keys(): {weights.keys()}") - vision_encoder_weights = filter_weights("visual", weights) - print( - f"vision_encoder_weights.keys(): {vision_encoder_weights.keys()}" - ) + if not DISAGG: + vision_encoder_weights = filter_vision_weights("visual", weights) + logger.debug("Loaded %d vision weights", len(vision_encoder_weights))
32-34: Global naming: preferG_prefix for globals.Per guidelines, globals should be UPPER_SNAKE_CASE prefixed with
G_. Consider renamingDISAGGtoG_DISAGGfor consistency.Sample:
-DISAGG = os.getenv('TLLM_MULTIMODAL_DISAGGREGATED', '0') == '1' +G_DISAGG = os.getenv('TLLM_MULTIMODAL_DISAGGREGATED', '0') == '1'Update references accordingly.
356-357: Resolve Ruff E501: wrap or reflow long lines.A few lines exceed 120 characters. Please wrap them to satisfy linters.
You can break the long comment at Line 477 and remove the debug prints at Lines 356–357 as suggested earlier. For other occurrences, wrap strings/comments across lines.
Also applies to: 477-479, 566-569
296-313: Add minimal docstrings for public helpers.
get_mrope_configandprepare_mrope_configare externally consumed viaMultimodalParams; add short Google-style docstrings for clarity.Also applies to: 609-711
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (19)
examples/llm-api/quickstart_multimodal.py(1 hunks)tensorrt_llm/_torch/attention_backend/interface.py(1 hunks)tensorrt_llm/_torch/models/checkpoints/__init__.py(2 hunks)tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py(1 hunks)tensorrt_llm/_torch/models/modeling_gpt_oss.py(1 hunks)tensorrt_llm/_torch/models/modeling_hunyuan_moe.py(0 hunks)tensorrt_llm/_torch/models/modeling_llama.py(1 hunks)tensorrt_llm/_torch/models/modeling_llama_min_latency.py(1 hunks)tensorrt_llm/_torch/models/modeling_qwen.py(2 hunks)tensorrt_llm/_torch/models/modeling_qwen2vl.py(12 hunks)tensorrt_llm/_torch/models/modeling_qwen3.py(0 hunks)tensorrt_llm/_torch/modules/attention.py(2 hunks)tensorrt_llm/_torch/modules/rotary_embedding.py(1 hunks)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py(4 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(11 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(1 hunks)tensorrt_llm/inputs/multimodal.py(2 hunks)tests/unittest/_torch/modeling/test_modeling_qwen2_5vl.py(1 hunks)tests/unittest/_torch/multimodal/test_share_multiparams.py(1 hunks)
💤 Files with no reviewable changes (2)
- tensorrt_llm/_torch/models/modeling_qwen3.py
- tensorrt_llm/_torch/models/modeling_hunyuan_moe.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/attention_backend/interface.pytensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.pyexamples/llm-api/quickstart_multimodal.pytensorrt_llm/_torch/models/modeling_llama_min_latency.pytests/unittest/_torch/multimodal/test_share_multiparams.pytensorrt_llm/_torch/models/modeling_qwen.pytensorrt_llm/_torch/models/modeling_llama.pytensorrt_llm/_torch/modules/attention.pytensorrt_llm/_torch/modules/rotary_embedding.pytensorrt_llm/_torch/models/checkpoints/__init__.pytensorrt_llm/_torch/models/modeling_gpt_oss.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/cuda_graph_runner.pytensorrt_llm/inputs/multimodal.pytensorrt_llm/_torch/pyexecutor/model_engine.pytests/unittest/_torch/modeling/test_modeling_qwen2_5vl.pytensorrt_llm/_torch/models/modeling_qwen2vl.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/attention_backend/interface.pytensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.pyexamples/llm-api/quickstart_multimodal.pytensorrt_llm/_torch/models/modeling_llama_min_latency.pytests/unittest/_torch/multimodal/test_share_multiparams.pytensorrt_llm/_torch/models/modeling_qwen.pytensorrt_llm/_torch/models/modeling_llama.pytensorrt_llm/_torch/modules/attention.pytensorrt_llm/_torch/modules/rotary_embedding.pytensorrt_llm/_torch/models/checkpoints/__init__.pytensorrt_llm/_torch/models/modeling_gpt_oss.pytensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/cuda_graph_runner.pytensorrt_llm/inputs/multimodal.pytensorrt_llm/_torch/pyexecutor/model_engine.pytests/unittest/_torch/modeling/test_modeling_qwen2_5vl.pytensorrt_llm/_torch/models/modeling_qwen2vl.py
🧠 Learnings (1)
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
🧬 Code graph analysis (12)
tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py (3)
tensorrt_llm/_torch/models/checkpoints/hf/weight_mapper.py (1)
HfWeightMapper(10-99)tensorrt_llm/_torch/models/modeling_utils.py (1)
register_mapper(597-608)tensorrt_llm/_torch/models/modeling_qwen2vl.py (1)
filter_weights(35-42)
examples/llm-api/quickstart_multimodal.py (2)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
generation_logits(208-217)tensorrt_llm/scaffolding/task.py (1)
logprobs(99-100)
tests/unittest/_torch/multimodal/test_share_multiparams.py (1)
tensorrt_llm/inputs/multimodal.py (1)
to_device(359-403)
tensorrt_llm/_torch/models/modeling_qwen.py (2)
tensorrt_llm/_torch/models/modeling_utils.py (1)
config(500-501)tensorrt_llm/_torch/models/modeling_bert.py (1)
config(266-285)
tensorrt_llm/_torch/modules/attention.py (2)
tensorrt_llm/_torch/modules/rotary_embedding.py (2)
MRotaryEmbedding(130-193)RotaryEmbedding(10-127)tensorrt_llm/functional.py (1)
is_mrope(707-708)
tensorrt_llm/_torch/modules/rotary_embedding.py (1)
tensorrt_llm/_torch/attention_backend/interface.py (1)
RopeParams(342-494)
tensorrt_llm/_torch/models/checkpoints/__init__.py (1)
tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py (1)
Qwen2VLHfWeightMapper(7-21)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
use_mrope(482-490)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
tensorrt_llm/inputs/multimodal.py (1)
MultimodalParams(152-475)tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
use_mrope(482-490)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
add_dummy_requests(68-69)add_dummy_requests(430-499)tensorrt_llm/inputs/multimodal.py (4)
to_device(359-403)MultimodalParams(152-475)strip_for_generation(449-471)has_content(473-475)
tests/unittest/_torch/modeling/test_modeling_qwen2_5vl.py (11)
tensorrt_llm/_torch/attention_backend/utils.py (1)
get_attention_backend(10-24)tensorrt_llm/_torch/metadata.py (1)
KVCacheParams(9-31)tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py (1)
Qwen2VLHfWeightMapper(7-21)tensorrt_llm/_torch/models/modeling_qwen2vl.py (1)
Qwen2_5_VLModel(802-815)tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (3)
DecodingCUDAGraphRunner(12-143)capture(78-104)run(109-143)tensorrt_llm/_torch/pyexecutor/resource_manager.py (6)
KVCacheManager(141-943)add_dummy_requests(68-69)add_dummy_requests(430-499)shutdown(80-81)shutdown(355-356)shutdown(1120-1121)tensorrt_llm/inputs/registry.py (1)
create_input_processor(274-303)tensorrt_llm/inputs/utils.py (1)
default_multimodal_input_loader(442-602)tensorrt_llm/inputs/data.py (1)
prompt_inputs(55-69)tensorrt_llm/inputs/multimodal.py (2)
MultimodalParams(152-475)to_device(359-403)tensorrt_llm/_torch/model_config.py (1)
torch_dtype(147-152)
tensorrt_llm/_torch/models/modeling_qwen2vl.py (5)
tensorrt_llm/_torch/models/checkpoints/base_weight_mapper.py (1)
BaseWeightMapper(10-165)tensorrt_llm/_torch/models/modeling_utils.py (2)
filter_weights(710-716)config(500-501)tensorrt_llm/runtime/multimodal_model_runner.py (1)
get_rope_index(1802-1968)tensorrt_llm/_torch/model_config.py (1)
torch_dtype(147-152)tensorrt_llm/inputs/multimodal.py (1)
MultimodalParams(152-475)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py
484-484: Line too long (130 > 120)
(E501)
tensorrt_llm/_torch/pyexecutor/model_engine.py
1296-1296: Line too long (135 > 120)
(E501)
1462-1462: Line too long (131 > 120)
(E501)
1583-1583: Line too long (145 > 120)
(E501)
1584-1584: Line too long (140 > 120)
(E501)
tests/unittest/_torch/modeling/test_modeling_qwen2_5vl.py
102-102: Line too long (130 > 120)
(E501)
tensorrt_llm/_torch/models/modeling_qwen2vl.py
356-356: Line too long (142 > 120)
(E501)
357-357: Line too long (148 > 120)
(E501)
477-477: Line too long (140 > 120)
(E501)
566-566: Line too long (126 > 120)
(E501)
🔇 Additional comments (22)
examples/llm-api/quickstart_multimodal.py (2)
223-226: Double-check the correct accessor for generation logits and expected shapeThe internal API snippet shows generation_logits shaped [beam_width, seq_length, vocab_size] after transpose; here you access output.outputs[0].generation_logits. Confirm that the sample-level accessor is intended (vs a top-level output.generation_logits) and that its shape semantics match your expectations before printing/summarizing.
If needed, we can adjust the print path to whichever accessor is canonical in the updated executor.
221-228: No missing CLI flags—add_llm_args already defines them
The flags--return_context_logits,--return_generation_logits, and--logprobsare all added inexamples/llm-api/quickstart_advanced.py(lines 143–149) withdefault=Falseandaction='store_true', so the checks inquickstart_multimodal.pywill never raise anAttributeError.tensorrt_llm/inputs/multimodal.py (2)
360-364: API extension looks good; positional compatibility preservedAdding
keyword: Optional[List[str]] = Noneafterpin_memorypreserves existing positional callers. Docstring updates clearly explain behavior. LGTM.
452-454: Generation strip behavior change: only keep mRoPE deltasThe narrowing to retain only
mrope_position_deltasundermrope_configis sensible and aligns with MRope-only needs during generation. Please confirm no downstream consumer expectsmrope_rotary_cos_sinat generation time; if any path does, this will now drop it.Also applies to: 458-471
tensorrt_llm/_torch/models/checkpoints/__init__.py (1)
9-9: Expose Qwen2VL mapper in public APIImporting Qwen2VLHfWeightMapper here is correct and enables external access. LGTM.
tensorrt_llm/_torch/modules/attention.py (1)
27-27: Publicly exporting MRotaryEmbedding is appropriateBringing MRotaryEmbedding into this module’s namespace matches the new MRope path usage. No issues.
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
486-488: Bug: invalid use of Tensor.expand with -1; shape creation for dummy mrope_position_ids will raise at runtimetorch.Tensor.expand does not accept -1 as a size. The current code will throw. Create the (3, 1, token_num) shape by first viewing to 3D, then expanding with concrete sizes.
Apply this fix:
- dummy_mrope_position_ids = torch.arange( - 0, token_num, dtype=torch.int32).expand(3, 1, -1).clone() + base = torch.arange(token_num, dtype=torch.int32).view(1, 1, token_num) + dummy_mrope_position_ids = base.expand(3, 1, token_num).clone()Likely an incorrect or invalid review comment.
tensorrt_llm/_torch/models/checkpoints/hf/qwen2vl_weight_mapper.py (1)
6-21: Mapper correctly normalizes 'model.language_model.' → 'model.' prefixes before delegating to HfWeightMapperThe transformation is correct and scoped, then reuses the base filter logic. Clean and minimal. LGTM.
tensorrt_llm/_torch/models/modeling_qwen.py (2)
35-35: LGTM!The addition of
mrope_sectionparameter retrieval from the config is properly handled with a default ofNone.
49-49: LGTM!The
rope_fusionparameter is correctly derived from thedisable_fuse_ropeconfig option with proper negation logic.tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (2)
48-59: LGTM! Clean implementation of mRoPE position IDs handling.The conditional initialization of position IDs based on
use_mropeis well-structured, using the appropriate 3D shape for mRoPE and 2D shape for standard RoPE.
129-139: LGTM! Proper handling of mRoPE data transfer.The non-blocking copy of mrope_position_deltas ensures efficient data transfer without blocking the compute stream.
tensorrt_llm/_torch/modules/rotary_embedding.py (2)
130-142: LGTM! Well-structured MRotaryEmbedding class initialization.The MRotaryEmbedding class properly extends RotaryEmbedding with mRoPE-specific parameters. The default mrope_section values [16, 24, 24] are appropriate for Qwen2/2.5-VL models.
153-165: LGTM! Elegant multi-section RoPE implementation.The splitting and concatenation of cos/sin values based on mrope_section is well-implemented. The logic correctly handles the 3D position_ids for mRoPE while maintaining backward compatibility with 2D position_ids.
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
446-448: LGTM! Proper initialization of mRoPE position IDs buffer.The 3D tensor allocation for mrope_position_ids_cuda with shape (3, 1, max_num_tokens) is correct for mRoPE.
1581-1600: LGTM! Proper handling of mRoPE vs standard position IDs.The conditional logic correctly handles the 3D mRoPE position IDs when available and falls back to standard 2D position IDs otherwise.
tensorrt_llm/_torch/models/modeling_qwen2vl.py (6)
746-749: Only compute mRoPE config when needed; ensure it’s complete.Good gating via
disable_fuse_rope. Ensuremrope_configincludes bothmrope_rotary_cos_sinandmrope_position_deltason the correct device/dtypes as in the previous refactor.
730-759: End-to-end data flow looks solid. Minor consistency checks.
- mm_embeds computed from encoder or cached path, then fused—LGTM.
- Passing
mrope_configintollm.forward(...)is consistent with the new flow. After the above fixes, device/dtype mismatches should be resolved.
750-761: Log level/shape logging looks appropriate.Context/generation split logging and final output shape debug are helpful and not noisy.
374-376: Ensure stability by using the publicfrom_configAPI or strict version pinningThe private method
_from_configisn’t part of the guaranteed public API and may change in futuretransformersreleases. To avoid unexpected breakages:• Replace
self.visual = model_class._from_config(config).eval()with the public constructor:
self.visual = model_class.from_config(config).eval()• If
from_configisn’t available or doesn’t meet your needs in Transformers 4.53.x, pin your dependency explicitly (e.g. insetup.pyorrequirements.txt):transformers>=4.53.0,<4.54.0and add a comment noting why the pin is required.
Please verify that
Qwen2VisionTransformerPretrainedModel.from_configexists and behaves as expected in your target version.
96-117: Filling masked positions with 1 is the established conventionA search across the codebase shows the exact same
position_ids.masked_fill_(attention_mask == 0, 1)pattern in:
tensorrt_llm/runtime/multimodal_model_runner.py:1950tensorrt_llm/runtime/generation.py:2898and2984This confirms that using
1for masked positions is intentional and consistent throughout. No change is needed here.
490-493: No unreachable early return in Qwen2VLModelBase.initThe
if hasattr(self, "llm"): returnguard in Qwen2VLModelBase.init can never fire under normal usage, because:
- Subclasses (Qwen2VLModel and Qwen2_5_VLModel) do not set
self.llmbefore callingsuper().__init__.- Within Qwen2VLModelBase.init,
self.llmis only assigned after the mm_encoder setup (lines 487–491).Therefore, when
DISAGGis False,self.mm_encoderwill always be initialized. You can safely ignore the original concern.Likely an incorrect or invalid review comment.
7095cdc to
f7cd755
Compare
|
/bot run |
|
PR_Github #17861 [ run ] triggered by Bot |
|
PR_Github #17861 [ run ] completed with state |
|
The fmha_v2 changes LGTM. I will let others review the rest of the codes. Thanks! |
f7cd755 to
805b4f6
Compare
|
/bot run |
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
c83cbaf to
b92fbf6
Compare
|
/bot run |
|
PR_Github #19335 [ run ] triggered by Bot |
|
PR_Github #19335 [ run ] completed with state |
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
|
/bot run |
|
PR_Github #19357 [ run ] triggered by Bot |
|
/bot run |
|
PR_Github #19384 [ run ] triggered by Bot |
|
PR_Github #19384 [ run ] completed with state |
|
/bot run |
|
PR_Github #19410 [ run ] triggered by Bot |
|
PR_Github #19410 [ run ] completed with state |
|
/bot run |
|
PR_Github #19442 [ run ] triggered by Bot |
|
/bot kill |
|
PR_Github #19492 [ kill ] triggered by Bot |
|
PR_Github #19442 [ run ] completed with state |
|
PR_Github #19492 [ kill ] completed with state |
|
/bot run |
|
PR_Github #19494 [ run ] triggered by Bot |
|
/bot run |
|
PR_Github #19534 [ run ] triggered by Bot |
|
PR_Github #19494 [ run ] completed with state |
|
PR_Github #19534 [ run ] completed with state |
| # NOTE: self.use_mrope is enough for differentiating whether to use mrope_position_ids but | ||
| # `_create_dummy_context_requests` from `kv_cache_creater` makes an exception that I can not add multimodal_data to the dummy_request | ||
| # so that we only replace position_ids with mrope_position_ids when it is not a dummy request and for models who is using mrope. | ||
| mrope_position_ids = torch.cat(mrope_position_ids, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've identified a potential device mismatch issue in the code. The tensor ctx_mrope_position_ids might be on the CUDA device, while gen_mrope_position_ids remains on the CPU device. This discrepancy will cause an error when attempting to concatenate these tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Nekofish-L, thx for raising this issue! Yes, as you pointed out, we found that this cause the memory location discrepancy. We are trying to address this in this PR.
The solution is that we removed mrope_position_ids from multimodal_data_device_paths which makes it to reside on CPU region.
Thx again for the catch.
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Summary by CodeRabbit
New Features
Performance
Improvements
Tests