KEMBAR78
[TRTLLM-8579][feat] Support quantized model for nano-v2-vlm by Wanli-Jiang · Pull Request #8304 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@Wanli-Jiang
Copy link
Collaborator

@Wanli-Jiang Wanli-Jiang commented Oct 13, 2025

Features:

  • Support for FP8 model. (local tested on H100 and RTX pro 6000)
  • Support for NVFP4 model. (local tested on RTX pro 6000)
  • Remove hard-limit for input PIL image since HF-input-processor can support tensor inputs.

Summary by CodeRabbit

  • New Features
    • Configurable quantization: You can now adjust quantization settings for vision-language models via model configuration.
  • Refactor
    • Vision encoder runs with quantization disabled by default for more stable behavior.
    • Streamlined image/video preprocessing by sending inputs directly to the processor (removes unnecessary conversions).
  • Bug Fixes
    • Prevented errors when loading weights if certain normalization keys are absent.
  • Documentation
    • Clarified that quantization settings can be modified while other attributes remain frozen by default.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

* Support for FP8 model.

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
@Wanli-Jiang Wanli-Jiang force-pushed the user/williamj/support-nanov2vlm-quant branch from 9002587 to 7fc73f7 Compare October 13, 2025 07:23
@Wanli-Jiang Wanli-Jiang marked this pull request as ready for review October 13, 2025 07:23
@Wanli-Jiang Wanli-Jiang requested review from a team as code owners October 13, 2025 07:23
@Wanli-Jiang
Copy link
Collaborator Author

/bot run

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 13, 2025

📝 Walkthrough

Walkthrough

Expands ModelConfig mutability to include quant_config. Updates RADIOVisionModel to optionally disable quantization by deep-copying and overriding quant_config while preserving kv_cache_quant_algo. Adjusts weight loading key removals to be conditional. Simplifies NanoV2VLM image/video preprocessing by removing tensor-to-PIL conversions and directly invoking the processor.

Changes

Cohort / File(s) Summary of changes
ModelConfig mutability
tensorrt_llm/_torch/model_config.py
Allows modifying quant_config in otherwise frozen ModelConfig; updates docstring to reflect permitted mutable fields.
NanoV2VLM preprocessing
tensorrt_llm/_torch/models/modeling_nanov2vlm.py
Instantiates RADIOVisionModel with disable_quantization=True. Removes Tensor→PIL conversions for images/videos; calls processor directly on provided inputs. Retains downstream multimodal handling.
RADIO vision model quantization & loading
tensorrt_llm/_torch/models/modeling_radio.py
Adds disable_quantization flag (default True). Deep-copies model_config; when disabled, replaces quant_config with new QuantConfig while preserving kv_cache_quant_algo; passes adjusted config to submodules. Makes load_weights key removals conditional on key presence.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor App as Caller
  participant NVLM as NanoV2VLM
  participant RVM as RADIOVisionModel
  participant VT as VisionTransformer
  participant Proc as Processor

  App->>NVLM: init(...)
  note over NVLM: Create RADIOVisionModel with disable_quantization=true
  NVLM->>RVM: __init__(model_config, disable_quantization=true)
  activate RVM
  RVM->>RVM: model_config' = deepcopy(model_config)
  alt quant_config present AND disable_quantization
    RVM->>RVM: model_config'.quant_config = QuantConfig()<br/>(preserve kv_cache_quant_algo)
  end
  RVM->>VT: init(model_config')
  deactivate RVM
Loading
sequenceDiagram
  autonumber
  actor User as Caller
  participant NVLM as NanoV2VLM
  participant Proc as Processor
  participant RVM as RADIOVisionModel

  User->>NVLM: generate(images/videos, text, ...)
  note over NVLM: Directly pass inputs to processor<br/>No Tensor→PIL conversion
  NVLM->>Proc: __call__(images or video, text, ...)
  Proc-->>NVLM: processed batches (tensors & tokens)
  NVLM->>RVM: encode_vision(processed inputs)
  RVM-->>NVLM: visual features
  NVLM-->>User: outputs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ⚠️ Warning The PR description does not adhere to the repository’s required template because it omits the @coderabbitai summary or formatted title at the top, introduces a non‐standard “## Features” section, and leaves both the ## Description and ## Test Coverage sections empty, preventing clear documentation of the issue, solution, and tests. Please update the PR body to follow the repository template by including either the @coderabbitai summary or a correctly formatted title at the top, provide a concise explanation of the issue and solution in the ## Description section, and list relevant tests in the ## Test Coverage section before confirming the checklist.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title follows the repository’s convention with a valid JIRA ticket and type tag and succinctly describes the primary feature of supporting quantized models for nano-v2-vlm.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 13, 2025

📝 Walkthrough

Walkthrough

ModelConfig now permits updating quant_config when frozen. RADIOVisionModel gains a disable_quantization flag, deep-copies and conditionally adjusts model_config, and passes the adjusted config downstream while hardening weight-loading. VLM pipelines bypass tensor-to-PIL conversions, sending images/videos directly to the processor and adjusting video patch/pixel accumulation.

Changes

Cohort / File(s) Summary of edits
Config mutability adjustment
tensorrt_llm/_torch/model_config.py
Allow quant_config updates on frozen ModelConfig via __setattr__; docstring updated accordingly.
Vision model quantization control
tensorrt_llm/_torch/models/modeling_radio.py
Add disable_quantization param (default True). Deep-copy model_config into self.model_config; when disabling quantization, replace/adjust quant_config while preserving kv_cache_quant_algo. Pass self.model_config to VisionTransformer and base. Make weight-loading key removals conditional.
VLM image/video preprocessing simplification
tensorrt_llm/_torch/models/modeling_nanov2vlm.py
Construct RADIOVisionModel with disable_quantization=True. Remove tensor→PIL conversions for image/video inputs; feed directly to processor. Rework video loop to process per-video outputs and accumulate num_patches and pixel_values from processor.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant RADIOVisionModel
  participant VisionTransformer
  participant Base as RADIOVisionModelBase

  Caller->>RADIOVisionModel: __init__(model_config, disable_quantization=True)
  activate RADIOVisionModel
  RADIOVisionModel->>RADIOVisionModel: deepcopy(model_config) -> self.model_config
  alt disable_quantization == True
    RADIOVisionModel->>RADIOVisionModel: Adjust self.model_config.quant_config (preserve kv_cache_quant_algo)
  else disable_quantization == False
    RADIOVisionModel->>RADIOVisionModel: Keep quant_config as-is
  end
  RADIOVisionModel->>VisionTransformer: init(self.model_config)
  RADIOVisionModel->>Base: init(self.model_config)
  RADIOVisionModel->>RADIOVisionModel: load_weights (safely ignore missing keys)
  deactivate RADIOVisionModel
Loading
sequenceDiagram
  autonumber
  participant App as VLM Caller
  participant VLM as NanoV2VLM
  participant Proc as Processor
  participant Vision as RADIOVisionModel

  App->>VLM: generate(images/videos, prompts)
  Note over VLM: No tensor→PIL conversion
  alt Images
    VLM->>Proc: process(images)
    Proc-->>VLM: pixel_values, num_patches
  else Videos
    loop each video
      VLM->>Proc: process(video)
      Proc-->>VLM: pixel_values_i, num_patches_i
      VLM->>VLM: accumulate pixel_values, num_patches
    end
  end
  VLM->>Vision: forward(pixel_values, num_patches)
  Vision-->>VLM: embeddings
  VLM-->>App: outputs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description does not follow the repository’s required template: it uses an unstructured “## Features:” section, leaves the “@coderabbitai summary” placeholder unpopulated, and fails to provide content under both the “## Description” and “## Test Coverage” headings. As a result, key sections are missing and the structure deviates from the template, making it unclear what issue is addressed, what solution is implemented, and how changes are validated. This lack of alignment with the template prevents a clear understanding of the PR’s scope and testing. Please restructure the PR description to match the repository template by removing the “## Features:” section, providing a clear summary under the “@coderabbitai summary” line, filling in the “## Description” section with a concise explanation of the problem and the solution, and populating the “## Test Coverage” section with relevant tests. Ensure that all placeholder comments are replaced with actual content before merging.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title includes the JIRA ticket identifier and the feature tag in the prescribed format and succinctly describes the main change, namely support for a quantized model in nano-v2-vlm. It is concise, clear, and directly reflects the primary enhancement introduced in this PR. Therefore, it adheres to the repository’s title guidelines.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9fe63dd and 7fc73f7.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/model_config.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_nanov2vlm.py (2 hunks)
  • tensorrt_llm/_torch/models/modeling_radio.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/models/modeling_nanov2vlm.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/_torch/models/modeling_radio.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/models/modeling_nanov2vlm.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/_torch/models/modeling_radio.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/models/modeling_nanov2vlm.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/_torch/models/modeling_radio.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/models/modeling_nanov2vlm.py (2)
tensorrt_llm/_torch/models/modeling_radio.py (1)
  • RADIOVisionModel (772-919)
tensorrt_llm/runtime/multimodal_model_runner.py (1)
  • processor (680-683)
tensorrt_llm/_torch/models/modeling_radio.py (2)
tensorrt_llm/models/modeling_utils.py (1)
  • QuantConfig (131-271)
tensorrt_llm/_torch/model_config.py (1)
  • ModelConfig (110-594)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (10)
tensorrt_llm/_torch/model_config.py (2)

163-165: LGTM!

The docstring accurately reflects the new behavior allowing quant_config modification for VLM quantization control.


170-171: quant_config exemption in freeze guard is safe
Only the intended override in modeling_radio.py writes to model_config.quant_config when frozen; no other assignments were found.

tensorrt_llm/_torch/models/modeling_nanov2vlm.py (2)

73-74: LGTM!

Passing disable_quantization=True to RADIOVisionModel aligns with the new quantization control introduced in modeling_radio.py and is appropriate for vision-only models.


303-305: Approve direct tensor inputs to processor

The transformers AutoImageProcessor supports torch.Tensor inputs (C, H, W) for single images and batches, so passing video frames as tensors is valid.

tensorrt_llm/_torch/models/modeling_radio.py (6)

5-5: LGTM!

The copy import is necessary for the deep-copy logic introduced in RADIOVisionModel.__init__.


25-25: LGTM!

The QuantConfig import is necessary for the conditional quantization control introduced in RADIOVisionModel.__init__.


826-826: LGTM!

Passing self.model_config (instead of the original model_config) to VisionTransformer ensures that the per-instance quantization control is propagated correctly.


865-865: LGTM!

Passing self.model_config (instead of the original model_config) to RADIOVisionModelBase ensures that the per-instance quantization control is propagated correctly.


878-881: LGTM!

Adding conditional checks before removing unexpected keys makes the weight-loading more robust and prevents ValueError when loading weights from models that don't have these specific keys.


775-793: Approve per-instance quantization change. model_config is deep-copied to avoid mutating the original, and preserving only kv_cache_quant_algo matches the behavior of apply_quant_config_exclude_modules.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21171 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #21171 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15984 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@Wanli-Jiang Wanli-Jiang merged commit ebf0e51 into NVIDIA:main Oct 16, 2025
11 checks passed
govind-ramnarayan pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Oct 21, 2025
)

Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants