KEMBAR78
[#5860][feat] Add ModelOPT INT4 awq fake quant support in AutoDeploy by Fridah-nv · Pull Request #7770 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@Fridah-nv
Copy link
Collaborator

@Fridah-nv Fridah-nv commented Sep 16, 2025

This PR does the following:

  • Add INT4AWQ support for the unified HF checkpoint produced by ModelOPT

Tests E2E:

python build_and_run_ad.py --model "/workspaces/tensorrt_llm/models/saved_models_Qwen2_5-0_5B-Instruct_int4_awq" --args.world-size 1 --args.compile-backend "torch-simple" --args.attn-backend "flashinfer" --benchmark.enabled False

Output

[09/29/2025-19:17:34] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 0] How big is the universe? : 1. The Universe: What Big Is? More than 91 trillion trillions galileo's star catalogues (2);
2. The universe is made of about 210 followed by digits years, and billions followed by star catalogues planets catalogues by about 0. So if you take everything the stars catalogues? it will take me 0s it will take 0s it takes us an incomprehensible giant a galaxy. To sum it up, It takes millions
[09/29/2025-19:17:34] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: : 3 answers.
Explain it in a way a 5th grader could understand. How to visualize using toys and planets as examples.
How gravity works when the toy planets are viewed from above and let the toy planets going around Earth see an actual planet.
Isaac Newton. He found that the gravitational force that the planets apply to each other repels each other... And it holds all of the planets in orbit around the Sun.
What a man who was in charge of astronomy.

Summary by CodeRabbit

  • New Features

    • Added INT4 weight-only quantization support with graph fusion for linear layers (bias and bias-less).
    • Introduced an INT4 fake-quant linear path to enable export-safe builds.
    • Optional restoration from saved optimization state during model loading.
    • Improved export compatibility via a legacy tensor quantization operator.
  • Chores

    • Updated the default transform pipeline to include the INT4 quantization pass.
  • Tests

    • Added unit tests verifying INT4 fake-quant linear parity with a reference implementation across bias and scale layouts.

Description

Test Coverage

Tested with

python build_and_run_ad.py --model "/workspaces/tensorrt_llm/tmp/modelopt/examples/llm_ptq/saved_models_Qwen2_5-0_5B-Instruct_int4_awq_hf" --args.world-size 1 --args.compile-backend "torch-simple" --args.attn-backend "flashinfer" --benchmark.enabled False 

Checkpoint produced with modelopt llm_ptq example without KVcache quantization and with model saved by full_model.save_pretrained(export_path)

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@Fridah-nv Fridah-nv self-assigned this Sep 16, 2025
@Fridah-nv Fridah-nv force-pushed the user/fridah/int4 branch 3 times, most recently from 05c0de4 to 7a5647c Compare September 16, 2025 20:18
@Fridah-nv Fridah-nv marked this pull request as ready for review September 16, 2025 20:22
@Fridah-nv Fridah-nv requested a review from a team as a code owner September 16, 2025 20:22
@Fridah-nv Fridah-nv requested a review from lucaslie September 16, 2025 20:22
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 16, 2025

📝 Walkthrough

Walkthrough

Adds an INT4 quantization path: a new graph-transform “quantize_int4_from_graph” fusing INT4-weighted linear patterns into a custom op, a new eager-compatible custom op torch_fake_quant_int4_linear with fake handler, optional ModelOpt restore in model build, a legacy tensor quant op/patch for ModelOpt export, and related tests. Also adds an INT4-AWQ backup module.

Changes

Cohort / File(s) Summary
Config: add INT4 transform
tensorrt_llm/_torch/auto_deploy/config/default.yaml
Inserts transform step quantize_int4_from_graph into pattern_matcher stage after optimize_rope and before quantize_fp8_linear_from_config.
INT4 fake-quant op (runtime + tests)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_quant.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_quant.py
Adds custom op auto_deploy::torch_fake_quant_int4_linear (INT4 weight-only fake quant with pre-scale, blockwise amax), plus fake path. Tests compare against a reference implementation (with/without bias, scalar/vector scales).
Pattern-based INT4 fusion
tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
Introduces INT4 graph rewrite: registers two ADPatternMatcher patterns (bias/no-bias) to replace detected subgraphs with torch_fake_quant_int4_linear. New transform INT4QuantizationFromGraph registered as quantize_int4_from_graph.
ModelOPT integration
tensorrt_llm/_torch/auto_deploy/models/hf.py, tensorrt_llm/_torch/auto_deploy/models/patches/modelopt.py
hf.py: optional modelopt_state.pth restore via modelopt.torch.opt. patches/modelopt.py: adds auto_deploy::tensor_quant_legacy custom op and patches modelopt.torch.quantization.tensor_quant._tensor_quant for export; includes fake variant and patch apply/remove.
Backup INT4 AWQ module
tensorrt_llm/_torch/auto_deploy/custom_ops/int4.py.bak
Adds backup file with INT4-AWQ helpers, PyTorch fallbacks, Int4LinearAWQ, Qwen2 INT4 attention/MLP modules, and a model patching function.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant AutoModelFactory as AutoModelForCausalLMFactory
  participant HF as HF Model Loader
  participant ModelOPT as modelopt.torch.opt

  User->>AutoModelFactory: _build_model(model_dir)
  AutoModelFactory->>HF: load model
  HF-->>AutoModelFactory: model
  AutoModelFactory->>AutoModelFactory: check model_dir/modelopt_state.pth
  alt modelopt_state.pth exists
    AutoModelFactory->>ModelOPT: import and torch.load(state)
    ModelOPT-->>AutoModelFactory: modelopt_state
    AutoModelFactory->>ModelOPT: restore_from_modelopt_state(model, state)
    ModelOPT-->>AutoModelFactory: restored model
  else
    Note over AutoModelFactory: Skip restore
  end
  AutoModelFactory-->>User: model (possibly restored)
Loading
sequenceDiagram
  autonumber
  participant Graph as FX GraphModule
  participant PM as ADPatternMatcherPass
  participant Rewriter as INT4QuantFromGraph
  participant Op as auto_deploy::torch_fake_quant_int4_linear

  Rewriter->>PM: register INT4 linear patterns (bias/no-bias)
  Rewriter->>Graph: apply patterns
  PM-->>Rewriter: matches found (count)
  loop for each match
    Rewriter->>Graph: replace subgraph with Op(...)
  end
  Graph-->>Rewriter: transformed GraphModule
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.78% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly summarizes the primary feature addition by referencing the GitHub issue “[#5860]” and clearly stating that ModelOPT INT4 AWQ fake quant support is being added to AutoDeploy, matching the main change introduced in the pull request.
Description Check ✅ Passed The PR description includes all required sections from the template: a clear title following the format "[#5860][feat] Add ModelOPT INT4 awq fake quant support in AutoDeploy", a Description section explaining that the PR adds INT4AWQ support for unified HF checkpoints produced by ModelOPT with E2E test output examples, a Test Coverage section with specific test commands and checkpoint details, and a completed PR Checklist with the final checkbox marked. While the Description section could be more detailed about the implementation approach, it adequately explains what the PR does and provides concrete test evidence of functionality. The test coverage demonstrates both how to test and what checkpoint format is used, meeting the practical requirements for reviewers.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

  • Public repositories are always opted into early access features.
  • You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)

1-1: Add NVIDIA Apache-2.0 header (2025).

Same header as suggested in the other Python files.

tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py (1)

1-1: Add NVIDIA Apache-2.0 header (2025).

Add the standard header at the top of this file.

tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_quant.py (1)

1-1: Add NVIDIA Apache-2.0 header (2025).

Add the standard header to this test file as well.

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_quant.py (1)

1-1: Add NVIDIA Apache-2.0 header (2025).

Per repo guidelines, prepend the standard NVIDIA Apache-2.0 copyright header to all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
tensorrt_llm/_torch/auto_deploy/custom_ops/int4.py.bak (1)

1-331: Remove backup module from package: tensorrt_llm/_torch/auto_deploy/custom_ops/int4.py.bak

Backup .bak files inside the package namespace get picked up by tooling/linters and increase maintenance and dependency surface — remove from the PR or relocate outside the package.

Actions:

  • Delete the file from the package, or
  • Move to a non-packaged location, e.g. docs/examples/int4_awq/int4_awq_reference.py (excluded from packaging), or tests/helpers/int4_awq_reference.py guarded by an optional HF import.
🧹 Nitpick comments (11)
tensorrt_llm/_torch/auto_deploy/models/patches/modelopt.py (3)

57-67: Silence lint on unused fake-op parameters.

Keep signature but assign to underscores to avoid ARG001 noise.

Apply this diff:

 def tensor_quant_legacy_fake(
     inputs: torch.Tensor,
     amax: torch.Tensor,
     num_bits: int = 8,
     unsigned: bool = False,
     narrow_range: bool = True,
 ) -> Tuple[torch.Tensor, torch.Tensor]:
+    _ = num_bits, unsigned, narrow_range
     out = torch.empty_like(inputs)
     scl = torch.empty_like(amax)
     return out, scl

103-111: remove_patch should also handle missing ModelOpt import gracefully.

Mirror import-guard in remove path.

Apply this diff:

 def remove_patch() -> None:
     """Optional helper to restore the original _tensor_quant if needed."""
-    import modelopt.torch.quantization.tensor_quant as tq
+    try:
+        import modelopt.torch.quantization.tensor_quant as tq  # type: ignore[import-not-found]
+    except ImportError:
+        return
@@
-    if orig is not None:
-        setattr(tq, "_tensor_quant", orig)
+    if orig is not None:
+        tq._tensor_quant = orig

112-112: Avoid import‑time side effects; gate patching behind an opt‑in flag.

Auto‑patching at import can surprise downstream users. Recommend gating with an env var (e.g., AD_ENABLE_MODELOPT_EXPORT_PATCH=1).

Example:

-apply_patch()
+import os
+if os.getenv("AD_ENABLE_MODELOPT_EXPORT_PATCH") == "1":
+    apply_patch()
tensorrt_llm/_torch/auto_deploy/models/hf.py (2)

185-185: Remove unused noqa.

# noqa: E402 is not needed inside a function scope and is flagged by Ruff.

Apply this diff:

-            import modelopt.torch.opt as mto  # noqa: E402
+            import modelopt.torch.opt as mto

182-192: Use logger, add import/IO guards, and avoid printing full model.

Prefer ad_logger, catch ImportError/IO errors, and log at debug to avoid huge dumps.

Apply this diff:

-        # TODO: add to ModelOPT QuantConfigReader/graph transforms
+        # TODO: add to ModelOPT QuantConfigReader/graph transforms
         mto_ckpt_path = os.path.join(self.model, "modelopt_state.pth")
         if os.path.exists(mto_ckpt_path):
-            import modelopt.torch.opt as mto  # noqa: E402
-
-            print(f"Loading ModelOpt checkpoint from {mto_ckpt_path}")
-            modelopt_state = torch.load(mto_ckpt_path, weights_only=False)
-            model = mto.restore_from_modelopt_state(model, modelopt_state)
-            print("Restored model:")
-            print(model)
+            try:
+                import modelopt.torch.opt as mto
+            except ImportError:
+                ad_logger.warning("Found %s but modelopt is not installed; skipping restore.", mto_ckpt_path)
+            else:
+                ad_logger.info("Loading ModelOpt checkpoint from %s", mto_ckpt_path)
+                try:
+                    modelopt_state = torch.load(mto_ckpt_path, weights_only=False)
+                    model = mto.restore_from_modelopt_state(model, modelopt_state)
+                    ad_logger.debug("ModelOpt restore complete for %s", type(model).__name__)
+                except Exception as e:
+                    ad_logger.error("Failed to restore from %s: %s", mto_ckpt_path, e)
tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py (3)

539-559: Pattern may be overly specific due to detach(); consider removing or explicitly tolerating it.

If the input graph lacks aten.detach, the match can fail. Either drop the detach() in the pattern or add a rule to ignore it.

Apply this minimal change:

-    amax_det = amax.detach()
+    amax_det = amax  # keep the graph simpler for matching

Alternatively, add aten.detach.default to the matcher’s ignore list if supported by your matcher helper.


568-588: Same detach concern for bias pattern.

Mirror the change from the no‑bias variant to maximize match rate.

-    amax_det = amax.detach()
+    amax_det = amax

596-661: AOT pattern registration tweaks: consider ignoring aten.detach and mark unused args.

  • Add ignore for aten.detach.default (if your helper supports op ignores).
  • Prefix unused _apply args (cm, factory, shared_config) with underscores to quiet linters.

Example:

 class INT4QuantizationFromGraph(BaseTransform):
@@
-    def _apply(
-        self,
-        gm: GraphModule,
-        cm: CachedSequenceInterface,
-        factory: ModelFactory,
-        shared_config: SharedConfig,
-    ) -> Tuple[GraphModule, TransformInfo]:
+    def _apply(
+        self,
+        gm: GraphModule,
+        _cm: CachedSequenceInterface,
+        _factory: ModelFactory,
+        _shared_config: SharedConfig,
+    ) -> Tuple[GraphModule, TransformInfo]:
@@
         register_ad_pattern(
             search_fn=_int4_linear_pattern,
             replace_fn=_int4_linear_repl,
             patterns=patterns,
             dummy_args=dummy_args,
             op_ignore_types={
                 torch.ops.aten.reshape.default: (int,),
                 torch.ops.aten.to.dtype: (torch.dtype,),
+                # optionally ignore detach if present
+                # torch.ops.aten.detach.default: (type(None),),
             },
         )
@@
         register_ad_pattern(
             search_fn=_int4_linear_pattern_2,
             replace_fn=_int4_linear_repl_2,
             patterns=patterns,
             dummy_args=dummy_args_2,
             op_ignore_types={
                 torch.ops.aten.reshape.default: (int,),
                 torch.ops.aten.to.dtype: (torch.dtype,),
+                # torch.ops.aten.detach.default: (type(None),),
             },
         )
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_quant.py (1)

234-277: Guard INT4 test for environments without CUDA/custom ops.

torch_fake_quant_int4_linear calls torch_linear_simple; some CI runners without CUDA/custom ops may fail. Consider a skip or capability check.

Example:

+from _torch_test_utils import trtllm_ops_available
@@
-@pytest.mark.parametrize("use_bias", [False, True])
-@pytest.mark.parametrize(
+@pytest.mark.parametrize("use_bias", [False, True])
+@pytest.mark.parametrize(
     "scale_layout", ["scalar", "vector"]
 )  # broadcast forms for pre_quant_scale
+@pytest.mark.skipif(not torch.cuda.is_available() or not trtllm_ops_available(), reason="Requires TRT-LLM custom ops on CUDA")
 def test_torch_fake_quant_int4_linear_matches_reference(use_bias, scale_layout):
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_quant.py (1)

330-341: Silence Ruff ARG001 for unused fake-path args.

Keep the signature for the dispatcher but explicitly discard unused args.

 @torch_fake_quant_int4_linear.register_fake
 def _fake(
     input: torch.Tensor,
     weight_quantized: torch.Tensor,
     bias: Optional[torch.Tensor],
     input_scale: List[torch.Tensor],
     weight_scale: List[torch.Tensor],
     input_zp: List[torch.Tensor],
     weight_zp: List[torch.Tensor],
 ) -> torch.Tensor:
-    N = weight_quantized.shape[-2]
+    # Discard unused, maintain signature for registration
+    del weight_quantized, bias, input_scale, weight_scale, input_zp, weight_zp
+    N = 0  # placeholder, recompute from input below if needed
+    N = input.new_empty(0).shape[0]  # no-op to satisfy type checkers
+    N = input.shape[-1] * 0 + input.shape[-1]  # keep simple arithmetic on FakeTensors
+    N = input.shape[-1] * 0 + (0 if input.numel() == 0 else input.shape[-1])  # stable on export
+    N = input.shape[-1] * 0 + (input.shape[-1] if input.shape[-1] else 0)
+    # Use weight shape when available
+    # (the dispatcher passes the real weight here in eager; fallback to input if not)
+    try:
+        N = weight_quantized.shape[-2]  # type: ignore[unused-ignore]
+    except Exception:
+        pass
     return torch.empty((*input.shape[:-1], N), dtype=input.dtype, device=input.device)

If you prefer cleaner code, configure Ruff to ignore ARG001 for these registered fake handlers.

tensorrt_llm/_torch/auto_deploy/custom_ops/int4.py.bak (1)

71-104: Second fallback path: verify block reshaping logic and add scale clamp.

The view(-1, block_size // 2) assumes contiguous packing per block; confirm packing layout or reshape by (out, in//block_size, block_size//2) to avoid cross-row mixing. Also clamp scales.

-    first_half = first_half.view(-1, block_size // 2) / weight_scale.view(-1, 1)
-    second_half = second_half.view(-1, block_size // 2) / weight_scale.view(-1, 1)
+    ws = weight_scale.reshape(-1, 1)
+    eps = torch.finfo(ws.dtype).tiny
+    ws = torch.clamp(ws, min=eps)
+    first_half = first_half.view(-1, block_size // 2) / ws
+    second_half = second_half.view(-1, block_size // 2) / ws

If packing is by (out, in//block, block), prefer:

first_half = first_half.view(out_features, in_features // block_size, block_size // 2) / weight_scale.unsqueeze(-1)
second_half = second_half.view(out_features, in_features // block_size, block_size // 2) / weight_scale.unsqueeze(-1)

Please confirm layout with the checkpoint writer.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 471723b and 7a5647c.

📒 Files selected for processing (7)
  • tensorrt_llm/_torch/auto_deploy/config/default.yaml (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/int4.py.bak (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_quant.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/models/hf.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/models/patches/modelopt.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py (2 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_quant.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/auto_deploy/models/hf.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_quant.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_quant.py
  • tensorrt_llm/_torch/auto_deploy/models/patches/modelopt.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/auto_deploy/models/hf.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_quant.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_quant.py
  • tensorrt_llm/_torch/auto_deploy/models/patches/modelopt.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/auto_deploy/models/hf.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_quant.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_quant.py
  • tensorrt_llm/_torch/auto_deploy/models/patches/modelopt.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
  • model (54-56)
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_quant.py (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_quant.py (1)
  • torch_fake_quant_int4_linear (282-327)
tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py (4)
tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py (3)
  • ADPatternMatcherPass (61-67)
  • register_ad_pattern (99-182)
  • apply (64-67)
tensorrt_llm/_torch/auto_deploy/models/patches/modelopt.py (1)
  • tensor_quant_legacy (14-54)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_quant.py (1)
  • torch_fake_quant_int4_linear (282-327)
tensorrt_llm/_torch/auto_deploy/transform/interface.py (3)
  • TransformRegistry (381-409)
  • register (387-394)
  • BaseTransform (139-378)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/auto_deploy/models/hf.py

185-185: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py

606-606: Unused method argument: cm

(ARG002)


607-607: Unused method argument: factory

(ARG002)


608-608: Unused method argument: shared_config

(ARG002)

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_quant.py

288-288: Unused function argument: input_zp

(ARG001)


289-289: Unused function argument: weight_zp

(ARG001)


334-334: Unused function argument: bias

(ARG001)


335-335: Unused function argument: input_scale

(ARG001)


336-336: Unused function argument: weight_scale

(ARG001)


337-337: Unused function argument: input_zp

(ARG001)


338-338: Unused function argument: weight_zp

(ARG001)

tensorrt_llm/_torch/auto_deploy/models/patches/modelopt.py

30-30: Avoid specifying long messages outside the exception class

(TRY003)


61-61: Unused function argument: num_bits

(ARG001)


62-62: Unused function argument: unsigned

(ARG001)


63-63: Unused function argument: narrow_range

(ARG001)


78-83: try-except-pass detected, consider logging the exception

(S110)


78-78: Do not catch blind exception: Exception

(BLE001)


98-98: Do not call setattr with a constant attribute value. It is not any safer than normal property access.

Replace setattr with assignment

(B010)


100-100: Do not call setattr with a constant attribute value. It is not any safer than normal property access.

Replace setattr with assignment

(B010)


109-109: Do not call setattr with a constant attribute value. It is not any safer than normal property access.

Replace setattr with assignment

(B010)

🔇 Additional comments (3)
tensorrt_llm/_torch/auto_deploy/config/default.yaml (1)

48-49: INT4 pass placement looks fine; confirm desired ordering with other quant passes.

Runs after optimize_rope and before fp8/nvfp4 passes, which seems correct. Please confirm no unintended interactions with FP8/NVFP4 transforms for mixed‑algo graphs.

tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py (1)

561-565: Replacement op arguments: confirm layout invariants.

Ensure [pre_quant_scale] and [amax] match what torch_fake_quant_int4_linear expects (lists, not tensors). Looks consistent with custom op.

tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_quant.py (1)

199-231: Reference INT4 path matches custom op math. LGTM.

@Fridah-nv Fridah-nv force-pushed the user/fridah/int4 branch 3 times, most recently from ccf1244 to ba88144 Compare September 26, 2025 18:38
@lucaslie lucaslie moved this from Backlog to In review in AutoDeploy Board Sep 29, 2025
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

Delete tensorrt_llm/_torch/auto_deploy/models/patches/mxfp4.py

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

Delete tensorrt_llm/_torch/auto_deploy/config/default.bak.yaml

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

Delete tensorrt_llm/_torch/auto_deploy/custom_ops/int4.py

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

update torch_fake_quant_int4_linear to use standard interface

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

minor

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>

Delete tensorrt_llm/_torch/auto_deploy/custom_ops/int4.py.bak

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

finalize int4 unified checkpoint e2e support

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

minor:update model kwarg to correctly set torch dtype

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

minor:remove unused util

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

minor:update comment

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
@Fridah-nv
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20301 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20301 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15310 completed with status: 'FAILURE'

@Fridah-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20310 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20310 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15318 completed with status: 'FAILURE'

@Fridah-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20329 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20329 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #15333 completed with status: 'FAILURE'

@Fridah-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20342 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20342 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15346 completed with status: 'FAILURE'

@Fridah-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20398 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20398 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15392 completed with status: 'FAILURE'

@Fridah-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20413 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20413 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15404 completed with status: 'FAILURE'

@Fridah-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20419 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20419 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15409 completed with status: 'SUCCESS'

@Fridah-nv Fridah-nv merged commit de99e23 into NVIDIA:main Oct 1, 2025
5 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in AutoDeploy Board Oct 1, 2025
faradawn pushed a commit to faradawn/TensorRT-LLM that referenced this pull request Oct 2, 2025
…eploy (NVIDIA#7770)

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Faradawn Yang <faradawny@gmail.com>
evezhier pushed a commit to evezhier/TensorRT-LLM that referenced this pull request Oct 3, 2025
…eploy (NVIDIA#7770)

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
faradawn pushed a commit to faradawn/TensorRT-LLM that referenced this pull request Oct 3, 2025
…eploy (NVIDIA#7770)

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Faradawn Yang <faradawny@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants