KEMBAR78
[https://nvbugs/5394409][feat] Support Mistral Small 3.1 multimodal in Triton Backend by dbari · Pull Request #6714 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@dbari
Copy link
Collaborator

@dbari dbari commented Aug 7, 2025

Summary by CodeRabbit

  • New Features

    • Added end-to-end support for the “pixtral” multimodal model.
    • Accept multiple images via new optional image inputs; supports URLs, base64, and local files.
    • Dynamic image size handling in preprocessing and encoders.
  • Configuration

    • Prompt embedding table data type is now configurable.
    • Added model path parameter for Hugging Face assets.
  • Tools

    • Multimodal client supports “pixtral” and raw image-bytes input.
  • Tests

    • Added integration tests and engine preparation for Mistral Small 3.1 Pixtral.
  • Chores

    • Enabled Git LFS for test image assets.

Description

Implemented Mistral Small 3.1 in Triton backend. It supports multiple images for each prompt and the images can have different resolution. Each image will be embedded into the prompt at the location of its respective [IMG] placeholder and have a length of one token for every 28x28 patch (disregarding padding and line break / end of image tokens). Also, since the vision encoder has a fixed max_batch_size but the request might contain an arbitrary number of images, the images are processed in batches of size max_batch_size.

After discussing with @Tabrizian (thanks!), I understood the interaction between the number of requests (triton batch) and the batch size of each request. The code has been reworked to allow an arbitrary triton batch size (number of requests per execute() call) and batch size per request but the batch size in each request should be 1.

Side note: The model runs successfully only when using the python backend for the tensorrt_llm step in the ensemble. This seems to be a bug or difference in configuration in the C++ tensorrtllm backend, that is independent of this PR. Unless there is another suggestion, I will open an issue for it.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 7, 2025

📝 Walkthrough

Walkthrough

Adds end-to-end “pixtral” multimodal support: preprocessing updates (new IMAGES input, HF config loading, VisionPreProcessor changes, pixtral-specific processing), ensemble and encoder configs, encoder-side pixtral logic and outputs, client handling of raw image bytes, dynamic prompt embedding table dtype, engine build/test utilities, and CI/template updates.

Changes

Cohort / File(s) Summary
Preprocessing: Python backend logic
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
Adds “pixtral” model type, loads HF processor/config, extends inputs to include IMAGES, implements pixtral-specific image+text processing via VisionPreProcessor.pixtral_process, remaps image token IDs, and outputs tensors accordingly. VisionPreProcessor gains hf_config param and pixtral_process method.
Preprocessing: Config
triton_backend/all_models/inflight_batcher_llm/preprocessing/config.pbtxt
Adds optional string input IMAGES [-1]; IMAGE_SIZES dims changed to [-1, 2]; introduces hf_model_path parameter.
Ensemble config
triton_backend/all_models/multimodal/ensemble/config.pbtxt
Adds optional input images_input [-1] and maps it to preprocessing IMAGES in ensemble_scheduling.
Multimodal encoders: Python backend logic
triton_backend/all_models/multimodal/multimodal_encoders/1/model.py
Integrates “pixtral” handling: reads HF config for sizes/tokens, builds attention masks from image_sizes, postprocesses features into prompt embedding table and vocab size; notes TODO on batching across varying image counts.
Multimodal encoders: Config
triton_backend/all_models/multimodal/multimodal_encoders/config.pbtxt
Changes image_sizes dims [2] → [-1, 2]; OUT_PROMPT_EMBEDDING_TABLE data_type parameterized as ${prompt_embedding_table_data_type}.
tensorrt_llm: Python backend logic
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py
Forces prompt_embedding_table tensor retrieval on torch with force_on_torch=True.
tensorrt_llm: Configs
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt, .../tensorrt_llm_bls/config.pbtxt
Parameterizes prompt_embedding_table data type to ${prompt_embedding_table_data_type}.
Client tool
triton_backend/tools/multimodal/client.py
load_image now supports returning raw bytes (as_pil_image flag). Adds “pixtral” model_type path: sends raw image bytes via images_input as np.object_ array.
Integration: Engine build and config templating
tests/integration/defs/triton_server/build_engines.py, .../common.py, .../test.sh
Adds Pixtral engine preparation (LLM + visual encoder), plumbs prompt_embedding_table_data_type into template filling, passes hf_model_path to preprocessing, sets default dtype env var, and adds pixtral E2E test invocations.
Integration: Tests
tests/integration/defs/triton_server/test_triton_llm.py, .../conftest.py
Adds pixtral test (Mistral Small 3.1 24B) with varied params and streaming/BLS options; adds fixture for model root path.
CI
triton_backend/ci/L0_backend_trtllm/test.sh
Replaces ${prompt_embedding_table_data_type} with TYPE_FP16 during CI config prep.
Repo assets
.gitattributes
Enables Git LFS for PNG/JPG test assets under tests/integration/test_input_files.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as Client
  participant E as Ensemble Model
  participant P as Preprocessing (Python)
  participant V as Multimodal Encoders (Pixtral)
  participant L as TensorRT-LLM

  U->>E: text_input, images_input (bytes), params
  E->>P: TEXT, IMAGES, params (hf_model_path, model_type=pixtral)
  note right of P: Load HF processor/config<br/>Build input_ids with image token remap<br/>Pad/collect image_sizes
  P-->>E: INPUT_TEXT_TOKENS, IMAGE_SIZES, (other vision tensors)
  E->>V: vision inputs (pixel_values, image_sizes, masks)
  note right of V: Encode images<br/>Produce prompt_embedding_table + prompt_vocab_size
  V-->>E: OUT_PROMPT_EMBEDDING_TABLE, OUT_PROMPT_VOCAB_SIZE
  E->>L: input_ids, prompt_embedding_table, prompt_vocab_size
  note right of L: get_input_tensor_by_name(..., force_on_torch=True)
  L-->>U: generated_text
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

Suggested labels

Community want to contribute

Suggested reviewers

  • Tabrizian
  • chzblych
  • Tom-Zheng
  • nv-guomingz
  • pcastonguay
  • Shixiaowei02

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@dbari dbari changed the title [nvbugs/5394409][feat] Support Mistral Small 3.1 multimodal in Triton Backend [https://nvbugs/5394409][feat] Support Mistral Small 3.1 multimodal in Triton Backend Aug 7, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (7)
triton_backend/all_models/inflight_batcher_llm/preprocessing/config.pbtxt (1)

264-269: Consider making the HuggingFace model path configurable.

The hardcoded path /model_hf might not be suitable for all deployment scenarios. Consider using a parameterized value like ${hf_model_path} to allow runtime configuration.

 parameters: {
   key: "hf_model_path"
   value: {
-    string_value: "/model_hf"
+    string_value: "${hf_model_path}"
   }
 }
triton_backend/all_models/multimodal/multimodal_encoders/1/model.py (2)

294-316: Batch size limitation needs addressing as mentioned in PR objectives.

The current implementation restricts pixtral to batch size 1 due to variable image counts per request. This limitation is acknowledged in the PR objectives as needing resolution. The attention mask creation logic is correct for handling variable image sizes.

Consider implementing one of the proposed solutions from the PR objectives:

  1. Keep the restriction but document it clearly
  2. Allow encoder batches to be broken up for arbitrary image counts
  3. Rework batching logic for higher batch sizes (ideal but may impact other models)

666-666: Fix line length to comply with style guidelines.

Line 666 exceeds the 120 character limit. Please break the comment into multiple lines.

-                # Check if adding this request to the current batch would exceed the maximum batch size
+                # Check if adding this request to the current batch would exceed
+                # the maximum batch size
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py (4)

804-808: Consider validating required attributes for pixtral models

While using suppress(AttributeError) handles missing attributes gracefully, for pixtral models these attributes are required. Consider adding validation.

with suppress(AttributeError):
    # Only for pixtral models
    self.vocab_size = hf_config.text_config.vocab_size
    self.image_size = hf_config.vision_config.image_size
    self.image_token_index = hf_config.image_token_index

# Validate required attributes for pixtral
if vision_model_type == 'pixtral':
    if not all(hasattr(self, attr) for attr in ['vocab_size', 'image_size', 'image_token_index']):
        raise ValueError("[TensorRT-LLM][ERROR] Failed to extract required config attributes for pixtral model")

1065-1065: Fix comment formatting

The comment incorrectly shows "else:" which is redundant.

-        else:  # input_images is not None:
+        else:  # input_images is not None

1111-1116: Fix line length violation

Line 1115 exceeds the 120 character limit.

         for key, tensor_list in preprocessor_outputs.items():
             val = self.convert_tensor_list_to_tensor(tensor_list)
             if key in self.output_str_dtypes:
-                val = self.convert_tensor_to_str_dtype(
-                    val, self.output_str_dtypes[key])
+                val = self.convert_tensor_to_str_dtype(val, 
+                                                      self.output_str_dtypes[key])
             vision_processed_tensors[key] = val

1118-1129: Document the token replacement logic

The logic for replacing image tokens with unique IDs starting from vocab_size is critical for prompt table lookup. Consider adding a comment to explain this behavior.

         # Replace all image tokens with a unique token_id > vocab_size.
         # This shall be used to lookup the prompt table.
+        # Each image token in a batch gets a unique ID starting from vocab_size,
+        # incrementing sequentially. This ensures proper mapping to visual embeddings.
         for batch_id in range(batch_size):
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4055b76 and 775af83.

📒 Files selected for processing (9)
  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py (10 hunks)
  • triton_backend/all_models/inflight_batcher_llm/preprocessing/config.pbtxt (3 hunks)
  • triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py (1 hunks)
  • triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt (1 hunks)
  • triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt (1 hunks)
  • triton_backend/all_models/multimodal/ensemble/config.pbtxt (2 hunks)
  • triton_backend/all_models/multimodal/multimodal_encoders/1/model.py (4 hunks)
  • triton_backend/all_models/multimodal/multimodal_encoders/config.pbtxt (2 hunks)
  • triton_backend/tools/multimodal/client.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py
  • triton_backend/tools/multimodal/client.py
  • triton_backend/all_models/multimodal/multimodal_encoders/1/model.py
  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py
  • triton_backend/tools/multimodal/client.py
  • triton_backend/all_models/multimodal/multimodal_encoders/1/model.py
  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
🧠 Learnings (4)
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py
  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py
  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • triton_backend/all_models/inflight_batcher_llm/preprocessing/config.pbtxt
  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: ministral is a valid model name from mistral ai, distinct from the regular mistral models. in tensor...
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".

Applied to files:

  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
🪛 Ruff (0.12.2)
triton_backend/all_models/multimodal/multimodal_encoders/1/model.py

666-666: Line too long (132 > 120)

(E501)

triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py

1115-1115: Line too long (149 > 120)

(E501)

🔇 Additional comments (14)
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt (1)

202-202: LGTM: Clean parameterization of prompt embedding table data type.

The change from hardcoded TYPE_FP16 to the configurable parameter ${prompt_embedding_table_data_type} enables flexible data type specification for the prompt embedding table input. This aligns well with the broader multimodal support requirements and maintains consistency across the related configuration files.

triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py (1)

283-285: LGTM: Proper tensor format enforcement for prompt embedding table.

Adding force_on_torch=True ensures the prompt_embedding_table tensor is consistently converted to PyTorch format regardless of its origin (CPU/GPU). This change complements the parameterized data type configuration and maintains robust tensor handling in the downstream processing logic.

triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt (1)

322-322: LGTM: Consistent parameterization across configuration files.

The change to ${prompt_embedding_table_data_type} maintains consistency with the identical parameterization in the tensorrt_llm_bls/config.pbtxt file. This coordinated approach across all related configuration files ensures uniform handling of the prompt embedding table data type throughout the system.

triton_backend/all_models/multimodal/ensemble/config.pbtxt (2)

60-66: LGTM! Proper configuration for pixtral multi-image support.

The new optional input correctly uses TYPE_STRING for raw image bytes and dynamic dimensions to handle an arbitrary number of images per request.


263-266: LGTM! Correct input mapping for preprocessing.

The mapping properly connects the ensemble's images_input to the preprocessing model's IMAGES input.

triton_backend/all_models/multimodal/multimodal_encoders/config.pbtxt (2)

79-79: LGTM! Dynamic dimensions support multiple images.

The change from [2] to [-1, 2] correctly enables handling multiple image sizes, which aligns with the Mistral Small 3.1 requirement for supporting multiple images with different resolutions per prompt.


117-117: LGTM! Parameterized data type provides flexibility.

Using ${prompt_embedding_table_data_type} allows runtime configuration of the precision, which is beneficial for supporting different model requirements and optimization scenarios.

triton_backend/all_models/inflight_batcher_llm/preprocessing/config.pbtxt (2)

61-67: LGTM! Proper configuration for pixtral image input.

The optional STRING input with dynamic dimensions correctly supports an arbitrary number of images for the pixtral model.


198-204: Verify IMAGE_SIZES dimension change for llava and update TODO

I didn’t find any llava‐specific code referencing the IMAGE_SIZES tensor—please double-check that the generic inflight_batcher_llm preprocessor handles dims: [-1, 2] correctly for llava_onevision before removing the TODO.

• Confirm in triton_backend/all_models/inflight_batcher_llm/preprocessing/config.pbtxt (and related preprocessing code) that llava_onevision accepts the new [-1, 2] shape
• If llava_onevision no longer breaks, remove or reword the # TODO: Unbreak llava! comment
• Add a regression test for llava_onevision’s image postprocessing to catch any dimension mismatches in future

triton_backend/tools/multimodal/client.py (3)

67-79: LGTM! Clean implementation for flexible image loading.

The refactored load_image function elegantly handles both raw bytes and PIL image outputs while maintaining backward compatibility through the default parameter.


242-242: LGTM! Pixtral model type properly added to choices.


257-261: LGTM! Correct handling of raw image bytes for pixtral.

The implementation properly loads images as raw bytes and prepares them as a numpy object array, which aligns with the pixtral model's requirements for processing multiple raw image inputs.

Also applies to: 311-313

triton_backend/all_models/multimodal/multimodal_encoders/1/model.py (2)

165-178: LGTM! Proper initialization for pixtral model.

The configuration loading and parameter extraction correctly sets up the necessary values for pixtral image processing.


541-581: LGTM! Postprocessing correctly handles pixtral visual features.

The implementation properly extracts and reshapes visual features according to the spatial merge size. The single batch assertion is consistent with the batch size limitation in get_requests.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py (3)

139-142: Fix line length violation and update error message.

The error message doesn't include 'pixtral' in the list of supported models, creating an inconsistency with the validation list. Additionally, the line exceeds the 120-character limit.

 assert self.model_type in [
     'llava', 'blip2-opt', 'pixtral', 'vila', 'mllama',
     'llava_onevision', 'qwen2_vl'
-], f"[TensorRT-LLM][ERROR] Currently supported multi-modal models are llava, blip2-opt, vila, mllama, llava_onevision and qwen2_vl. Got {self.model_type}."
+], (f"[TensorRT-LLM][ERROR] Currently supported multi-modal models are "
+    f"llava, blip2-opt, pixtral, vila, mllama, llava_onevision and qwen2_vl. "
+    f"Got {self.model_type}.")

154-170: Add robust validation for hf_model_path parameter.

The code assumes hf_model_path exists without validation, which could lead to runtime errors. The parameter access is not safely handled.

 if self.model_type in [
         'mllama', 'llava_onevision', 'qwen2_vl', 'pixtral'
 ]:
-    # Read the correct path from the config
-    hf_model_path = model_config['parameters']['hf_model_path'][
-        'string_value']
+    # Read the correct path from the config
+    hf_model_path = model_config['parameters'].get('hf_model_path')
+    if not hf_model_path:
+        raise ValueError(f"[TensorRT-LLM][ERROR] 'hf_model_path' parameter is required for {self.model_type} model but not found in config")
+    hf_model_path = hf_model_path['string_value']
+    if not hf_model_path or hf_model_path == "${hf_model_path}":
+        raise ValueError(f"[TensorRT-LLM][ERROR] 'hf_model_path' parameter is not properly configured for {self.model_type} model")

335-346: Fix incorrect array shape for request_input_len.

The current implementation creates an array with shape (1, batch_size) instead of the expected (batch_size, 1) shape.

-                    request_input_len = np.array([[
-                        len(input_ids_for_batch)
-                        for input_ids_for_batch in pixtral_input_id_tensor
-                    ]])
+                    request_input_len = np.array([
+                        [len(input_ids_for_batch)]
+                        for input_ids_for_batch in pixtral_input_id_tensor
+                    ])
🧹 Nitpick comments (1)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py (1)

1043-1131: Comprehensive pixtral processing with minor improvements needed.

The implementation is well-structured and handles multiple input types correctly. However, there are a few areas for improvement:

  1. Line length violation (line 1116): The line exceeds 120 characters
  2. Explicit error handling instead of assert (line 1067 equivalent - already handled with ValueError)
  3. Input validation could be enhanced
             if key in self.output_str_dtypes:
-                val = self.convert_tensor_to_str_dtype(
-                    val, self.output_str_dtypes[key])
+                val = self.convert_tensor_to_str_dtype(val, 
+                                                      self.output_str_dtypes[key])

The method demonstrates good practices with:

  • Proper error handling for invalid input dimensions
  • Consistent batch processing logic
  • Appropriate tensor manipulation
  • Clear separation of concerns between input types
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2925fd8 and ffe412b.

📒 Files selected for processing (1)
  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py (10 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
🧠 Learnings (5)
📚 Learning: ministral is a valid model name from mistral ai, distinct from the regular mistral models. in tensor...
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".

Applied to files:

  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
🪛 Ruff (0.12.2)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py

141-141: Line too long (176 > 120)

(E501)


1116-1116: Line too long (149 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py (7)

38-38: LGTM: Import addition is appropriate.

The addition of AutoConfig and AutoProcessor imports is necessary for the new pixtral model support and follows the namespace preservation guideline.


299-303: LGTM: Input tensor handling is correctly implemented.

The addition of IMAGES input tensor handling is properly integrated with existing multimodal logic and follows the established pattern for other input tensors.


360-362: LGTM: Conditional input creation bypass is logical.

The bypass of _create_request for pixtral model is appropriate since the method handles its own input ID creation and request length calculation.


393-399: LGTM: Output tensor creation follows established patterns.

The pixtral-specific tensor creation properly converts the processed data to the expected Triton tensor format and uses appropriate data types.


754-755: LGTM: Constructor parameter addition is well-designed.

The addition of hf_config parameter to the VisionPreProcessor constructor is necessary for pixtral model support and maintains backward compatibility.


803-806: LGTM: Pixtral-specific configuration initialization is appropriate.

The initialization of vocab_size, image_size, and image_token_index from the HuggingFace config is correctly implemented for pixtral model support.


1065-1067: LGTM: Proper error handling instead of assert.

The code correctly uses explicit error handling with ValueError instead of assert statements, ensuring validation is always enforced regardless of Python optimization flags.

@dbari
Copy link
Collaborator Author

dbari commented Aug 7, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14492 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14492 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10947 completed with status: 'FAILURE'

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (7)
tests/integration/defs/triton_server/test.sh (1)

47-50: Use more idiomatic default assignment

Tiny shell-style nit: the same effect can be written in one line while preventing an accidental later re-assignment:

: "${PROMPT_EMBEDDING_TABLE_DTYPE:=TYPE_FP16}"
readonly PROMPT_EMBEDDING_TABLE_DTYPE

This is POSIX-compliant, shorter, and marks the value immutable.

tests/integration/defs/triton_server/build_engines.py (1)

1768-1781: Add a concise docstring per guidelines.

Improves readability and Sphinx parsing.

 def prepare_mistral3_pixtral_engine(tensorrt_llm_multimodal_example_root,
                                     tensorrt_llm_llama_example_root,
                                     mistral_small_model_root):
+    """Prepare Mistral Small 3.1 (Pixtral) engines.
+
+    Converts the HF checkpoint to local, builds the LLM engine, and builds the
+    Pixtral visual encoder engine.
+
+    Args:
+        tensorrt_llm_multimodal_example_root: Path to multimodal example root.
+        tensorrt_llm_llama_example_root: Path to llama example root for converter.
+        mistral_small_model_root: HF model directory for Mistral Small 3.1.
+
+    Returns:
+        Tuple[str, str]: (llm engine_dir, multimodal_engine_dir).
+    """
tests/integration/defs/triton_server/test_triton_llm.py (2)

3958-3962: Explicitly import prepare_mistral3_pixtral_engine to avoid F405 and improve clarity

Static analysis flags F405 due to star-imports. Prefer explicit imports for used symbols to reduce ambiguity and lint noise.

Outside this hunk, update imports at the top of the file:

-from .build_engines import *
-from .common import *
+from .build_engines import (
+    prepare_llama_v2_7b_engine,
+    prepare_llama_v2_70b_engine,
+    prepare_mistral_v1_7b_engine,
+    prepare_medusa_vicuna_7b_engine,
+    prepare_eagle_vicuna_7b_engine,
+    prepare_gpt_350m_engine,
+    prepare_gpt_175b_engine,
+    prepare_llama_v3_8b_engine,
+    prepare_llama_v3_70b_engine,
+    prepare_gpt_gather_logits_engine,
+    prepare_t5_small_engine,
+    prepare_whisper_large_engine,
+    prepare_gpt_2b_lora_engine,
+    prepare_tiny_llama_1b_engine,
+    prepare_mllama_engine,
+    prepare_llava_engine,
+    prepare_llava_onevision_engine,
+    prepare_mistral3_pixtral_engine,
+)
+from .common import (
+    prepare_ib_model_repo,
+    prepare_multimodal_model_repo,
+    prepare_custom_config,
+    prepare_disaggregated_serving_model_repo,
+    modify_ib_config_pbtxt,
+    modify_disaggregated_serving_config_pbtxt,
+    check_server_ready,
+    run_cpp_backend_tests,
+    run_cpp_streaming_backend_tests,
+    validate_by_keyword,
+    query_gpu_name,
+    get_gpu_full_name,
+    retrieve_latency_value,
+)

3909-3911: Document rationale for TRITON_MAX_BATCH_SIZE=1

Given current batching limitations with variable image counts, locking max batch size to 1 is intentional. Add a short comment here to prevent future “optimizations” that might reintroduce batching regressions.

Example:

NOTE: Pixtral currently supports variable #images per request; set max_batch_size=1 until batching is reworked.

tests/integration/defs/triton_server/common.py (3)

250-251: Document and validate the new dtype parameter (keep backward compatible).

Add a brief docstring entry for PROMPT_EMBEDDING_TABLE_DTYPE (allowed Triton types, default is TYPE_FP16). Optional: validate input early to catch typos.


278-278: Consistently propagate prompt_embedding_table_data_type (OK), consider DRY + quoting.

Propagation looks correct. To reduce duplication and avoid shell word-splitting, consider:

  • Define once: prompt_dtype_kv = f"prompt_embedding_table_data_type:{PROMPT_EMBEDDING_TABLE_DTYPE}"
  • Inject prompt_dtype_kv into each fill_template call
  • Quote the KV list (like other calls) to protect spaces in paths.

Also applies to: 310-310, 335-335, 356-356, 371-371


355-356: Fix long line (Ruff E501) by splitting KV assignments.

-        f"tensorrt_llm_model_name:{TENSORRT_LLM_TARGET_MODEL_NAME},tensorrt_llm_draft_model_name:{TENSORRT_LLM_DRAFT_MODEL_NAME},logits_datatype:TYPE_FP32," \
+        f"tensorrt_llm_model_name:{TENSORRT_LLM_TARGET_MODEL_NAME}," \
+        f"tensorrt_llm_draft_model_name:{TENSORRT_LLM_DRAFT_MODEL_NAME},logits_datatype:TYPE_FP32," \
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ffe412b and f125942.

📒 Files selected for processing (6)
  • tests/integration/defs/triton_server/build_engines.py (1 hunks)
  • tests/integration/defs/triton_server/common.py (6 hunks)
  • tests/integration/defs/triton_server/conftest.py (1 hunks)
  • tests/integration/defs/triton_server/test.sh (5 hunks)
  • tests/integration/defs/triton_server/test_triton_llm.py (1 hunks)
  • triton_backend/ci/L0_backend_trtllm/test.sh (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • triton_backend/ci/L0_backend_trtllm/test.sh
  • tests/integration/defs/triton_server/conftest.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tests/integration/defs/triton_server/build_engines.py
  • tests/integration/defs/triton_server/common.py
  • tests/integration/defs/triton_server/test_triton_llm.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tests/integration/defs/triton_server/build_engines.py
  • tests/integration/defs/triton_server/common.py
  • tests/integration/defs/triton_server/test_triton_llm.py
🧠 Learnings (7)
📓 Common learnings
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid and distinct model family from Mistral AI, separate from their regular Mistral models. Ministral 8B is specifically designed for edge computing and on-device applications, released in October 2024. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/defs/triton_server/build_engines.py
  • tests/integration/defs/triton_server/test.sh
  • tests/integration/defs/triton_server/common.py
  • tests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-08-06T03:47:16.802Z
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".

Applied to files:

  • tests/integration/defs/triton_server/build_engines.py
  • tests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tests/integration/defs/triton_server/build_engines.py
  • tests/integration/defs/triton_server/common.py
  • tests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/defs/triton_server/test.sh
  • tests/integration/defs/triton_server/common.py
  • tests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-07-22T09:22:14.726Z
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • tests/integration/defs/triton_server/test.sh
  • tests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-08-06T03:47:16.802Z
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid and distinct model family from Mistral AI, separate from their regular Mistral models. Ministral 8B is specifically designed for edge computing and on-device applications, released in October 2024. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".

Applied to files:

  • tests/integration/defs/triton_server/test_triton_llm.py
🪛 Ruff (0.12.2)
tests/integration/defs/triton_server/common.py

341-341: Line too long (128 > 120)

(E501)


355-355: Line too long (158 > 120)

(E501)

tests/integration/defs/triton_server/test_triton_llm.py

3959-3959: prepare_mistral3_pixtral_engine may be undefined, or defined from star imports

(F405)


3965-3965: prepare_ib_model_repo may be undefined, or defined from star imports

(F405)


3968-3968: prepare_multimodal_model_repo may be undefined, or defined from star imports

(F405)


3970-3970: prepare_multimodal_model_repo may be undefined, or defined from star imports

(F405)


3975-3975: modify_ib_config_pbtxt may be undefined, or defined from star imports

(F405)


4012-4012: check_server_ready may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tests/integration/defs/triton_server/test.sh (1)

189-190: Double-check template placeholders for the new key

All fill_template.py invocations now pass prompt_embedding_table_data_type. Make sure every affected config.pbtxt template actually contains the ${prompt_embedding_table_data_type} placeholder; otherwise the key will be silently ignored and the runtime will still default to TYPE_FP16.

Also applies to: 195-198, 208-208, 226-226

tests/integration/defs/triton_server/build_engines.py (1)

1803-1810: Verify visual encoder max_batch_size=2 is sufficient for “multiple images per prompt”.

Given the batching constraints discussed in the PR, this hard-cap may be too restrictive for realistic prompts. Other multimodal builders use 8–16.

If acceptable, leave as-is; otherwise consider:

-        "--max_batch_size=2",
+        "--max_batch_size=8",

Or parameterize it if tests need different limits.

tests/integration/defs/triton_server/test_triton_llm.py (3)

3948-3951: Fixture Verified: mistral_small_3_1_24b_model_root Exists
The mistral_small_3_1_24b_model_root fixture is defined in

  • tests/integration/defs/triton_server/conftest.py:554

No further action required.


3919-3922: End-to-end dtype templating confirmed

modify_ib_config_pbtxt accepts ENCODER_INPUT_FEATURES_DTYPE and PROMPT_EMBEDDING_TABLE_DTYPE
– Integration tests invoke tools/fill_template.py with encoder_input_features_data_type:${ENCODER_INPUT_FEATURES_DTYPE} and prompt_embedding_table_data_type:${PROMPT_EMBEDDING_TABLE_DTYPE}
– All relevant triton_backend/all_models/**/config.pbtxt files declare ${encoder_input_features_data_type} and ${prompt_embedding_table_data_type} placeholders

No changes needed.


3999-4004: Confirm Python backend in generated Triton configs (manual verification needed)

I ran the provided script but didn’t find any config.pbtxt files under triton_backend/triton_repo—the model repo is generated at test time. Please:

  • Run the integration tests to create triton_backend/triton_repo
  • Inspect each config.pbtxt and verify it contains
    backend: "python"  # (or platform: "python")

This ensures we never hit the C++ backend path.

tests/integration/defs/triton_server/common.py (1)

341-341: Break long f-string into multiple lines and keep KV args consistent

• File: tests/integration/defs/triton_server/common.py, line 341
Apply the same splitting style as nearby calls:

-        f"python3 {fill_template_py} -i {preprocessing_config} tokenizer_dir:{TOKENIZER_PATH},hf_model_path:{TOKENIZER_PATH}," \
+        f"python3 {fill_template_py} -i {preprocessing_config} " \
+        f"tokenizer_dir:{TOKENIZER_PATH},hf_model_path:{TOKENIZER_PATH}," \

Placeholders for tokenizer_dir and hf_model_path are already present in each preprocessing/config.pbtxt, so the CI will continue to pass.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tests/integration/defs/triton_server/test_triton_llm.py (1)

4025-4031: Quote the --text argument to preserve the embedded newline

TEXT can contain a \n; without quoting, it may be split into multiple argv tokens.
Use the same quoting style applied to the mllama test:

-        f"--text={TEXT}",
+        f"--text='{TEXT}'",
🧹 Nitpick comments (1)
tests/integration/defs/triton_server/test_triton_llm.py (1)

3968-3971: Consider asserting basic correctness

The test only checks that the call succeeds; no assertions are made on the response.
Adding a minimal keyword check (similar to other multimodal tests) would catch silent regressions.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f125942 and 863758f.

📒 Files selected for processing (1)
  • tests/integration/defs/triton_server/test_triton_llm.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tests/integration/defs/triton_server/test_triton_llm.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tests/integration/defs/triton_server/test_triton_llm.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-08-06T03:47:16.802Z
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".

Applied to files:

  • tests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-08-06T03:47:16.802Z
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid and distinct model family from Mistral AI, separate from their regular Mistral models. Ministral 8B is specifically designed for edge computing and on-device applications, released in October 2024. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".

Applied to files:

  • tests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tests/integration/defs/triton_server/test_triton_llm.py
🪛 Ruff (0.12.2)
tests/integration/defs/triton_server/test_triton_llm.py

3897-3897: Line too long (155 > 120)

(E501)


3968-3968: prepare_mistral3_pixtral_engine may be undefined, or defined from star imports

(F405)


3974-3974: prepare_ib_model_repo may be undefined, or defined from star imports

(F405)


3977-3977: prepare_multimodal_model_repo may be undefined, or defined from star imports

(F405)


3979-3979: prepare_multimodal_model_repo may be undefined, or defined from star imports

(F405)


3984-3984: modify_ib_config_pbtxt may be undefined, or defined from star imports

(F405)


4021-4021: check_server_ready may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tests/integration/defs/triton_server/test_triton_llm.py (1)

3894-3899: Double-check client support for comma-separated multi-image input

The second parametrized case passes two image URLs as a single --image value separated by a comma.
Ensure client.py for pixtral actually expects this format (vs. multiple --image flags or JSON).
If not, adjust the test or extend the parser accordingly.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py (2)

154-160: hf_model_path still not validated — comment already raised on previous commit.
Please add the missing presence / placeholder checks to avoid runtime failures before the path is dereferenced.


343-346: request_input_len shape is transposed

The construction gives shape (1, batch_size); downstream code expects (batch_size, 1) (see earlier review).
Swap the loops as previously suggested.

-request_input_len = np.array([[
-    len(ids) for ids in pixtral_input_id_tensor
-]])
+request_input_len = np.array(
+    [[len(ids)] for ids in pixtral_input_id_tensor]
+)
🧹 Nitpick comments (1)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py (1)

299-304: Prefer explicit is not None checks for input-tensor presence

Using truthiness on pb_utils.Tensor objects (img_urls or image_bytes …) relies on the object’s __bool__ implementation and becomes brittle.
Consider:

has_vision_inputs = any(
    tensor is not None
    for tensor in (img_urls, image_bytes, input_images, video_bytes)
)
if self.is_multimodal and has_vision_inputs:
    …

This avoids accidental ValueError: ambiguous truth value issues when the underlying object turns into a NumPy array.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 863758f and 4bc8678.

📒 Files selected for processing (1)
  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py (10 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid and distinct model family from Mistral AI, separate from their regular Mistral models. Ministral 8B is specifically designed for edge computing and on-device applications, released in October 2024. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
📚 Learning: 2025-08-06T03:47:16.802Z
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".

Applied to files:

  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: 2025-07-22T09:22:14.726Z
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
🪛 Ruff (0.12.2)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py

141-141: Line too long (176 > 120)

(E501)


1125-1125: Line too long (149 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py (1)

394-399: Potential device mismatch when converting Torch tensor to NumPy

.numpy() only works on CPU tensors. If upstream code ever moves INPUT_IDS to GPU, this line will raise.
Safer:

input_ids_np = pixtral_input_id_tensor.cpu().numpy().astype(self.input_id_dtype)
input_id_tensor = pb_utils.Tensor('INPUT_ID', input_ids_np)

Add .cpu() or ensure the tensor is guaranteed to reside on CPU beforehand.

@dbari
Copy link
Collaborator Author

dbari commented Aug 8, 2025

/bot run --stage-list "A30-Triton-Post-Merge-1, A30-Triton-Post-Merge-2, B200_PCIe-Triton-Post-Merge-1, A100X-Triton-Post-Merge-1, A100X-Triton-Post-Merge-2"

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tests/integration/defs/triton_server/build_model.sh (1)

716-723: Align trtllm-build flags with existing conventions and set plugin policy explicitly.

Other blocks explicitly configure GEMM/attention plugins and use space-separated flags. Recommend matching that for predictability.

Apply:

-    trtllm-build --checkpoint_dir ./c-model/Mistral-Small-3.1-24B-Instruct-2503/bf16 \
-                --max_batch_size=8 \
-                --max_input_len=4096 \
-                --max_seq_len=4096 \
-                --max_multimodal_len=32768 \
-                --use_paged_context_fmha=enable \
-                --output_dir trt_engines/Mistral-Small-3.1-24B-Instruct-2503/bf16/1-gpu
+    trtllm-build --checkpoint_dir ./c-model/Mistral-Small-3.1-24B-Instruct-2503/bf16 \
+                --gemm_plugin auto \
+                --max_batch_size 8 \
+                --max_input_len 4096 \
+                --max_seq_len 4096 \
+                --max_multimodal_len 32768 \
+                --use_paged_context_fmha enable \
+                --output_dir trt_engines/Mistral-Small-3.1-24B-Instruct-2503/bf16/1-gpu

Note: If you rely on default plugin selection, keep it but still switch to the consistent flag formatting.

tests/integration/defs/triton_server/test_triton.py (1)

476-479: Wrap long lines to satisfy E501 and improve readability.

Ruff flagged these lines (E501). Wrap the f-strings using parentheses.

Apply:

-    # Visual encoder path follows the multimodal tmp engine pattern similar to others
-    visual_encoder_path = f"{llm_root}/examples/models/core/multimodal/tmp/trt_engines/{os.path.basename(model_path)}/multimodal_encoder/"
-    run_shell_command(
-        f"cd {tritonserver_test_root} && ./test.sh {test_name} {engine_dir} {model_path} {tokenizer_type} skip skip skip {visual_encoder_path}",
-        llm_root)
+    # Visual encoder path follows the multimodal tmp engine pattern similar to others
+    visual_encoder_path = (
+        f"{llm_root}/examples/models/core/multimodal/tmp/trt_engines/"
+        f"{os.path.basename(model_path)}/multimodal_encoder/"
+    )
+    run_shell_command(
+        (
+            f"cd {tritonserver_test_root} && ./test.sh {test_name} "
+            f"{engine_dir} {model_path} {tokenizer_type} skip skip skip "
+            f"{visual_encoder_path}"
+        ),
+        llm_root,
+    )
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4bc8678 and f8bb8d6.

📒 Files selected for processing (3)
  • tests/integration/defs/triton_server/build_model.sh (1 hunks)
  • tests/integration/defs/triton_server/test.sh (6 hunks)
  • tests/integration/defs/triton_server/test_triton.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/defs/triton_server/test.sh
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tests/integration/defs/triton_server/test_triton.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tests/integration/defs/triton_server/test_triton.py
🧠 Learnings (8)
📓 Common learnings
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
📚 Learning: 2025-08-06T03:47:16.802Z
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".

Applied to files:

  • tests/integration/defs/triton_server/build_model.sh
  • tests/integration/defs/triton_server/test_triton.py
📚 Learning: 2025-08-06T03:47:16.802Z
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid and distinct model family from Mistral AI, separate from their regular Mistral models. Ministral 8B is specifically designed for edge computing and on-device applications, released in October 2024. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".

Applied to files:

  • tests/integration/defs/triton_server/build_model.sh
  • tests/integration/defs/triton_server/test_triton.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/defs/triton_server/build_model.sh
  • tests/integration/defs/triton_server/test_triton.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/defs/triton_server/build_model.sh
  • tests/integration/defs/triton_server/test_triton.py
📚 Learning: 2025-08-08T04:10:18.987Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:18.987Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

  • tests/integration/defs/triton_server/build_model.sh
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tests/integration/defs/triton_server/build_model.sh
  • tests/integration/defs/triton_server/test_triton.py
📚 Learning: 2025-07-22T09:22:14.726Z
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • tests/integration/defs/triton_server/build_model.sh
🪛 Ruff (0.12.2)
tests/integration/defs/triton_server/test_triton.py

476-476: Line too long (138 > 120)

(E501)


478-478: Line too long (144 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tests/integration/defs/triton_server/build_model.sh (1)

720-721: Verify max_multimodal_len sizing.

32768 is quite large and may inflate memory/engine size. Ensure it’s justified by worst-case images-per-prompt × patches-per-image for Pixtral, given “[IMG]” expansion policy.

Would you confirm the intended upper bound (images per request and maximum image resolution/patch grid), and whether 16384 would suffice for CI to reduce build time and memory?

tests/integration/defs/triton_server/test_triton.py (3)

65-67: Model mapping entry looks good.

The new "mistral_small_3_1""Mistral-Small-3.1-24B-Instruct-2503" mapping matches the build script’s MODEL_DIR. No issues.


127-129: Engine directory mapping LGTM.

Path is consistent with the bf16 engine output in the build script.


476-477: Confirm basename-based visual encoder path matches build output.

This relies on os.path.basename(model_path) being identical to the directory name used in build_model.sh. Today they match; if MODEL_DIR changes, this will drift. Consider deriving from a shared mapping or a helper to avoid string coupling.

If you prefer, I can factor a small helper to compute visual_encoder_path in one place used by all multimodal tests.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14624 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14624 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11048 (Partly Tested) completed with status: 'FAILURE'

@dbari dbari force-pushed the feature/5394409-mistral-small-3.1 branch from f109060 to bf1dead Compare August 11, 2025 07:59
Copy link
Collaborator

@jaedeok-nvidia jaedeok-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding Pixtral support to the trtllm backend of Triton. I’ve left several comments. Could you please take a look?

@dbari
Copy link
Collaborator Author

dbari commented Aug 12, 2025

/bot run --stage-list "A30-Triton-Post-Merge-1, A30-Triton-Post-Merge-2, B200_PCIe-Triton-Post-Merge-1, A100X-Triton-Post-Merge-1, A100X-Triton-Post-Merge-2"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15010 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15010 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #11337 (Partly Tested) completed with status: 'FAILURE'

@dbari dbari force-pushed the feature/5394409-mistral-small-3.1 branch from abdd1c3 to a00d34d Compare August 13, 2025 07:27
@dbari
Copy link
Collaborator Author

dbari commented Aug 13, 2025

/bot run --stage-list "A30-Triton-Post-Merge-1, A30-Triton-Post-Merge-2, B200_PCIe-Triton-Post-Merge-1, A100X-Triton-Post-Merge-1, A100X-Triton-Post-Merge-2"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15091 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15091 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11394 (Partly Tested) completed with status: 'FAILURE'

dbari and others added 7 commits August 19, 2025 09:40
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com>
Signed-off-by: Dimitrios Bariamis <dbari@users.noreply.github.com>
- The logger has no method logger.verbose()

Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
@dbari dbari force-pushed the feature/5394409-mistral-small-3.1 branch from 54d9e62 to 48bc0e6 Compare August 19, 2025 07:40
@dbari
Copy link
Collaborator Author

dbari commented Aug 19, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15734 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11824 completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15744 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15744 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11833 completed with status: 'FAILURE'

@dbari
Copy link
Collaborator Author

dbari commented Aug 19, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15790 [ run ] triggered by Bot

@dbari
Copy link
Collaborator Author

dbari commented Aug 19, 2025

What is the issue? Are we able to get correct outputs outside Triton with executor?

@Tabrizian I repeated the tests with both backends in an image built with the current main branch of TRT-LLM and the problem seems to have been solved in the meantime. The outputs for backend: "python" and backend: "tensorrtllm" are identical.

Seeing this, I will also switch the backend for this model to tensorrtllm.

Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
@tensorrt-cicd
Copy link
Collaborator

PR_Github #15790 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11868 completed with status: 'FAILURE'

@dbari
Copy link
Collaborator Author

dbari commented Aug 20, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15875 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15875 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11934 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@dbari
Copy link
Collaborator Author

dbari commented Aug 21, 2025

Any comments by the reviewers on the two open conversations (one and two) or should I resolve them and merge? The CI (full build) is green.

…stead of URLs

Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
@dbari
Copy link
Collaborator Author

dbari commented Aug 21, 2025

/bot run --stage-list "A100X-Triton-Post-Merge-1, A100X-Triton-Post-Merge-2"

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
.gitattributes (1)

10-11: Handle .jpeg and uppercase extensions; simplify with character classes

Git attribute patterns are case-sensitive. Current rules won’t catch .JPEG/.PNG nor .jpeg. Recommend replacing these two lines with case-insensitive character-class patterns that also cover .jpeg.

Apply this diff:

- tests/integration/test_input_files/*.png filter=lfs diff=lfs merge=lfs -text
- tests/integration/test_input_files/*.jpg filter=lfs diff=lfs merge=lfs -text
+tests/integration/test_input_files/*.[Pp][Nn][Gg] filter=lfs diff=lfs merge=lfs -text
+tests/integration/test_input_files/*.[Jj][Pp][Ee]?[Gg] filter=lfs diff=lfs merge=lfs -text

Notes:

  • The second rule matches both .jpg and .jpeg (the ? covers the optional “e”).
  • If you expect nested subfolders under test_input_files, consider adding an additional pair using **/ to cover recursion (supported in modern Git), or confirm you’ll keep a flat directory.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f8bb8d6 and 1024c47.

📒 Files selected for processing (1)
  • .gitattributes (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
.gitattributes (2)

10-11: Scoped LFS rules for test images look good

Limiting LFS to tests/integration/test_input_files prevents repo-wide PNG/JPG from being forced into LFS while still protecting large test assets. Nice, targeted change.


10-11: CI/LFS Sanity Check: Manual Verification Needed

The sandbox environment doesn’t support git lfs, so we can’t auto-verify here. Please confirm in an LFS-enabled clone (e.g. your CI agent with git lfs install) that all image assets under tests/integration/test_input_files are indeed tracked as LFS pointers before running your integration tests:

• Count and list all image files:

find tests/integration/test_input_files \
  -type f \( -iname "*.png" -o -iname "*.jpg" -o -iname "*.jpeg" \) | sort

— Expect 5 files.

• Verify they’re LFS-tracked:

git lfs ls-files -- tests/integration/test_input_files | awk '{print $3}' | sort

— Should also list the same 5 paths.

• Alternatively, check pointer content in your CI checkout:

grep -Il "version https://git-lfs.github.com/spec/v1" tests/integration/test_input_files/*

— Only true pointers will match.

If any image is missing:

  • Re-add it to pick up the new .gitattributes rule:
    git rm --cached <file> && git add <file>
  • Or rewrite history to migrate existing blobs:
    git lfs migrate import \
      --include="tests/integration/test_input_files/*.{png,jpg,jpeg}"

Also ensure your CI pipeline runs:

git lfs install --local
git lfs fetch --all
git lfs checkout

so that pointer files are replaced with real binaries before tests execute.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16054 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16054 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12073 (Partly Tested) completed with status: 'SUCCESS'

@dbari
Copy link
Collaborator Author

dbari commented Aug 21, 2025

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16067 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16067 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #16054 (Partly Tested) for commit 1024c47

@MartinMarciniszyn MartinMarciniszyn merged commit f49dafe into NVIDIA:main Aug 21, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants