-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[https://nvbugs/5394409][feat] Support Mistral Small 3.1 multimodal in Triton Backend #6714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[https://nvbugs/5394409][feat] Support Mistral Small 3.1 multimodal in Triton Backend #6714
Conversation
📝 WalkthroughWalkthroughAdds end-to-end “pixtral” multimodal support: preprocessing updates (new IMAGES input, HF config loading, VisionPreProcessor changes, pixtral-specific processing), ensemble and encoder configs, encoder-side pixtral logic and outputs, client handling of raw image bytes, dynamic prompt embedding table dtype, engine build/test utilities, and CI/template updates. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as Client
participant E as Ensemble Model
participant P as Preprocessing (Python)
participant V as Multimodal Encoders (Pixtral)
participant L as TensorRT-LLM
U->>E: text_input, images_input (bytes), params
E->>P: TEXT, IMAGES, params (hf_model_path, model_type=pixtral)
note right of P: Load HF processor/config<br/>Build input_ids with image token remap<br/>Pad/collect image_sizes
P-->>E: INPUT_TEXT_TOKENS, IMAGE_SIZES, (other vision tensors)
E->>V: vision inputs (pixel_values, image_sizes, masks)
note right of V: Encode images<br/>Produce prompt_embedding_table + prompt_vocab_size
V-->>E: OUT_PROMPT_EMBEDDING_TABLE, OUT_PROMPT_VOCAB_SIZE
E->>L: input_ids, prompt_embedding_table, prompt_vocab_size
note right of L: get_input_tensor_by_name(..., force_on_torch=True)
L-->>U: generated_text
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related issues
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (7)
triton_backend/all_models/inflight_batcher_llm/preprocessing/config.pbtxt (1)
264-269: Consider making the HuggingFace model path configurable.The hardcoded path
/model_hfmight not be suitable for all deployment scenarios. Consider using a parameterized value like${hf_model_path}to allow runtime configuration.parameters: { key: "hf_model_path" value: { - string_value: "/model_hf" + string_value: "${hf_model_path}" } }triton_backend/all_models/multimodal/multimodal_encoders/1/model.py (2)
294-316: Batch size limitation needs addressing as mentioned in PR objectives.The current implementation restricts pixtral to batch size 1 due to variable image counts per request. This limitation is acknowledged in the PR objectives as needing resolution. The attention mask creation logic is correct for handling variable image sizes.
Consider implementing one of the proposed solutions from the PR objectives:
- Keep the restriction but document it clearly
- Allow encoder batches to be broken up for arbitrary image counts
- Rework batching logic for higher batch sizes (ideal but may impact other models)
666-666: Fix line length to comply with style guidelines.Line 666 exceeds the 120 character limit. Please break the comment into multiple lines.
- # Check if adding this request to the current batch would exceed the maximum batch size + # Check if adding this request to the current batch would exceed + # the maximum batch sizetriton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py (4)
804-808: Consider validating required attributes for pixtral modelsWhile using
suppress(AttributeError)handles missing attributes gracefully, for pixtral models these attributes are required. Consider adding validation.with suppress(AttributeError): # Only for pixtral models self.vocab_size = hf_config.text_config.vocab_size self.image_size = hf_config.vision_config.image_size self.image_token_index = hf_config.image_token_index # Validate required attributes for pixtral if vision_model_type == 'pixtral': if not all(hasattr(self, attr) for attr in ['vocab_size', 'image_size', 'image_token_index']): raise ValueError("[TensorRT-LLM][ERROR] Failed to extract required config attributes for pixtral model")
1065-1065: Fix comment formattingThe comment incorrectly shows "else:" which is redundant.
- else: # input_images is not None: + else: # input_images is not None
1111-1116: Fix line length violationLine 1115 exceeds the 120 character limit.
for key, tensor_list in preprocessor_outputs.items(): val = self.convert_tensor_list_to_tensor(tensor_list) if key in self.output_str_dtypes: - val = self.convert_tensor_to_str_dtype( - val, self.output_str_dtypes[key]) + val = self.convert_tensor_to_str_dtype(val, + self.output_str_dtypes[key]) vision_processed_tensors[key] = val
1118-1129: Document the token replacement logicThe logic for replacing image tokens with unique IDs starting from vocab_size is critical for prompt table lookup. Consider adding a comment to explain this behavior.
# Replace all image tokens with a unique token_id > vocab_size. # This shall be used to lookup the prompt table. + # Each image token in a batch gets a unique ID starting from vocab_size, + # incrementing sequentially. This ensures proper mapping to visual embeddings. for batch_id in range(batch_size):
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py(10 hunks)triton_backend/all_models/inflight_batcher_llm/preprocessing/config.pbtxt(3 hunks)triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py(1 hunks)triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt(1 hunks)triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt(1 hunks)triton_backend/all_models/multimodal/ensemble/config.pbtxt(2 hunks)triton_backend/all_models/multimodal/multimodal_encoders/1/model.py(4 hunks)triton_backend/all_models/multimodal/multimodal_encoders/config.pbtxt(2 hunks)triton_backend/tools/multimodal/client.py(5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.pytriton_backend/tools/multimodal/client.pytriton_backend/all_models/multimodal/multimodal_encoders/1/model.pytriton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.pytriton_backend/tools/multimodal/client.pytriton_backend/all_models/multimodal/multimodal_encoders/1/model.pytriton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
🧠 Learnings (4)
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.pytriton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.pytriton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
triton_backend/all_models/inflight_batcher_llm/preprocessing/config.pbtxttriton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: ministral is a valid model name from mistral ai, distinct from the regular mistral models. in tensor...
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
Applied to files:
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
🪛 Ruff (0.12.2)
triton_backend/all_models/multimodal/multimodal_encoders/1/model.py
666-666: Line too long (132 > 120)
(E501)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
1115-1115: Line too long (149 > 120)
(E501)
🔇 Additional comments (14)
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt (1)
202-202: LGTM: Clean parameterization of prompt embedding table data type.The change from hardcoded
TYPE_FP16to the configurable parameter${prompt_embedding_table_data_type}enables flexible data type specification for the prompt embedding table input. This aligns well with the broader multimodal support requirements and maintains consistency across the related configuration files.triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py (1)
283-285: LGTM: Proper tensor format enforcement for prompt embedding table.Adding
force_on_torch=Trueensures theprompt_embedding_tabletensor is consistently converted to PyTorch format regardless of its origin (CPU/GPU). This change complements the parameterized data type configuration and maintains robust tensor handling in the downstream processing logic.triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt (1)
322-322: LGTM: Consistent parameterization across configuration files.The change to
${prompt_embedding_table_data_type}maintains consistency with the identical parameterization in thetensorrt_llm_bls/config.pbtxtfile. This coordinated approach across all related configuration files ensures uniform handling of the prompt embedding table data type throughout the system.triton_backend/all_models/multimodal/ensemble/config.pbtxt (2)
60-66: LGTM! Proper configuration for pixtral multi-image support.The new optional input correctly uses TYPE_STRING for raw image bytes and dynamic dimensions to handle an arbitrary number of images per request.
263-266: LGTM! Correct input mapping for preprocessing.The mapping properly connects the ensemble's
images_inputto the preprocessing model'sIMAGESinput.triton_backend/all_models/multimodal/multimodal_encoders/config.pbtxt (2)
79-79: LGTM! Dynamic dimensions support multiple images.The change from
[2]to[-1, 2]correctly enables handling multiple image sizes, which aligns with the Mistral Small 3.1 requirement for supporting multiple images with different resolutions per prompt.
117-117: LGTM! Parameterized data type provides flexibility.Using
${prompt_embedding_table_data_type}allows runtime configuration of the precision, which is beneficial for supporting different model requirements and optimization scenarios.triton_backend/all_models/inflight_batcher_llm/preprocessing/config.pbtxt (2)
61-67: LGTM! Proper configuration for pixtral image input.The optional STRING input with dynamic dimensions correctly supports an arbitrary number of images for the pixtral model.
198-204: Verify IMAGE_SIZES dimension change for llava and update TODOI didn’t find any llava‐specific code referencing the
IMAGE_SIZEStensor—please double-check that the generic inflight_batcher_llm preprocessor handlesdims: [-1, 2]correctly for llava_onevision before removing the TODO.• Confirm in
triton_backend/all_models/inflight_batcher_llm/preprocessing/config.pbtxt(and related preprocessing code) that llava_onevision accepts the new[-1, 2]shape
• If llava_onevision no longer breaks, remove or reword the# TODO: Unbreak llava!comment
• Add a regression test for llava_onevision’s image postprocessing to catch any dimension mismatches in futuretriton_backend/tools/multimodal/client.py (3)
67-79: LGTM! Clean implementation for flexible image loading.The refactored
load_imagefunction elegantly handles both raw bytes and PIL image outputs while maintaining backward compatibility through the default parameter.
242-242: LGTM! Pixtral model type properly added to choices.
257-261: LGTM! Correct handling of raw image bytes for pixtral.The implementation properly loads images as raw bytes and prepares them as a numpy object array, which aligns with the pixtral model's requirements for processing multiple raw image inputs.
Also applies to: 311-313
triton_backend/all_models/multimodal/multimodal_encoders/1/model.py (2)
165-178: LGTM! Proper initialization for pixtral model.The configuration loading and parameter extraction correctly sets up the necessary values for pixtral image processing.
541-581: LGTM! Postprocessing correctly handles pixtral visual features.The implementation properly extracts and reshapes visual features according to the spatial merge size. The single batch assertion is consistent with the batch size limitation in
get_requests.
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
Outdated
Show resolved
Hide resolved
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
Outdated
Show resolved
Hide resolved
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
Outdated
Show resolved
Hide resolved
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
Outdated
Show resolved
Hide resolved
triton_backend/all_models/multimodal/multimodal_encoders/1/model.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py (3)
139-142: Fix line length violation and update error message.The error message doesn't include 'pixtral' in the list of supported models, creating an inconsistency with the validation list. Additionally, the line exceeds the 120-character limit.
assert self.model_type in [ 'llava', 'blip2-opt', 'pixtral', 'vila', 'mllama', 'llava_onevision', 'qwen2_vl' -], f"[TensorRT-LLM][ERROR] Currently supported multi-modal models are llava, blip2-opt, vila, mllama, llava_onevision and qwen2_vl. Got {self.model_type}." +], (f"[TensorRT-LLM][ERROR] Currently supported multi-modal models are " + f"llava, blip2-opt, pixtral, vila, mllama, llava_onevision and qwen2_vl. " + f"Got {self.model_type}.")
154-170: Add robust validation for hf_model_path parameter.The code assumes
hf_model_pathexists without validation, which could lead to runtime errors. The parameter access is not safely handled.if self.model_type in [ 'mllama', 'llava_onevision', 'qwen2_vl', 'pixtral' ]: - # Read the correct path from the config - hf_model_path = model_config['parameters']['hf_model_path'][ - 'string_value'] + # Read the correct path from the config + hf_model_path = model_config['parameters'].get('hf_model_path') + if not hf_model_path: + raise ValueError(f"[TensorRT-LLM][ERROR] 'hf_model_path' parameter is required for {self.model_type} model but not found in config") + hf_model_path = hf_model_path['string_value'] + if not hf_model_path or hf_model_path == "${hf_model_path}": + raise ValueError(f"[TensorRT-LLM][ERROR] 'hf_model_path' parameter is not properly configured for {self.model_type} model")
335-346: Fix incorrect array shape for request_input_len.The current implementation creates an array with shape (1, batch_size) instead of the expected (batch_size, 1) shape.
- request_input_len = np.array([[ - len(input_ids_for_batch) - for input_ids_for_batch in pixtral_input_id_tensor - ]]) + request_input_len = np.array([ + [len(input_ids_for_batch)] + for input_ids_for_batch in pixtral_input_id_tensor + ])
🧹 Nitpick comments (1)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py (1)
1043-1131: Comprehensive pixtral processing with minor improvements needed.The implementation is well-structured and handles multiple input types correctly. However, there are a few areas for improvement:
- Line length violation (line 1116): The line exceeds 120 characters
- Explicit error handling instead of assert (line 1067 equivalent - already handled with ValueError)
- Input validation could be enhanced
if key in self.output_str_dtypes: - val = self.convert_tensor_to_str_dtype( - val, self.output_str_dtypes[key]) + val = self.convert_tensor_to_str_dtype(val, + self.output_str_dtypes[key])The method demonstrates good practices with:
- Proper error handling for invalid input dimensions
- Consistent batch processing logic
- Appropriate tensor manipulation
- Clear separation of concerns between input types
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py(10 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
🧠 Learnings (5)
📚 Learning: ministral is a valid model name from mistral ai, distinct from the regular mistral models. in tensor...
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
Applied to files:
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
🪛 Ruff (0.12.2)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
141-141: Line too long (176 > 120)
(E501)
1116-1116: Line too long (149 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py (7)
38-38: LGTM: Import addition is appropriate.The addition of
AutoConfigandAutoProcessorimports is necessary for the new pixtral model support and follows the namespace preservation guideline.
299-303: LGTM: Input tensor handling is correctly implemented.The addition of
IMAGESinput tensor handling is properly integrated with existing multimodal logic and follows the established pattern for other input tensors.
360-362: LGTM: Conditional input creation bypass is logical.The bypass of
_create_requestfor pixtral model is appropriate since the method handles its own input ID creation and request length calculation.
393-399: LGTM: Output tensor creation follows established patterns.The pixtral-specific tensor creation properly converts the processed data to the expected Triton tensor format and uses appropriate data types.
754-755: LGTM: Constructor parameter addition is well-designed.The addition of
hf_configparameter to the VisionPreProcessor constructor is necessary for pixtral model support and maintains backward compatibility.
803-806: LGTM: Pixtral-specific configuration initialization is appropriate.The initialization of vocab_size, image_size, and image_token_index from the HuggingFace config is correctly implemented for pixtral model support.
1065-1067: LGTM: Proper error handling instead of assert.The code correctly uses explicit error handling with ValueError instead of assert statements, ensuring validation is always enforced regardless of Python optimization flags.
|
/bot run |
|
PR_Github #14492 [ run ] triggered by Bot |
|
PR_Github #14492 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (7)
tests/integration/defs/triton_server/test.sh (1)
47-50: Use more idiomatic default assignmentTiny shell-style nit: the same effect can be written in one line while preventing an accidental later re-assignment:
: "${PROMPT_EMBEDDING_TABLE_DTYPE:=TYPE_FP16}" readonly PROMPT_EMBEDDING_TABLE_DTYPEThis is POSIX-compliant, shorter, and marks the value immutable.
tests/integration/defs/triton_server/build_engines.py (1)
1768-1781: Add a concise docstring per guidelines.Improves readability and Sphinx parsing.
def prepare_mistral3_pixtral_engine(tensorrt_llm_multimodal_example_root, tensorrt_llm_llama_example_root, mistral_small_model_root): + """Prepare Mistral Small 3.1 (Pixtral) engines. + + Converts the HF checkpoint to local, builds the LLM engine, and builds the + Pixtral visual encoder engine. + + Args: + tensorrt_llm_multimodal_example_root: Path to multimodal example root. + tensorrt_llm_llama_example_root: Path to llama example root for converter. + mistral_small_model_root: HF model directory for Mistral Small 3.1. + + Returns: + Tuple[str, str]: (llm engine_dir, multimodal_engine_dir). + """tests/integration/defs/triton_server/test_triton_llm.py (2)
3958-3962: Explicitly import prepare_mistral3_pixtral_engine to avoid F405 and improve clarityStatic analysis flags F405 due to star-imports. Prefer explicit imports for used symbols to reduce ambiguity and lint noise.
Outside this hunk, update imports at the top of the file:
-from .build_engines import * -from .common import * +from .build_engines import ( + prepare_llama_v2_7b_engine, + prepare_llama_v2_70b_engine, + prepare_mistral_v1_7b_engine, + prepare_medusa_vicuna_7b_engine, + prepare_eagle_vicuna_7b_engine, + prepare_gpt_350m_engine, + prepare_gpt_175b_engine, + prepare_llama_v3_8b_engine, + prepare_llama_v3_70b_engine, + prepare_gpt_gather_logits_engine, + prepare_t5_small_engine, + prepare_whisper_large_engine, + prepare_gpt_2b_lora_engine, + prepare_tiny_llama_1b_engine, + prepare_mllama_engine, + prepare_llava_engine, + prepare_llava_onevision_engine, + prepare_mistral3_pixtral_engine, +) +from .common import ( + prepare_ib_model_repo, + prepare_multimodal_model_repo, + prepare_custom_config, + prepare_disaggregated_serving_model_repo, + modify_ib_config_pbtxt, + modify_disaggregated_serving_config_pbtxt, + check_server_ready, + run_cpp_backend_tests, + run_cpp_streaming_backend_tests, + validate_by_keyword, + query_gpu_name, + get_gpu_full_name, + retrieve_latency_value, +)
3909-3911: Document rationale for TRITON_MAX_BATCH_SIZE=1Given current batching limitations with variable image counts, locking max batch size to 1 is intentional. Add a short comment here to prevent future “optimizations” that might reintroduce batching regressions.
Example:
NOTE: Pixtral currently supports variable #images per request; set max_batch_size=1 until batching is reworked.
tests/integration/defs/triton_server/common.py (3)
250-251: Document and validate the new dtype parameter (keep backward compatible).Add a brief docstring entry for PROMPT_EMBEDDING_TABLE_DTYPE (allowed Triton types, default is TYPE_FP16). Optional: validate input early to catch typos.
278-278: Consistently propagate prompt_embedding_table_data_type (OK), consider DRY + quoting.Propagation looks correct. To reduce duplication and avoid shell word-splitting, consider:
- Define once: prompt_dtype_kv = f"prompt_embedding_table_data_type:{PROMPT_EMBEDDING_TABLE_DTYPE}"
- Inject prompt_dtype_kv into each fill_template call
- Quote the KV list (like other calls) to protect spaces in paths.
Also applies to: 310-310, 335-335, 356-356, 371-371
355-356: Fix long line (Ruff E501) by splitting KV assignments.- f"tensorrt_llm_model_name:{TENSORRT_LLM_TARGET_MODEL_NAME},tensorrt_llm_draft_model_name:{TENSORRT_LLM_DRAFT_MODEL_NAME},logits_datatype:TYPE_FP32," \ + f"tensorrt_llm_model_name:{TENSORRT_LLM_TARGET_MODEL_NAME}," \ + f"tensorrt_llm_draft_model_name:{TENSORRT_LLM_DRAFT_MODEL_NAME},logits_datatype:TYPE_FP32," \
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
tests/integration/defs/triton_server/build_engines.py(1 hunks)tests/integration/defs/triton_server/common.py(6 hunks)tests/integration/defs/triton_server/conftest.py(1 hunks)tests/integration/defs/triton_server/test.sh(5 hunks)tests/integration/defs/triton_server/test_triton_llm.py(1 hunks)triton_backend/ci/L0_backend_trtllm/test.sh(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- triton_backend/ci/L0_backend_trtllm/test.sh
- tests/integration/defs/triton_server/conftest.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/integration/defs/triton_server/build_engines.pytests/integration/defs/triton_server/common.pytests/integration/defs/triton_server/test_triton_llm.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tests/integration/defs/triton_server/build_engines.pytests/integration/defs/triton_server/common.pytests/integration/defs/triton_server/test_triton_llm.py
🧠 Learnings (7)
📓 Common learnings
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid and distinct model family from Mistral AI, separate from their regular Mistral models. Ministral 8B is specifically designed for edge computing and on-device applications, released in October 2024. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/triton_server/build_engines.pytests/integration/defs/triton_server/test.shtests/integration/defs/triton_server/common.pytests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-08-06T03:47:16.802Z
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
Applied to files:
tests/integration/defs/triton_server/build_engines.pytests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tests/integration/defs/triton_server/build_engines.pytests/integration/defs/triton_server/common.pytests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/defs/triton_server/test.shtests/integration/defs/triton_server/common.pytests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-07-22T09:22:14.726Z
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
tests/integration/defs/triton_server/test.shtests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-08-06T03:47:16.802Z
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid and distinct model family from Mistral AI, separate from their regular Mistral models. Ministral 8B is specifically designed for edge computing and on-device applications, released in October 2024. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
Applied to files:
tests/integration/defs/triton_server/test_triton_llm.py
🪛 Ruff (0.12.2)
tests/integration/defs/triton_server/common.py
341-341: Line too long (128 > 120)
(E501)
355-355: Line too long (158 > 120)
(E501)
tests/integration/defs/triton_server/test_triton_llm.py
3959-3959: prepare_mistral3_pixtral_engine may be undefined, or defined from star imports
(F405)
3965-3965: prepare_ib_model_repo may be undefined, or defined from star imports
(F405)
3968-3968: prepare_multimodal_model_repo may be undefined, or defined from star imports
(F405)
3970-3970: prepare_multimodal_model_repo may be undefined, or defined from star imports
(F405)
3975-3975: modify_ib_config_pbtxt may be undefined, or defined from star imports
(F405)
4012-4012: check_server_ready may be undefined, or defined from star imports
(F405)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tests/integration/defs/triton_server/test.sh (1)
189-190: Double-check template placeholders for the new keyAll
fill_template.pyinvocations now passprompt_embedding_table_data_type. Make sure every affectedconfig.pbtxttemplate actually contains the${prompt_embedding_table_data_type}placeholder; otherwise the key will be silently ignored and the runtime will still default toTYPE_FP16.Also applies to: 195-198, 208-208, 226-226
tests/integration/defs/triton_server/build_engines.py (1)
1803-1810: Verify visual encoder max_batch_size=2 is sufficient for “multiple images per prompt”.Given the batching constraints discussed in the PR, this hard-cap may be too restrictive for realistic prompts. Other multimodal builders use 8–16.
If acceptable, leave as-is; otherwise consider:
- "--max_batch_size=2", + "--max_batch_size=8",Or parameterize it if tests need different limits.
tests/integration/defs/triton_server/test_triton_llm.py (3)
3948-3951: Fixture Verified:mistral_small_3_1_24b_model_rootExists
Themistral_small_3_1_24b_model_rootfixture is defined in
- tests/integration/defs/triton_server/conftest.py:554
No further action required.
3919-3922: End-to-end dtype templating confirmed–
modify_ib_config_pbtxtacceptsENCODER_INPUT_FEATURES_DTYPEandPROMPT_EMBEDDING_TABLE_DTYPE
– Integration tests invoketools/fill_template.pywithencoder_input_features_data_type:${ENCODER_INPUT_FEATURES_DTYPE}andprompt_embedding_table_data_type:${PROMPT_EMBEDDING_TABLE_DTYPE}
– All relevanttriton_backend/all_models/**/config.pbtxtfiles declare${encoder_input_features_data_type}and${prompt_embedding_table_data_type}placeholdersNo changes needed.
3999-4004: Confirm Python backend in generated Triton configs (manual verification needed)I ran the provided script but didn’t find any
config.pbtxtfiles undertriton_backend/triton_repo—the model repo is generated at test time. Please:
- Run the integration tests to create
triton_backend/triton_repo- Inspect each
config.pbtxtand verify it containsbackend: "python" # (or platform: "python")This ensures we never hit the C++ backend path.
tests/integration/defs/triton_server/common.py (1)
341-341: Break long f-string into multiple lines and keep KV args consistent• File: tests/integration/defs/triton_server/common.py, line 341
Apply the same splitting style as nearby calls:- f"python3 {fill_template_py} -i {preprocessing_config} tokenizer_dir:{TOKENIZER_PATH},hf_model_path:{TOKENIZER_PATH}," \ + f"python3 {fill_template_py} -i {preprocessing_config} " \ + f"tokenizer_dir:{TOKENIZER_PATH},hf_model_path:{TOKENIZER_PATH}," \Placeholders for
tokenizer_dirandhf_model_pathare already present in eachpreprocessing/config.pbtxt, so the CI will continue to pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/integration/defs/triton_server/test_triton_llm.py (1)
4025-4031: Quote the--textargument to preserve the embedded newline
TEXTcan contain a\n; without quoting, it may be split into multiple argv tokens.
Use the same quoting style applied to the mllama test:- f"--text={TEXT}", + f"--text='{TEXT}'",
🧹 Nitpick comments (1)
tests/integration/defs/triton_server/test_triton_llm.py (1)
3968-3971: Consider asserting basic correctnessThe test only checks that the call succeeds; no assertions are made on the response.
Adding a minimal keyword check (similar to other multimodal tests) would catch silent regressions.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/integration/defs/triton_server/test_triton_llm.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/integration/defs/triton_server/test_triton_llm.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tests/integration/defs/triton_server/test_triton_llm.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-08-06T03:47:16.802Z
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
Applied to files:
tests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-08-06T03:47:16.802Z
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid and distinct model family from Mistral AI, separate from their regular Mistral models. Ministral 8B is specifically designed for edge computing and on-device applications, released in October 2024. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
Applied to files:
tests/integration/defs/triton_server/test_triton_llm.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tests/integration/defs/triton_server/test_triton_llm.py
🪛 Ruff (0.12.2)
tests/integration/defs/triton_server/test_triton_llm.py
3897-3897: Line too long (155 > 120)
(E501)
3968-3968: prepare_mistral3_pixtral_engine may be undefined, or defined from star imports
(F405)
3974-3974: prepare_ib_model_repo may be undefined, or defined from star imports
(F405)
3977-3977: prepare_multimodal_model_repo may be undefined, or defined from star imports
(F405)
3979-3979: prepare_multimodal_model_repo may be undefined, or defined from star imports
(F405)
3984-3984: modify_ib_config_pbtxt may be undefined, or defined from star imports
(F405)
4021-4021: check_server_ready may be undefined, or defined from star imports
(F405)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tests/integration/defs/triton_server/test_triton_llm.py (1)
3894-3899: Double-check client support for comma-separated multi-image inputThe second parametrized case passes two image URLs as a single
--imagevalue separated by a comma.
Ensureclient.pyforpixtralactually expects this format (vs. multiple--imageflags or JSON).
If not, adjust the test or extend the parser accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py (2)
154-160:hf_model_pathstill not validated — comment already raised on previous commit.
Please add the missing presence / placeholder checks to avoid runtime failures before the path is dereferenced.
343-346:request_input_lenshape is transposedThe construction gives shape
(1, batch_size); downstream code expects(batch_size, 1)(see earlier review).
Swap the loops as previously suggested.-request_input_len = np.array([[ - len(ids) for ids in pixtral_input_id_tensor -]]) +request_input_len = np.array( + [[len(ids)] for ids in pixtral_input_id_tensor] +)
🧹 Nitpick comments (1)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py (1)
299-304: Prefer explicitis not Nonechecks for input-tensor presenceUsing truthiness on
pb_utils.Tensorobjects (img_urls or image_bytes …) relies on the object’s__bool__implementation and becomes brittle.
Consider:has_vision_inputs = any( tensor is not None for tensor in (img_urls, image_bytes, input_images, video_bytes) ) if self.is_multimodal and has_vision_inputs: …This avoids accidental
ValueError: ambiguous truth valueissues when the underlying object turns into a NumPy array.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py(10 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid and distinct model family from Mistral AI, separate from their regular Mistral models. Ministral 8B is specifically designed for edge computing and on-device applications, released in October 2024. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
📚 Learning: 2025-08-06T03:47:16.802Z
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
Applied to files:
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
📚 Learning: 2025-07-22T09:22:14.726Z
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
🪛 Ruff (0.12.2)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
141-141: Line too long (176 > 120)
(E501)
1125-1125: Line too long (149 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py (1)
394-399: Potential device mismatch when converting Torch tensor to NumPy
.numpy()only works on CPU tensors. If upstream code ever movesINPUT_IDSto GPU, this line will raise.
Safer:input_ids_np = pixtral_input_id_tensor.cpu().numpy().astype(self.input_id_dtype) input_id_tensor = pb_utils.Tensor('INPUT_ID', input_ids_np)Add
.cpu()or ensure the tensor is guaranteed to reside on CPU beforehand.
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
Outdated
Show resolved
Hide resolved
|
/bot run --stage-list "A30-Triton-Post-Merge-1, A30-Triton-Post-Merge-2, B200_PCIe-Triton-Post-Merge-1, A100X-Triton-Post-Merge-1, A100X-Triton-Post-Merge-2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/integration/defs/triton_server/build_model.sh (1)
716-723: Align trtllm-build flags with existing conventions and set plugin policy explicitly.Other blocks explicitly configure GEMM/attention plugins and use space-separated flags. Recommend matching that for predictability.
Apply:
- trtllm-build --checkpoint_dir ./c-model/Mistral-Small-3.1-24B-Instruct-2503/bf16 \ - --max_batch_size=8 \ - --max_input_len=4096 \ - --max_seq_len=4096 \ - --max_multimodal_len=32768 \ - --use_paged_context_fmha=enable \ - --output_dir trt_engines/Mistral-Small-3.1-24B-Instruct-2503/bf16/1-gpu + trtllm-build --checkpoint_dir ./c-model/Mistral-Small-3.1-24B-Instruct-2503/bf16 \ + --gemm_plugin auto \ + --max_batch_size 8 \ + --max_input_len 4096 \ + --max_seq_len 4096 \ + --max_multimodal_len 32768 \ + --use_paged_context_fmha enable \ + --output_dir trt_engines/Mistral-Small-3.1-24B-Instruct-2503/bf16/1-gpuNote: If you rely on default plugin selection, keep it but still switch to the consistent flag formatting.
tests/integration/defs/triton_server/test_triton.py (1)
476-479: Wrap long lines to satisfy E501 and improve readability.Ruff flagged these lines (E501). Wrap the f-strings using parentheses.
Apply:
- # Visual encoder path follows the multimodal tmp engine pattern similar to others - visual_encoder_path = f"{llm_root}/examples/models/core/multimodal/tmp/trt_engines/{os.path.basename(model_path)}/multimodal_encoder/" - run_shell_command( - f"cd {tritonserver_test_root} && ./test.sh {test_name} {engine_dir} {model_path} {tokenizer_type} skip skip skip {visual_encoder_path}", - llm_root) + # Visual encoder path follows the multimodal tmp engine pattern similar to others + visual_encoder_path = ( + f"{llm_root}/examples/models/core/multimodal/tmp/trt_engines/" + f"{os.path.basename(model_path)}/multimodal_encoder/" + ) + run_shell_command( + ( + f"cd {tritonserver_test_root} && ./test.sh {test_name} " + f"{engine_dir} {model_path} {tokenizer_type} skip skip skip " + f"{visual_encoder_path}" + ), + llm_root, + )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tests/integration/defs/triton_server/build_model.sh(1 hunks)tests/integration/defs/triton_server/test.sh(6 hunks)tests/integration/defs/triton_server/test_triton.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/integration/defs/triton_server/test.sh
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/integration/defs/triton_server/test_triton.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tests/integration/defs/triton_server/test_triton.py
🧠 Learnings (8)
📓 Common learnings
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
📚 Learning: 2025-08-06T03:47:16.802Z
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
Applied to files:
tests/integration/defs/triton_server/build_model.shtests/integration/defs/triton_server/test_triton.py
📚 Learning: 2025-08-06T03:47:16.802Z
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid and distinct model family from Mistral AI, separate from their regular Mistral models. Ministral 8B is specifically designed for edge computing and on-device applications, released in October 2024. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".
Applied to files:
tests/integration/defs/triton_server/build_model.shtests/integration/defs/triton_server/test_triton.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/defs/triton_server/build_model.shtests/integration/defs/triton_server/test_triton.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/triton_server/build_model.shtests/integration/defs/triton_server/test_triton.py
📚 Learning: 2025-08-08T04:10:18.987Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:18.987Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Applied to files:
tests/integration/defs/triton_server/build_model.sh
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tests/integration/defs/triton_server/build_model.shtests/integration/defs/triton_server/test_triton.py
📚 Learning: 2025-07-22T09:22:14.726Z
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
tests/integration/defs/triton_server/build_model.sh
🪛 Ruff (0.12.2)
tests/integration/defs/triton_server/test_triton.py
476-476: Line too long (138 > 120)
(E501)
478-478: Line too long (144 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tests/integration/defs/triton_server/build_model.sh (1)
720-721: Verify max_multimodal_len sizing.32768 is quite large and may inflate memory/engine size. Ensure it’s justified by worst-case images-per-prompt × patches-per-image for Pixtral, given “[IMG]” expansion policy.
Would you confirm the intended upper bound (images per request and maximum image resolution/patch grid), and whether 16384 would suffice for CI to reduce build time and memory?
tests/integration/defs/triton_server/test_triton.py (3)
65-67: Model mapping entry looks good.The new
"mistral_small_3_1"→"Mistral-Small-3.1-24B-Instruct-2503"mapping matches the build script’s MODEL_DIR. No issues.
127-129: Engine directory mapping LGTM.Path is consistent with the bf16 engine output in the build script.
476-477: Confirm basename-based visual encoder path matches build output.This relies on
os.path.basename(model_path)being identical to the directory name used inbuild_model.sh. Today they match; ifMODEL_DIRchanges, this will drift. Consider deriving from a shared mapping or a helper to avoid string coupling.If you prefer, I can factor a small helper to compute
visual_encoder_pathin one place used by all multimodal tests.
|
PR_Github #14624 [ run ] triggered by Bot |
|
PR_Github #14624 [ run ] completed with state |
triton_backend/all_models/inflight_batcher_llm/preprocessing/config.pbtxt
Outdated
Show resolved
Hide resolved
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
Outdated
Show resolved
Hide resolved
f109060 to
bf1dead
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding Pixtral support to the trtllm backend of Triton. I’ve left several comments. Could you please take a look?
triton_backend/all_models/multimodal/multimodal_encoders/1/model.py
Outdated
Show resolved
Hide resolved
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
Outdated
Show resolved
Hide resolved
triton_backend/all_models/inflight_batcher_llm/preprocessing/1/model.py
Outdated
Show resolved
Hide resolved
triton_backend/all_models/multimodal/multimodal_encoders/1/model.py
Outdated
Show resolved
Hide resolved
|
/bot run --stage-list "A30-Triton-Post-Merge-1, A30-Triton-Post-Merge-2, B200_PCIe-Triton-Post-Merge-1, A100X-Triton-Post-Merge-1, A100X-Triton-Post-Merge-2" |
|
PR_Github #15010 [ run ] triggered by Bot |
|
PR_Github #15010 [ run ] completed with state |
abdd1c3 to
a00d34d
Compare
|
/bot run --stage-list "A30-Triton-Post-Merge-1, A30-Triton-Post-Merge-2, B200_PCIe-Triton-Post-Merge-1, A100X-Triton-Post-Merge-1, A100X-Triton-Post-Merge-2" |
|
PR_Github #15091 [ run ] triggered by Bot |
|
PR_Github #15091 [ run ] completed with state |
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: Iman Tabrizian <10105175+Tabrizian@users.noreply.github.com> Signed-off-by: Dimitrios Bariamis <dbari@users.noreply.github.com>
- The logger has no method logger.verbose() Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
54d9e62 to
48bc0e6
Compare
|
/bot run |
|
PR_Github #15734 [ run ] completed with state |
|
PR_Github #15744 [ run ] triggered by Bot |
|
PR_Github #15744 [ run ] completed with state |
|
/bot run |
|
PR_Github #15790 [ run ] triggered by Bot |
@Tabrizian I repeated the tests with both backends in an image built with the current Seeing this, I will also switch the backend for this model to |
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
|
PR_Github #15790 [ run ] completed with state |
|
/bot run |
|
PR_Github #15875 [ run ] triggered by Bot |
|
PR_Github #15875 [ run ] completed with state |
…stead of URLs Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
|
/bot run --stage-list "A100X-Triton-Post-Merge-1, A100X-Triton-Post-Merge-2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
.gitattributes (1)
10-11: Handle .jpeg and uppercase extensions; simplify with character classesGit attribute patterns are case-sensitive. Current rules won’t catch .JPEG/.PNG nor .jpeg. Recommend replacing these two lines with case-insensitive character-class patterns that also cover .jpeg.
Apply this diff:
- tests/integration/test_input_files/*.png filter=lfs diff=lfs merge=lfs -text - tests/integration/test_input_files/*.jpg filter=lfs diff=lfs merge=lfs -text +tests/integration/test_input_files/*.[Pp][Nn][Gg] filter=lfs diff=lfs merge=lfs -text +tests/integration/test_input_files/*.[Jj][Pp][Ee]?[Gg] filter=lfs diff=lfs merge=lfs -textNotes:
- The second rule matches both .jpg and .jpeg (the
?covers the optional “e”).- If you expect nested subfolders under test_input_files, consider adding an additional pair using
**/to cover recursion (supported in modern Git), or confirm you’ll keep a flat directory.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
.gitattributes(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
.gitattributes (2)
10-11: Scoped LFS rules for test images look goodLimiting LFS to tests/integration/test_input_files prevents repo-wide PNG/JPG from being forced into LFS while still protecting large test assets. Nice, targeted change.
10-11: CI/LFS Sanity Check: Manual Verification NeededThe sandbox environment doesn’t support
git lfs, so we can’t auto-verify here. Please confirm in an LFS-enabled clone (e.g. your CI agent withgit lfs install) that all image assets undertests/integration/test_input_filesare indeed tracked as LFS pointers before running your integration tests:• Count and list all image files:
find tests/integration/test_input_files \ -type f \( -iname "*.png" -o -iname "*.jpg" -o -iname "*.jpeg" \) | sort— Expect 5 files.
• Verify they’re LFS-tracked:
git lfs ls-files -- tests/integration/test_input_files | awk '{print $3}' | sort— Should also list the same 5 paths.
• Alternatively, check pointer content in your CI checkout:
grep -Il "version https://git-lfs.github.com/spec/v1" tests/integration/test_input_files/*— Only true pointers will match.
If any image is missing:
- Re-add it to pick up the new
.gitattributesrule:git rm --cached <file> && git add <file>- Or rewrite history to migrate existing blobs:
git lfs migrate import \ --include="tests/integration/test_input_files/*.{png,jpg,jpeg}"Also ensure your CI pipeline runs:
git lfs install --local git lfs fetch --all git lfs checkoutso that pointer files are replaced with real binaries before tests execute.
|
PR_Github #16054 [ run ] triggered by Bot |
|
PR_Github #16054 [ run ] completed with state |
|
/bot reuse-pipeline |
|
PR_Github #16067 [ reuse-pipeline ] triggered by Bot |
|
PR_Github #16067 [ reuse-pipeline ] completed with state |
Summary by CodeRabbit
New Features
Configuration
Tools
Tests
Chores
Description
Implemented Mistral Small 3.1 in Triton backend. It supports multiple images for each prompt and the images can have different resolution. Each image will be embedded into the prompt at the location of its respective
[IMG]placeholder and have a length of one token for every 28x28 patch (disregarding padding and line break / end of image tokens). Also, since the vision encoder has a fixedmax_batch_sizebut the request might contain an arbitrary number of images, the images are processed in batches of sizemax_batch_size.After discussing with @Tabrizian (thanks!), I understood the interaction between the number of requests (triton batch) and the batch size of each request. The code has been reworked to allow an arbitrary triton batch size (number of requests per
execute()call)and batch size per requestbut the batch size in each request should be 1.Side note: The model runs successfully only when using the
pythonbackend for thetensorrt_llmstep in the ensemble. This seems to be a bug or difference in configuration in the C++tensorrtllmbackend, that is independent of this PR. Unless there is another suggestion, I will open an issue for it.GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.