KEMBAR78
[TRTLLM-6771][feat] Support MMMU for multimodal models by yechank-nvidia · Pull Request #6828 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@yechank-nvidia
Copy link
Collaborator

@yechank-nvidia yechank-nvidia commented Aug 12, 2025

Summary by CodeRabbit

  • New Features

    • MMMU added as a new multimodal evaluation command in the CLI; end-to-end multimodal evaluation enabled (image handling, multimodal prompts, chat-template support).
    • Image loaders accept direct image objects and normalize RGBA→RGB automatically; added utilities to apply chat templates and convert image modes.
  • Bug Fixes

    • Per-message multimodal placeholder handling improved for more reliable formatting and evaluation.

@yechank-nvidia yechank-nvidia requested a review from a team as a code owner August 12, 2025 13:06
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 12, 2025

📝 Walkthrough

Walkthrough

Adds MMMU as a public evaluator and CLI subcommand; implements multimodal evaluation support (MultimodalLmEvalWrapper, is_multimodal flag, LM_EVAL_DEFAULT_IMAGE_PLACEHOLDER); extends image utilities to accept PIL images and handle RGBA→RGB conversion; tracks per-message multimodal placeholder counts during chat parsing.

Changes

Cohort / File(s) Summary
CLI wiring for MMMU
tensorrt_llm/commands/eval.py
Imports MMMU and registers MMMU.command with the main Click group in two locations to expose the new CLI subcommand.
Evaluate exports
tensorrt_llm/evaluate/__init__.py
Re-exports MMMU from lm_eval and adds MMMU to __all__.
Multimodal lm_eval integration + CLI
tensorrt_llm/evaluate/lm_eval.py
Adds LM_EVAL_DEFAULT_IMAGE_PLACEHOLDER, MultimodalLmEvalWrapper, MMMU evaluator/CLI, _get_model_type, apply_chat_template, generate_until, LmEvalEvaluator is_multimodal flag and wrapper selection, and task config adjustments for lm_eval.
Inputs API exports
tensorrt_llm/inputs/__init__.py
Re-exports apply_chat_template and convert_image_mode from inputs.utils and updates __all__.
Image and multimodal utilities
tensorrt_llm/inputs/utils.py
Adds rgba_to_rgb and convert_image_mode; load_image / async_load_image accept PIL.Image and return RGB; refactors conversion; normalizes modality "multiple_image""image"; handle_placeholder_exceptions now accepts per-message placeholder counts (list[dict[str,int]]).
Per-message placeholder tracking
tensorrt_llm/serve/chat_utils.py
Tracks per-message multimodal placeholder counts (mm_placeholder_counts), updates message content in-loop via add_multimodal_placeholders, and returns mm_placeholder_counts alongside conversation and coroutines.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant CLI as trtllm eval
  participant MMMU as MMMU.command
  participant Harness as command_harness
  participant Eval as LmEvalEvaluator
  participant Wrapper as MultimodalLmEvalWrapper
  participant LLM as LLM

  User->>CLI: trtllm eval mmmu [options]
  CLI->>MMMU: invoke subcommand
  MMMU->>Harness: build evaluator_kwargs (is_multimodal=True, params)
  Harness->>Eval: instantiate evaluator(**kwargs)
  Eval->>Eval: choose MultimodalLmEvalWrapper if MULTIMODAL
  Eval->>Wrapper: construct wrapper with llm, sampling_params
  Wrapper->>LLM: generate_async(prompt + media)
  LLM-->>Wrapper: text outputs
  Wrapper-->>Eval: per-sample results
  Eval-->>MMMU: evaluation metrics/results
  MMMU-->>User: print/report
Loading
sequenceDiagram
  participant Client
  participant Chat as parse_chat_messages_coroutines
  participant Utils as add_multimodal_placeholders

  Client->>Chat: messages with media
  loop for each message
    Chat->>Chat: load/parse media, compute mm_placeholder_count
    Chat->>Utils: update parsed_msg.content with per-message placeholders
    Chat->>Chat: append count to mm_placeholder_counts
  end
  Chat-->>Client: conversation, coroutines, mm_placeholder_counts
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~35 minutes

Possibly related PRs

Suggested reviewers

  • QiJune
  • Wanli-Jiang
  • xinhe-nv
  • symphonylyh
  • StanleySun639

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@yechank-nvidia yechank-nvidia requested a review from syuoni August 12, 2025 13:09
@yechank-nvidia yechank-nvidia self-assigned this Aug 12, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (3)
tensorrt_llm/inputs/utils.py (1)

456-462: Handle potential empty placeholder counts gracefully.

The current implementation assumes all mm_placeholder_count entries have at least one key when creating image placeholders. If a message has no multimodal content, this could lead to unexpected behavior.

Consider adding a check for empty placeholder counts:

 def handle_placeholder_exceptions(model_type: str,
                                   conversation: list[ConversationMessage],
                                   mm_placeholder_counts: list[dict[str, int]]):
     if model_type == "llava_next":
         # we need to convert the flattened content back to conversation format
         for conv, mm_placeholder_count in zip(conversation,
                                               mm_placeholder_counts):
             conv["content"] = [{"type": "text", "text": conv["content"]}, \
-                *[{"type": "image"} for _ in mm_placeholder_count]]
+                *[{"type": "image"} for _ in mm_placeholder_count if mm_placeholder_count]]
tensorrt_llm/evaluate/lm_eval.py (2)

161-161: Revise docstring to accurately reflect TRT-LLM's unlimited image capability

The comment about "currently unlimited in TRT-LLM" contradicts the parameter's purpose and the default value of 999 from lm_eval. Consider clarifying the actual limitation.

-            max_images: Maximum number of images per prompt (currently unlimited in TRT-LLM), set to 999 from lm_eval's default value.
+            max_images: Maximum number of images per prompt (default: 999 from lm_eval)

321-344: Simplify nested function and improve readability

The _adjust_config function is defined inside __init__ but could be a class method or module-level function for better organization. Also, the recursive structure could be clearer.

Consider extracting this as a private class method:

-        # Adopted from https://github.com/EleutherAI/lm-evaluation-harness/blob/7f04db12d2f8e7a99a0830d99eb78130e1ba2122/lm_eval/evaluator.py#L290
-        def _adjust_config(task_dict, random_seed):
-            adjusted_task_dict = {}
-            for task_name, task_obj in task_dict.items():
-                if isinstance(task_obj, dict):
-                    adjusted_task_dict = {
-                        **adjusted_task_dict,
-                        **{
-                            task_name: _adjust_config(task_obj, random_seed)
-                        },
-                    }
-                else:
-                    # NOTE: Few-shot random seed
-                    task_obj.set_config(key="num_fewshot", value=0)
-                    task_obj.set_fewshot_seed(seed=random_seed)
-                    adjusted_task_dict[task_name] = task_obj
-
-                    # NOTE: Shuffle dataset
-                    data = adjusted_task_dict[task_name].dataset
-                    for split in data.keys():
-                        data[split] = data[split].shuffle(random_seed)
-
-            return adjusted_task_dict
-
-        self.task_dict = _adjust_config(self.task_dict, random_seed)
+        self.task_dict = self._adjust_config(self.task_dict, random_seed)

And add as a class method:

    def _adjust_config(self, task_dict, random_seed):
        """Adjust task configuration for evaluation.
        
        Adopted from: https://github.com/EleutherAI/lm-evaluation-harness/blob/7f04db12d2f8e7a99a0830d99eb78130e1ba2122/lm_eval/evaluator.py#L290
        """
        adjusted_task_dict = {}
        for task_name, task_obj in task_dict.items():
            if isinstance(task_obj, dict):
                # Recursively adjust nested task dictionaries
                adjusted_task_dict[task_name] = self._adjust_config(task_obj, random_seed)
            else:
                # Configure individual task
                task_obj.set_config(key="num_fewshot", value=0)
                task_obj.set_fewshot_seed(seed=random_seed)
                adjusted_task_dict[task_name] = task_obj
                
                # Shuffle dataset splits
                data = task_obj.dataset
                for split in data.keys():
                    data[split] = data[split].shuffle(random_seed)
                    
        return adjusted_task_dict
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a060e12 and 11d98c5.

📒 Files selected for processing (6)
  • tensorrt_llm/commands/eval.py (2 hunks)
  • tensorrt_llm/evaluate/__init__.py (1 hunks)
  • tensorrt_llm/evaluate/lm_eval.py (10 hunks)
  • tensorrt_llm/inputs/__init__.py (2 hunks)
  • tensorrt_llm/inputs/utils.py (5 hunks)
  • tensorrt_llm/serve/chat_utils.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/evaluate/__init__.py
  • tensorrt_llm/inputs/__init__.py
  • tensorrt_llm/serve/chat_utils.py
  • tensorrt_llm/commands/eval.py
  • tensorrt_llm/inputs/utils.py
  • tensorrt_llm/evaluate/lm_eval.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/evaluate/__init__.py
  • tensorrt_llm/inputs/__init__.py
  • tensorrt_llm/serve/chat_utils.py
  • tensorrt_llm/commands/eval.py
  • tensorrt_llm/inputs/utils.py
  • tensorrt_llm/evaluate/lm_eval.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/commands/eval.py
🪛 Ruff (0.12.2)
tensorrt_llm/evaluate/lm_eval.py

136-137: Module level import not at top of file

(E402)


158-158: Line too long (134 > 120)

(E501)

🔇 Additional comments (12)
tensorrt_llm/serve/chat_utils.py (2)

182-196: LGTM! Per-message multimodal placeholder tracking is well-implemented.

The implementation correctly:

  • Initializes mm_placeholder_counts list to track per-message counts
  • Computes placeholder counts for each message individually using the tracker
  • Updates message content with placeholders when counts exist
  • Accumulates all per-message counts for return

This approach enables proper multimodal content handling on a per-message basis rather than globally.


175-199: No further action required: callers already handle the new 3-tuple return

The only call site (in tensorrt_llm/serve/openai_server.py) destructures three values as expected:

  • conversation, mm_coroutines, mm_placeholder_counts = parse_chat_messages_coroutines(...)

No other callers were found. All consumers have been updated to match the new signature.

tensorrt_llm/inputs/utils.py (5)

27-43: LGTM! Well-documented RGBA to RGB conversion utility.

The function properly validates input mode, uses the alpha channel as a mask for proper transparency handling, and includes clear documentation about the background color choice. The default white background is well-justified.


45-53: LGTM! Generic image mode converter with proper RGBA handling.

The function correctly delegates RGBA→RGB conversion to the specialized rgba_to_rgb function and falls back to the standard convert method for other conversions. Early return for same-mode cases is efficient.


74-81: LGTM! Enhanced load_image function supports PIL Image inputs.

The addition of direct PIL Image support with early return for Image.Image instances is well-implemented. The RGB conversion ensures consistent output format.


98-106: LGTM! Consistent PIL Image support in async variant.

The async version maintains the same early-return pattern for PIL Image instances as the synchronous version, ensuring API consistency.


643-646: LGTM! Proper modality normalization for consistent handling.

The normalization of "multiple_image" to "image" ensures consistent modality handling throughout the multimodal pipeline. This prevents potential issues with placeholder generation and data tracking.

tensorrt_llm/inputs/__init__.py (1)

8-11: LGTM! Proper public API extension for new utilities.

The new utilities apply_chat_template and convert_image_mode are correctly exported and added to __all__, making them available as part of the public API at tensorrt_llm.inputs.*. This aligns with the multimodal enhancements in the PR.

Also applies to: 32-34

tensorrt_llm/commands/eval.py (2)

23-24: LGTM! MMMU import added for multimodal evaluation support.

The import correctly adds MMMU alongside existing evaluation classes, enabling multimodal evaluation capabilities through the CLI interface.


155-155: LGTM! MMMU command properly registered.

The MMMU command is correctly added to the CLI group, making it available as a subcommand for multimodal evaluation tasks.

tensorrt_llm/evaluate/__init__.py (1)

18-18: LGTM! MMMU properly exported from evaluate package.

MMMU is correctly imported from lm_eval and added to the public API via __all__, making it available for import by other modules including the CLI interface.

Also applies to: 23-23

tensorrt_llm/evaluate/lm_eval.py (1)

253-253: prompt_inputs supports string inputs as expected

The prompt_inputs function has a branch for isinstance(inputs, str) that wraps the string in a TextPrompt. Passing a string here is handled correctly, so no changes are needed.

Copy link
Collaborator

@syuoni syuoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @yechank-nvidia ,

Thanks for adding this new task to trtllm-eval!

How about addding a test in tests/integration/defs/accuracy/test_llm_api_pytorch.py? It should evaluate a multi-modal model using the new MMMU task.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (6)
tensorrt_llm/evaluate/lm_eval.py (6)

651-658: Consider making default values more flexible

Both is_multimodal and apply_chat_template default to True, which creates redundancy since multimodal mode automatically enables chat templates (line 303). Consider keeping is_multimodal as True for MMMU but making apply_chat_template default to False to avoid confusion.

     @click.option("--apply_chat_template",
                   is_flag=True,
-                  default=True,
+                  default=False,
-                  help="Whether to apply chat template.")
+                  help="Whether to apply chat template (automatically enabled for multimodal).")

134-139: Move imports to the top of the file

These imports should be moved to the top of the file to follow Python's import organization conventions (PEP 8). Currently they're placed after class definitions, which violates standard practice.

-from tensorrt_llm.inputs import (ConversationMessage, MultimodalDataTracker,
-                                 add_multimodal_placeholders,
-                                 convert_image_mode)
-from tensorrt_llm.inputs.utils import \
-    apply_chat_template as trtllm_apply_chat_template

Add these imports at the top of the file after line 39:

from tensorrt_llm.inputs import (ConversationMessage, MultimodalDataTracker,
                                 add_multimodal_placeholders,
                                 convert_image_mode)
from tensorrt_llm.inputs.utils import apply_chat_template as trtllm_apply_chat_template

141-180: Consider improving model type extraction robustness

The _get_model_type method accesses a private attribute _hf_model_dir and has a catch-all exception handler. Consider checking for both private and public attribute variants, and catch only expected exceptions.

     def _get_model_type(self, llm: Union[LLM, PyTorchLLM]) -> str:
         """Extract model type from the model configuration."""
         try:
-            config_path = os.path.join(llm._hf_model_dir, 'config.json')
+            # Try to get the model directory path
+            if hasattr(llm, '_hf_model_dir'):
+                model_dir = llm._hf_model_dir
+            elif hasattr(llm, 'hf_model_dir'):
+                model_dir = llm.hf_model_dir
+            else:
+                logger.warning("Could not find model directory attribute")
+                return "unknown"
+                
+            config_path = os.path.join(model_dir, 'config.json')
             with open(config_path, 'r') as f:
                 config = json.load(f)
-            return config['model_type']
-        except (KeyError, FileNotFoundError, json.JSONDecodeError) as e:
+            return config.get('model_type', 'unknown')
+        except (FileNotFoundError, json.JSONDecodeError) as e:
             logger.warning(f"Could not determine model type: {e}")
             return "unknown"

200-204: Remove incomplete interleave implementation

The interleaved text and image logic is incomplete with just a split operation and ellipsis. This dead code should be removed or properly implemented.

-            if self.interleave:
-                # TODO: Implement interleaved text and image.
-                text.split(LM_EVAL_DEFAULT_IMAGE_PLACEHOLDER)
-                ...
-            else:
-                text = text.replace(LM_EVAL_DEFAULT_IMAGE_PLACEHOLDER, "")
+            # TODO: Implement interleaved text and image support
+            text = text.replace(LM_EVAL_DEFAULT_IMAGE_PLACEHOLDER, "")

256-259: Add error handling for missing or invalid image data

The code assumes media_data["visual"] exists and contains valid images. Add validation to handle cases where the visual data might be missing or malformed.

-            images = [
-                convert_image_mode(img, "RGB") for img in media_data["visual"]
-            ]
+            images = []
+            if media_data and "visual" in media_data:
+                try:
+                    images = [
+                        convert_image_mode(img, "RGB") for img in media_data["visual"]
+                    ]
+                except Exception as e:
+                    logger.error(f"Failed to process visual data: {e}")
+                    raise ValueError(f"Invalid visual data in request: {e}")

620-621: Pass is_multimodal=True explicitly for MMMU evaluator

The MMMU evaluator should explicitly pass is_multimodal=True to the parent class to ensure multimodal support is enabled.

     def __init__(self, **kwargs):
+        kwargs['is_multimodal'] = True
         super().__init__("mmmu_val", **kwargs)
🧹 Nitpick comments (2)
tensorrt_llm/inputs/utils.py (2)

74-81: Consider using convert_image_mode for consistency

The function currently uses image.convert('RGB') directly, while the rest of the codebase uses the new convert_image_mode helper. Consider using it here for consistency.

     if isinstance(image, Image.Image):
-        return image.convert('RGB')
+        return convert_image_mode(image, 'RGB')

98-106: Consider using convert_image_mode for consistency

Similar to the synchronous version, consider using convert_image_mode for consistency.

     if isinstance(image, Image.Image):
-        return image.convert('RGB')
+        return convert_image_mode(image, 'RGB')
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11d98c5 and b8b1215.

📒 Files selected for processing (2)
  • tensorrt_llm/evaluate/lm_eval.py (10 hunks)
  • tensorrt_llm/inputs/utils.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/inputs/utils.py
  • tensorrt_llm/evaluate/lm_eval.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/inputs/utils.py
  • tensorrt_llm/evaluate/lm_eval.py
🪛 Ruff (0.12.2)
tensorrt_llm/evaluate/lm_eval.py

136-137: Module level import not at top of file

(E402)


158-158: Line too long (134 > 120)

(E501)

🔇 Additional comments (7)
tensorrt_llm/inputs/utils.py (4)

27-42: LGTM!

The rgba_to_rgb function is well-implemented with proper validation, error handling, and a clear explanation for using white as the default background color.


45-53: LGTM!

The convert_image_mode function provides a clean abstraction for image mode conversion with proper handling of the RGBA to RGB case.


456-465: LGTM!

The updated function correctly handles per-message placeholder counts for the llava_next model by zipping conversations with their corresponding placeholder counts.


643-646: LGTM!

The normalization of "multiple_image" to "image" ensures consistent handling of image data across different modality identifiers.

tensorrt_llm/evaluate/lm_eval.py (3)

301-306: LGTM!

Proper handling of multimodal mode that automatically enables chat templates with appropriate logging.


321-345: LGTM!

The _adjust_config function properly handles dataset shuffling and few-shot configuration with the provided random seed.


380-383: LGTM!

Clean implementation that selects the appropriate wrapper class based on the multimodal flag.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
tensorrt_llm/evaluate/lm_eval.py (5)

36-38: Fix import ordering: Place all imports at the top of the file

These imports violate PEP 8 by appearing after class definitions. Move them to the top of the file with the other imports.

Remove lines 36-38 and add them after line 26:

 import tensorrt_llm.profiler as profiler
+from tensorrt_llm.inputs import (ConversationMessage, MultimodalDataTracker,
+                                 add_multimodal_placeholders, convert_image_mode,
+                                 prompt_inputs)
+from tensorrt_llm.inputs.utils import apply_chat_template as trtllm_apply_chat_template
-from tensorrt_llm.inputs import prompt_inputs

Also remove the duplicate import on line 26.


167-176: Add robust error handling for model type extraction

The method accesses a private attribute _hf_model_dir which could break with internal changes. Additionally, the generic exception handling might mask important errors.

 def _get_model_type(self, llm: Union[LLM, PyTorchLLM]) -> str:
     """Extract model type from the model configuration."""
     try:
-        config_path = os.path.join(llm._hf_model_dir, 'config.json')
+        # Try to get the model directory path
+        if hasattr(llm, 'hf_model_dir'):
+            model_dir = llm.hf_model_dir
+        elif hasattr(llm, '_hf_model_dir'):
+            model_dir = llm._hf_model_dir
+        else:
+            logger.warning("Could not find model directory attribute")
+            return "unknown"
+            
+        config_path = os.path.join(model_dir, 'config.json')
         with open(config_path, 'r') as f:
             config = json.load(f)
-        return config['model_type']
-    except (KeyError, FileNotFoundError, json.JSONDecodeError) as e:
+        return config.get('model_type', 'unknown')
+    except (FileNotFoundError, json.JSONDecodeError) as e:
         logger.warning(f"Could not determine model type: {e}")
         return "unknown"

196-201: Remove incomplete interleave implementation

The interleaved text and image logic contains dead code with just a split operation and ellipsis. This should be removed or properly implemented.

-            if self.interleave:
-                # TODO: Implement interleaved text and image.
-                text.split(LM_EVAL_DEFAULT_IMAGE_PLACEHOLDER)
-                ...
-            else:
-                text = text.replace(LM_EVAL_DEFAULT_IMAGE_PLACEHOLDER, "")
+            # TODO: Implement interleaved text and image support
+            text = text.replace(LM_EVAL_DEFAULT_IMAGE_PLACEHOLDER, "")

253-255: Add validation for visual data in multimodal requests

The code assumes media_data["visual"] exists and contains valid images without any validation.

-            images = [
-                convert_image_mode(img, "RGB") for img in media_data["visual"]
-            ]
+            images = []
+            if media_data and "visual" in media_data:
+                try:
+                    images = [
+                        convert_image_mode(img, "RGB") for img in media_data["visual"]
+                    ]
+                except Exception as e:
+                    logger.error(f"Failed to process visual data: {e}")
+                    raise ValueError(f"Invalid visual data in request: {e}")
+            else:
+                logger.warning("No visual data found in multimodal request")

606-607: MMMU should explicitly set multimodal flag

The MMMU evaluator should explicitly pass is_multimodal=True to ensure multimodal support is always enabled.

 def __init__(self, **kwargs):
+    kwargs['is_multimodal'] = True
     super().__init__("mmmu_val", **kwargs)
🧹 Nitpick comments (1)
tensorrt_llm/evaluate/lm_eval.py (1)

149-158: Fix line length issue in docstring

Line 157 exceeds the 120 character limit.

-            max_images: Maximum number of images per prompt (currently unlimited in TRT-LLM), set to 999 from lm_eval's default value.
+            max_images: Maximum number of images per prompt (currently unlimited in TRT-LLM),
+                set to 999 from lm_eval's default value.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b8b1215 and 15b44ec.

📒 Files selected for processing (1)
  • tensorrt_llm/evaluate/lm_eval.py (10 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/evaluate/lm_eval.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/evaluate/lm_eval.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/evaluate/lm_eval.py
🪛 Ruff (0.12.2)
tensorrt_llm/evaluate/lm_eval.py

154-154: Line too long (134 > 120)

(E501)

🔇 Additional comments (3)
tensorrt_llm/evaluate/lm_eval.py (3)

44-46: LGTM! Clear documentation of the image placeholder constant

Good addition of the LM_EVAL_DEFAULT_IMAGE_PLACEHOLDER constant with proper documentation referencing the source.


317-341: Elegant implementation of configuration adjustment

The nested function _adjust_config provides a clean recursive approach for adjusting task configurations with proper handling of nested task dictionaries.


640-643: Good implementation ensuring MMMU multimodal configuration

The command method correctly forces multimodal and chat template flags for MMMU, ensuring the evaluator always runs in the correct mode.

Copy link
Collaborator

@rakib-hasan rakib-hasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some minor comments. But overall, LGTM.

@yechank-nvidia
Copy link
Collaborator Author

/bot run

@yechank-nvidia
Copy link
Collaborator Author

@syuoni, added a single test to verify it is working. I will add more tests on other models on the following up PRs.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15579 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15579 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11730 completed with status: 'FAILURE'

Copy link
Collaborator

@syuoni syuoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! @yechank-nvidia

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
@yechank-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15753 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15753 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11841 completed with status: 'FAILURE'

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
@yechank-nvidia
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15846 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15846 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11910 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@syuoni syuoni merged commit 0893afa into NVIDIA:main Aug 21, 2025
4 checks passed
zhou-yuxin pushed a commit to zhou-yuxin/TensorRT-LLM that referenced this pull request Aug 21, 2025
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: Yuxin <yuxinz@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants