KEMBAR78
[TRTLLM-7208][feat] Implement basic functionalities for Responses API by JunyiXu-nv · Pull Request #7341 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@JunyiXu-nv
Copy link
Collaborator

@JunyiXu-nv JunyiXu-nv commented Aug 28, 2025

Summary by CodeRabbit

  • New Features

    • OpenAI-compatible Responses API at /v1/responses with streaming (SSE) and non‑streaming modes.
    • Stateful per-request streaming that emits complete OpenAI-style messages across calls and handles channel/tool transitions.
    • Enhanced tool handling: nested tool names supported and "none" option to disable tools.
    • Optional persistent conversation storage with retrieval by response ID; expanded request/response models and per-token usage details.
  • Tests

    • New unit and integration tests covering reasoning, multi‑turn chat, tool-calling, and streaming.

Description

New Features

  • Add basic Responses API implementations, including stateful multi-turn chat, function calling, reasoning effort setting, streaming reasoning and message response.

Tests

  • Added unit and e2e tests covering chat, multi-turn chat, reasoning, streaming, and tool calls.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 28, 2025

📝 Walkthrough

Walkthrough

Adds a new OpenAI Responses API with /v1/responses, stateful Harmony streaming (per-request stream state and token→OpenAI message conversion), response protocol models, conversation history storage, streaming utilities, orchestration glue, and tests (unit and integration).

Changes

Cohort / File(s) Summary
Harmony adapter: stateful streaming & tool handling
tensorrt_llm/serve/harmony_adapter.py
Adds per-request HarmonyStreamState parser access, process_token_batch_to_messages, get_parser, get_stream_state, and stateful_stream_harmony_tokens_to_openai_messages. Extends existing delta path to detect channel/recipient transitions, complete active tool calls on transitions, emit closing deltas, and broaden tool filtering (including tool_choice=="none").
OpenAI Responses protocol models
tensorrt_llm/serve/openai_protocol.py
Adds ResponsesRequest, ResponsesResponse, ResponsesStreamResponse, ResponseInputOutputItem alias, token/usage detail models, validators, and translation to SamplingParams (including guided decoding support).
OpenAI server: /v1/responses endpoint
tensorrt_llm/serve/openai_server.py
Registers /v1/responses, adds openai_responses handler, response-ID error helpers, conversation_store and enable_store attributes, and integrates responses_utils orchestration for streaming and non-stream flows.
Responses utilities module
tensorrt_llm/serve/responses_utils.py
New module adding token encoding/decoding, message builders, input parsing, ConversationHistoryStore (capacity/eviction/async locks), Harmony→OpenAI prompt construction, output parsing, SSE streaming event pipeline, request preprocessing and response creation, and finish-reason mapping.
Integration test wrapper
tests/integration/defs/test_e2e.py
Adds test_openai_responses wrapper to invoke the app test _test_openai_responses.py via pytest.
Test list update
tests/integration/test_lists/test-db/l0_h100.yml
Adds test_e2e.py::test_openai_responses to the l0_h100 PyTorch pre-merge test list.
Unit tests for Responses API
tests/unittest/llmapi/apps/_test_openai_responses.py
New async test suite exercising reasoning, chat, multi-turn, tool-calling, streaming, and streaming tool-call flows using a RemoteOpenAIServer and AsyncOpenAI client with tool mocks and streaming assertions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant S as OpenAIServer (/v1/responses)
  participant U as responses_utils
  participant H as HarmonyAdapter
  participant L as LLM Engine

  C->>S: POST ResponsesRequest (stream=false)
  S->>U: request_preprocess(request, prev_response, H, store, enable_store)
  U->>H: construct_harmony_messages(...)
  S->>L: generate_async(inputs, sampling_params)
  L-->>S: generation_result
  S->>U: create_response(generator, request, sampling_params, model_name, store, generation_result)
  U-->>S: ResponsesResponse
  S-->>C: 200 ResponsesResponse
Loading
sequenceDiagram
  autonumber
  participant C as Client (SSE)
  participant S as OpenAIServer (/v1/responses)
  participant U as responses_utils
  participant H as HarmonyAdapter
  participant L as LLM Engine

  C->>S: POST ResponsesRequest (stream=true)
  S->>U: request_preprocess(...)
  U->>H: construct_harmony_messages(...)
  S->>L: generate_async(...)
  loop token chunks
    L-->>S: tokens/events
    S->>U: process_streaming_events(request, params, generator, H, model_name, store)
    U-->>S: SSE event (created/in_progress/deltas/tool_call/etc.)
    S-->>C: SSE event
  end
  U-->>S: final response.completed
  S-->>C: SSE final + [DONE]
Loading
sequenceDiagram
  autonumber
  participant U as responses_utils
  participant H as HarmonyAdapter
  participant P as HarmonyStreamState
  participant O as OpenAI Messages

  U->>H: start streaming with request_id
  H->>P: get_stream_state(request_id)
  loop tokens
    U->>H: stateful_stream_harmony_tokens_to_openai_messages(request_id, tokens, tools, tool_choice)
    H->>P: process_token_batch_to_messages(tokens)
    P-->>H: accumulated messages
    H-->>U: OpenAI messages
    U->>O: emit deltas/messages
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Possibly related PRs

Suggested labels

Community want to contribute

Suggested reviewers

  • hlu1
  • QiJune
  • dongfengy
  • Superjomn
  • LinPoly

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@JunyiXu-nv JunyiXu-nv force-pushed the dev-junyi-responses-api-impl branch from f77e55e to 1aa7242 Compare August 28, 2025 09:52
@JunyiXu-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16841 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/serve/harmony_adapter.py (1)

1489-1563: Bug: uses generator.outputs after loop; should use last res.
Accessing generator.outputs will raise; use last iteration’s res and guard empty streams.

 async def handle_streaming_response(
@@
-    # Clean up state
-    harmony_adapter.cleanup_stream_state(request_id)
-
-    # Send final message with finish_reason
-    output = generator.outputs[0]
+    # Clean up state
+    harmony_adapter.cleanup_stream_state(request_id)
+
+    # Send final message with finish_reason
+    try:
+        output = res.outputs[0]  # use last yielded result
+        finish_reason = output.finish_reason
+        stop_reason = output.stop_reason
+    except UnboundLocalError:
+        # No chunks yielded; emit minimal completed chunk
+        finish_reason = None
+        stop_reason = None
     final_response = ChatCompletionStreamResponse(
         model=request.model,
         choices=[
             ChatCompletionResponseStreamChoice(
                 index=0,
-                delta=DeltaMessage(),
-                finish_reason=output.finish_reason,
-                stop_reason=output.stop_reason)
+                delta=DeltaMessage(),
+                finish_reason=finish_reason,
+                stop_reason=stop_reason)
         ])
🧹 Nitpick comments (24)
tests/unittest/llmapi/apps/_test_openai_responses.py (3)

1-1: Add NVIDIA copyright header.
Tests are Python files; guidelines require the NVIDIA header with current year.

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 import json

32-42: Fix typo in helper name and improve assertion messages.
Rename check_reponse -> check_response; assertion text reads better as “does not exist”.

-def check_reponse(response, prefix=""):
+def check_response(response, prefix=""):
     reasoning_exist, message_exist = False, False
     for output in response.output:
         if output.type == "reasoning":
             reasoning_exist = True
         elif output.type == "message":
             message_exist = True
 
-    assert reasoning_exist, f"{prefix}Reasoning content not exists!"
-    assert message_exist, f"{prefix}Message content not exists!"
+    assert reasoning_exist, f"{prefix}Reasoning content does not exist!"
+    assert message_exist, f"{prefix}Message content does not exist!"

Additional edits (outside range) to update call sites:

-    check_reponse(response, "test_reasoning: ")
+    check_response(response, "test_reasoning: ")
-        check_reponse(response, f"test_reasoning_effort_{effort}: ")
+        check_response(response, f"test_reasoning_effort_{effort}: ")
-    check_reponse(response, "test_chat: ")
+    check_response(response, "test_chat: ")
-    check_reponse(response, "test_multi_turn_chat_1: ")
+    check_response(response, "test_multi_turn_chat_1: ")
-    check_reponse(response_2, "test_multi_turn_chat_2: ")
+    check_response(response_2, "test_multi_turn_chat_2: ")

156-164: Be robust to function call id naming (call_id vs id).
OpenAI SDK types sometimes expose id under different names. Use fallback to avoid brittleness.

-    assert function_call.name == "get_current_weather"
-
-    args = json.loads(function_call.arguments)
-    answer = get_current_weather(**args)
-    messages.append({
-        "type": "function_call_output",
-        "call_id": function_call.call_id,
-        "output": json.dumps(answer),
-    })
+    assert function_call.name == "get_current_weather"
+    args = json.loads(function_call.arguments)
+    answer = get_current_weather(**args)
+    call_id = getattr(function_call, "call_id", getattr(function_call, "id", None))
+    assert call_id, "Function call id missing"
+    messages.append({
+        "type": "function_call_output",
+        "call_id": call_id,
+        "output": json.dumps(answer),
+    })
tensorrt_llm/serve/harmony_adapter.py (5)

1-3: Ensure NVIDIA copyright header is present.
File lacks the NVIDIA header required by guidelines. Keep existing attributions; prepend NVIDIA lines.

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project

132-167: Docstring punctuation and Python 3.8 typing compatibility.

  • End the summary line with a period (D415).
  • Replace builtin generics/union with typing.List/Optional to support Python 3.8.
-def process_token_batch_to_messages(self,
-                                    tokens: list[int]) -> list[Message]:
-    """
-    Process a batch of tokens while maintaining parsing state.
-    Returns OpenAI Messages for Responses API
-    """
+def process_token_batch_to_messages(self,
+                                    tokens: List[int]) -> List[Message]:
+    """
+    Process a batch of tokens while maintaining parsing state.
+    Returns OpenAI Messages for Responses API.
+    """

Add missing imports (top of file):

-from typing import Any, AsyncGenerator, Literal
+from typing import Any, AsyncGenerator, Literal, List, Optional

360-362: Return type uses Python 3.10 unions; switch to Optional for 3.8+.

-def get_stream_state(self, request_id: str) -> HarmonyStreamState | None:
+def get_stream_state(self, request_id: str) -> Optional[HarmonyStreamState]:
     return self._stream_states.get(request_id, None)

1260-1295: Docstring punctuation and Python 3.8 typing compatibility.
Mirror the earlier change: List/Optional and end the summary line with a period.

-def stateful_stream_harmony_tokens_to_openai_messages(
-        self,
-        request_id: str,
-        tokens: list[int],
-        available_tools: list[dict[str, Any]] | None = None,
-        tool_choice: str | None = None) -> list[Message]:
-    """
-    Process tokens using stateful parsing.
+def stateful_stream_harmony_tokens_to_openai_messages(
+        self,
+        request_id: str,
+        tokens: List[int],
+        available_tools: Optional[List[dict[str, Any]]] = None,
+        tool_choice: Optional[str] = None) -> List[Message]:
+    """
+    Process tokens using stateful parsing.
 
     This method maintains persistent state across multiple calls for the same request,
     ensuring proper channel transitions and tool call handling.
@@
-        Returns:
-            List of OpenAI Messages
+        Returns:
+            List of OpenAI Messages.
     """

77-81: Clarify sent_tool_arguments semantics.
Comment says “length” but code treats it as a boolean. Align for maintainability.

-        # Track sent arguments for tool call streaming deltas
-        self.sent_tool_arguments = {}  # tool_call_id -> sent_arguments_length
+        # Track whether we've sent the first chunk for a tool call (affects id/name nullability in deltas)
+        self.sent_tool_arguments = {}  # tool_call_id -> bool
tensorrt_llm/serve/openai_protocol.py (3)

1-1: Ensure NVIDIA copyright header appears at top.
Prepend the standard NVIDIA header per guidelines (retain attribution line that follows).

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # Adapted from
 # https://github.com/vllm-project/vllm/blob/4db5176d9758b720b05460c50ace3c01026eb158/vllm/entrypoints/openai/protocol.py

680-716: Use List[...] for Python 3.8 and tighten Optional types.
Replace builtin generics with typing.List and Optional where needed.

-class ResponsesRequest(OpenAIBaseModel):
+class ResponsesRequest(OpenAIBaseModel):
@@
-    include: Optional[list[
+    include: Optional[List[
         Literal[
             "code_interpreter_call.outputs",
             "computer_call_output.output.image_url",
             "file_search_call.results",
             "message.input_image.image_url",
             "message.output_text.logprobs",
             "reasoning.encrypted_content",
         ],
-    ]] = None
-    input: Union[str, list[ResponseInputOutputItem]]
+    ]] = None
+    input: Union[str, List[ResponseInputOutputItem]]
@@
-    tools: list[Tool] = Field(default_factory=list)
+    tools: List[Tool] = Field(default_factory=list)

799-865: Use List[...] in ResponsesResponse for Python 3.8.
Also consistent with earlier change in ResponsesRequest.

-class ResponsesResponse(OpenAIBaseModel):
+class ResponsesResponse(OpenAIBaseModel):
@@
-    output: list[ResponseOutputItem]
+    output: List[ResponseOutputItem]
@@
-    tools: list[Tool]
+    tools: List[Tool]
tensorrt_llm/serve/openai_server.py (7)

1-1: Add NVIDIA copyright header (2025).

Per coding guidelines, prepend the NVIDIA copyright header. Place it after the shebang.

 #!/usr/bin/env python
+#
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#

43-46: Prefer module-namespace imports per guidelines.

Newly added symbol imports (ModelList, ResponsesRequest, ResponsesResponse, UsageInfo) violate “preserve module namespaces”. Consider importing the module and referencing attributes.


50-56: Avoid multi-line symbol imports; import the module and alias functions.

To match the guideline and reduce import churn, import tensorrt_llm.serve.responses_utils as a module and reference functions via the module.

-from tensorrt_llm.serve.responses_utils import ConversationHistoryStore
-from tensorrt_llm.serve.responses_utils import \
-    create_response as responses_api_create_response
-from tensorrt_llm.serve.responses_utils import \
-    process_streaming_events as responses_api_process_streaming_events
-from tensorrt_llm.serve.responses_utils import \
-    request_preprocess as responses_api_request_preprocess
+from tensorrt_llm.serve import responses_utils as responses_api

And update call sites accordingly (see below diffs).


93-98: Make env-flag parsing robust; lazily initialize store.

  • Parse boolean envs safely (accept 1/true/yes).
  • Only construct ConversationHistoryStore when enabled.
-# Enable response storage for Responses API
-self.enable_store = True
-if len(os.getenv("TRTLLM_RESPONSES_API_DISABLE_STORE", "")) > 0:
-    self.enable_store = False
-self.conversation_store = ConversationHistoryStore()
+# Enable response storage for Responses API
+self.enable_store = os.getenv("TRTLLM_RESPONSES_API_DISABLE_STORE", "").lower() not in {"1", "true", "yes"}
+self.conversation_store = ConversationHistoryStore() if self.enable_store else None

182-195: Use precise error types and align ID prefix.

  • For 404, use a “NotFoundError” (vs InvalidRequestError).
  • Error text says “resp” while validation uses “resp_”. Align to “resp_”.
 def _create_invalid_response_id_error(self, response_id: str) -> Response:
     return self.create_error_response(
-        err_type="InvalidRequestError",
-        message=(f"Invalid 'response_id': '{response_id}'. "
-                 "Expected an ID that begins with 'resp'."),
+        err_type="InvalidRequestError",
+        message=(f"Invalid 'response_id': '{response_id}'. "
+                 "Expected an ID that begins with 'resp_'."),
     )

 def _create_response_id_not_found_error(self, response_id: str) -> Response:
     return self.create_error_response(
-        err_type="InvalidRequestError",
+        err_type="NotFoundError",
         message=f"Response with id '{response_id}' not found.",
         status_code=HTTPStatus.NOT_FOUND,
     )

799-811: Response ID validation logic is good; align message and handle disabled store.

  • Align error text to “resp_” (see earlier diff).
  • If store is disabled, skip prev-response lookup to avoid None derefs elsewhere.

844-845: Remove unreachable fallback return.

All code paths already return; this block is dead.

-        return JSONResponse(content={"detail": "None"})
-
tensorrt_llm/serve/responses_utils.py (6)

1-1: Add NVIDIA copyright header (2025).

Per guidelines, prepend header.

+#
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#

232-242: Return a copy of conversation history to avoid accidental external mutation.

-                return self.conversations.get(conversation_id, [])
+                return list(self.conversations.get(conversation_id, []))

271-276: Ensure mappings are fully cleaned when evicting responses.

Also remove stale conversation mappings if present.

     def _pop_response(self) -> None:
         responses_debug_log(f"responses type: {type(self.responses)}")
         resp_id, _ = self.responses.popitem(last=False)
-        if resp_id in self.response_to_conversation:
-            self.response_to_conversation.pop(resp_id)
+        conv_id = self.response_to_conversation.pop(resp_id, None)
+        if conv_id is not None and conv_id in self.conversation_to_response:
+            self.conversation_to_response.pop(conv_id, None)

376-383: Docstring (D200) nit: render_for_completion.

One-line docstring would satisfy Ruff, but optional.


488-500: Provide a default for unknown finish reasons instead of raising.

To avoid crashing on new/unknown reasons, default to “failed” or “incomplete”.

-    raise RuntimeError("Should never reach here!")
+    return 'failed'

390-486: parse_output_message: solid mapping; minor robustness.

Consider passing through model logprobs if available instead of None/[] placeholders; otherwise OK.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ae89163 and 1aa7242.

📒 Files selected for processing (7)
  • tensorrt_llm/serve/harmony_adapter.py (5 hunks)
  • tensorrt_llm/serve/openai_protocol.py (2 hunks)
  • tensorrt_llm/serve/openai_server.py (5 hunks)
  • tensorrt_llm/serve/responses_utils.py (1 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)
  • tests/unittest/llmapi/apps/_test_openai_responses.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports: import the subpackage/module, not the symbol (from package.subpackage import foo; foo.SomeClass())
Naming: files snake_case; classes PascalCase; functions/methods snake_case; local variables snake_case (k_ prefix if starting with a number); globals G_ + UPPER_SNAKE_CASE; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; reserve comments for function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; inline docstrings for attributes/variables are allowed
Avoid reflection when straightforward code suffices (e.g., prefer explicit parameters over dict(**locals()))
Use narrow except clauses (e.g., catch FileNotFoundError instead of bare except)
For duck-typing try/except, keep try body minimal and use else for the main logic

Files:

  • tests/integration/defs/test_e2e.py
  • tests/unittest/llmapi/apps/_test_openai_responses.py
  • tensorrt_llm/serve/openai_protocol.py
  • tensorrt_llm/serve/responses_utils.py
  • tensorrt_llm/serve/openai_server.py
  • tensorrt_llm/serve/harmony_adapter.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header with current year to all source files

Files:

  • tests/integration/defs/test_e2e.py
  • tests/unittest/llmapi/apps/_test_openai_responses.py
  • tensorrt_llm/serve/openai_protocol.py
  • tensorrt_llm/serve/responses_utils.py
  • tensorrt_llm/serve/openai_server.py
  • tensorrt_llm/serve/harmony_adapter.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/test-db/l0_h100.yml
🧬 Code graph analysis (6)
tests/integration/defs/test_e2e.py (1)
tests/integration/defs/conftest.py (3)
  • llm_root (180-181)
  • llm_venv (707-723)
  • unittest_path (90-91)
tests/unittest/llmapi/apps/_test_openai_responses.py (1)
examples/models/core/gpt_oss/openai_chat_client_function_calling.py (1)
  • get_current_weather (59-60)
tensorrt_llm/serve/openai_protocol.py (2)
tensorrt_llm/llmapi/llm_args.py (1)
  • Field (69-96)
tensorrt_llm/sampling_params.py (2)
  • SamplingParams (125-486)
  • GuidedDecodingParams (14-36)
tensorrt_llm/serve/responses_utils.py (3)
tensorrt_llm/llmapi/llm.py (1)
  • RequestOutput (46-86)
tensorrt_llm/serve/openai_protocol.py (7)
  • OpenAIBaseModel (59-61)
  • ResponsesRequest (680-780)
  • ResponsesResponse (799-864)
  • to_sampling_params (267-309)
  • to_sampling_params (597-638)
  • to_sampling_params (730-766)
  • from_request (829-864)
tensorrt_llm/serve/harmony_adapter.py (4)
  • get_stop_tokens (363-369)
  • stateful_stream_harmony_tokens_to_openai_messages (1260-1294)
  • get_stream_state (360-361)
  • get_parser (82-83)
tensorrt_llm/serve/openai_server.py (2)
tensorrt_llm/serve/openai_protocol.py (2)
  • ResponsesRequest (680-780)
  • ResponsesResponse (799-864)
tensorrt_llm/serve/responses_utils.py (5)
  • ConversationHistoryStore (138-275)
  • create_response (547-604)
  • process_streaming_events (607-844)
  • request_preprocess (502-544)
  • load_response (153-156)
tensorrt_llm/serve/harmony_adapter.py (1)
tensorrt_llm/logger.py (2)
  • error (125-126)
  • debug (143-144)
🪛 Ruff (0.12.2)
tensorrt_llm/serve/responses_utils.py

340-341: One-line docstring should fit on one line

Reformat to one line

(D200)


390-391: One-line docstring should fit on one line

Reformat to one line

(D200)

tensorrt_llm/serve/harmony_adapter.py

132-134: 1 blank line required between summary line and description

(D205)


132-134: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)


359-362: 1 blank line required between summary line and description

(D205)

🔇 Additional comments (14)
tests/integration/test_lists/test-db/l0_h100.yml (1)

106-106: Wrapper for test_openai_responses verified Defined in tests/integration/defs/test_e2e.py at line 1507 and is discoverable by pytest.

tests/integration/defs/test_e2e.py (1)

1516-1521: Wrapper for Responses API tests: LGTM.
Matches the surrounding OpenAI wrappers; no extra flags needed.

tests/unittest/llmapi/apps/_test_openai_responses.py (1)

173-192: Streaming assertions: LGTM.
Confirms both message and reasoning deltas are emitted.

tensorrt_llm/serve/harmony_adapter.py (2)

60-63: Tool name extraction logic: LGTM.
Handles both top-level name and nested function.name.


82-84: Accessor exposure: LGTM.
Returning the StreamableParser is useful for debugging and advanced integrations.

tensorrt_llm/serve/openai_protocol.py (2)

675-678: Type alias: LGTM.
Defines the exact union used by Responses API inputs/outputs.


867-873: Streaming response wrapper: LGTM.
Fields align with OpenAI Responses streaming signals.

tensorrt_llm/serve/openai_server.py (4)

212-214: Route registration LGTM.

New /v1/responses route is correctly wired.


786-789: Harmony-only guard LGTM.

Clear error for unsupported formats.


811-819: Prefer module alias and pass stop tokens via preprocess (already done).

No action required; just noting good reuse of preprocessing.


822-826: Streaming response LGTM.

Correct media type and generator handoff.

tensorrt_llm/serve/responses_utils.py (3)

502-545: request_preprocess: good use of stop tokens and store gating.

LGTM overall. Minor: consider logging when store is disabled to assist debugging.


547-605: create_response: LGTM; consider capturing perf metrics if available.

No blocker.


637-657: Event initialization LGTM.

Good use of created/in_progress events with sequence assignment handled centrally.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (8)
tensorrt_llm/serve/openai_server.py (2)

774-785: Streaming wrapper: signature and model id look correct.

Inner generator returns AsyncGenerator[str, None] and uses self.model. Good improvement and consistent with the utils’ API.


828-835: Return JSON for non-streaming Responses API (consistency with other handlers).

Wrap the Pydantic model in JSONResponse to match chat/completions handlers and ensure correct content-type.

Apply:

-            else:
-                return await responses_api_create_response(
-                    generator=promise,
-                    request=request,
-                    sampling_params=sampling_params,
-                    model_name=self.model,
-                    conversation_store=self.conversation_store,
-                    generation_result=None,
-                    enable_store=self.enable_store)
+            else:
+                rsp = await responses_api_create_response(
+                    generator=promise,
+                    request=request,
+                    sampling_params=sampling_params,
+                    model_name=self.model,
+                    conversation_store=self.conversation_store,
+                    generation_result=None,
+                    enable_store=self.enable_store)
+                return JSONResponse(content=rsp.model_dump())
tensorrt_llm/serve/responses_utils.py (6)

220-231: append_messages: trim the correct conversation.

Call shrinker with conversation_id; resp_id mapping isn’t relevant for trimming.

-            if len(self.conversations[conversation_id]
-                   ) > self.conversation_capacity:
-                self._pop_conversation(resp_id)
+            while len(self.conversations[conversation_id]) > self.conversation_capacity:
+                self._shrink_conversation(conversation_id)

338-344: Avoid mutable default arg and tighten docstring.

Use Optional[list] for prev_msgs and a true one-line docstring (ruff D200).

-def construct_harmony_messages(
+def construct_harmony_messages(
     request: ResponsesRequest,
     prev_response: Optional[ResponsesResponse],
-    prev_msgs: list[Message] = [],
+    prev_msgs: Optional[list[Message]] = None,
 ) -> list[Message]:
-    """Construct messages from request input, includes conversation history messages if exists."""
+    """Construct messages from request input, including conversation history when available."""
@@
-    else:
-        messages.extend(prev_msgs)
+    else:
+        messages.extend(prev_msgs or [])

158-167: Mutable default arg and capacity trim bug (uses resp_id before mapping).

  • Avoid default [] for resp_msgs.
  • When trimming, use the conversation_id you just computed; resp_id isn’t mapped yet, so _pop_conversation(resp_id) no-ops.
-    async def store_response(self,
-                             resp: ResponsesResponse,
-                             resp_msgs: Optional[list[Message]] = [],
-                             prev_resp_id: Optional[str] = None) -> None:
+    async def store_response(self,
+                             resp: ResponsesResponse,
+                             resp_msgs: Optional[list[Message]] = None,
+                             prev_resp_id: Optional[str] = None) -> None:
@@
-            if resp_id in self.response_to_conversation:
+            # Normalize msgs
+            resp_msgs = list(resp_msgs or [])
+            if resp_id in self.response_to_conversation:
                 conversation_id = self.response_to_conversation[resp_id]
                 self.conversations[conversation_id].extend(resp_msgs)
             elif prev_resp_id is not None:
                 conversation_id = self.response_to_conversation[prev_resp_id]
                 self.conversations[conversation_id].extend(resp_msgs)
-                while len(self.conversations[conversation_id]
-                          ) > self.conversation_capacity:
-                    self._pop_conversation(resp_id)
+                while len(self.conversations[conversation_id]) > self.conversation_capacity:
+                    self._shrink_conversation(conversation_id)
             else:
                 conversation_id = random_uuid()
-                self.conversations[conversation_id] = resp_msgs
+                self.conversations[conversation_id] = list(resp_msgs)
@@
             self.response_to_conversation[resp_id] = conversation_id
             self.conversation_to_response[conversation_id] = resp_id
             self._update_visited_conversation(conversation_id)

Also applies to: 169-190


197-214: store_messages overwrites conversation and trims via resp_id; append/trim by conversation_id.

Preserve history by extending and trim deterministically on the active conversation.

-            self.conversations[conversation_id] = msgs
-            if len(self.conversations[conversation_id]
-                   ) > self.conversation_capacity:
-                self._pop_conversation(resp_id)
+            if conversation_id in self.conversations:
+                self.conversations[conversation_id].extend(msgs)
+            else:
+                self.conversations[conversation_id] = list(msgs)
+            while len(self.conversations[conversation_id]) > self.conversation_capacity:
+                self._shrink_conversation(conversation_id)

258-273: Rewrite conversation trimmer: robust, conversation-id based, handle Role type.

Current _pop_conversation mixes Role enum vs str and can IndexError if bounds aren’t found. Replace with a safe _shrink_conversation(conversation_id) and update callers.

-    def _pop_conversation(self, resp_id) -> None:
-        conversation_id = self.response_to_conversation.get(resp_id, None)
-        if conversation_id is None:
-            return
-
-        conversation = self.conversations[conversation_id]
-        first_conversation_range = []
-        for i, msg in enumerate(conversation):
-            if msg.author.role == Role.USER:
-                first_conversation_range.append(i)
-            elif msg.channel == "final":
-                first_conversation_range.append(i)
-                break
-        del conversation[
-            first_conversation_range[0]:first_conversation_range[1] + 1]
+    def _shrink_conversation(self, conversation_id: str) -> None:
+        conversation = self.conversations.get(conversation_id, [])
+        if not conversation:
+            return
+        start_idx = end_idx = None
+        for i, msg in enumerate(conversation):
+            role = getattr(msg.author, "role", msg.author)
+            role = getattr(role, "value", role)
+            if start_idx is None and role == "user":
+                start_idx = i
+            if msg.channel == "final":
+                end_idx = i
+                break
+        if start_idx is not None and end_idx is not None and end_idx >= start_idx:
+            del conversation[start_idx:end_idx + 1]
+        else:
+            # Fallback: drop earliest message(s)
+            del conversation[0:min(2, len(conversation))]

633-637: Streaming IDs and indices: initialize item id and bump content_index on deltas.

  • current_item_id must be stable and non-empty.
  • Increment current_content_index after each delta to keep indices consistent.
-    current_content_index = 0  # FIXME: this number is never changed
+    current_content_index = 0
     current_output_index = 0
-    current_item_id = ""  # FIXME: this number is never changed
+    current_item_id = f"msg_{random_uuid()}"
@@
                 yield _send_event(
                     ResponseTextDeltaEvent(
@@
                     ))
+                current_content_index += 1
@@
                 yield _send_event(
                     ResponseReasoningTextDeltaEvent(
@@
                     ))
+                current_content_index += 1

Also applies to: 782-793, 823-831

🧹 Nitpick comments (5)
tensorrt_llm/serve/openai_server.py (3)

182-187: Fix prefix in error message.

You validate against "resp_" but the message says “resp”. Align the text to avoid confusing clients.

-            message=(f"Invalid 'response_id': '{response_id}'. "
-                     "Expected an ID that begins with 'resp'."),
+            message=(f"Invalid 'response_id': '{response_id}'. "
+                     "Expected an ID that begins with 'resp_'."),

844-845: Remove unreachable return.

All code paths in try/except return; this tail return is dead.

-        return JSONResponse(content={"detail": "None"})

773-773: Line too long (E501).

Break function signature to respect 120 cols (ruff E501).

-    async def openai_responses(self, request: ResponsesRequest, raw_request: Request) -> Response:
+    async def openai_responses(
+        self,
+        request: ResponsesRequest,
+        raw_request: Request,
+    ) -> Response:
tensorrt_llm/serve/responses_utils.py (2)

153-157: Type hint: load_response may return None.

Adjust return annotation to Optional[ResponsesResponse] to reflect OrderedDict.get semantics.

-    async def load_response(self, resp_id: str) -> ResponsesResponse:
+    async def load_response(self, resp_id: str) -> Optional[ResponsesResponse]:

391-394: Docstring style (D200).

Summary fits on one line; condense to satisfy ruff.

-def parse_output_message(message: Message) -> list[ResponseOutputItem]:
-    """
-    Parse a Harmony message into a list of output response items.
-    """
+def parse_output_message(message: Message) -> list[ResponseOutputItem]:
+    """Parse a Harmony message into a list of response output items."""
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1aa7242 and 02aeb37.

📒 Files selected for processing (3)
  • tensorrt_llm/serve/openai_protocol.py (2 hunks)
  • tensorrt_llm/serve/openai_server.py (5 hunks)
  • tensorrt_llm/serve/responses_utils.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/serve/openai_protocol.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports: import the subpackage/module, not the symbol (from package.subpackage import foo; foo.SomeClass())
Naming: files snake_case; classes PascalCase; functions/methods snake_case; local variables snake_case (k_ prefix if starting with a number); globals G_ + UPPER_SNAKE_CASE; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; reserve comments for function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; inline docstrings for attributes/variables are allowed
Avoid reflection when straightforward code suffices (e.g., prefer explicit parameters over dict(**locals()))
Use narrow except clauses (e.g., catch FileNotFoundError instead of bare except)
For duck-typing try/except, keep try body minimal and use else for the main logic

Files:

  • tensorrt_llm/serve/openai_server.py
  • tensorrt_llm/serve/responses_utils.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header with current year to all source files

Files:

  • tensorrt_llm/serve/openai_server.py
  • tensorrt_llm/serve/responses_utils.py
🧬 Code graph analysis (2)
tensorrt_llm/serve/openai_server.py (2)
tensorrt_llm/serve/openai_protocol.py (1)
  • ResponsesRequest (680-782)
tensorrt_llm/serve/responses_utils.py (5)
  • ConversationHistoryStore (138-278)
  • create_response (548-605)
  • process_streaming_events (608-845)
  • request_preprocess (503-545)
  • load_response (153-156)
tensorrt_llm/serve/responses_utils.py (3)
tensorrt_llm/llmapi/llm.py (1)
  • RequestOutput (46-86)
tensorrt_llm/serve/openai_protocol.py (7)
  • OpenAIBaseModel (59-61)
  • ResponsesRequest (680-782)
  • ResponsesResponse (801-866)
  • to_sampling_params (267-309)
  • to_sampling_params (597-638)
  • to_sampling_params (730-766)
  • from_request (831-866)
tensorrt_llm/serve/harmony_adapter.py (3)
  • get_stop_tokens (363-369)
  • stateful_stream_harmony_tokens_to_openai_messages (1260-1294)
  • get_parser (82-83)
🪛 Ruff (0.12.2)
tensorrt_llm/serve/openai_server.py

773-773: Line too long (125 > 120)

(E501)

tensorrt_llm/serve/responses_utils.py

392-393: One-line docstring should fit on one line

Reformat to one line

(D200)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16841 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12644 completed with status: 'FAILURE'

Copy link
Collaborator

@LinPoly LinPoly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Shall we add license?

@LinPoly
Copy link
Collaborator

LinPoly commented Aug 28, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16856 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16856 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12655 completed with status: 'FAILURE'

@JunyiXu-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16923 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16923 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12714 completed with status: 'FAILURE'

@JunyiXu-nv
Copy link
Collaborator Author

/bot run --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (8)
tensorrt_llm/serve/responses_utils.py (8)

1-3: Use NVIDIA copyright header; remove vLLM attribution at file top.

This is a production source file; follow repo header policy.

Apply:

-# SPDX-License-Identifier: Apache-2.0
-# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
-
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""TensorRT-LLM Responses API utilities."""

161-170: Avoid mutable default arg in store_response and guard None.

Re-introduces a known bug.

Apply:

-    async def store_response(self,
-                             resp: ResponsesResponse,
-                             resp_msgs: Optional[list[Message]] = [],
-                             prev_resp_id: Optional[str] = None) -> None:
+    async def store_response(self,
+                             resp: ResponsesResponse,
+                             resp_msgs: Optional[list[Message]] = None,
+                             prev_resp_id: Optional[str] = None) -> None:
@@
-        async with self.conversations_lock:
+        if resp_msgs is None:
+            resp_msgs = []
+        async with self.conversations_lock:

Also applies to: 172-172


172-193: Conversation trimming calls wrong helper and misses a path.

  • Capacity trimming is invoked with resp_id (no mapping yet), so it’s a no-op.
  • The first branch (existing conversation) doesn’t trim at all.

Apply:

         if resp_id in self.response_to_conversation:
             conversation_id = self.response_to_conversation[resp_id]
-            self.conversations[conversation_id].extend(resp_msgs)
+            self.conversations[conversation_id].extend(resp_msgs)
+            while len(self.conversations[conversation_id]) > self.conversation_capacity:
+                self._shrink_conversation(conversation_id)
         elif prev_resp_id is not None:
             conversation_id = self.response_to_conversation[prev_resp_id]
             self.conversations[conversation_id].extend(resp_msgs)
-            while len(self.conversations[conversation_id]
-                      ) > self.conversation_capacity:
-                self._pop_conversation(resp_id)
+            while len(self.conversations[conversation_id]) > self.conversation_capacity:
+                self._shrink_conversation(conversation_id)
         else:
             conversation_id = random_uuid()
             self.conversations[conversation_id] = resp_msgs

194-217: store_messages overwrites history and trims via wrong call.

Append/init instead of replace; trim the correct conversation.

Apply:

-            self.conversations[conversation_id] = msgs
-            if len(self.conversations[conversation_id]
-                   ) > self.conversation_capacity:
-                self._pop_conversation(resp_id)
+            if conversation_id in self.conversations:
+                self.conversations[conversation_id].extend(msgs)
+            else:
+                self.conversations[conversation_id] = list(msgs)
+            while len(self.conversations[conversation_id]) > self.conversation_capacity:
+                self._shrink_conversation(conversation_id)

218-234: append_messages trims via wrong call; ensure loop until under capacity.

Use conversation_id, not resp_id.

Apply:

-            if len(self.conversations[conversation_id]
-                   ) > self.conversation_capacity:
-                self._pop_conversation(resp_id)
+            while len(self.conversations[conversation_id]) > self.conversation_capacity:
+                self._shrink_conversation(conversation_id)

261-276: Rewrite pop helper: accept conversation_id, normalize roles, handle edge cases.

Current code can IndexError and mixes Role enum vs string.

Apply:

-    def _pop_conversation(self, resp_id) -> None:
-        conversation_id = self.response_to_conversation.get(resp_id, None)
-        if conversation_id is None:
-            return
-
-        conversation = self.conversations[conversation_id]
-        first_conversation_range = []
-        for i, msg in enumerate(conversation):
-            if msg.author.role == Role.USER:
-                first_conversation_range.append(i)
-            elif msg.channel == "final":
-                first_conversation_range.append(i)
-                break
-        del conversation[
-            first_conversation_range[0]:first_conversation_range[1] + 1]
+    def _shrink_conversation(self, conversation_id: str) -> None:
+        conversation = self.conversations.get(conversation_id, [])
+        if not conversation:
+            return
+        start_idx = None
+        end_idx = None
+        for i, msg in enumerate(conversation):
+            role = getattr(getattr(msg, "author", None), "role", None)
+            role = getattr(role, "value", role)
+            if start_idx is None and role == "user":
+                start_idx = i
+            if msg.channel == "final":
+                end_idx = i
+                break
+        if start_idx is None or end_idx is None or end_idx < start_idx:
+            del conversation[0:min(2, len(conversation))]
+        else:
+            del conversation[start_idx:end_idx + 1]

341-347: Avoid mutable default arg in construct_harmony_messages; guard prev_msgs.

Also satisfies Ruff’s D200 one-line docstring guideline already present.

Apply:

-def construct_harmony_messages(
+def construct_harmony_messages(
     request: ResponsesRequest,
     prev_response: Optional[ResponsesResponse],
-    prev_msgs: list[Message] = [],
+    prev_msgs: Optional[list[Message]] = None,
 ) -> list[Message]:
@@
-    else:
-        messages.extend(prev_msgs)
+    else:
+        messages.extend(prev_msgs or [])

Also applies to: 357-358


611-665: Initialize stable item_id, reset per item, and advance content_index on deltas.

Empty item_id breaks clients; indices never advance.

Apply:

-    current_content_index = 0  # FIXME: this number is never changed
+    current_content_index = 0
     current_output_index = 0
-    current_item_id = ""  # FIXME: this number is never changed
+    current_item_id = f"msg_{random_uuid()}"
-        if parser.state == StreamState.EXPECT_START:
+        if parser.state == StreamState.EXPECT_START:
+            current_content_index = 0
+            current_item_id = f"msg_{random_uuid()}"
             current_output_index += 1
             sent_output_item_added = False
                 yield _send_event(
                     ResponseTextDeltaEvent(
@@
                     ))
+                current_content_index += 1
                 yield _send_event(
                     ResponseReasoningTextDeltaEvent(
@@
                     ))
+                current_content_index += 1

Also applies to: 676-680, 753-796, 826-835

🧹 Nitpick comments (1)
tensorrt_llm/serve/responses_utils.py (1)

492-503: finish_reason_mapping: add safe default instead of raising.

Prefer returning "incomplete" or "failed" to avoid crashing on new reasons.

Apply:

-    raise RuntimeError("Should never reach here!")
+    return "incomplete"
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 02aeb37 and d20deb3.

📒 Files selected for processing (1)
  • tensorrt_llm/serve/responses_utils.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cc,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cc,cxx,cu,cuh,py}: If a constructor parameter name conflicts with a public member, add a trailing underscore to the parameter (e.g., foo_).
Use uppercase literal suffixes (e.g., 1234L not 1234l).
Use spaces, not tabs; indent by 4 spaces.

Files:

  • tensorrt_llm/serve/responses_utils.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Target Python 3.8+ for all Python code.
Indent with 4 spaces; do not use tabs.
Maintain module namespace on imports; import the module/submodule, not individual classes/functions (e.g., from package.subpackage import foo; foo.SomeClass()).
Python filenames use snake_case (e.g., some_file.py).
Class names use PascalCase.
Function and method names use snake_case.
Local variable names use snake_case; if starting with a number, prefix with k_ (e.g., k_99th_percentile).
Global variables use UPPER_SNAKE_CASE and prefix G_ (e.g., G_MY_GLOBAL).
Constants use UPPER_SNAKE_CASE.
Avoid shadowing outer-scope variables.
Initialize all externally visible members of a class in init.
Prefer docstrings for interfaces used outside a file; use comments for local code within functions or local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline with docstrings placed under the definition.
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns).
Limit except clauses to specific exceptions; avoid bare except.
For duck-typing try/except, keep try blocks minimal and use else for the main logic.

Files:

  • tensorrt_llm/serve/responses_utils.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA 2025 Apache-2.0 copyright header block at the top of all source files (.cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/serve/responses_utils.py
🧠 Learnings (1)
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/serve/responses_utils.py
🧬 Code graph analysis (1)
tensorrt_llm/serve/responses_utils.py (3)
tensorrt_llm/llmapi/llm.py (1)
  • RequestOutput (46-86)
tensorrt_llm/serve/openai_protocol.py (7)
  • OpenAIBaseModel (59-61)
  • ResponsesRequest (680-782)
  • ResponsesResponse (801-866)
  • to_sampling_params (267-309)
  • to_sampling_params (597-638)
  • to_sampling_params (730-766)
  • from_request (831-866)
tensorrt_llm/serve/harmony_adapter.py (4)
  • get_stop_tokens (363-369)
  • stateful_stream_harmony_tokens_to_openai_messages (1260-1294)
  • get_stream_state (360-361)
  • get_parser (82-83)
🪛 Ruff (0.12.2)
tensorrt_llm/serve/responses_utils.py

395-396: One-line docstring should fit on one line

Reformat to one line

(D200)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16939 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16939 [ run ] completed with state FAILURE

@JunyiXu-nv
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16940 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16940 [ run ] completed with state FAILURE

@JunyiXu-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16943 [ run ] completed with state FAILURE

@JunyiXu-nv JunyiXu-nv force-pushed the dev-junyi-responses-api-impl branch from d20deb3 to ef912de Compare August 29, 2025 09:08
@JunyiXu-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16965 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16965 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12734 completed with status: 'FAILURE'

@JunyiXu-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17126 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17126 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12879 completed with status: 'FAILURE'

@JunyiXu-nv JunyiXu-nv force-pushed the dev-junyi-responses-api-impl branch from ef912de to a7da33a Compare September 1, 2025 02:24
@JunyiXu-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17137 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17137 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12886 completed with status: 'FAILURE'

Signed-off-by: Junyi Xu <junyix@nvidia.com>
Signed-off-by: Junyi Xu <junyix@nvidia.com>
Signed-off-by: Junyi Xu <junyix@nvidia.com>
@JunyiXu-nv JunyiXu-nv force-pushed the dev-junyi-responses-api-impl branch from a7da33a to 99663be Compare September 2, 2025 00:27
@JunyiXu-nv
Copy link
Collaborator Author

/mbot run

@JunyiXu-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17265 [ run ] triggered by Bot

@JunyiXu-nv JunyiXu-nv enabled auto-merge (squash) September 2, 2025 03:38
@tensorrt-cicd
Copy link
Collaborator

PR_Github #17265 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12975 completed with status: 'FAILURE'

@JunyiXu-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17317 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17317 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13015 completed with status: 'SUCCESS'

@JunyiXu-nv JunyiXu-nv merged commit eefe5f2 into NVIDIA:main Sep 2, 2025
5 checks passed
JunyiXu-nv added a commit to JunyiXu-nv/TensorRT-LLM that referenced this pull request Sep 8, 2025
JunyiXu-nv added a commit to JunyiXu-nv/TensorRT-LLM that referenced this pull request Sep 8, 2025
…NVIDIA#7341)

Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants