-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-7208][feat] Implement basic functionalities for Responses API #7341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TRTLLM-7208][feat] Implement basic functionalities for Responses API #7341
Conversation
📝 WalkthroughWalkthroughAdds a new OpenAI Responses API with /v1/responses, stateful Harmony streaming (per-request stream state and token→OpenAI message conversion), response protocol models, conversation history storage, streaming utilities, orchestration glue, and tests (unit and integration). Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant C as Client
participant S as OpenAIServer (/v1/responses)
participant U as responses_utils
participant H as HarmonyAdapter
participant L as LLM Engine
C->>S: POST ResponsesRequest (stream=false)
S->>U: request_preprocess(request, prev_response, H, store, enable_store)
U->>H: construct_harmony_messages(...)
S->>L: generate_async(inputs, sampling_params)
L-->>S: generation_result
S->>U: create_response(generator, request, sampling_params, model_name, store, generation_result)
U-->>S: ResponsesResponse
S-->>C: 200 ResponsesResponse
sequenceDiagram
autonumber
participant C as Client (SSE)
participant S as OpenAIServer (/v1/responses)
participant U as responses_utils
participant H as HarmonyAdapter
participant L as LLM Engine
C->>S: POST ResponsesRequest (stream=true)
S->>U: request_preprocess(...)
U->>H: construct_harmony_messages(...)
S->>L: generate_async(...)
loop token chunks
L-->>S: tokens/events
S->>U: process_streaming_events(request, params, generator, H, model_name, store)
U-->>S: SSE event (created/in_progress/deltas/tool_call/etc.)
S-->>C: SSE event
end
U-->>S: final response.completed
S-->>C: SSE final + [DONE]
sequenceDiagram
autonumber
participant U as responses_utils
participant H as HarmonyAdapter
participant P as HarmonyStreamState
participant O as OpenAI Messages
U->>H: start streaming with request_id
H->>P: get_stream_state(request_id)
loop tokens
U->>H: stateful_stream_harmony_tokens_to_openai_messages(request_id, tokens, tools, tool_choice)
H->>P: process_token_batch_to_messages(tokens)
P-->>H: accumulated messages
H-->>U: OpenAI messages
U->>O: emit deltas/messages
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–90 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
f77e55e
to
1aa7242
Compare
/bot run |
PR_Github #16841 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/serve/harmony_adapter.py (1)
1489-1563
: Bug: uses generator.outputs after loop; should use last res.
Accessing generator.outputs will raise; use last iteration’s res and guard empty streams.async def handle_streaming_response( @@ - # Clean up state - harmony_adapter.cleanup_stream_state(request_id) - - # Send final message with finish_reason - output = generator.outputs[0] + # Clean up state + harmony_adapter.cleanup_stream_state(request_id) + + # Send final message with finish_reason + try: + output = res.outputs[0] # use last yielded result + finish_reason = output.finish_reason + stop_reason = output.stop_reason + except UnboundLocalError: + # No chunks yielded; emit minimal completed chunk + finish_reason = None + stop_reason = None final_response = ChatCompletionStreamResponse( model=request.model, choices=[ ChatCompletionResponseStreamChoice( index=0, - delta=DeltaMessage(), - finish_reason=output.finish_reason, - stop_reason=output.stop_reason) + delta=DeltaMessage(), + finish_reason=finish_reason, + stop_reason=stop_reason) ])
🧹 Nitpick comments (24)
tests/unittest/llmapi/apps/_test_openai_responses.py (3)
1-1
: Add NVIDIA copyright header.
Tests are Python files; guidelines require the NVIDIA header with current year.+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + import json
32-42
: Fix typo in helper name and improve assertion messages.
Rename check_reponse -> check_response; assertion text reads better as “does not exist”.-def check_reponse(response, prefix=""): +def check_response(response, prefix=""): reasoning_exist, message_exist = False, False for output in response.output: if output.type == "reasoning": reasoning_exist = True elif output.type == "message": message_exist = True - assert reasoning_exist, f"{prefix}Reasoning content not exists!" - assert message_exist, f"{prefix}Message content not exists!" + assert reasoning_exist, f"{prefix}Reasoning content does not exist!" + assert message_exist, f"{prefix}Message content does not exist!"Additional edits (outside range) to update call sites:
- check_reponse(response, "test_reasoning: ") + check_response(response, "test_reasoning: ") - check_reponse(response, f"test_reasoning_effort_{effort}: ") + check_response(response, f"test_reasoning_effort_{effort}: ") - check_reponse(response, "test_chat: ") + check_response(response, "test_chat: ") - check_reponse(response, "test_multi_turn_chat_1: ") + check_response(response, "test_multi_turn_chat_1: ") - check_reponse(response_2, "test_multi_turn_chat_2: ") + check_response(response_2, "test_multi_turn_chat_2: ")
156-164
: Be robust to function call id naming (call_id vs id).
OpenAI SDK types sometimes expose id under different names. Use fallback to avoid brittleness.- assert function_call.name == "get_current_weather" - - args = json.loads(function_call.arguments) - answer = get_current_weather(**args) - messages.append({ - "type": "function_call_output", - "call_id": function_call.call_id, - "output": json.dumps(answer), - }) + assert function_call.name == "get_current_weather" + args = json.loads(function_call.arguments) + answer = get_current_weather(**args) + call_id = getattr(function_call, "call_id", getattr(function_call, "id", None)) + assert call_id, "Function call id missing" + messages.append({ + "type": "function_call_output", + "call_id": call_id, + "output": json.dumps(answer), + })tensorrt_llm/serve/harmony_adapter.py (5)
1-3
: Ensure NVIDIA copyright header is present.
File lacks the NVIDIA header required by guidelines. Keep existing attributions; prepend NVIDIA lines.+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
132-167
: Docstring punctuation and Python 3.8 typing compatibility.
- End the summary line with a period (D415).
- Replace builtin generics/union with typing.List/Optional to support Python 3.8.
-def process_token_batch_to_messages(self, - tokens: list[int]) -> list[Message]: - """ - Process a batch of tokens while maintaining parsing state. - Returns OpenAI Messages for Responses API - """ +def process_token_batch_to_messages(self, + tokens: List[int]) -> List[Message]: + """ + Process a batch of tokens while maintaining parsing state. + Returns OpenAI Messages for Responses API. + """Add missing imports (top of file):
-from typing import Any, AsyncGenerator, Literal +from typing import Any, AsyncGenerator, Literal, List, Optional
360-362
: Return type uses Python 3.10 unions; switch to Optional for 3.8+.-def get_stream_state(self, request_id: str) -> HarmonyStreamState | None: +def get_stream_state(self, request_id: str) -> Optional[HarmonyStreamState]: return self._stream_states.get(request_id, None)
1260-1295
: Docstring punctuation and Python 3.8 typing compatibility.
Mirror the earlier change: List/Optional and end the summary line with a period.-def stateful_stream_harmony_tokens_to_openai_messages( - self, - request_id: str, - tokens: list[int], - available_tools: list[dict[str, Any]] | None = None, - tool_choice: str | None = None) -> list[Message]: - """ - Process tokens using stateful parsing. +def stateful_stream_harmony_tokens_to_openai_messages( + self, + request_id: str, + tokens: List[int], + available_tools: Optional[List[dict[str, Any]]] = None, + tool_choice: Optional[str] = None) -> List[Message]: + """ + Process tokens using stateful parsing. This method maintains persistent state across multiple calls for the same request, ensuring proper channel transitions and tool call handling. @@ - Returns: - List of OpenAI Messages + Returns: + List of OpenAI Messages. """
77-81
: Clarify sent_tool_arguments semantics.
Comment says “length” but code treats it as a boolean. Align for maintainability.- # Track sent arguments for tool call streaming deltas - self.sent_tool_arguments = {} # tool_call_id -> sent_arguments_length + # Track whether we've sent the first chunk for a tool call (affects id/name nullability in deltas) + self.sent_tool_arguments = {} # tool_call_id -> booltensorrt_llm/serve/openai_protocol.py (3)
1-1
: Ensure NVIDIA copyright header appears at top.
Prepend the standard NVIDIA header per guidelines (retain attribution line that follows).+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # Adapted from # https://github.com/vllm-project/vllm/blob/4db5176d9758b720b05460c50ace3c01026eb158/vllm/entrypoints/openai/protocol.py
680-716
: Use List[...] for Python 3.8 and tighten Optional types.
Replace builtin generics with typing.List and Optional where needed.-class ResponsesRequest(OpenAIBaseModel): +class ResponsesRequest(OpenAIBaseModel): @@ - include: Optional[list[ + include: Optional[List[ Literal[ "code_interpreter_call.outputs", "computer_call_output.output.image_url", "file_search_call.results", "message.input_image.image_url", "message.output_text.logprobs", "reasoning.encrypted_content", ], - ]] = None - input: Union[str, list[ResponseInputOutputItem]] + ]] = None + input: Union[str, List[ResponseInputOutputItem]] @@ - tools: list[Tool] = Field(default_factory=list) + tools: List[Tool] = Field(default_factory=list)
799-865
: Use List[...] in ResponsesResponse for Python 3.8.
Also consistent with earlier change in ResponsesRequest.-class ResponsesResponse(OpenAIBaseModel): +class ResponsesResponse(OpenAIBaseModel): @@ - output: list[ResponseOutputItem] + output: List[ResponseOutputItem] @@ - tools: list[Tool] + tools: List[Tool]tensorrt_llm/serve/openai_server.py (7)
1-1
: Add NVIDIA copyright header (2025).Per coding guidelines, prepend the NVIDIA copyright header. Place it after the shebang.
#!/usr/bin/env python +# +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#
43-46
: Prefer module-namespace imports per guidelines.Newly added symbol imports (ModelList, ResponsesRequest, ResponsesResponse, UsageInfo) violate “preserve module namespaces”. Consider importing the module and referencing attributes.
50-56
: Avoid multi-line symbol imports; import the module and alias functions.To match the guideline and reduce import churn, import
tensorrt_llm.serve.responses_utils
as a module and reference functions via the module.-from tensorrt_llm.serve.responses_utils import ConversationHistoryStore -from tensorrt_llm.serve.responses_utils import \ - create_response as responses_api_create_response -from tensorrt_llm.serve.responses_utils import \ - process_streaming_events as responses_api_process_streaming_events -from tensorrt_llm.serve.responses_utils import \ - request_preprocess as responses_api_request_preprocess +from tensorrt_llm.serve import responses_utils as responses_apiAnd update call sites accordingly (see below diffs).
93-98
: Make env-flag parsing robust; lazily initialize store.
- Parse boolean envs safely (accept 1/true/yes).
- Only construct
ConversationHistoryStore
when enabled.-# Enable response storage for Responses API -self.enable_store = True -if len(os.getenv("TRTLLM_RESPONSES_API_DISABLE_STORE", "")) > 0: - self.enable_store = False -self.conversation_store = ConversationHistoryStore() +# Enable response storage for Responses API +self.enable_store = os.getenv("TRTLLM_RESPONSES_API_DISABLE_STORE", "").lower() not in {"1", "true", "yes"} +self.conversation_store = ConversationHistoryStore() if self.enable_store else None
182-195
: Use precise error types and align ID prefix.
- For 404, use a “NotFoundError” (vs InvalidRequestError).
- Error text says “resp” while validation uses “resp_”. Align to “resp_”.
def _create_invalid_response_id_error(self, response_id: str) -> Response: return self.create_error_response( - err_type="InvalidRequestError", - message=(f"Invalid 'response_id': '{response_id}'. " - "Expected an ID that begins with 'resp'."), + err_type="InvalidRequestError", + message=(f"Invalid 'response_id': '{response_id}'. " + "Expected an ID that begins with 'resp_'."), ) def _create_response_id_not_found_error(self, response_id: str) -> Response: return self.create_error_response( - err_type="InvalidRequestError", + err_type="NotFoundError", message=f"Response with id '{response_id}' not found.", status_code=HTTPStatus.NOT_FOUND, )
799-811
: Response ID validation logic is good; align message and handle disabled store.
- Align error text to “resp_” (see earlier diff).
- If store is disabled, skip prev-response lookup to avoid
None
derefs elsewhere.
844-845
: Remove unreachable fallback return.All code paths already return; this block is dead.
- return JSONResponse(content={"detail": "None"}) -tensorrt_llm/serve/responses_utils.py (6)
1-1
: Add NVIDIA copyright header (2025).Per guidelines, prepend header.
+# +# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +#
232-242
: Return a copy of conversation history to avoid accidental external mutation.- return self.conversations.get(conversation_id, []) + return list(self.conversations.get(conversation_id, []))
271-276
: Ensure mappings are fully cleaned when evicting responses.Also remove stale conversation mappings if present.
def _pop_response(self) -> None: responses_debug_log(f"responses type: {type(self.responses)}") resp_id, _ = self.responses.popitem(last=False) - if resp_id in self.response_to_conversation: - self.response_to_conversation.pop(resp_id) + conv_id = self.response_to_conversation.pop(resp_id, None) + if conv_id is not None and conv_id in self.conversation_to_response: + self.conversation_to_response.pop(conv_id, None)
376-383
: Docstring (D200) nit: render_for_completion.One-line docstring would satisfy Ruff, but optional.
488-500
: Provide a default for unknown finish reasons instead of raising.To avoid crashing on new/unknown reasons, default to “failed” or “incomplete”.
- raise RuntimeError("Should never reach here!") + return 'failed'
390-486
: parse_output_message: solid mapping; minor robustness.Consider passing through model logprobs if available instead of
None
/[]
placeholders; otherwise OK.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
tensorrt_llm/serve/harmony_adapter.py
(5 hunks)tensorrt_llm/serve/openai_protocol.py
(2 hunks)tensorrt_llm/serve/openai_server.py
(5 hunks)tensorrt_llm/serve/responses_utils.py
(1 hunks)tests/integration/defs/test_e2e.py
(1 hunks)tests/integration/test_lists/test-db/l0_h100.yml
(1 hunks)tests/unittest/llmapi/apps/_test_openai_responses.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports: import the subpackage/module, not the symbol (from package.subpackage import foo; foo.SomeClass())
Naming: files snake_case; classes PascalCase; functions/methods snake_case; local variables snake_case (k_ prefix if starting with a number); globals G_ + UPPER_SNAKE_CASE; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; reserve comments for function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; inline docstrings for attributes/variables are allowed
Avoid reflection when straightforward code suffices (e.g., prefer explicit parameters over dict(**locals()))
Use narrow except clauses (e.g., catch FileNotFoundError instead of bare except)
For duck-typing try/except, keep try body minimal and use else for the main logic
Files:
tests/integration/defs/test_e2e.py
tests/unittest/llmapi/apps/_test_openai_responses.py
tensorrt_llm/serve/openai_protocol.py
tensorrt_llm/serve/responses_utils.py
tensorrt_llm/serve/openai_server.py
tensorrt_llm/serve/harmony_adapter.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header with current year to all source files
Files:
tests/integration/defs/test_e2e.py
tests/unittest/llmapi/apps/_test_openai_responses.py
tensorrt_llm/serve/openai_protocol.py
tensorrt_llm/serve/responses_utils.py
tensorrt_llm/serve/openai_server.py
tensorrt_llm/serve/harmony_adapter.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/test-db/l0_h100.yml
🧬 Code graph analysis (6)
tests/integration/defs/test_e2e.py (1)
tests/integration/defs/conftest.py (3)
llm_root
(180-181)llm_venv
(707-723)unittest_path
(90-91)
tests/unittest/llmapi/apps/_test_openai_responses.py (1)
examples/models/core/gpt_oss/openai_chat_client_function_calling.py (1)
get_current_weather
(59-60)
tensorrt_llm/serve/openai_protocol.py (2)
tensorrt_llm/llmapi/llm_args.py (1)
Field
(69-96)tensorrt_llm/sampling_params.py (2)
SamplingParams
(125-486)GuidedDecodingParams
(14-36)
tensorrt_llm/serve/responses_utils.py (3)
tensorrt_llm/llmapi/llm.py (1)
RequestOutput
(46-86)tensorrt_llm/serve/openai_protocol.py (7)
OpenAIBaseModel
(59-61)ResponsesRequest
(680-780)ResponsesResponse
(799-864)to_sampling_params
(267-309)to_sampling_params
(597-638)to_sampling_params
(730-766)from_request
(829-864)tensorrt_llm/serve/harmony_adapter.py (4)
get_stop_tokens
(363-369)stateful_stream_harmony_tokens_to_openai_messages
(1260-1294)get_stream_state
(360-361)get_parser
(82-83)
tensorrt_llm/serve/openai_server.py (2)
tensorrt_llm/serve/openai_protocol.py (2)
ResponsesRequest
(680-780)ResponsesResponse
(799-864)tensorrt_llm/serve/responses_utils.py (5)
ConversationHistoryStore
(138-275)create_response
(547-604)process_streaming_events
(607-844)request_preprocess
(502-544)load_response
(153-156)
tensorrt_llm/serve/harmony_adapter.py (1)
tensorrt_llm/logger.py (2)
error
(125-126)debug
(143-144)
🪛 Ruff (0.12.2)
tensorrt_llm/serve/responses_utils.py
340-341: One-line docstring should fit on one line
Reformat to one line
(D200)
390-391: One-line docstring should fit on one line
Reformat to one line
(D200)
tensorrt_llm/serve/harmony_adapter.py
132-134: 1 blank line required between summary line and description
(D205)
132-134: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
359-362: 1 blank line required between summary line and description
(D205)
🔇 Additional comments (14)
tests/integration/test_lists/test-db/l0_h100.yml (1)
106-106
: Wrapper for test_openai_responses verified Defined in tests/integration/defs/test_e2e.py at line 1507 and is discoverable by pytest.tests/integration/defs/test_e2e.py (1)
1516-1521
: Wrapper for Responses API tests: LGTM.
Matches the surrounding OpenAI wrappers; no extra flags needed.tests/unittest/llmapi/apps/_test_openai_responses.py (1)
173-192
: Streaming assertions: LGTM.
Confirms both message and reasoning deltas are emitted.tensorrt_llm/serve/harmony_adapter.py (2)
60-63
: Tool name extraction logic: LGTM.
Handles both top-level name and nested function.name.
82-84
: Accessor exposure: LGTM.
Returning the StreamableParser is useful for debugging and advanced integrations.tensorrt_llm/serve/openai_protocol.py (2)
675-678
: Type alias: LGTM.
Defines the exact union used by Responses API inputs/outputs.
867-873
: Streaming response wrapper: LGTM.
Fields align with OpenAI Responses streaming signals.tensorrt_llm/serve/openai_server.py (4)
212-214
: Route registration LGTM.New /v1/responses route is correctly wired.
786-789
: Harmony-only guard LGTM.Clear error for unsupported formats.
811-819
: Prefer module alias and pass stop tokens via preprocess (already done).No action required; just noting good reuse of preprocessing.
822-826
: Streaming response LGTM.Correct media type and generator handoff.
tensorrt_llm/serve/responses_utils.py (3)
502-545
: request_preprocess: good use of stop tokens and store gating.LGTM overall. Minor: consider logging when store is disabled to assist debugging.
547-605
: create_response: LGTM; consider capturing perf metrics if available.No blocker.
637-657
: Event initialization LGTM.Good use of created/in_progress events with sequence assignment handled centrally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (8)
tensorrt_llm/serve/openai_server.py (2)
774-785
: Streaming wrapper: signature and model id look correct.Inner generator returns AsyncGenerator[str, None] and uses self.model. Good improvement and consistent with the utils’ API.
828-835
: Return JSON for non-streaming Responses API (consistency with other handlers).Wrap the Pydantic model in JSONResponse to match chat/completions handlers and ensure correct content-type.
Apply:
- else: - return await responses_api_create_response( - generator=promise, - request=request, - sampling_params=sampling_params, - model_name=self.model, - conversation_store=self.conversation_store, - generation_result=None, - enable_store=self.enable_store) + else: + rsp = await responses_api_create_response( + generator=promise, + request=request, + sampling_params=sampling_params, + model_name=self.model, + conversation_store=self.conversation_store, + generation_result=None, + enable_store=self.enable_store) + return JSONResponse(content=rsp.model_dump())tensorrt_llm/serve/responses_utils.py (6)
220-231
: append_messages: trim the correct conversation.Call shrinker with conversation_id; resp_id mapping isn’t relevant for trimming.
- if len(self.conversations[conversation_id] - ) > self.conversation_capacity: - self._pop_conversation(resp_id) + while len(self.conversations[conversation_id]) > self.conversation_capacity: + self._shrink_conversation(conversation_id)
338-344
: Avoid mutable default arg and tighten docstring.Use Optional[list] for prev_msgs and a true one-line docstring (ruff D200).
-def construct_harmony_messages( +def construct_harmony_messages( request: ResponsesRequest, prev_response: Optional[ResponsesResponse], - prev_msgs: list[Message] = [], + prev_msgs: Optional[list[Message]] = None, ) -> list[Message]: - """Construct messages from request input, includes conversation history messages if exists.""" + """Construct messages from request input, including conversation history when available.""" @@ - else: - messages.extend(prev_msgs) + else: + messages.extend(prev_msgs or [])
158-167
: Mutable default arg and capacity trim bug (uses resp_id before mapping).
- Avoid default [] for resp_msgs.
- When trimming, use the conversation_id you just computed; resp_id isn’t mapped yet, so _pop_conversation(resp_id) no-ops.
- async def store_response(self, - resp: ResponsesResponse, - resp_msgs: Optional[list[Message]] = [], - prev_resp_id: Optional[str] = None) -> None: + async def store_response(self, + resp: ResponsesResponse, + resp_msgs: Optional[list[Message]] = None, + prev_resp_id: Optional[str] = None) -> None: @@ - if resp_id in self.response_to_conversation: + # Normalize msgs + resp_msgs = list(resp_msgs or []) + if resp_id in self.response_to_conversation: conversation_id = self.response_to_conversation[resp_id] self.conversations[conversation_id].extend(resp_msgs) elif prev_resp_id is not None: conversation_id = self.response_to_conversation[prev_resp_id] self.conversations[conversation_id].extend(resp_msgs) - while len(self.conversations[conversation_id] - ) > self.conversation_capacity: - self._pop_conversation(resp_id) + while len(self.conversations[conversation_id]) > self.conversation_capacity: + self._shrink_conversation(conversation_id) else: conversation_id = random_uuid() - self.conversations[conversation_id] = resp_msgs + self.conversations[conversation_id] = list(resp_msgs) @@ self.response_to_conversation[resp_id] = conversation_id self.conversation_to_response[conversation_id] = resp_id self._update_visited_conversation(conversation_id)Also applies to: 169-190
197-214
: store_messages overwrites conversation and trims via resp_id; append/trim by conversation_id.Preserve history by extending and trim deterministically on the active conversation.
- self.conversations[conversation_id] = msgs - if len(self.conversations[conversation_id] - ) > self.conversation_capacity: - self._pop_conversation(resp_id) + if conversation_id in self.conversations: + self.conversations[conversation_id].extend(msgs) + else: + self.conversations[conversation_id] = list(msgs) + while len(self.conversations[conversation_id]) > self.conversation_capacity: + self._shrink_conversation(conversation_id)
258-273
: Rewrite conversation trimmer: robust, conversation-id based, handle Role type.Current _pop_conversation mixes Role enum vs str and can IndexError if bounds aren’t found. Replace with a safe _shrink_conversation(conversation_id) and update callers.
- def _pop_conversation(self, resp_id) -> None: - conversation_id = self.response_to_conversation.get(resp_id, None) - if conversation_id is None: - return - - conversation = self.conversations[conversation_id] - first_conversation_range = [] - for i, msg in enumerate(conversation): - if msg.author.role == Role.USER: - first_conversation_range.append(i) - elif msg.channel == "final": - first_conversation_range.append(i) - break - del conversation[ - first_conversation_range[0]:first_conversation_range[1] + 1] + def _shrink_conversation(self, conversation_id: str) -> None: + conversation = self.conversations.get(conversation_id, []) + if not conversation: + return + start_idx = end_idx = None + for i, msg in enumerate(conversation): + role = getattr(msg.author, "role", msg.author) + role = getattr(role, "value", role) + if start_idx is None and role == "user": + start_idx = i + if msg.channel == "final": + end_idx = i + break + if start_idx is not None and end_idx is not None and end_idx >= start_idx: + del conversation[start_idx:end_idx + 1] + else: + # Fallback: drop earliest message(s) + del conversation[0:min(2, len(conversation))]
633-637
: Streaming IDs and indices: initialize item id and bump content_index on deltas.
- current_item_id must be stable and non-empty.
- Increment current_content_index after each delta to keep indices consistent.
- current_content_index = 0 # FIXME: this number is never changed + current_content_index = 0 current_output_index = 0 - current_item_id = "" # FIXME: this number is never changed + current_item_id = f"msg_{random_uuid()}" @@ yield _send_event( ResponseTextDeltaEvent( @@ )) + current_content_index += 1 @@ yield _send_event( ResponseReasoningTextDeltaEvent( @@ )) + current_content_index += 1Also applies to: 782-793, 823-831
🧹 Nitpick comments (5)
tensorrt_llm/serve/openai_server.py (3)
182-187
: Fix prefix in error message.You validate against "resp_" but the message says “resp”. Align the text to avoid confusing clients.
- message=(f"Invalid 'response_id': '{response_id}'. " - "Expected an ID that begins with 'resp'."), + message=(f"Invalid 'response_id': '{response_id}'. " + "Expected an ID that begins with 'resp_'."),
844-845
: Remove unreachable return.All code paths in try/except return; this tail return is dead.
- return JSONResponse(content={"detail": "None"})
773-773
: Line too long (E501).Break function signature to respect 120 cols (ruff E501).
- async def openai_responses(self, request: ResponsesRequest, raw_request: Request) -> Response: + async def openai_responses( + self, + request: ResponsesRequest, + raw_request: Request, + ) -> Response:tensorrt_llm/serve/responses_utils.py (2)
153-157
: Type hint: load_response may return None.Adjust return annotation to Optional[ResponsesResponse] to reflect OrderedDict.get semantics.
- async def load_response(self, resp_id: str) -> ResponsesResponse: + async def load_response(self, resp_id: str) -> Optional[ResponsesResponse]:
391-394
: Docstring style (D200).Summary fits on one line; condense to satisfy ruff.
-def parse_output_message(message: Message) -> list[ResponseOutputItem]: - """ - Parse a Harmony message into a list of output response items. - """ +def parse_output_message(message: Message) -> list[ResponseOutputItem]: + """Parse a Harmony message into a list of response output items."""
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/serve/openai_protocol.py
(2 hunks)tensorrt_llm/serve/openai_server.py
(5 hunks)tensorrt_llm/serve/responses_utils.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/serve/openai_protocol.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports: import the subpackage/module, not the symbol (from package.subpackage import foo; foo.SomeClass())
Naming: files snake_case; classes PascalCase; functions/methods snake_case; local variables snake_case (k_ prefix if starting with a number); globals G_ + UPPER_SNAKE_CASE; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; reserve comments for function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; inline docstrings for attributes/variables are allowed
Avoid reflection when straightforward code suffices (e.g., prefer explicit parameters over dict(**locals()))
Use narrow except clauses (e.g., catch FileNotFoundError instead of bare except)
For duck-typing try/except, keep try body minimal and use else for the main logic
Files:
tensorrt_llm/serve/openai_server.py
tensorrt_llm/serve/responses_utils.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header with current year to all source files
Files:
tensorrt_llm/serve/openai_server.py
tensorrt_llm/serve/responses_utils.py
🧬 Code graph analysis (2)
tensorrt_llm/serve/openai_server.py (2)
tensorrt_llm/serve/openai_protocol.py (1)
ResponsesRequest
(680-782)tensorrt_llm/serve/responses_utils.py (5)
ConversationHistoryStore
(138-278)create_response
(548-605)process_streaming_events
(608-845)request_preprocess
(503-545)load_response
(153-156)
tensorrt_llm/serve/responses_utils.py (3)
tensorrt_llm/llmapi/llm.py (1)
RequestOutput
(46-86)tensorrt_llm/serve/openai_protocol.py (7)
OpenAIBaseModel
(59-61)ResponsesRequest
(680-782)ResponsesResponse
(801-866)to_sampling_params
(267-309)to_sampling_params
(597-638)to_sampling_params
(730-766)from_request
(831-866)tensorrt_llm/serve/harmony_adapter.py (3)
get_stop_tokens
(363-369)stateful_stream_harmony_tokens_to_openai_messages
(1260-1294)get_parser
(82-83)
🪛 Ruff (0.12.2)
tensorrt_llm/serve/openai_server.py
773-773: Line too long (125 > 120)
(E501)
tensorrt_llm/serve/responses_utils.py
392-393: One-line docstring should fit on one line
Reformat to one line
(D200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
PR_Github #16841 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Shall we add license?
/bot run |
PR_Github #16856 [ run ] triggered by Bot |
PR_Github #16856 [ run ] completed with state |
/bot run |
PR_Github #16923 [ run ] triggered by Bot |
PR_Github #16923 [ run ] completed with state |
/bot run --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (8)
tensorrt_llm/serve/responses_utils.py (8)
1-3
: Use NVIDIA copyright header; remove vLLM attribution at file top.This is a production source file; follow repo header policy.
Apply:
-# SPDX-License-Identifier: Apache-2.0 -# SPDX-FileCopyrightText: Copyright contributors to the vLLM project - +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +"""TensorRT-LLM Responses API utilities."""
161-170
: Avoid mutable default arg in store_response and guard None.Re-introduces a known bug.
Apply:
- async def store_response(self, - resp: ResponsesResponse, - resp_msgs: Optional[list[Message]] = [], - prev_resp_id: Optional[str] = None) -> None: + async def store_response(self, + resp: ResponsesResponse, + resp_msgs: Optional[list[Message]] = None, + prev_resp_id: Optional[str] = None) -> None: @@ - async with self.conversations_lock: + if resp_msgs is None: + resp_msgs = [] + async with self.conversations_lock:Also applies to: 172-172
172-193
: Conversation trimming calls wrong helper and misses a path.
- Capacity trimming is invoked with resp_id (no mapping yet), so it’s a no-op.
- The first branch (existing conversation) doesn’t trim at all.
Apply:
if resp_id in self.response_to_conversation: conversation_id = self.response_to_conversation[resp_id] - self.conversations[conversation_id].extend(resp_msgs) + self.conversations[conversation_id].extend(resp_msgs) + while len(self.conversations[conversation_id]) > self.conversation_capacity: + self._shrink_conversation(conversation_id) elif prev_resp_id is not None: conversation_id = self.response_to_conversation[prev_resp_id] self.conversations[conversation_id].extend(resp_msgs) - while len(self.conversations[conversation_id] - ) > self.conversation_capacity: - self._pop_conversation(resp_id) + while len(self.conversations[conversation_id]) > self.conversation_capacity: + self._shrink_conversation(conversation_id) else: conversation_id = random_uuid() self.conversations[conversation_id] = resp_msgs
194-217
: store_messages overwrites history and trims via wrong call.Append/init instead of replace; trim the correct conversation.
Apply:
- self.conversations[conversation_id] = msgs - if len(self.conversations[conversation_id] - ) > self.conversation_capacity: - self._pop_conversation(resp_id) + if conversation_id in self.conversations: + self.conversations[conversation_id].extend(msgs) + else: + self.conversations[conversation_id] = list(msgs) + while len(self.conversations[conversation_id]) > self.conversation_capacity: + self._shrink_conversation(conversation_id)
218-234
: append_messages trims via wrong call; ensure loop until under capacity.Use conversation_id, not resp_id.
Apply:
- if len(self.conversations[conversation_id] - ) > self.conversation_capacity: - self._pop_conversation(resp_id) + while len(self.conversations[conversation_id]) > self.conversation_capacity: + self._shrink_conversation(conversation_id)
261-276
: Rewrite pop helper: accept conversation_id, normalize roles, handle edge cases.Current code can IndexError and mixes Role enum vs string.
Apply:
- def _pop_conversation(self, resp_id) -> None: - conversation_id = self.response_to_conversation.get(resp_id, None) - if conversation_id is None: - return - - conversation = self.conversations[conversation_id] - first_conversation_range = [] - for i, msg in enumerate(conversation): - if msg.author.role == Role.USER: - first_conversation_range.append(i) - elif msg.channel == "final": - first_conversation_range.append(i) - break - del conversation[ - first_conversation_range[0]:first_conversation_range[1] + 1] + def _shrink_conversation(self, conversation_id: str) -> None: + conversation = self.conversations.get(conversation_id, []) + if not conversation: + return + start_idx = None + end_idx = None + for i, msg in enumerate(conversation): + role = getattr(getattr(msg, "author", None), "role", None) + role = getattr(role, "value", role) + if start_idx is None and role == "user": + start_idx = i + if msg.channel == "final": + end_idx = i + break + if start_idx is None or end_idx is None or end_idx < start_idx: + del conversation[0:min(2, len(conversation))] + else: + del conversation[start_idx:end_idx + 1]
341-347
: Avoid mutable default arg in construct_harmony_messages; guard prev_msgs.Also satisfies Ruff’s D200 one-line docstring guideline already present.
Apply:
-def construct_harmony_messages( +def construct_harmony_messages( request: ResponsesRequest, prev_response: Optional[ResponsesResponse], - prev_msgs: list[Message] = [], + prev_msgs: Optional[list[Message]] = None, ) -> list[Message]: @@ - else: - messages.extend(prev_msgs) + else: + messages.extend(prev_msgs or [])Also applies to: 357-358
611-665
: Initialize stable item_id, reset per item, and advance content_index on deltas.Empty item_id breaks clients; indices never advance.
Apply:
- current_content_index = 0 # FIXME: this number is never changed + current_content_index = 0 current_output_index = 0 - current_item_id = "" # FIXME: this number is never changed + current_item_id = f"msg_{random_uuid()}"- if parser.state == StreamState.EXPECT_START: + if parser.state == StreamState.EXPECT_START: + current_content_index = 0 + current_item_id = f"msg_{random_uuid()}" current_output_index += 1 sent_output_item_added = Falseyield _send_event( ResponseTextDeltaEvent( @@ )) + current_content_index += 1
yield _send_event( ResponseReasoningTextDeltaEvent( @@ )) + current_content_index += 1
Also applies to: 676-680, 753-796, 826-835
🧹 Nitpick comments (1)
tensorrt_llm/serve/responses_utils.py (1)
492-503
: finish_reason_mapping: add safe default instead of raising.Prefer returning "incomplete" or "failed" to avoid crashing on new reasons.
Apply:
- raise RuntimeError("Should never reach here!") + return "incomplete"
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tensorrt_llm/serve/responses_utils.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cc,cxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cc,cxx,cu,cuh,py}
: If a constructor parameter name conflicts with a public member, add a trailing underscore to the parameter (e.g., foo_).
Use uppercase literal suffixes (e.g., 1234L not 1234l).
Use spaces, not tabs; indent by 4 spaces.
Files:
tensorrt_llm/serve/responses_utils.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Target Python 3.8+ for all Python code.
Indent with 4 spaces; do not use tabs.
Maintain module namespace on imports; import the module/submodule, not individual classes/functions (e.g., from package.subpackage import foo; foo.SomeClass()).
Python filenames use snake_case (e.g., some_file.py).
Class names use PascalCase.
Function and method names use snake_case.
Local variable names use snake_case; if starting with a number, prefix with k_ (e.g., k_99th_percentile).
Global variables use UPPER_SNAKE_CASE and prefix G_ (e.g., G_MY_GLOBAL).
Constants use UPPER_SNAKE_CASE.
Avoid shadowing outer-scope variables.
Initialize all externally visible members of a class in init.
Prefer docstrings for interfaces used outside a file; use comments for local code within functions or local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline with docstrings placed under the definition.
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns).
Limit except clauses to specific exceptions; avoid bare except.
For duck-typing try/except, keep try blocks minimal and use else for the main logic.
Files:
tensorrt_llm/serve/responses_utils.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA 2025 Apache-2.0 copyright header block at the top of all source files (.cpp, .h, .cu, .py).
Files:
tensorrt_llm/serve/responses_utils.py
🧠 Learnings (1)
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/serve/responses_utils.py
🧬 Code graph analysis (1)
tensorrt_llm/serve/responses_utils.py (3)
tensorrt_llm/llmapi/llm.py (1)
RequestOutput
(46-86)tensorrt_llm/serve/openai_protocol.py (7)
OpenAIBaseModel
(59-61)ResponsesRequest
(680-782)ResponsesResponse
(801-866)to_sampling_params
(267-309)to_sampling_params
(597-638)to_sampling_params
(730-766)from_request
(831-866)tensorrt_llm/serve/harmony_adapter.py (4)
get_stop_tokens
(363-369)stateful_stream_harmony_tokens_to_openai_messages
(1260-1294)get_stream_state
(360-361)get_parser
(82-83)
🪛 Ruff (0.12.2)
tensorrt_llm/serve/responses_utils.py
395-396: One-line docstring should fit on one line
Reformat to one line
(D200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
PR_Github #16939 [ run ] triggered by Bot |
PR_Github #16939 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #16940 [ run ] triggered by Bot |
PR_Github #16940 [ run ] completed with state |
/bot run |
PR_Github #16943 [ run ] completed with state |
d20deb3
to
ef912de
Compare
/bot run |
PR_Github #16965 [ run ] triggered by Bot |
PR_Github #16965 [ run ] completed with state |
/bot run |
PR_Github #17126 [ run ] triggered by Bot |
PR_Github #17126 [ run ] completed with state |
ef912de
to
a7da33a
Compare
/bot run |
PR_Github #17137 [ run ] triggered by Bot |
PR_Github #17137 [ run ] completed with state |
Signed-off-by: Junyi Xu <junyix@nvidia.com>
Signed-off-by: Junyi Xu <junyix@nvidia.com>
Signed-off-by: Junyi Xu <junyix@nvidia.com>
Signed-off-by: Junyi Xu <junyix@nvidia.com>
a7da33a
to
99663be
Compare
/mbot run |
/bot run |
PR_Github #17265 [ run ] triggered by Bot |
PR_Github #17265 [ run ] completed with state |
/bot run |
PR_Github #17317 [ run ] triggered by Bot |
PR_Github #17317 [ run ] completed with state |
…NVIDIA#7341) Signed-off-by: Junyi Xu <junyix@nvidia.com>
…NVIDIA#7341) Signed-off-by: Junyi Xu <219237550+JunyiXu-nv@users.noreply.github.com>
Summary by CodeRabbit
New Features
Tests
Description
New Features
Tests
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.