-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-6549][feat] add perf metrics endpoint to openai server and openai disagg server #6985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds per-request perf metrics capture to OpenAIServer, cross-server correlation and aggregation in OpenAIDisaggServer with a new Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant OpenAIServer
participant MetricsQueue as "perf_metrics (deque)"
Client->>OpenAIServer: POST /v1/chat or /v1/completions
OpenAIServer->>OpenAIServer: process request
OpenAIServer->>MetricsQueue: _extract_metrics(res) -- acquire lock, append
OpenAIServer-->>Client: response
Client->>OpenAIServer: GET /perf_metrics
OpenAIServer->>MetricsQueue: copy + reset (under lock)
OpenAIServer-->>Client: { perf_metrics: [...] }
sequenceDiagram
participant Client
participant DisaggServer
participant CtxServer
participant GenServer
Client->>DisaggServer: Disaggregated request
DisaggServer->>DisaggServer: if ctx required and perf enabled: _add_perf_metrics_keys(ctx, gen, ctx_request_id)
Client->>DisaggServer: GET /perf_metrics
par Poll both servers
DisaggServer->>CtxServer: GET /perf_metrics
CtxServer-->>DisaggServer: ctx metrics
DisaggServer->>GenServer: GET /perf_metrics
GenServer-->>DisaggServer: gen metrics
end
DisaggServer->>DisaggServer: correlate by (ctx_server, gen_server, ctx_request_id) under lock
DisaggServer-->>Client: aggregated entries with ctx_perf_metrics & gen_perf_metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (9)
tensorrt_llm/serve/openai_server.py (2)
85-87: Bound the in-memory perf metrics buffer to avoid unbounded growthIf clients don’t poll /perf_metrics frequently (or ever),
self.perf_metricscan grow without bound. Consider a bounded deque to cap memory usage.Apply within the shown range:
- self.perf_metrics = [] + self.perf_metrics = deque(maxlen=1000) self.perf_metrics_lock = asyncio.Lock()Add the import at the top of the file (outside the shown range):
from collections import deque
245-281: Harden serialization against optional/absent fields
speculative_decodingand some timing/KV-cache fields may be absent or None depending on backend/features. Accessing attributes unguarded can raise. Recommend using getattr to guard optional fields.Apply within the shown range:
- for metrics_dict in perf_metrics: + for metrics_dict in perf_metrics: metrics = metrics_dict["perf_metrics"] timing_metrics = metrics.timing_metrics kv_cache_metrics = metrics.kv_cache_metrics speculative_decoding = metrics.speculative_decoding # TODO: other metrics metrics_json = { "first_iter": metrics.first_iter, "last_iter": metrics.last_iter, - "arrival_time": timing_metrics.arrival_time.total_seconds(), - "first_scheduled_time": timing_metrics.first_scheduled_time.total_seconds(), - "first_token_time": timing_metrics.first_token_time.total_seconds(), - "last_token_time": timing_metrics.last_token_time.total_seconds(), + "arrival_time": getattr(timing_metrics, "arrival_time").total_seconds(), + "first_scheduled_time": getattr(timing_metrics, "first_scheduled_time").total_seconds(), + "first_token_time": getattr(timing_metrics, "first_token_time").total_seconds(), + "last_token_time": getattr(timing_metrics, "last_token_time").total_seconds(), "num_total_allocated_blocks": kv_cache_metrics.num_total_allocated_blocks, "num_new_allocated_blocks": kv_cache_metrics.num_new_allocated_blocks, "num_reused_blocks": kv_cache_metrics.num_reused_blocks, "num_missed_blocks": kv_cache_metrics.num_missed_blocks, } - if timing_metrics.kv_cache_size > 0: + kv_cache_size = getattr(timing_metrics, "kv_cache_size", 0) + if kv_cache_size and kv_cache_size > 0: metrics_json.update({ - "kv_cache_size": timing_metrics.kv_cache_size, - "kv_cache_transfer_start": timing_metrics.kv_cache_transfer_start.total_seconds(), - "kv_cache_transfer_end": timing_metrics.kv_cache_transfer_end.total_seconds(), + "kv_cache_size": kv_cache_size, + **( + {"kv_cache_transfer_start": getattr(timing_metrics, "kv_cache_transfer_start").total_seconds()} + if getattr(timing_metrics, "kv_cache_transfer_start", None) is not None else {} + ), + **( + {"kv_cache_transfer_end": getattr(timing_metrics, "kv_cache_transfer_end").total_seconds()} + if getattr(timing_metrics, "kv_cache_transfer_end", None) is not None else {} + ), }) - if speculative_decoding.total_draft_tokens > 0: + if getattr(speculative_decoding, "total_draft_tokens", 0) > 0: metrics_json.update({ - "acceptance_rate": speculative_decoding.acceptance_rate, - "total_accepted_draft_tokens": speculative_decoding.total_accepted_draft_tokens, - "total_draft_tokens": speculative_decoding.total_draft_tokens, + "acceptance_rate": getattr(speculative_decoding, "acceptance_rate"), + "total_accepted_draft_tokens": getattr(speculative_decoding, "total_accepted_draft_tokens"), + "total_draft_tokens": getattr(speculative_decoding, "total_draft_tokens"), }) metrics_dict["perf_metrics"] = metrics_json return JSONResponse(content=perf_metrics)tests/unittest/llmapi/apps/_test_openai_perf_metrics.py (1)
61-66: Use equality (==) instead of identity (is) and add a short poll to avoid flakes
response.status is 200relies on CPython small-int interning and is not the intended check; use==. Also, adding a brief retry loop reduces flakiness if metrics are exposed slightly after the completion returns.Apply within the shown range:
- response = urlopen(f'{server.url_root}/perf_metrics') - assert response.status is 200 - - data_list = json.loads(response.read()) - assert len(data_list) == 1 + # Poll a few times in case metric extraction lags the response + data_list = [] + for _ in range(10): + with urlopen(f'{server.url_root}/perf_metrics') as response: + assert response.status == 200 + data_list = json.loads(response.read()) + if len(data_list) >= 1: + break + time.sleep(0.2) + assert len(data_list) == 1Add this import at the top of the file (outside the shown range):
import timetensorrt_llm/serve/openai_disagg_server.py (1)
137-146: Consider parallelizing fan-out GETs and handling non-200s gracefullyThe current loop fetches each server’s
/perf_metricssequentially and assumes success. For lower latency and resilience under partial failures, gather requests concurrently and skip/record servers that return non-200 or raise.If helpful, I can provide a concurrent
asyncio.gatherversion with per-server error handling.tests/integration/defs/disaggregated/test_disaggregated.py (5)
20-20: Import Optional for accurate type annotation usageYou assign None to a Callable-typed parameter later. Bring in Optional to annotate it correctly.
Apply this diff:
-from typing import Callable +from typing import Callable, Optional
39-45: Add return type, docstring, and explicit UTF-8 decodingSmall polish to improve clarity and robustness when reading YAML.
Apply this diff:
-def get_disagg_server_url_from_cfg(config_file: str): - with open(config_file, 'r') as file: +def get_disagg_server_url_from_cfg(config_file: str) -> str: + """Derive the disaggregated server base URL from a YAML config.""" + with open(config_file, 'r', encoding='utf-8') as file: config = yaml.safe_load(file) server_host = config.get('hostname', 'localhost') server_port = config.get('port', 8000) return f"http://{server_host}:{server_port}"
165-166: Make extra_endpoints_test annotation Optional to match default NoneType checkers will flag Callable[[str], None] with a None default. Use Optional[Callable[…]].
Apply this diff:
- prompt_file="prompts.json", - extra_endpoints_test: Callable[[str], None] = None): + prompt_file="prompts.json", + extra_endpoints_test: Optional[Callable[[str], None]] = None):
246-249: Avoid re-reading the YAML config on every iterationMinor inefficiency: the server URL is constant for the test. Compute it once outside the loop.
Apply this diff within the shown lines:
- if extra_endpoints_test is not None: - extra_endpoints_test( - get_disagg_server_url_from_cfg(config_file)) + if extra_endpoints_test is not None: + extra_endpoints_test(server_url)And add this one-liner just before entering the workers/server context (or right after computing server_cmd), outside the per-iteration loop:
server_url = get_disagg_server_url_from_cfg(config_file)
561-571: Relax strict temporal assertions to reduce flakiness across schedulers/clocksTiming can be equal at boundaries due to clock resolution and ordering. Using < may intermittently fail; prefer <= for adjacent events. Keep strict ordering where causality must hold.
Apply this diff:
- assert ctx_metrics["first_token_time"] == ctx_metrics["last_token_time"] - assert ctx_metrics["last_token_time"] < gen_metrics["arrival_time"] + assert ctx_metrics["first_token_time"] == ctx_metrics["last_token_time"] + assert ctx_metrics["last_token_time"] <= gen_metrics["arrival_time"] @@ - assert gen_metrics["arrival_time"] < gen_metrics[ + assert gen_metrics["arrival_time"] <= gen_metrics[ "kv_cache_transfer_start"] - assert gen_metrics["kv_cache_transfer_start"] < gen_metrics[ + assert gen_metrics["kv_cache_transfer_start"] <= gen_metrics[ "kv_cache_transfer_end"] - assert gen_metrics["kv_cache_transfer_end"] < gen_metrics[ + assert gen_metrics["kv_cache_transfer_end"] <= gen_metrics[ "first_scheduled_time"]If ctx/gen servers ever run on different hosts with unsynchronized clocks, consider validating relative durations (non-negative deltas) instead of absolute cross-server timestamps.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
tensorrt_llm/serve/openai_disagg_server.py(4 hunks)tensorrt_llm/serve/openai_server.py(8 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_metrics.yaml(1 hunks)tests/integration/defs/disaggregated/test_disaggregated.py(6 hunks)tests/integration/defs/test_e2e.py(1 hunks)tests/unittest/llmapi/apps/_test_openai_perf_metrics.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/test_e2e.pytests/unittest/llmapi/apps/_test_openai_perf_metrics.pytensorrt_llm/serve/openai_disagg_server.pytensorrt_llm/serve/openai_server.pytests/integration/defs/disaggregated/test_disaggregated.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/test_e2e.pytests/unittest/llmapi/apps/_test_openai_perf_metrics.pytensorrt_llm/serve/openai_disagg_server.pytensorrt_llm/serve/openai_server.pytests/integration/defs/disaggregated/test_disaggregated.py
🧬 Code Graph Analysis (5)
tests/integration/defs/test_e2e.py (1)
tests/integration/defs/conftest.py (4)
llm_venv(707-723)test_root(2189-2190)llm_root(180-181)unittest_path(90-91)
tests/unittest/llmapi/apps/_test_openai_perf_metrics.py (1)
tests/integration/defs/test_e2e.py (1)
temp_extra_llm_api_options_file(724-759)
tensorrt_llm/serve/openai_disagg_server.py (1)
tests/unittest/disaggregated/test_router.py (1)
get(23-25)
tensorrt_llm/serve/openai_server.py (5)
tensorrt_llm/serve/openai_disagg_server.py (1)
perf_metrics(137-168)tensorrt_llm/llmapi/llm.py (1)
RequestOutput(47-85)tensorrt_llm/executor/result.py (3)
finished(531-532)outputs(193-208)request_id(508-509)tensorrt_llm/metrics/collector.py (1)
log_metrics_dict(100-105)tests/integration/defs/accuracy/test_disaggregated_serving.py (1)
outputs(41-42)
tests/integration/defs/disaggregated/test_disaggregated.py (2)
tests/integration/defs/conftest.py (3)
disaggregated_test_root(2339-2344)llm_venv(707-723)disaggregated_example_root(270-275)tensorrt_llm/serve/openai_disagg_server.py (1)
perf_metrics(137-168)
🔇 Additional comments (13)
tests/integration/defs/disaggregated/test_configs/disagg_config_metrics.yaml (1)
8-25: Config looks good for exercising disaggregated perf-metricsThe configuration enables return_perf_metrics on both ctx/gen servers and provides distinct URLs, which is required for correlation in the disaggregated /perf_metrics flow. Nothing blocking from my side.
tensorrt_llm/serve/openai_server.py (6)
158-158: LGTM: new /perf_metrics route is correctly registeredThe new endpoint is added consistently alongside existing metrics endpoints.
292-307: LGTM: metrics extraction gated on finish and guarded by a lockThe early return for unfinished results and the locked append prevent races and partial metrics from leaking into the endpoint.
323-324: LGTM: streaming path calls extractor per yielded chunkOnly finished results will be recorded; the call here is safe due to your finish-guard in
_extract_metrics.
341-342: LGTM: non-streaming chat path records metrics once at completionThis ensures a single per-request metric entry.
434-435: LGTM: completions (non-streaming) path records metrics on completionConsistent with chat path; good.
470-471: LGTM: completions streaming path calls extractor per chunkSame rationale as chat streaming; safe with the finish-guard.
tests/integration/defs/test_e2e.py (1)
1500-1506: LGTM: adds the perf-metrics test to the e2e suiteThis integrates the new module cleanly without impacting existing tests.
tests/unittest/llmapi/apps/_test_openai_perf_metrics.py (1)
87-92: LGTM: negative checks for disaggregation-only fieldsThese assertions correctly ensure the standalone server doesn’t emit disagg-only fields in its perf metrics.
tensorrt_llm/serve/openai_disagg_server.py (2)
122-123: LGTM: new /perf_metrics endpoint exposedThis aligns with the ctx/gen servers and enables aggregation at the disagg layer.
307-310: LGTM: pre-register correlation keys for requests requiring ctxCapturing (ctx_server, gen_server, ctx_request_id) under a lock is the right primitive for later aggregation.
tests/integration/defs/disaggregated/test_disaggregated.py (2)
571-575: Nice wiring of the perf_metrics end-to-end checkGood reuse of run_disaggregated_test with a pluggable endpoint checker; this keeps the dataplane test flow consistent.
69-69: Confirmed config path and contents
The filetests/integration/defs/disaggregated/test_configs/disagg_config_metrics.yamlexists and includes thereturn_perf_metricsentries, so no changes are needed.
|
/bot run |
|
PR_Github #15684 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
tensorrt_llm/serve/openai_disagg_server.py (1)
418-445: Python 3.8 compatibility: replace list[str] with List[str]Same PEP 585 generics issue here; update annotations for method and inner helper.
Apply this diff:
- async def wait_for_all_servers_ready(cls, session: aiohttp.ClientSession, - ctx_servers: list[str], - gen_servers: list[str], + async def wait_for_all_servers_ready(cls, session: aiohttp.ClientSession, + ctx_servers: List[str], + gen_servers: List[str], server_start_timeout_secs: int = 180): - async def get_unready_servers(servers: list[str]) -> list[str]: + async def get_unready_servers(servers: List[str]) -> List[str]:Ensure typing imports include List (already present).
♻️ Duplicate comments (3)
tensorrt_llm/serve/openai_disagg_server.py (2)
57-63: Python 3.8 compatibility: replace PEP 585 generics with typing.List/Dict/Tuple (duplicate of prior review)Using list[...]/dict[...] requires Python 3.9+. The project targets 3.8+. Also tighten the nested mapping type for ctx_request_id keys.
Apply within the shown range:
- # record corresponding keys of context and generation servers for perf metrics - self.perf_metrics_keys: list[tuple[str, str, int]] = [] # (ctx_server_key, gen_server_key, ctx_request_id) + # record corresponding keys of context and generation servers for perf metrics + self.perf_metrics_keys: List[Tuple[str, str, int]] = [] # (ctx_server_key, gen_server_key, ctx_request_id) self.perf_metrics_keys_lock = asyncio.Lock() - self.server_perf_metrics: dict[str, dict] = {} # server_key -> {ctx_request_id: perf_metrics} + self.server_perf_metrics: Dict[str, Dict[int, dict]] = {} # server_key -> {ctx_request_id: perf_metrics} for server in self.ctx_servers + self.gen_servers: self.server_perf_metrics[server] = {}And update typing imports at the top of the file:
-from typing import List, Optional, Type, Union +from typing import Dict, List, Optional, Tuple, Type, Union
145-166: Fix data loss and partial-pair emission when correlating ctx/gen metrics (duplicate of prior review)Current logic pops gen first, then appends even if ctx is missing (ctx_perf_metrics can be None). This both drops ctx data prematurely and can emit incomplete pairs. Peek (get) until both present, then pop both.
Apply this diff:
- remain_keys = [] - for ctx_server, gen_server, ctx_request_id in self.perf_metrics_keys: - gen_perf_metrics = self.server_perf_metrics[gen_server].pop(ctx_request_id, None) - if gen_perf_metrics is None: - # generation not finished - remain_keys.append((ctx_server, gen_server, ctx_request_id)) - continue - ctx_perf_metrics = self.server_perf_metrics[ctx_server].pop(ctx_request_id, None) - return_metrics.append({ - "ctx_server": ctx_server, - "gen_server": gen_server, - "ctx_perf_metrics": ctx_perf_metrics, - "gen_perf_metrics": gen_perf_metrics}) - self.perf_metrics_keys = remain_keys + remain_keys = [] + for ctx_server, gen_server, ctx_request_id in self.perf_metrics_keys: + ctx_perf_metrics = self.server_perf_metrics.get(ctx_server, {}).get(ctx_request_id) + gen_perf_metrics = self.server_perf_metrics.get(gen_server, {}).get(ctx_request_id) + if not (ctx_perf_metrics and gen_perf_metrics): + # one side not finished yet + remain_keys.append((ctx_server, gen_server, ctx_request_id)) + continue + # both present; remove from stores to avoid duplication + self.server_perf_metrics[ctx_server].pop(ctx_request_id, None) + self.server_perf_metrics[gen_server].pop(ctx_request_id, None) + return_metrics.append({ + "ctx_server": ctx_server, + "gen_server": gen_server, + "ctx_perf_metrics": ctx_perf_metrics, + "gen_perf_metrics": gen_perf_metrics, + }) + self.perf_metrics_keys = remain_keystests/integration/defs/disaggregated/test_disaggregated.py (1)
529-577: Complete prior fix: prefer json.load(resp) and verify gen metrics schemaYou already fixed the urllib import and added a timeout. Finish by parsing JSON directly from the response. Also, confirm whether kv_cache_size resides at gen_perf_metrics top-level or under perf_metrics.
Apply this diff for JSON parsing:
- with urllib.request.urlopen(f"{server_url}/perf_metrics", - timeout=10) as resp: + with urllib.request.urlopen(f"{server_url}/perf_metrics", timeout=10) as resp: assert resp.status == 200 - perf_metrics = json.loads(resp.read()) + perf_metrics = json.load(resp)If kv_cache_size is a gen-level field (as described in the PR), adjust assertions accordingly:
- Use
gen_item = item["gen_perf_metrics"]- Keep
gen_metrics = gen_item["perf_metrics"]for timing fields- Assert
"kv_cache_size" in gen_itemand related ordering on gen_item fields.To verify the schema location from the codebase, run:
#!/bin/bash # Inspect where kv_cache_size is attached in server responses rg -n -C3 --type=py -P '\bkv_cache_size\b'This should show whether kv_cache_size is included inside "perf_metrics" or at the top level of the gen-server metrics object so the test can assert the right path.
🧹 Nitpick comments (3)
tensorrt_llm/serve/openai_disagg_server.py (2)
306-309: Avoid duplicate correlation keysRepeated appends of the same (ctx_server, gen_server, ctx_request_id) can cause unnecessary work. Guard against duplicates.
Apply this diff:
- if need_ctx: - async with self.perf_metrics_keys_lock: - self.perf_metrics_keys.append((ctx_server, gen_server, req.disaggregated_params.ctx_request_id)) + if need_ctx: + async with self.perf_metrics_keys_lock: + key = (ctx_server, gen_server, req.disaggregated_params.ctx_request_id) + if key not in self.perf_metrics_keys: + self.perf_metrics_keys.append(key)
1-1: Missing NVIDIA copyright headerPer repository guidelines, prepend the NVIDIA SPDX header to this source file.
Example:
# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0tests/integration/defs/disaggregated/test_disaggregated.py (1)
39-45: Make YAML read robust and add return type annotationAdd explicit return type, set encoding, and guard against empty files.
Apply this diff:
-def get_disagg_server_url_from_cfg(config_file: str): - with open(config_file, 'r') as file: - config = yaml.safe_load(file) +def get_disagg_server_url_from_cfg(config_file: str) -> str: + with open(config_file, 'r', encoding='utf-8') as file: + config = yaml.safe_load(file) or {} server_host = config.get('hostname', 'localhost') server_port = config.get('port', 8000) return f"http://{server_host}:{server_port}"
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/serve/openai_disagg_server.py(4 hunks)tests/integration/defs/disaggregated/test_disaggregated.py(6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/serve/openai_disagg_server.pytests/integration/defs/disaggregated/test_disaggregated.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/serve/openai_disagg_server.pytests/integration/defs/disaggregated/test_disaggregated.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/serve/openai_disagg_server.py (1)
122-122: Route registration looks goodNew /perf_metrics endpoint is correctly registered via FastAPI.
tests/integration/defs/disaggregated/test_disaggregated.py (4)
20-20: LGTM: precise typing for callbackImporting Callable and using a typed callback improves clarity.
69-69: Config mapping addition looks goodperf_metrics is correctly wired to the dedicated test config.
165-167: run_disaggregated_test extensibility is cleanAdding prompt_file and extra_endpoints_test parameters is a tidy way to inject endpoint checks.
246-249: Nice hook for endpoint validationDeferred call to extra_endpoints_test using the same config aligns the test with the server under test.
|
|
|
PR_Github #15684 [ run ] completed with state |
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/commands/serve.py (2)
354-357: Wrong exception type caught for Enum lookup; invalid server_role won’t be handled as intended.Enum item access via ServerRole[name] raises KeyError, not ValueError. The except block won’t execute, and an unhandled KeyError will bubble.
Apply:
- except ValueError: + except KeyError: raise ValueError(f"Invalid server role: {server_role}. " \ f"Must be one of: {', '.join([role.name for role in ServerRole])}")
321-341: Dropped kwargs from get_llm_args: moe_cluster_parallel_size (and any extras) are ignored.serve() passes moe_cluster_parallel_size=cluster_size to get_llm_args, which lands in the function’s **llm_args_extra_dict, but that return value is discarded. As a result, cluster_size never reaches update_llm_args_with_extra_dict and is silently ignored.
Patch to preserve and merge the extras returned by get_llm_args with YAML overrides:
- llm_args, _ = get_llm_args( + llm_args, llm_args_extra_dict = get_llm_args( model=model, tokenizer=tokenizer, backend=backend, max_beam_width=max_beam_width, max_batch_size=max_batch_size, max_num_tokens=max_num_tokens, max_seq_len=max_seq_len, tensor_parallel_size=tp_size, pipeline_parallel_size=pp_size, moe_expert_parallel_size=ep_size, moe_cluster_parallel_size=cluster_size, gpus_per_node=gpus_per_node, free_gpu_memory_fraction=kv_cache_free_gpu_memory_fraction, mamba_ssm_cache_dtype=mamba_ssm_cache_dtype, num_postprocess_workers=num_postprocess_workers, trust_remote_code=trust_remote_code, reasoning_parser=reasoning_parser, fail_fast_on_attention_window_too_large= fail_fast_on_attention_window_too_large) - llm_args_extra_dict = {} if extra_llm_api_options is not None: with open(extra_llm_api_options, 'r') as f: - llm_args_extra_dict = yaml.safe_load(f) + file_overrides = yaml.safe_load(f) or {} + if not isinstance(file_overrides, dict): + raise ValueError("extra_llm_api_options must parse to a mapping (YAML dict)") + llm_args_extra_dict.update(file_overrides) llm_args = update_llm_args_with_extra_dict(llm_args, llm_args_extra_dict)This fixes cluster_size being dropped and makes future extras resilient.
Also applies to: 342-347
♻️ Duplicate comments (2)
tests/integration/defs/disaggregated/test_disaggregated.py (1)
529-577: Tighten JSON parsing and set a slightly more forgiving timeoutUse json.load on the response file-like object and keep the explicit urllib.request import. Your structure assertions are good.
Apply this diff:
- import json - import urllib.request + import json + import urllib.request @@ - with urllib.request.urlopen(f"{server_url}/perf_metrics", - timeout=10) as resp: + with urllib.request.urlopen(f"{server_url}/perf_metrics", + timeout=30) as resp: assert resp.status == 200 - perf_metrics = json.loads(resp.read()) + perf_metrics = json.load(resp)tensorrt_llm/serve/openai_disagg_server.py (1)
11-11: Use typing generics for Python 3.8 compatibilityThe codebase targets Python 3.8+, where PEP 585 built-in generics (list[str], dict[str, …], tuple[...]) are not supported. Replace them with typing.List/Dict/Tuple and import the needed symbols.
Apply this diff to the imports and changed annotations in this file:
-from typing import Optional, Type, Union +from typing import Dict, List, Optional, Tuple, Type, UnionAnd update usages within the shown ranges:
- self.server_perf_metrics: dict[str, dict[int, dict]] = {} + self.server_perf_metrics: Dict[str, Dict[int, dict]] = {}For other occurrences in this file (e.g., method annotations using list[str]), similarly convert to List[str]/Tuple[...].
</blockquote></details> </blockquote></details> <details> <summary>🧹 Nitpick comments (9)</summary><blockquote> <details> <summary>tensorrt_llm/serve/openai_server.py (4)</summary><blockquote> `88-100`: **Bounded queue for metrics is good; guard against None config** Deque with maxlen addresses unbounded growth nicely. To avoid surprises if the config remains Optional in some environments, coerce None to 0 before comparison. Apply this diff: ```diff - max_perf_metrics = self.llm.args.perf_metrics_max_requests + max_perf_metrics = int(self.llm.args.perf_metrics_max_requests or 0)
169-169: Endpoint naming is fine; consider consistency with config knobRoute name
/perf_metricsmatches functionality. If you want stronger alignment with the controlling flag (return_perf_metrics), leaving as-is is ok. A rename to/request_perf_metricscould be considered later, but not required now.
266-309: Safe swap-and-drain pattern; minor structural suggestionSwapping out the deque under lock and then formatting outside the lock is correct. A minor ergonomics improvement: instead of mutating elements in-place (metrics_dict["perf_metrics"] = metrics_json), build the final list to avoid mixing raw and formatted entries if future code reads the deque during formatting.
320-336: Ensure request_perf_metrics is present before accessingAccessing res.outputs[0].request_perf_metrics assumes metrics were collected. If return_perf_metrics is true but upstream didn’t populate request_perf_metrics, this will raise. Add a guard to skip appending when it’s None.
Apply this diff:
- item = { - "request_id": res.request_id, - "perf_metrics": res.outputs[0].request_perf_metrics - } + req_perf = res.outputs[0].request_perf_metrics + if req_perf is None: + return + item = { + "request_id": res.request_id, + "perf_metrics": req_perf, + }tensorrt_llm/serve/openai_disagg_server.py (3)
146-157: Harden remote fetches: check HTTP status and skip failuresA single failing upstream /perf_metrics currently bubbles up as an exception after buffering. Consider skipping servers that return non-200 or raise client errors, logging a warning, so the endpoint remains robust.
Apply this diff:
- try: - for server in self.ctx_servers + self.gen_servers: - async with self.session.get(f"{server}/perf_metrics") as response: - server_perf_metrics = await response.json() - perf_metrics[server] = server_perf_metrics - except Exception as e: - # Keep the exception to raise it after saving perf metrics - exc = e + try: + for server in self.ctx_servers + self.gen_servers: + try: + async with self.session.get(f"{server}/perf_metrics") as response: + if response.status != 200: + logger.warning(f"Skipping {server}/perf_metrics: HTTP {response.status}") + continue + server_perf_metrics = await response.json() + perf_metrics[server] = server_perf_metrics + except aiohttp.ClientError as e: + logger.error(f"Failed to fetch {server}/perf_metrics: {e}") + continue + except Exception as e: + # Keep the exception to raise it after saving perf metrics + exc = e
177-191: Avoid data loss when correlating: pop only after both sides presentYou already pop the gen side only after checking its presence; consider mirroring the get-then-pop pattern for both sides to avoid races. This is fine currently but can be made more explicit and symmetric.
Apply this diff:
- gen_perf_metrics = self.server_perf_metrics[gen_server].pop(ctx_request_id, None) - if gen_perf_metrics is None: + gen_perf_metrics = self.server_perf_metrics[gen_server].get(ctx_request_id) + if gen_perf_metrics is None: # generation not finished remain_keys.append((ctx_server, gen_server, ctx_request_id)) continue - ctx_perf_metrics = self.server_perf_metrics[ctx_server].pop(ctx_request_id, None) + ctx_perf_metrics = self.server_perf_metrics[ctx_server].get(ctx_request_id) + # both present; remove to avoid duplication next time + self.server_perf_metrics[gen_server].pop(ctx_request_id, None) + self.server_perf_metrics[ctx_server].pop(ctx_request_id, None)
118-123: create_error_response type is misleading and always raisesSignature says -> ErrorResponse, but it unconditionally raises HTTPException. Either return a JSONResponse with an ErrorResponse body or change the signature/name to reflect raising.
Apply this diff to return a proper response:
- def create_error_response( - message: str, - status_code: HTTPStatus = HTTPStatus.BAD_REQUEST) -> ErrorResponse: - raise HTTPException(status_code=500, detail=f"Internal server error {message}") + def create_error_response( + message: str, + status_code: HTTPStatus = HTTPStatus.BAD_REQUEST) -> JSONResponse: + payload = {"error": message, "code": status_code.value} + return JSONResponse(content=payload, status_code=status_code.value)And ensure call sites use this response (or keep raising where appropriate).
tensorrt_llm/commands/serve.py (2)
1-1: Missing NVIDIA copyright header.Per repository guidelines, prepend the current year NVIDIA header to all source files.
Add at the very top:
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
206-210: Help text contradicts the default backend.Help says “Default is cpp path” while default is "pytorch".
Consider:
- help="Set to 'pytorch' for pytorch path. Default is cpp path.") + help="Set to 'pytorch' for PyTorch path. Default is 'pytorch'.")
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (9)
tensorrt_llm/commands/serve.py(3 hunks)tensorrt_llm/llmapi/disagg_utils.py(4 hunks)tensorrt_llm/llmapi/llm_args.py(1 hunks)tensorrt_llm/serve/openai_disagg_server.py(6 hunks)tensorrt_llm/serve/openai_server.py(9 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_metrics.yaml(1 hunks)tests/integration/defs/disaggregated/test_disaggregated.py(6 hunks)tests/integration/defs/test_e2e.py(1 hunks)tests/unittest/llmapi/apps/_test_openai_perf_metrics.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/integration/defs/disaggregated/test_configs/disagg_config_metrics.yaml
- tests/integration/defs/test_e2e.py
- tests/unittest/llmapi/apps/_test_openai_perf_metrics.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/llmapi/llm_args.pytests/integration/defs/disaggregated/test_disaggregated.pytensorrt_llm/llmapi/disagg_utils.pytensorrt_llm/serve/openai_server.pytensorrt_llm/commands/serve.pytensorrt_llm/serve/openai_disagg_server.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/llmapi/llm_args.pytests/integration/defs/disaggregated/test_disaggregated.pytensorrt_llm/llmapi/disagg_utils.pytensorrt_llm/serve/openai_server.pytensorrt_llm/commands/serve.pytensorrt_llm/serve/openai_disagg_server.py
🧬 Code Graph Analysis (5)
tensorrt_llm/llmapi/llm_args.py (1)
tensorrt_llm/builder.py (1)
default(50-58)
tests/integration/defs/disaggregated/test_disaggregated.py (2)
tests/integration/defs/conftest.py (4)
disaggregated_test_root(2339-2344)llm_venv(707-723)disaggregated_example_root(270-275)llama_model_root(964-1039)tensorrt_llm/serve/openai_disagg_server.py (1)
perf_metrics(146-192)
tensorrt_llm/serve/openai_server.py (5)
tensorrt_llm/serve/openai_disagg_server.py (1)
perf_metrics(146-192)tensorrt_llm/_utils.py (1)
set_prometheus_multiproc_dir(1118-1129)tensorrt_llm/metrics/collector.py (2)
MetricsCollector(10-105)log_metrics_dict(100-105)tensorrt_llm/llmapi/llm.py (1)
RequestOutput(47-87)tensorrt_llm/executor/result.py (3)
finished(544-545)outputs(197-212)request_id(521-522)
tensorrt_llm/commands/serve.py (1)
tensorrt_llm/llmapi/disagg_utils.py (2)
MetadataServerConfig(59-64)ServerRole(19-22)
tensorrt_llm/serve/openai_disagg_server.py (3)
tensorrt_llm/llmapi/disagg_utils.py (3)
DisaggServerConfig(47-55)MetadataServerConfig(59-64)get_ctx_gen_server_urls(67-77)tensorrt_llm/serve/metadata_server.py (7)
create_metadata_server(80-95)get(17-18)get(38-39)get(66-68)keys(29-30)keys(47-58)keys(76-77)tensorrt_llm/serve/router.py (2)
create_router(615-653)KvCacheAwareRouter(512-612)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (10)
tensorrt_llm/llmapi/disagg_utils.py (2)
55-56: Config knob addition looks goodDisaggServerConfig now carries perf_metrics_max_requests with a sane default. This aligns with the OpenAIDisaggServer usage.
131-135: Constructor wiring is consistentPassing perf_metrics_max_requests into DisaggServerConfig is correct and consistent with new config flow.
tests/integration/defs/disaggregated/test_disaggregated.py (4)
20-20: Import Callable is appropriateUsing Callable for the new hook type improves clarity and type hints for editors.
39-45: Helper to derive server URL from YAML config looks goodSimple and robust defaults when hostname/port are not present. No issues.
166-167: Extending test harness with extra_endpoints_test is a nice patternThe injection point cleanly decouples endpoint assertions from the core E2E loop.
246-249: Hook usage is correctCalling the endpoint verifier after each iteration ensures the metrics endpoint gets exercised. Looks good.
tensorrt_llm/serve/openai_disagg_server.py (1)
55-67: Good: opt-in correlation buffers and locks are only instantiated when enabledThe gating on perf_metrics_max_requests > 0 avoids unnecessary overhead. No issues.
tensorrt_llm/commands/serve.py (3)
6-6: Typing import cleanup is correct.Removing unused typing imports and keeping only Any, Optional aligns with the current usage.
22-24: serve.py import cleanup verifiedNo lingering references to CtxGenServerConfig in tensorrt_llm/commands/serve.py — the class remains in disagg_utils (and its tests) by design. Dropping the import here is correct.
473-476: Perf metrics retention is properly boundedThe
DisaggServerConfig(llmapi/disagg_utils.py) wiresperf_metrics_max_requests(default 0), and both servers enforce limits:
- In
openai_disagg_server.py
•self.perf_metrics_keys = deque(maxlen=self.perf_metrics_max_requests)
•self.server_perf_metricsprunes oldest entries when its length exceedsperf_metrics_max_requests.
•/perf_metricsendpoint is registered inregister_routes.- In
openai_server.py
•self.perf_metrics = deque(maxlen=self.llm.args.perf_metrics_max_requests)
• Reinitialized on each fetch to keep only the last N entries.
•/perf_metricsendpoint is registered.Defaults, wiring, and bounded retention are all correctly implemented—no changes needed.
|
PR_Github #15890 [ run ] triggered by Bot |
|
/bot run |
|
PR_Github #15893 [ run ] triggered by Bot |
|
PR_Github #15890 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/llmapi/disagg_utils.py (1)
1-6: Add NVIDIA copyright headerThis source file is missing the required NVIDIA copyright/SPDX header.
Apply at the top of the file:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +tensorrt_llm/serve/openai_disagg_server.py (1)
445-451: Fix Python 3.8 typing in method annotationsReplace list[str] with List[str] and import List from typing (already added above).
Apply this diff:
- async def wait_for_all_servers_ready(cls, session: aiohttp.ClientSession, - ctx_servers: list[str], - gen_servers: list[str], + async def wait_for_all_servers_ready(cls, session: aiohttp.ClientSession, + ctx_servers: List[str], + gen_servers: List[str], server_start_timeout_secs: int = 180): - async def get_unready_servers(servers: list[str]) -> list[str]: + async def get_unready_servers(servers: List[str]) -> List[str]:
♻️ Duplicate comments (3)
tensorrt_llm/serve/openai_disagg_server.py (3)
150-160: Harden cross-server fetch: check HTTP status and skip failures (don’t raise for a single bad server)A failing server currently bubbles an exception after caching metrics, breaking the whole endpoint. Skip failed/non-200 servers and continue aggregating.
Apply this diff:
- perf_metrics = {} - exc = None - try: - for server in self.ctx_servers + self.gen_servers: - async with self.session.get(f"{server}/perf_metrics") as response: - server_perf_metrics = await response.json() - perf_metrics[server] = server_perf_metrics - except Exception as e: - # Keep the exception to raise it after saving perf metrics - exc = e + perf_metrics = {} + for server in self.ctx_servers + self.gen_servers: + try: + async with self.session.get(f"{server}/perf_metrics") as response: + if response.status != 200: + logger.warning(f"Skipping {server}/perf_metrics: HTTP {response.status}") + continue + server_perf_metrics = await response.json() + perf_metrics[server] = server_perf_metrics + except aiohttp.ClientError as e: + logger.error(f"Failed to fetch {server}/perf_metrics: {e}") + continueAnd remove the exc handling below since we no longer re-raise.
11-12: Use typing.Dict/List/Tuple for Python 3.8 compatibilityPEP 585 built-in generics in type hints aren’t supported on 3.8. Import and use typing equivalents.
Apply this diff:
-from typing import Optional, Type, Union +from typing import Dict, List, Optional, Tuple, Type, UnionAdditionally, update annotations below as suggested in related diffs.
178-191: Avoid data loss when correlating ctx/gen metrics (don’t pop until both present)If gen is ready but ctx isn’t, current code pops gen and drops the pair. Peek for both; only pop when both are present.
Apply this diff:
- remain_keys = [] - for ctx_server, gen_server, ctx_request_id in self.perf_metrics_keys: - gen_perf_metrics = self.server_perf_metrics[gen_server].pop(ctx_request_id, None) - if gen_perf_metrics is None: - # generation not finished - remain_keys.append((ctx_server, gen_server, ctx_request_id)) - continue - ctx_perf_metrics = self.server_perf_metrics[ctx_server].pop(ctx_request_id, None) - return_metrics.append({ - "ctx_server": ctx_server, - "gen_server": gen_server, - "ctx_perf_metrics": ctx_perf_metrics, - "gen_perf_metrics": gen_perf_metrics}) + remain_keys = [] + for ctx_server, gen_server, ctx_request_id in self.perf_metrics_keys: + ctx_perf_metrics = self.server_perf_metrics.get(ctx_server, {}).get(ctx_request_id) + gen_perf_metrics = self.server_perf_metrics.get(gen_server, {}).get(ctx_request_id) + if not (ctx_perf_metrics and gen_perf_metrics): + # one side not finished yet + remain_keys.append((ctx_server, gen_server, ctx_request_id)) + continue + # both present; remove and return + self.server_perf_metrics[ctx_server].pop(ctx_request_id, None) + self.server_perf_metrics[gen_server].pop(ctx_request_id, None) + return_metrics.append({ + "ctx_server": ctx_server, + "gen_server": gen_server, + "ctx_perf_metrics": ctx_perf_metrics, + "gen_perf_metrics": gen_perf_metrics})
🧹 Nitpick comments (3)
tensorrt_llm/llmapi/disagg_utils.py (1)
71-76: Validate server configs to avoid invalid URLs (http://None:None)If hostname/port are unset, this builds invalid URLs. Fail fast with a clear error.
Apply this diff:
ctx_server_urls = [] gen_server_urls = [] for cfg in server_configs: - if cfg.type == "ctx": - ctx_server_urls.append(f"http://{cfg.hostname}:{cfg.port}") - else: - gen_server_urls.append(f"http://{cfg.hostname}:{cfg.port}") + if cfg.hostname is None or cfg.port is None: + raise ValueError(f"Missing hostname/port for server config: {cfg}") + if cfg.type == "ctx": + ctx_server_urls.append(f"http://{cfg.hostname}:{cfg.port}") + elif cfg.type == "gen": + gen_server_urls.append(f"http://{cfg.hostname}:{cfg.port}") + else: + raise ValueError(f"Unknown server type: {cfg.type}")tensorrt_llm/serve/openai_disagg_server.py (2)
36-36: Fix comment typoNit: “enale” -> “enable”.
Apply this diff:
-# yapf: enale +# yapf: enable
401-401: Use typing.Callable for create_generator type hintMinor typing cleanup; also import Callable from typing.
Apply this diff:
- response_type: Type[Union[CompletionResponse, ChatCompletionResponse]], - create_generator: callable) -> Union[CompletionResponse, ChatCompletionResponse, StreamingResponse]: + response_type: Type[Union[CompletionResponse, ChatCompletionResponse]], + create_generator: Callable) -> Union[CompletionResponse, ChatCompletionResponse, StreamingResponse]:Outside this range (imports):
from typing import Callable
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
tensorrt_llm/llmapi/disagg_utils.py(4 hunks)tensorrt_llm/llmapi/llm_args.py(1 hunks)tensorrt_llm/serve/openai_disagg_server.py(6 hunks)tests/integration/defs/disaggregated/test_disaggregated.py(6 hunks)tests/integration/test_lists/test-db/l0_a10.yml(1 hunks)tests/integration/test_lists/test-db/l0_h100.yml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/llmapi/llm_args.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/llmapi/disagg_utils.pytests/integration/defs/disaggregated/test_disaggregated.pytensorrt_llm/serve/openai_disagg_server.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/llmapi/disagg_utils.pytests/integration/defs/disaggregated/test_disaggregated.pytensorrt_llm/serve/openai_disagg_server.py
🧬 Code Graph Analysis (2)
tests/integration/defs/disaggregated/test_disaggregated.py (2)
tests/integration/defs/conftest.py (4)
disaggregated_test_root(2339-2344)llm_venv(707-723)disaggregated_example_root(270-275)llama_model_root(964-1039)tensorrt_llm/serve/openai_disagg_server.py (1)
perf_metrics(146-193)
tensorrt_llm/serve/openai_disagg_server.py (3)
tensorrt_llm/llmapi/disagg_utils.py (3)
DisaggServerConfig(47-55)MetadataServerConfig(59-64)get_ctx_gen_server_urls(67-78)tensorrt_llm/serve/metadata_server.py (7)
create_metadata_server(80-95)get(17-18)get(38-39)get(66-68)keys(29-30)keys(47-58)keys(76-77)tensorrt_llm/serve/router.py (2)
create_router(615-653)KvCacheAwareRouter(512-612)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tests/integration/test_lists/test-db/l0_a10.yml (1)
28-28: Add perf-metrics E2E to A10 pre-merge: LGTMThe new entry properly wires the OpenAI perf-metrics E2E test into the A10 list.
tensorrt_llm/llmapi/disagg_utils.py (1)
55-55: New config field perf_metrics_max_requests: LGTMThe addition is consistent with the server’s retention policy and defaulting to 0 disables buffering.
tests/integration/defs/disaggregated/test_disaggregated.py (4)
39-45: Helper to resolve disagg server URL from YAML: LGTMSimple and correct; aligns with the test configs’ hostname/port keys.
165-166: Extend runner with prompt_file and extra_endpoints_test: LGTMThe new parameters integrate cleanly with existing client invocations and post-iteration hooks.
246-249: Post-run endpoint checks hook: LGTMCalling the extra test after each iteration is useful for validating /perf_metrics without interfering with output verification.
529-576: Disaggregated /perf_metrics test: LGTMGood use of urlopen with timeout and json.load; field checks and temporal ordering assertions look solid.
tests/integration/test_lists/test-db/l0_h100.yml (1)
67-67: Add perf-metrics disagg test to H100 pre-merge: LGTMAppropriately placed alongside other disaggregated tests.
tensorrt_llm/serve/openai_disagg_server.py (1)
127-128: Expose /perf_metrics route: LGTMPublic endpoint registration is correct and matches the tests’ expectations.
|
PR_Github #15893 [ run ] completed with state |
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/serve/openai_disagg_server.py (2)
415-419: Bug: aiohttp ClientResponse has no.ok; use status/raise_for_status instead
response.okis a requests/HTTPX-ism; in aiohttp this will raise AttributeError at runtime.- if not response.ok: - logger.error(f"Received failed response {response_dict}") - response.raise_for_status() + if response.status >= 400: + logger.error(f"Received failed response {response.status}: {response_dict}") + response.raise_for_status()
466-469: Timeout handling catches wrong exception type
asyncio.wait_forraisesasyncio.TimeoutError, notCancelledError. As-is, timeout won’t be turned into the intendedTimeoutError.- try: - await asyncio.wait_for(check_all_servers_ready(), timeout=server_start_timeout_secs) - except asyncio.CancelledError: - raise TimeoutError("Timeout waiting for context and generation servers to be ready") + try: + await asyncio.wait_for(check_all_servers_ready(), timeout=server_start_timeout_secs) + except asyncio.TimeoutError: + raise TimeoutError("Timeout waiting for context and generation servers to be ready")
♻️ Duplicate comments (7)
tensorrt_llm/serve/openai_disagg_server.py (7)
446-453: Python 3.8 typing: replace list[str] with List[str]Adhere to the 3.8+ target and repo policy.
- async def wait_for_all_servers_ready(cls, session: aiohttp.ClientSession, - ctx_servers: list[str], - gen_servers: list[str], + async def wait_for_all_servers_ready(cls, session: aiohttp.ClientSession, + ctx_servers: List[str], + gen_servers: List[str], server_start_timeout_secs: int = 180): - async def get_unready_servers(servers: list[str]) -> list[str]: + async def get_unready_servers(servers: List[str]) -> List[str]:
11-11: Typing imports incomplete; add typing generics for 3.8 compatibilityYou use Dict/List/Tuple/Deque/Callable/AsyncIterator in suggested fixes and to avoid PEP 585 built-ins. Extend imports.
-from typing import Optional, Type, Union +from typing import AsyncIterator, Callable, Deque, Dict, List, Optional, Tuple, Type, Union
59-63: Python 3.8 compatibility: avoid PEP 585 built-in generics and add precise annotationsReplace built-in generics and annotate the deque precisely. Also keep server_perf_metrics’ value type stable.
- # (ctx_server, gen_server, ctx_request_id) - self.perf_metrics_keys = deque(maxlen=self.perf_metrics_max_requests) + # (ctx_server, gen_server, ctx_request_id) + self.perf_metrics_keys: Deque[Tuple[str, str, int]] + self.perf_metrics_keys = deque(maxlen=self.perf_metrics_max_requests) self.perf_metrics_keys_lock = asyncio.Lock() # server_key -> {ctx_request_id: perf_metrics} - self.server_perf_metrics: dict[str, dict[int, dict]] = {} + self.server_perf_metrics: Dict[str, Dict[int, dict]] = {}
180-193: Fix data loss when correlating ctx/gen metrics: only pop after both sides are presentCurrent flow pops gen first; if ctx isn’t ready yet, you emit a record with ctx=None and drop gen, losing correlation forever. Peek with get(), only pop when both exist, and retain the key otherwise.
- remain_keys = [] - for ctx_server, gen_server, ctx_request_id in self.perf_metrics_keys: - gen_perf_metrics = self.server_perf_metrics[gen_server].pop(ctx_request_id, None) - if gen_perf_metrics is None: - # generation not finished - remain_keys.append((ctx_server, gen_server, ctx_request_id)) - continue - ctx_perf_metrics = self.server_perf_metrics[ctx_server].pop(ctx_request_id, None) - return_metrics.append({ - "ctx_server": ctx_server, - "gen_server": gen_server, - "ctx_perf_metrics": ctx_perf_metrics, - "gen_perf_metrics": gen_perf_metrics}) + remain_keys = [] + for ctx_server, gen_server, ctx_request_id in self.perf_metrics_keys: + ctx_store = self.server_perf_metrics.get(ctx_server, {}) + gen_store = self.server_perf_metrics.get(gen_server, {}) + ctx_perf_metrics = ctx_store.get(ctx_request_id) + gen_perf_metrics = gen_store.get(ctx_request_id) + if not (ctx_perf_metrics and gen_perf_metrics): + # one side not finished yet; keep the key + remain_keys.append((ctx_server, gen_server, ctx_request_id)) + continue + # both present; now remove from stores to avoid duplication + ctx_store.pop(ctx_request_id, None) + gen_store.pop(ctx_request_id, None) + return_metrics.append({ + "ctx_server": ctx_server, + "gen_server": gen_server, + "ctx_perf_metrics": ctx_perf_metrics, + "gen_perf_metrics": gen_perf_metrics, + })
1-1: Add NVIDIA SPDX header per repo guidelinesSource files must include the NVIDIA copyright/SPDX header.
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + #!/usr/bin/env python
59-63: Audit and update PEP 585 built-in generics for Python 3.8 compatibilityThe repository currently uses built-in generics (e.g.
list[str],dict[str, …],tuple[int, …],deque[int]) which are only supported on Python 3.9+. Since our target is Python 3.8, these must be replaced with the corresponding types from thetypingmodule.• In
tensorrt_llm/serve/openai_disagg_server.py
– Lines 59–63:
```diff
- self.server_perf_metrics: dict[str, dict[int, dict]] = {}
- from typing import Dict
- self.server_perf_metrics: Dict[str, Dict[int, Dict]] = {}
– Lines 446–453 (method signature):
```diff
- async def get_unready_servers(servers: list[str]) -> list[str]:
- from typing import List
- async def get_unready_servers(servers: List[str]) -> List[str]:
• Apply the same conversion across the codebase wherever you see:
list[...]→typing.List[...]dict[...]→typing.Dict[...]set[...]→typing.Set[...]tuple[...]→typing.Tuple[...]deque[...]→typing.Deque[...]• Ensure you add the necessary imports, e.g.
from typing import List, Dict, Tuple, Set, Deque from collections import deque # if using deque instancesThis refactor is mandatory to maintain compatibility with Python 3.8.
146-160: Harden /perf_metrics aggregation: check HTTP status, handle errors per-server, don’t raise global exceptionAs-is, any exception in the loop defers raising and then 500s the endpoint after partially updating caches. Prefer skipping failed servers and proceeding. Also check
response.statusbefore parsing JSON.- perf_metrics = {} - exc = None - try: - for server in self.ctx_servers + self.gen_servers: - async with self.session.get(f"{server}/perf_metrics") as response: - server_perf_metrics = await response.json() - perf_metrics[server] = server_perf_metrics - except Exception as e: - # Keep the exception to raise it after saving perf metrics - exc = e + perf_metrics: Dict[str, list] = {} + for server in self.ctx_servers + self.gen_servers: + try: + async with self.session.get(f"{server}/perf_metrics") as response: + if response.status != 200: + logger.warning(f"Skipping {server}/perf_metrics: HTTP {response.status}") + continue + try: + server_perf_metrics = await response.json() + except Exception as je: + logger.warning(f"Invalid JSON from {server}/perf_metrics: {je}") + continue + perf_metrics[server] = server_perf_metrics + except aiohttp.ClientError as ce: + logger.error(f"Failed to fetch {server}/perf_metrics: {ce}") + continueOptional (latency): fetch all servers concurrently with
asyncio.gatherand per-task error handling. Want a suggested patch?
🧹 Nitpick comments (5)
tensorrt_llm/serve/openai_disagg_server.py (5)
399-405: Tighten type ofcreate_generatorparameterUse
typing.Callablewith an explicit signature that matches your generator helpers.- async def send_request(self, url: str, + async def send_request(self, url: str, request: Union[CompletionRequest, ChatCompletionRequest], endpoint: str, response_type: Type[Union[CompletionResponse, ChatCompletionResponse]], - create_generator: callable) -> Union[CompletionResponse, ChatCompletionResponse, StreamingResponse]: + create_generator: Callable[[str, Union[CompletionRequest, ChatCompletionRequest]], AsyncIterator[bytes]] + ) -> Union[CompletionResponse, ChatCompletionResponse, StreamingResponse]:
334-337: Consider awaiting key registration to reduce race windowUsing
asyncio.create_taskmakes correlation key insertion eventual. If/perf_metricsis called immediately, that request may be missed until the next poll. If acceptable, ignore; otherwiseawait self._add_perf_metrics_keys(...)here. Minor.If you want to quantify any impact, I can script a stress test that triggers
/perf_metricsconcurrently with request dispatch to measure missed correlations.
36-36: Nit: fix typo in yapf directiveSpelling: “enable”.
-# yapf: enale +# yapf: enable
118-123:create_error_responseignores provided status and always raises 500Either return a proper
JSONResponsewith the given status or raiseHTTPException(status_code=status_code, ...). Otherwise callers can’t control status codes.- def create_error_response( - message: str, - status_code: HTTPStatus = HTTPStatus.BAD_REQUEST) -> ErrorResponse: - raise HTTPException(status_code=500, detail=f"Internal server error {message}") + def create_error_response( + message: str, + status_code: HTTPStatus = HTTPStatus.BAD_REQUEST) -> ErrorResponse: + raise HTTPException(status_code=int(status_code), detail=str(message))
146-196: Add tests for/perf_metricserror-paths and pruning
- One server returns non-200; verify endpoint still returns aggregated results.
- JSON parse error from a server; verify skip.
- Correlation path when ctx arrives before gen and vice versa; verify no data loss.
- Pruning: exceed
perf_metrics_max_requests; verify oldest entries are dropped deterministically.I can draft pytest cases mirroring
tests/unittest/llmapi/apps/_test_openai_perf_metrics.pystyle.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/serve/openai_disagg_server.py(6 hunks)tests/integration/test_lists/test-db/l0_h100.yml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/integration/test_lists/test-db/l0_h100.yml
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/serve/openai_disagg_server.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/serve/openai_disagg_server.py
🧠 Learnings (1)
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/serve/openai_disagg_server.py
🧬 Code Graph Analysis (1)
tensorrt_llm/serve/openai_disagg_server.py (3)
tensorrt_llm/llmapi/disagg_utils.py (3)
DisaggServerConfig(47-55)MetadataServerConfig(59-64)get_ctx_gen_server_urls(67-78)tensorrt_llm/serve/metadata_server.py (7)
create_metadata_server(80-95)get(17-18)get(38-39)get(66-68)keys(29-30)keys(47-58)keys(76-77)tensorrt_llm/serve/router.py (2)
create_router(615-653)KvCacheAwareRouter(512-612)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
|
PR_Github #15995 [ run ] triggered by Bot |
|
PR_Github #15995 [ run ] completed with state |
|
/bot run |
|
PR_Github #16100 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/unittest/api_stability/references/llm.yaml (1)
134-137: Consider grouping with the Stats section for readabilityperf_metrics_max_requests is conceptually tied to return_perf_metrics and other stats controls. For maintainers scanning this reference, placing it under the “# Stats” block (near return_perf_metrics) would improve discoverability. No functional change—purely organizational.
Proposed edit (move block near the Stats entries):
@@ - print_iter_log: - annotation: bool - default: False - status: beta - perf_metrics_max_requests: - annotation: int - default: 0 - status: prototype + print_iter_log: + annotation: bool + default: False + status: beta @@ return_perf_metrics: annotation: bool default: False status: prototype + perf_metrics_max_requests: + annotation: int + default: 0 + status: prototype
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/unittest/api_stability/references/llm.yaml(1 hunks)
|
PR_Github #16100 [ run ] completed with state |
|
/bot run |
|
PR_Github #16143 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tests/integration/defs/disaggregated/test_disaggregated.py (1)
543-551: Past feedback addressed: correct urllib import, timeout, and JSON parsingThanks for switching to urllib.request, adding a timeout, and using json.load on the response object. This resolves the previously raised runtime and robustness issues.
🧹 Nitpick comments (3)
tests/integration/defs/disaggregated/test_disaggregated.py (3)
39-45: Add return type annotation and a brief docstring; open YAML with explicit encodingMinor polish to improve readability and consistency with the rest of the file where files are opened with encoding.
-def get_disagg_server_url_from_cfg(config_file: str): - with open(config_file, 'r') as file: +def get_disagg_server_url_from_cfg(config_file: str) -> str: + """Return the disaggregated server base URL (http://host:port) from a YAML config.""" + with open(config_file, 'r', encoding='utf-8') as file: config = yaml.safe_load(file) server_host = config.get('hostname', 'localhost') server_port = config.get('port', 8000) return f"http://{server_host}:{server_port}"
246-249: Avoid re-parsing YAML each loop iteration; compute server_url onceWe open and parse the config file on every iteration just to derive the URL. Cache it once for slight speedup and less IO.
- if extra_endpoints_test is not None: - extra_endpoints_test( - get_disagg_server_url_from_cfg(config_file)) + if extra_endpoints_test is not None: + extra_endpoints_test(server_url)Add this initialization near where config_file is determined (outside the loop):
# After obtaining config_file and before entering the for-loop server_url = get_disagg_server_url_from_cfg(config_file)
547-552: Add a retry loop when polling/perf_metricsto reduce flakesI’ve verified that all metric producers emit both
timing_metricsandkv_cache_metrics(seetensorrt_llm/serve/openai_server.pyaround lines 273–298), so it’s safe to wait briefly for the aggregator to combine them.Please consider updating the test at
tests/integration/defs/disaggregated/test_disaggregated.py(around lines 547–552) as follows:- with urllib.request.urlopen(f"{server_url}/perf_metrics", - timeout=10) as resp: - assert resp.status == 200 - perf_metrics = json.load(resp) - assert len(perf_metrics) > 0 + import time + perf_metrics = [] + for _ in range(10): # up to ~10s total + with urllib.request.urlopen(f"{server_url}/perf_metrics", timeout=10) as resp: + assert resp.status == 200 + perf_metrics = json.load(resp) + if perf_metrics: + break + time.sleep(1) + assert perf_metrics, "No perf metrics returned after retries"
- Location to update:
- tests/integration/defs/disaggregated/test_disaggregated.py lines 547–552
- Verified producers include
kv_cache_metrics:
- tensorrt_llm/serve/openai_server.py lines 273–298 [run_scripts output]
Optional refactor to improve test stability without major changes to production code.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tests/integration/defs/disaggregated/test_disaggregated.py(6 hunks)tests/integration/defs/test_e2e.py(1 hunks)tests/integration/test_lists/test-db/l0_h100.yml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/integration/test_lists/test-db/l0_h100.yml
- tests/integration/defs/test_e2e.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/disaggregated/test_disaggregated.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/disaggregated/test_disaggregated.py
🧬 Code graph analysis (1)
tests/integration/defs/disaggregated/test_disaggregated.py (2)
tests/integration/defs/conftest.py (4)
disaggregated_test_root(2339-2344)llm_venv(707-723)disaggregated_example_root(270-275)llama_model_root(964-1039)tensorrt_llm/serve/openai_disagg_server.py (1)
perf_metrics(146-195)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tests/integration/defs/disaggregated/test_disaggregated.py (3)
20-20: Type annotations import looks goodAdding Callable is appropriate for the new callback parameter.
165-166: Signature extension is clear and typedAdding prompt_file default and the typed extra_endpoints_test callback reads well.
69-69: perf_metrics config reference verified – LGTMVerified that
disagg_config_metrics.yamlexists undertests/integration/defs/disaggregated/test_configs/and that the mapping intests/integration/defs/disaggregated/test_disaggregated.py(line 69) correctly references it. No further changes needed.
|
PR_Github #16143 [ run ] completed with state |
|
some timing are get from |
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
The steady clock and the python |
ec90693 to
0f06d5c
Compare
|
/bot run |
|
PR_Github #16349 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
tests/integration/defs/disaggregated/test_disaggregated.py (2)
544-551: Fix urllib import and add timeout for robustnessThe code uses
urllib.request.urlopenbut only importsurllib, which will cause anAttributeErrorat runtime. Also, add a timeout to prevent indefinite blocking.Apply this diff:
def extra_endpoints_test(server_url: str): import json - import urllib.request + import urllib.request with urllib.request.urlopen(f"{server_url}/perf_metrics", - timeout=10) as resp: + timeout=30) as resp: assert resp.status == 200 perf_metrics = json.load(resp)
559-571: Fix: kv_cache_size is not in timing_metrics and timestamps need robust checksThe test incorrectly tries to access
kv_cache_sizefromgen_metrics(which are timing_metrics). The field is actually inkv_cache_metrics. Also, the strict timing comparisons could be flaky.Apply this diff:
ctx_metrics = item["ctx_perf_metrics"]["perf_metrics"]["timing_metrics"] gen_metrics = item["gen_perf_metrics"]["perf_metrics"]["timing_metrics"] + gen_kv_metrics = item["gen_perf_metrics"]["perf_metrics"].get("kv_cache_metrics", {}) # only one token is generated in ctx assert ctx_metrics["last_token_time"] - ctx_metrics[ "first_token_time"] < 1e-3 assert ctx_metrics["last_token_time"] < gen_metrics["arrival_time"] - assert gen_metrics["kv_cache_size"] > 0 + # Check kv_cache_metrics if available + if gen_kv_metrics: + # Note: kv_cache_size might be in timing_metrics for legacy reasons + if "kv_cache_size" in gen_metrics: + assert gen_metrics["kv_cache_size"] > 0 assert gen_metrics["arrival_time"] < gen_metrics[ "kv_cache_transfer_start"] assert gen_metrics["kv_cache_transfer_start"] < gen_metrics[ "kv_cache_transfer_end"] assert gen_metrics["kv_cache_transfer_end"] < gen_metrics[ "first_scheduled_time"]tensorrt_llm/serve/openai_disagg_server.py (2)
55-66: Fix type annotations for Python 3.8 compatibilityThe code uses PEP 585 built-in generic
dict[str, dict[int, dict]]which is not supported in Python 3.8. Replace with typing module equivalents.Apply this diff:
+from typing import Dict # server_key -> {ctx_request_id: perf_metrics} - self.server_perf_metrics: dict[str, dict[int, dict]] = {} + self.server_perf_metrics: Dict[str, Dict[int, dict]] = {}
150-159: Add error handling for individual server failuresIf any server's
/perf_metricsendpoint fails or returns non-200, the entire aggregation fails. Consider skipping failed servers to keep the endpoint robust.Apply this diff:
perf_metrics = {} exc = None try: for server in self.ctx_servers + self.gen_servers: - async with self.session.get(f"{server}/perf_metrics") as response: - server_perf_metrics = await response.json() - perf_metrics[server] = server_perf_metrics + try: + async with self.session.get(f"{server}/perf_metrics") as response: + if response.status != 200: + logger.warning(f"Skipping {server}/perf_metrics: HTTP {response.status}") + continue + server_perf_metrics = await response.json() + perf_metrics[server] = server_perf_metrics + except Exception as e: + logger.error(f"Failed to fetch {server}/perf_metrics: {e}") + # Don't set exc here - allow partial aggregation except Exception as e: # Keep the exception to raise it after saving perf metrics exc = e
🧹 Nitpick comments (2)
tensorrt_llm/serve/openai_server.py (1)
266-308: Consider adding error handling for missing perf_metrics fieldsThe function accesses nested attributes without checking if they exist. Consider adding defensive checks to handle cases where some metrics might be None or missing.
Add defensive checks:
for metrics_dict in perf_metrics: metrics = metrics_dict["perf_metrics"] + if metrics is None: + continue timing_metrics = metrics.timing_metrics kv_cache_metrics = metrics.kv_cache_metrics speculative_decoding = metrics.speculative_decodingtensorrt_llm/serve/openai_disagg_server.py (1)
177-178: Consider not re-raising the exception to allow partial resultsRe-raising the exception here causes the endpoint to return 500 even when some metrics were successfully collected. Consider returning partial results instead.
- if exc is not None: - raise exc + # Log the exception but continue with partial results + if exc is not None: + logger.error(f"Error during perf metrics collection: {exc}")
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (12)
tensorrt_llm/commands/serve.py(3 hunks)tensorrt_llm/llmapi/disagg_utils.py(4 hunks)tensorrt_llm/llmapi/llm_args.py(1 hunks)tensorrt_llm/serve/openai_disagg_server.py(6 hunks)tensorrt_llm/serve/openai_server.py(9 hunks)tests/integration/defs/disaggregated/test_configs/disagg_config_metrics.yaml(1 hunks)tests/integration/defs/disaggregated/test_disaggregated.py(7 hunks)tests/integration/defs/test_e2e.py(1 hunks)tests/integration/test_lists/test-db/l0_a10.yml(1 hunks)tests/integration/test_lists/test-db/l0_h100.yml(1 hunks)tests/unittest/api_stability/references/llm.yaml(1 hunks)tests/unittest/llmapi/apps/_test_openai_perf_metrics.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- tests/integration/test_lists/test-db/l0_h100.yml
- tests/integration/test_lists/test-db/l0_a10.yml
- tensorrt_llm/llmapi/llm_args.py
- tests/unittest/api_stability/references/llm.yaml
- tests/unittest/llmapi/apps/_test_openai_perf_metrics.py
- tensorrt_llm/llmapi/disagg_utils.py
- tests/integration/defs/disaggregated/test_configs/disagg_config_metrics.yaml
- tests/integration/defs/test_e2e.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/serve/openai_server.pytests/integration/defs/disaggregated/test_disaggregated.pytensorrt_llm/serve/openai_disagg_server.pytensorrt_llm/commands/serve.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/serve/openai_server.pytests/integration/defs/disaggregated/test_disaggregated.pytensorrt_llm/serve/openai_disagg_server.pytensorrt_llm/commands/serve.py
🧠 Learnings (1)
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/serve/openai_disagg_server.py
🧬 Code graph analysis (4)
tensorrt_llm/serve/openai_server.py (5)
tensorrt_llm/serve/openai_disagg_server.py (1)
perf_metrics(146-195)tensorrt_llm/_utils.py (1)
set_prometheus_multiproc_dir(1118-1129)tensorrt_llm/metrics/collector.py (2)
MetricsCollector(10-105)log_metrics_dict(100-105)tensorrt_llm/llmapi/llm.py (1)
RequestOutput(47-87)tensorrt_llm/executor/result.py (3)
finished(544-545)outputs(197-212)request_id(521-522)
tests/integration/defs/disaggregated/test_disaggregated.py (2)
tests/integration/defs/conftest.py (4)
disaggregated_test_root(2339-2344)llm_venv(707-723)disaggregated_example_root(270-275)llama_model_root(964-1039)tensorrt_llm/serve/openai_disagg_server.py (1)
perf_metrics(146-195)
tensorrt_llm/serve/openai_disagg_server.py (4)
cpp/tensorrt_llm/common/envUtils.h (1)
tensorrt_llm(23-119)tensorrt_llm/llmapi/disagg_utils.py (3)
DisaggServerConfig(47-55)MetadataServerConfig(59-64)get_ctx_gen_server_urls(67-78)tensorrt_llm/serve/metadata_server.py (7)
create_metadata_server(80-95)get(17-18)get(38-39)get(66-68)keys(29-30)keys(47-58)keys(76-77)tensorrt_llm/serve/router.py (2)
create_router(615-653)KvCacheAwareRouter(512-612)
tensorrt_llm/commands/serve.py (2)
tensorrt_llm/llmapi/disagg_utils.py (2)
MetadataServerConfig(59-64)ServerRole(19-22)tensorrt_llm/serve/openai_disagg_server.py (1)
OpenAIDisaggServer(39-472)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (23)
tests/integration/defs/disaggregated/test_disaggregated.py (7)
20-20: LGTM!The import of
Callablefrom thetypingmodule is correctly added and follows Python 3.8+ compatibility requirements.
39-44: LGTM! Good helper function for extracting server URL from config.The function correctly extracts hostname and port from the YAML config with sensible defaults (localhost:8000), which aligns with the disaggregated server defaults.
69-69: LGTM!The addition of the perf_metrics test configuration entry follows the established pattern for test configurations.
165-166: LGTM! Clean API extension for extra endpoint testing.The optional
extra_endpoints_testparameter is well-designed, usingOptional[Callable[[str], None]]which is compatible with Python 3.8+.
186-186: LGTM!Correctly derives the server URL from the config file before starting the test.
247-248: LGTM! Good integration point for additional testing.The extra endpoints test is invoked at the right time - after the client runs but before output verification, allowing validation of metrics collected during the test.
573-577: LGTM!The test function is correctly integrated with the test harness, passing the extra_endpoints_test callback to validate the perf_metrics endpoint.
tensorrt_llm/serve/openai_server.py (5)
7-7: LGTM!The import of
dequefrom thecollectionsmodule is correctly added for the perf metrics queue implementation.
88-99: LGTM! Well-structured initialization of perf metrics state.The initialization properly sets up the perf metrics collection infrastructure only when both
return_perf_metricsis enabled andperf_metrics_max_requests > 0. The bounded deque ensures memory is controlled.
169-169: LGTM!The
/perf_metricsendpoint is correctly registered in the routes.
320-335: LGTM! Clean extraction and storage of metrics.The
_extract_metricsmethod properly checks if the request is finished before processing, logs metrics if a collector is available, and safely appends to the perf_metrics queue with proper locking.
352-352: LGTM! Consistent metrics extraction across all completion paths.The metrics extraction is correctly integrated at all completion points - streaming chat, non-streaming chat, streaming completion, and non-streaming completion.
Also applies to: 370-370, 539-539, 575-575
tensorrt_llm/serve/openai_disagg_server.py (7)
4-4: LGTM!The imports for
itertoolsanddequeare correctly added for the perf metrics implementation.Also applies to: 8-8
11-11: LGTM!The typing import update correctly removes unused
Listand keeps only what's needed.
22-24: LGTM!The imports are correctly updated to use the new config-driven API with
DisaggServerConfigandget_ctx_gen_server_urls.
41-54: LGTM! Clean config-driven initialization.The constructor is well refactored to use the single
DisaggServerConfigobject, deriving servers and routers from it. The implementation is cleaner and more maintainable than the previous multi-parameter approach.
127-127: LGTM!The
/perf_metricsendpoint is correctly registered in the routes.
142-144: LGTM! Thread-safe helper for adding perf metrics keys.The helper method properly uses async locking to safely append correlation keys to the deque.
334-336: LGTM! Correct integration of perf metrics tracking.The perf metrics keys are properly registered when both context processing is needed and perf metrics tracking is enabled. Using
asyncio.create_taskensures non-blocking execution.tensorrt_llm/commands/serve.py (4)
3-3: LGTM!The
signalmodule import is correctly added for signal handling functionality.
6-6: LGTM!The typing import update correctly removes unused
Listand keeps onlyOptionalandAny.
23-25: LGTM! Clean import update for config-driven API.The imports are correctly updated to support the new config-driven disaggregated server setup, removing the now-unused
CtxGenServerConfig.
483-486: LGTM! Clean config-driven server instantiation.The disaggregated server instantiation is cleanly updated to use the single config object approach, which is more maintainable than the previous multi-parameter approach.
|
PR_Github #16349 [ run ] completed with state |
| continue | ||
| server_metrics[ctx_request_id] = request_perf_metrics | ||
|
|
||
| if len(server_metrics) > self.perf_metrics_max_requests: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, you could use a ring buffer with size = perf_metrics_max_requests instead. Would be faster and easier to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ctx_request_id is used as a look-up key in later metrics aggregation, so server_metrics here has to be a map with max entries.
| return JSONResponse(content=[]) | ||
| async with self.perf_metrics_lock: | ||
| perf_metrics = self.perf_metrics | ||
| self.perf_metrics = deque(maxlen=self.llm.args.perf_metrics_max_requests) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug. Subsequent calls to this /perf_metrics API will return and empty list [] rather than returning the last perf_metrics_max_requests number of requests. Right now only the first caller gets the metrics for the queued requests. See my output below:
root@f269356a9f20:/dynamo/build# curl localhost:8000/perf_metrics | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 348 100 348 0 0 175k 0 --:--:-- --:--:-- --:--:-- 339k
[
{
"request_id": 3,
"perf_metrics": {
"first_iter": 821,
"last_iter": 1810,
"timing_metrics": {
"arrival_time": 1946966.715181,
"first_scheduled_time": 1946966.715351,
"first_token_time": 1946966.744874,
"last_token_time": 1946976.857435
},
"kv_cache_metrics": {
"num_total_allocated_blocks": 1,
"num_new_allocated_blocks": 0,
"num_reused_blocks": 2,
"num_missed_blocks": 0
}
}
}
]
# Immediately calling the same endpoint again.
root@f269356a9f20:/dynamo/build# curl localhost:8000/perf_metrics | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2 100 2 0 0 1098 0 --:--:-- --:--:-- --:--:-- 2000
[]
I would expect the second call to this api endpoint to return the data for request_id: 3
My recommendation to fixing this is to keep a list (or my aforementioned ring buffer with a lock) that holds the previous perf_metrics_max_requests number of items to return to the caller. This way the API is idempotent and the race condition for who calls this API first is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior is consistent with the /metrics endpoint, that entries are cached and batch retrieved. After a call to the endpoint, the cached entries are returned and cleared. The perf_metrics_max_requests means how much entries to be kept at most if not requested frequently, not how much would be returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I disagree with the structure of this API then. In the event the caller who is processing these metrics goes down, there is no recovery for the metrics which were disbursed. If this is a large number then a large chunk of data is lost. This also limits the number of data processors to one if we want to have multiple or redundant data processors.
If you are worried about memory footprint for a large number of perf_metrics items then you can hard code an upper bound like: user_max_perf_items > MAX_PERF_ITEMS ? MAX_PERF_ITEMS : user_max_perf_items.
I want to understand the fundamental reason why the queue must be used other than, "this is what the spec says."
Summary by CodeRabbit
New Features
Tests
Chores
Description
Add a
/perf_metricsendpoint to openai server. Whenreturn_perf_metricsis set totrue, andperf_metrics_max_requestsis larger than 0, the endpoint will return a list of per-request metrics of finished requests. Previously it is only available via LLMAPI. This is similar to the existing/metricsendpoint which returns per-iteration statistics. The server will keep at mostperf_metrics_max_requestsentries to avoid OOM and returns all of them when requested. Each metrics item is returned at most once.For disaggregated serving, the endpoint will collect perf metrics from all servers, match corresponding context and generation requests and form a new perf metrics list. This requires
return_perf_metricsandperf_metrics_max_requestsis set on both context and generation servers, andperf_metrics_max_requestson the disagg server.IFB output:
Disaggregated serving output:
Test Coverage
tests/unittest/llmapi/apps/_test_openai_perf_metrics.pytests/integration/defs/disaggregated/test_disaggregated.py::test_disaggregated_perf_metricsGitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.