KEMBAR78
[TRTLLM-6549][feat] add perf metrics endpoint to openai server and openai disagg server by zhengd-nv · Pull Request #6985 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@zhengd-nv
Copy link
Collaborator

@zhengd-nv zhengd-nv commented Aug 18, 2025

Summary by CodeRabbit

  • New Features

    • Added a /perf_metrics endpoint exposing per-request performance metrics for chat, completions, and streaming.
    • Cross-server correlation and aggregation of perf metrics for disaggregated deployments with bounded history and configurable max entries.
    • Server startup now accepts a single config object with new perf-metrics flags.
  • Tests

    • Added unit and integration tests and a disaggregated test config validating /perf_metrics payloads and correlations.
  • Chores

    • Updated startup/config handling to use the new config-driven server construction.

Description

Add a /perf_metrics endpoint to openai server. When return_perf_metrics is set to true, and perf_metrics_max_requests is larger than 0, the endpoint will return a list of per-request metrics of finished requests. Previously it is only available via LLMAPI. This is similar to the existing /metrics endpoint which returns per-iteration statistics. The server will keep at most perf_metrics_max_requests entries to avoid OOM and returns all of them when requested. Each metrics item is returned at most once.

For disaggregated serving, the endpoint will collect perf metrics from all servers, match corresponding context and generation requests and form a new perf metrics list. This requires return_perf_metrics and perf_metrics_max_requests is set on both context and generation servers, and perf_metrics_max_requests on the disagg server.

IFB output:

[{
  "request_id": 2,
  "perf_metrics": {
    "first_iter": 7,
    "last_iter": 16,
    "timing_metrics": {
      "arrival_time": 19786.758976,
      "first_scheduled_time": 19786.75926,
      "first_token_time": 19786.809404,
      "last_token_time": 19786.864166
    },
    "kv_cache_metrics": {
      "num_total_allocated_blocks": 65,
      "num_new_allocated_blocks": 65,
      "num_reused_blocks": 0,
      "num_missed_blocks": 0
    }
  }
}, ...]

Disaggregated serving output:

[{
  "ctx_server": "http://localhost:8001",
  "gen_server": "http://localhost:8003",
  "ctx_perf_metrics": {
    "request_id": 2,
    "perf_metrics": {
      "first_iter": 7,
      "last_iter": 7,
      "timing_metrics": {
        "arrival_time": 22561.15635,
        "first_scheduled_time": 22561.156587,
        "first_token_time": 22561.204958,
        "last_token_time": 22561.204958
      },
      "kv_cache_metrics": {
        "num_total_allocated_blocks": 65,
        "num_new_allocated_blocks": 65,
        "num_reused_blocks": 0,
        "num_missed_blocks": 0
      }
    },
    "ctx_request_id": 1
  },
  "gen_perf_metrics": {
    "request_id": 2,
    "perf_metrics": {
      "first_iter": 75,
      "last_iter": 83,
      "timing_metrics": {
        "arrival_time": 22561.231106,
        "first_scheduled_time": 22561.490031,
        "first_token_time": 22561.497665,
        "last_token_time": 22561.539704,
        "kv_cache_size": 59637760,
        "kv_cache_transfer_start": 22561.231877,
        "kv_cache_transfer_end": 22561.48959
      },
      "kv_cache_metrics": {
        "num_total_allocated_blocks": 65,
        "num_new_allocated_blocks": 65,
        "num_reused_blocks": 0,
        "num_missed_blocks": 0
      }
    },
    "ctx_request_id": 1
  }
}, ...]

Test Coverage

  • tests/unittest/llmapi/apps/_test_openai_perf_metrics.py
  • tests/integration/defs/disaggregated/test_disaggregated.py::test_disaggregated_perf_metrics

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 18, 2025

📝 Walkthrough

Walkthrough

Adds per-request perf metrics capture to OpenAIServer, cross-server correlation and aggregation in OpenAIDisaggServer with a new /perf_metrics endpoint, introduces config-driven disaggregated server setup and perf-related config fields, and adds unit/integration tests and test-list entries exercising the metrics flows.

Changes

Cohort / File(s) Summary
Disaggregated server metrics & routing
tensorrt_llm/serve/openai_disagg_server.py
Replaces ctor with config-driven DisaggServerConfig; derives ctx/gen servers via get_ctx_gen_server_urls(...); adds perf_metrics_max_requests, perf_metrics_keys (deque), perf_metrics_keys_lock, and server_perf_metrics; registers GET /perf_metrics which polls child servers, stores/prunes results, and returns correlated (ctx_server, gen_server, ctx_request_id) metric pairs when both sides present; adds _add_perf_metrics_keys and synchronization.
Monolithic server metrics & endpoint
tensorrt_llm/serve/openai_server.py
Adds bounded perf_metrics deque with perf_metrics_lock; adds _extract_metrics to collect/enqueue per-request metrics (guarded by lock); instruments chat/completion/stream flows to call _extract_metrics; adds GET /perf_metrics (get_perf_metrics) which copies+resets queue safely and serializes metrics.
Disagg config, args & utils
tensorrt_llm/llmapi/disagg_utils.py, tensorrt_llm/llmapi/llm_args.py
Adds perf_metrics_max_requests to DisaggServerConfig and TorchLlmArgs; introduces get_ctx_gen_server_urls(server_configs) returning (ctx_urls, gen_urls); wires perf field through extract_disagg_cfg.
Serve command & imports
tensorrt_llm/commands/serve.py
Updates disaggregated server instantiation to OpenAIDisaggServer(config=...); removes local get_ctx_gen_server_urls usage and CtxGenServerConfig import; adjusts typing imports and startup args.
Integration test configs & tests
tests/integration/defs/disaggregated/test_configs/disagg_config_metrics.yaml, tests/integration/defs/disaggregated/test_disaggregated.py
Adds YAML enabling perf metrics for ctx/gen servers and top-level perf_metrics_max_requests; adds get_disagg_server_url_from_cfg, extends run_disaggregated_test(..., extra_endpoints_test), and adds test_disaggregated_perf_metrics which fetches/validates aggregated /perf_metrics.
E2E / unit tests
tests/integration/defs/test_e2e.py, tests/unittest/llmapi/apps/_test_openai_perf_metrics.py
Adds test_openai_perf_metrics runner entry and a new unittest that spins up RemoteOpenAIServer with perf metrics, issues a completion, queries /perf_metrics, and validates metric structure and timestamp/field relationships.
Test lists
tests/integration/test_lists/test-db/l0_a10.yml, tests/integration/test_lists/test-db/l0_h100.yml
Adds test_openai_perf_metrics to pre_merge list and inserts test_disaggregated_perf_metrics[...] into disaggregated test list to run perf-metrics tests.
API stability reference
tests/unittest/api_stability/references/llm.yaml
Adds perf_metrics_max_requests parameter to API YAML for init (type int, default 0, status prototype).

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant OpenAIServer
  participant MetricsQueue as "perf_metrics (deque)"

  Client->>OpenAIServer: POST /v1/chat or /v1/completions
  OpenAIServer->>OpenAIServer: process request
  OpenAIServer->>MetricsQueue: _extract_metrics(res)  -- acquire lock, append
  OpenAIServer-->>Client: response

  Client->>OpenAIServer: GET /perf_metrics
  OpenAIServer->>MetricsQueue: copy + reset (under lock)
  OpenAIServer-->>Client: { perf_metrics: [...] }
Loading
sequenceDiagram
  participant Client
  participant DisaggServer
  participant CtxServer
  participant GenServer

  Client->>DisaggServer: Disaggregated request
  DisaggServer->>DisaggServer: if ctx required and perf enabled: _add_perf_metrics_keys(ctx, gen, ctx_request_id)

  Client->>DisaggServer: GET /perf_metrics
  par Poll both servers
    DisaggServer->>CtxServer: GET /perf_metrics
    CtxServer-->>DisaggServer: ctx metrics
    DisaggServer->>GenServer: GET /perf_metrics
    GenServer-->>DisaggServer: gen metrics
  end
  DisaggServer->>DisaggServer: correlate by (ctx_server, gen_server, ctx_request_id) under lock
  DisaggServer-->>Client: aggregated entries with ctx_perf_metrics & gen_perf_metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

Community want to contribute

Suggested reviewers

  • chuangz0
  • pcastonguay
  • Tabrizian
  • litaotju

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (9)
tensorrt_llm/serve/openai_server.py (2)

85-87: Bound the in-memory perf metrics buffer to avoid unbounded growth

If clients don’t poll /perf_metrics frequently (or ever), self.perf_metrics can grow without bound. Consider a bounded deque to cap memory usage.

Apply within the shown range:

-        self.perf_metrics = []
+        self.perf_metrics = deque(maxlen=1000)
         self.perf_metrics_lock = asyncio.Lock()

Add the import at the top of the file (outside the shown range):

from collections import deque

245-281: Harden serialization against optional/absent fields

speculative_decoding and some timing/KV-cache fields may be absent or None depending on backend/features. Accessing attributes unguarded can raise. Recommend using getattr to guard optional fields.

Apply within the shown range:

-        for metrics_dict in perf_metrics:
+        for metrics_dict in perf_metrics:
             metrics = metrics_dict["perf_metrics"]
             timing_metrics = metrics.timing_metrics
             kv_cache_metrics = metrics.kv_cache_metrics
             speculative_decoding = metrics.speculative_decoding
             # TODO: other metrics
             metrics_json = {
                 "first_iter": metrics.first_iter,
                 "last_iter": metrics.last_iter,
-                "arrival_time": timing_metrics.arrival_time.total_seconds(),
-                "first_scheduled_time": timing_metrics.first_scheduled_time.total_seconds(),
-                "first_token_time": timing_metrics.first_token_time.total_seconds(),
-                "last_token_time": timing_metrics.last_token_time.total_seconds(),
+                "arrival_time": getattr(timing_metrics, "arrival_time").total_seconds(),
+                "first_scheduled_time": getattr(timing_metrics, "first_scheduled_time").total_seconds(),
+                "first_token_time": getattr(timing_metrics, "first_token_time").total_seconds(),
+                "last_token_time": getattr(timing_metrics, "last_token_time").total_seconds(),
                 "num_total_allocated_blocks": kv_cache_metrics.num_total_allocated_blocks,
                 "num_new_allocated_blocks": kv_cache_metrics.num_new_allocated_blocks,
                 "num_reused_blocks": kv_cache_metrics.num_reused_blocks,
                 "num_missed_blocks": kv_cache_metrics.num_missed_blocks,
             }
-            if timing_metrics.kv_cache_size > 0:
+            kv_cache_size = getattr(timing_metrics, "kv_cache_size", 0)
+            if kv_cache_size and kv_cache_size > 0:
                 metrics_json.update({
-                    "kv_cache_size": timing_metrics.kv_cache_size,
-                    "kv_cache_transfer_start": timing_metrics.kv_cache_transfer_start.total_seconds(),
-                    "kv_cache_transfer_end": timing_metrics.kv_cache_transfer_end.total_seconds(),
+                    "kv_cache_size": kv_cache_size,
+                    **(
+                        {"kv_cache_transfer_start": getattr(timing_metrics, "kv_cache_transfer_start").total_seconds()}
+                        if getattr(timing_metrics, "kv_cache_transfer_start", None) is not None else {}
+                    ),
+                    **(
+                        {"kv_cache_transfer_end": getattr(timing_metrics, "kv_cache_transfer_end").total_seconds()}
+                        if getattr(timing_metrics, "kv_cache_transfer_end", None) is not None else {}
+                    ),
                 })
-            if speculative_decoding.total_draft_tokens > 0:
+            if getattr(speculative_decoding, "total_draft_tokens", 0) > 0:
                 metrics_json.update({
-                    "acceptance_rate": speculative_decoding.acceptance_rate,
-                    "total_accepted_draft_tokens": speculative_decoding.total_accepted_draft_tokens,
-                    "total_draft_tokens": speculative_decoding.total_draft_tokens,
+                    "acceptance_rate": getattr(speculative_decoding, "acceptance_rate"),
+                    "total_accepted_draft_tokens": getattr(speculative_decoding, "total_accepted_draft_tokens"),
+                    "total_draft_tokens": getattr(speculative_decoding, "total_draft_tokens"),
                 })
             metrics_dict["perf_metrics"] = metrics_json
         return JSONResponse(content=perf_metrics)
tests/unittest/llmapi/apps/_test_openai_perf_metrics.py (1)

61-66: Use equality (==) instead of identity (is) and add a short poll to avoid flakes

response.status is 200 relies on CPython small-int interning and is not the intended check; use ==. Also, adding a brief retry loop reduces flakiness if metrics are exposed slightly after the completion returns.

Apply within the shown range:

-    response = urlopen(f'{server.url_root}/perf_metrics')
-    assert response.status is 200
-
-    data_list = json.loads(response.read())
-    assert len(data_list) == 1
+    # Poll a few times in case metric extraction lags the response
+    data_list = []
+    for _ in range(10):
+        with urlopen(f'{server.url_root}/perf_metrics') as response:
+            assert response.status == 200
+            data_list = json.loads(response.read())
+        if len(data_list) >= 1:
+            break
+        time.sleep(0.2)
+    assert len(data_list) == 1

Add this import at the top of the file (outside the shown range):

import time
tensorrt_llm/serve/openai_disagg_server.py (1)

137-146: Consider parallelizing fan-out GETs and handling non-200s gracefully

The current loop fetches each server’s /perf_metrics sequentially and assumes success. For lower latency and resilience under partial failures, gather requests concurrently and skip/record servers that return non-200 or raise.

If helpful, I can provide a concurrent asyncio.gather version with per-server error handling.

tests/integration/defs/disaggregated/test_disaggregated.py (5)

20-20: Import Optional for accurate type annotation usage

You assign None to a Callable-typed parameter later. Bring in Optional to annotate it correctly.

Apply this diff:

-from typing import Callable
+from typing import Callable, Optional

39-45: Add return type, docstring, and explicit UTF-8 decoding

Small polish to improve clarity and robustness when reading YAML.

Apply this diff:

-def get_disagg_server_url_from_cfg(config_file: str):
-    with open(config_file, 'r') as file:
+def get_disagg_server_url_from_cfg(config_file: str) -> str:
+    """Derive the disaggregated server base URL from a YAML config."""
+    with open(config_file, 'r', encoding='utf-8') as file:
         config = yaml.safe_load(file)
     server_host = config.get('hostname', 'localhost')
     server_port = config.get('port', 8000)
     return f"http://{server_host}:{server_port}"

165-166: Make extra_endpoints_test annotation Optional to match default None

Type checkers will flag Callable[[str], None] with a None default. Use Optional[Callable[…]].

Apply this diff:

-                           prompt_file="prompts.json",
-                           extra_endpoints_test: Callable[[str], None] = None):
+                           prompt_file="prompts.json",
+                           extra_endpoints_test: Optional[Callable[[str], None]] = None):

246-249: Avoid re-reading the YAML config on every iteration

Minor inefficiency: the server URL is constant for the test. Compute it once outside the loop.

Apply this diff within the shown lines:

-                if extra_endpoints_test is not None:
-                    extra_endpoints_test(
-                        get_disagg_server_url_from_cfg(config_file))
+                if extra_endpoints_test is not None:
+                    extra_endpoints_test(server_url)

And add this one-liner just before entering the workers/server context (or right after computing server_cmd), outside the per-iteration loop:

server_url = get_disagg_server_url_from_cfg(config_file)

561-571: Relax strict temporal assertions to reduce flakiness across schedulers/clocks

Timing can be equal at boundaries due to clock resolution and ordering. Using < may intermittently fail; prefer <= for adjacent events. Keep strict ordering where causality must hold.

Apply this diff:

-        assert ctx_metrics["first_token_time"] == ctx_metrics["last_token_time"]
-        assert ctx_metrics["last_token_time"] < gen_metrics["arrival_time"]
+        assert ctx_metrics["first_token_time"] == ctx_metrics["last_token_time"]
+        assert ctx_metrics["last_token_time"] <= gen_metrics["arrival_time"]
@@
-        assert gen_metrics["arrival_time"] < gen_metrics[
+        assert gen_metrics["arrival_time"] <= gen_metrics[
             "kv_cache_transfer_start"]
-        assert gen_metrics["kv_cache_transfer_start"] < gen_metrics[
+        assert gen_metrics["kv_cache_transfer_start"] <= gen_metrics[
             "kv_cache_transfer_end"]
-        assert gen_metrics["kv_cache_transfer_end"] < gen_metrics[
+        assert gen_metrics["kv_cache_transfer_end"] <= gen_metrics[
             "first_scheduled_time"]

If ctx/gen servers ever run on different hosts with unsynchronized clocks, consider validating relative durations (non-negative deltas) instead of absolute cross-server timestamps.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 55f4f2d and 611f92b.

📒 Files selected for processing (6)
  • tensorrt_llm/serve/openai_disagg_server.py (4 hunks)
  • tensorrt_llm/serve/openai_server.py (8 hunks)
  • tests/integration/defs/disaggregated/test_configs/disagg_config_metrics.yaml (1 hunks)
  • tests/integration/defs/disaggregated/test_disaggregated.py (6 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/unittest/llmapi/apps/_test_openai_perf_metrics.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/integration/defs/test_e2e.py
  • tests/unittest/llmapi/apps/_test_openai_perf_metrics.py
  • tensorrt_llm/serve/openai_disagg_server.py
  • tensorrt_llm/serve/openai_server.py
  • tests/integration/defs/disaggregated/test_disaggregated.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/integration/defs/test_e2e.py
  • tests/unittest/llmapi/apps/_test_openai_perf_metrics.py
  • tensorrt_llm/serve/openai_disagg_server.py
  • tensorrt_llm/serve/openai_server.py
  • tests/integration/defs/disaggregated/test_disaggregated.py
🧬 Code Graph Analysis (5)
tests/integration/defs/test_e2e.py (1)
tests/integration/defs/conftest.py (4)
  • llm_venv (707-723)
  • test_root (2189-2190)
  • llm_root (180-181)
  • unittest_path (90-91)
tests/unittest/llmapi/apps/_test_openai_perf_metrics.py (1)
tests/integration/defs/test_e2e.py (1)
  • temp_extra_llm_api_options_file (724-759)
tensorrt_llm/serve/openai_disagg_server.py (1)
tests/unittest/disaggregated/test_router.py (1)
  • get (23-25)
tensorrt_llm/serve/openai_server.py (5)
tensorrt_llm/serve/openai_disagg_server.py (1)
  • perf_metrics (137-168)
tensorrt_llm/llmapi/llm.py (1)
  • RequestOutput (47-85)
tensorrt_llm/executor/result.py (3)
  • finished (531-532)
  • outputs (193-208)
  • request_id (508-509)
tensorrt_llm/metrics/collector.py (1)
  • log_metrics_dict (100-105)
tests/integration/defs/accuracy/test_disaggregated_serving.py (1)
  • outputs (41-42)
tests/integration/defs/disaggregated/test_disaggregated.py (2)
tests/integration/defs/conftest.py (3)
  • disaggregated_test_root (2339-2344)
  • llm_venv (707-723)
  • disaggregated_example_root (270-275)
tensorrt_llm/serve/openai_disagg_server.py (1)
  • perf_metrics (137-168)
🔇 Additional comments (13)
tests/integration/defs/disaggregated/test_configs/disagg_config_metrics.yaml (1)

8-25: Config looks good for exercising disaggregated perf-metrics

The configuration enables return_perf_metrics on both ctx/gen servers and provides distinct URLs, which is required for correlation in the disaggregated /perf_metrics flow. Nothing blocking from my side.

tensorrt_llm/serve/openai_server.py (6)

158-158: LGTM: new /perf_metrics route is correctly registered

The new endpoint is added consistently alongside existing metrics endpoints.


292-307: LGTM: metrics extraction gated on finish and guarded by a lock

The early return for unfinished results and the locked append prevent races and partial metrics from leaking into the endpoint.


323-324: LGTM: streaming path calls extractor per yielded chunk

Only finished results will be recorded; the call here is safe due to your finish-guard in _extract_metrics.


341-342: LGTM: non-streaming chat path records metrics once at completion

This ensures a single per-request metric entry.


434-435: LGTM: completions (non-streaming) path records metrics on completion

Consistent with chat path; good.


470-471: LGTM: completions streaming path calls extractor per chunk

Same rationale as chat streaming; safe with the finish-guard.

tests/integration/defs/test_e2e.py (1)

1500-1506: LGTM: adds the perf-metrics test to the e2e suite

This integrates the new module cleanly without impacting existing tests.

tests/unittest/llmapi/apps/_test_openai_perf_metrics.py (1)

87-92: LGTM: negative checks for disaggregation-only fields

These assertions correctly ensure the standalone server doesn’t emit disagg-only fields in its perf metrics.

tensorrt_llm/serve/openai_disagg_server.py (2)

122-123: LGTM: new /perf_metrics endpoint exposed

This aligns with the ctx/gen servers and enables aggregation at the disagg layer.


307-310: LGTM: pre-register correlation keys for requests requiring ctx

Capturing (ctx_server, gen_server, ctx_request_id) under a lock is the right primitive for later aggregation.

tests/integration/defs/disaggregated/test_disaggregated.py (2)

571-575: Nice wiring of the perf_metrics end-to-end check

Good reuse of run_disaggregated_test with a pluggable endpoint checker; this keeps the dataplane test flow consistent.


69-69: Confirmed config path and contents
The file tests/integration/defs/disaggregated/test_configs/disagg_config_metrics.yaml exists and includes the return_perf_metrics entries, so no changes are needed.

@zhengd-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15684 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
tensorrt_llm/serve/openai_disagg_server.py (1)

418-445: Python 3.8 compatibility: replace list[str] with List[str]

Same PEP 585 generics issue here; update annotations for method and inner helper.

Apply this diff:

-    async def wait_for_all_servers_ready(cls, session: aiohttp.ClientSession,
-                                         ctx_servers: list[str],
-                                         gen_servers: list[str],
+    async def wait_for_all_servers_ready(cls, session: aiohttp.ClientSession,
+                                         ctx_servers: List[str],
+                                         gen_servers: List[str],
                                          server_start_timeout_secs: int = 180):
-        async def get_unready_servers(servers: list[str]) -> list[str]:
+        async def get_unready_servers(servers: List[str]) -> List[str]:

Ensure typing imports include List (already present).

♻️ Duplicate comments (3)
tensorrt_llm/serve/openai_disagg_server.py (2)

57-63: Python 3.8 compatibility: replace PEP 585 generics with typing.List/Dict/Tuple (duplicate of prior review)

Using list[...]/dict[...] requires Python 3.9+. The project targets 3.8+. Also tighten the nested mapping type for ctx_request_id keys.

Apply within the shown range:

-        # record corresponding keys of context and generation servers for perf metrics
-        self.perf_metrics_keys: list[tuple[str, str, int]] = [] # (ctx_server_key, gen_server_key, ctx_request_id)
+        # record corresponding keys of context and generation servers for perf metrics
+        self.perf_metrics_keys: List[Tuple[str, str, int]] = []  # (ctx_server_key, gen_server_key, ctx_request_id)
         self.perf_metrics_keys_lock = asyncio.Lock()
-        self.server_perf_metrics: dict[str, dict] = {} # server_key -> {ctx_request_id: perf_metrics}
+        self.server_perf_metrics: Dict[str, Dict[int, dict]] = {}  # server_key -> {ctx_request_id: perf_metrics}
         for server in self.ctx_servers + self.gen_servers:
             self.server_perf_metrics[server] = {}

And update typing imports at the top of the file:

-from typing import List, Optional, Type, Union
+from typing import Dict, List, Optional, Tuple, Type, Union

145-166: Fix data loss and partial-pair emission when correlating ctx/gen metrics (duplicate of prior review)

Current logic pops gen first, then appends even if ctx is missing (ctx_perf_metrics can be None). This both drops ctx data prematurely and can emit incomplete pairs. Peek (get) until both present, then pop both.

Apply this diff:

-            remain_keys = []
-            for ctx_server, gen_server, ctx_request_id in self.perf_metrics_keys:
-                gen_perf_metrics = self.server_perf_metrics[gen_server].pop(ctx_request_id, None)
-                if gen_perf_metrics is None:
-                    # generation not finished
-                    remain_keys.append((ctx_server, gen_server, ctx_request_id))
-                    continue
-                ctx_perf_metrics = self.server_perf_metrics[ctx_server].pop(ctx_request_id, None)
-                return_metrics.append({
-                    "ctx_server": ctx_server,
-                    "gen_server": gen_server,
-                    "ctx_perf_metrics": ctx_perf_metrics,
-                    "gen_perf_metrics": gen_perf_metrics})
-            self.perf_metrics_keys = remain_keys
+            remain_keys = []
+            for ctx_server, gen_server, ctx_request_id in self.perf_metrics_keys:
+                ctx_perf_metrics = self.server_perf_metrics.get(ctx_server, {}).get(ctx_request_id)
+                gen_perf_metrics = self.server_perf_metrics.get(gen_server, {}).get(ctx_request_id)
+                if not (ctx_perf_metrics and gen_perf_metrics):
+                    # one side not finished yet
+                    remain_keys.append((ctx_server, gen_server, ctx_request_id))
+                    continue
+                # both present; remove from stores to avoid duplication
+                self.server_perf_metrics[ctx_server].pop(ctx_request_id, None)
+                self.server_perf_metrics[gen_server].pop(ctx_request_id, None)
+                return_metrics.append({
+                    "ctx_server": ctx_server,
+                    "gen_server": gen_server,
+                    "ctx_perf_metrics": ctx_perf_metrics,
+                    "gen_perf_metrics": gen_perf_metrics,
+                })
+            self.perf_metrics_keys = remain_keys
tests/integration/defs/disaggregated/test_disaggregated.py (1)

529-577: Complete prior fix: prefer json.load(resp) and verify gen metrics schema

You already fixed the urllib import and added a timeout. Finish by parsing JSON directly from the response. Also, confirm whether kv_cache_size resides at gen_perf_metrics top-level or under perf_metrics.

Apply this diff for JSON parsing:

-        with urllib.request.urlopen(f"{server_url}/perf_metrics",
-                                    timeout=10) as resp:
+        with urllib.request.urlopen(f"{server_url}/perf_metrics", timeout=10) as resp:
             assert resp.status == 200
-            perf_metrics = json.loads(resp.read())
+            perf_metrics = json.load(resp)

If kv_cache_size is a gen-level field (as described in the PR), adjust assertions accordingly:

  • Use gen_item = item["gen_perf_metrics"]
  • Keep gen_metrics = gen_item["perf_metrics"] for timing fields
  • Assert "kv_cache_size" in gen_item and related ordering on gen_item fields.

To verify the schema location from the codebase, run:

#!/bin/bash
# Inspect where kv_cache_size is attached in server responses
rg -n -C3 --type=py -P '\bkv_cache_size\b' 

This should show whether kv_cache_size is included inside "perf_metrics" or at the top level of the gen-server metrics object so the test can assert the right path.

🧹 Nitpick comments (3)
tensorrt_llm/serve/openai_disagg_server.py (2)

306-309: Avoid duplicate correlation keys

Repeated appends of the same (ctx_server, gen_server, ctx_request_id) can cause unnecessary work. Guard against duplicates.

Apply this diff:

-            if need_ctx:
-                async with self.perf_metrics_keys_lock:
-                    self.perf_metrics_keys.append((ctx_server, gen_server, req.disaggregated_params.ctx_request_id))
+            if need_ctx:
+                async with self.perf_metrics_keys_lock:
+                    key = (ctx_server, gen_server, req.disaggregated_params.ctx_request_id)
+                    if key not in self.perf_metrics_keys:
+                        self.perf_metrics_keys.append(key)

1-1: Missing NVIDIA copyright header

Per repository guidelines, prepend the NVIDIA SPDX header to this source file.

Example:

# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
tests/integration/defs/disaggregated/test_disaggregated.py (1)

39-45: Make YAML read robust and add return type annotation

Add explicit return type, set encoding, and guard against empty files.

Apply this diff:

-def get_disagg_server_url_from_cfg(config_file: str):
-    with open(config_file, 'r') as file:
-        config = yaml.safe_load(file)
+def get_disagg_server_url_from_cfg(config_file: str) -> str:
+    with open(config_file, 'r', encoding='utf-8') as file:
+        config = yaml.safe_load(file) or {}
     server_host = config.get('hostname', 'localhost')
     server_port = config.get('port', 8000)
     return f"http://{server_host}:{server_port}"
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 611f92b and 848605a.

📒 Files selected for processing (2)
  • tensorrt_llm/serve/openai_disagg_server.py (4 hunks)
  • tests/integration/defs/disaggregated/test_disaggregated.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/serve/openai_disagg_server.py
  • tests/integration/defs/disaggregated/test_disaggregated.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/serve/openai_disagg_server.py
  • tests/integration/defs/disaggregated/test_disaggregated.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/serve/openai_disagg_server.py (1)

122-122: Route registration looks good

New /perf_metrics endpoint is correctly registered via FastAPI.

tests/integration/defs/disaggregated/test_disaggregated.py (4)

20-20: LGTM: precise typing for callback

Importing Callable and using a typed callback improves clarity.


69-69: Config mapping addition looks good

perf_metrics is correctly wired to the dedicated test config.


165-167: run_disaggregated_test extensibility is clean

Adding prompt_file and extra_endpoints_test parameters is a tidy way to inject endpoint checks.


246-249: Nice hook for endpoint validation

Deferred call to extra_endpoints_test using the same config aligns the test with the server under test.

@reasonsolo
Copy link
Collaborator

reasonsolo commented Aug 19, 2025

I think we should only keep a limited amount of metrics in record if you don't respond the metrics for each request, or you either get an OOM or FastAPI hanging for processing large JSON data.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15684 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11798 completed with status: 'SUCCESS'

@zhengd-nv zhengd-nv changed the title [None][feat] add perf metrics endpoint to openai server and openai disagg server [TRTLLM-6549][feat] add perf metrics endpoint to openai server and openai disagg server Aug 20, 2025
@zhengd-nv zhengd-nv requested a review from a team as a code owner August 20, 2025 07:44
@zhengd-nv zhengd-nv requested a review from Superjomn August 20, 2025 07:44
@zhengd-nv
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/commands/serve.py (2)

354-357: Wrong exception type caught for Enum lookup; invalid server_role won’t be handled as intended.

Enum item access via ServerRole[name] raises KeyError, not ValueError. The except block won’t execute, and an unhandled KeyError will bubble.

Apply:

-        except ValueError:
+        except KeyError:
             raise ValueError(f"Invalid server role: {server_role}. " \
                              f"Must be one of: {', '.join([role.name for role in ServerRole])}")

321-341: Dropped kwargs from get_llm_args: moe_cluster_parallel_size (and any extras) are ignored.

serve() passes moe_cluster_parallel_size=cluster_size to get_llm_args, which lands in the function’s **llm_args_extra_dict, but that return value is discarded. As a result, cluster_size never reaches update_llm_args_with_extra_dict and is silently ignored.

Patch to preserve and merge the extras returned by get_llm_args with YAML overrides:

-    llm_args, _ = get_llm_args(
+    llm_args, llm_args_extra_dict = get_llm_args(
         model=model,
         tokenizer=tokenizer,
         backend=backend,
         max_beam_width=max_beam_width,
         max_batch_size=max_batch_size,
         max_num_tokens=max_num_tokens,
         max_seq_len=max_seq_len,
         tensor_parallel_size=tp_size,
         pipeline_parallel_size=pp_size,
         moe_expert_parallel_size=ep_size,
         moe_cluster_parallel_size=cluster_size,
         gpus_per_node=gpus_per_node,
         free_gpu_memory_fraction=kv_cache_free_gpu_memory_fraction,
         mamba_ssm_cache_dtype=mamba_ssm_cache_dtype,
         num_postprocess_workers=num_postprocess_workers,
         trust_remote_code=trust_remote_code,
         reasoning_parser=reasoning_parser,
         fail_fast_on_attention_window_too_large=
         fail_fast_on_attention_window_too_large)
 
-    llm_args_extra_dict = {}
     if extra_llm_api_options is not None:
         with open(extra_llm_api_options, 'r') as f:
-            llm_args_extra_dict = yaml.safe_load(f)
+            file_overrides = yaml.safe_load(f) or {}
+            if not isinstance(file_overrides, dict):
+                raise ValueError("extra_llm_api_options must parse to a mapping (YAML dict)")
+            llm_args_extra_dict.update(file_overrides)
     llm_args = update_llm_args_with_extra_dict(llm_args, llm_args_extra_dict)

This fixes cluster_size being dropped and makes future extras resilient.

Also applies to: 342-347

♻️ Duplicate comments (2)
tests/integration/defs/disaggregated/test_disaggregated.py (1)

529-577: Tighten JSON parsing and set a slightly more forgiving timeout

Use json.load on the response file-like object and keep the explicit urllib.request import. Your structure assertions are good.

Apply this diff:

-        import json
-        import urllib.request
+        import json
+        import urllib.request
@@
-        with urllib.request.urlopen(f"{server_url}/perf_metrics",
-                                    timeout=10) as resp:
+        with urllib.request.urlopen(f"{server_url}/perf_metrics",
+                                    timeout=30) as resp:
             assert resp.status == 200
-            perf_metrics = json.loads(resp.read())
+            perf_metrics = json.load(resp)
tensorrt_llm/serve/openai_disagg_server.py (1)

11-11: Use typing generics for Python 3.8 compatibility

The codebase targets Python 3.8+, where PEP 585 built-in generics (list[str], dict[str, …], tuple[...]) are not supported. Replace them with typing.List/Dict/Tuple and import the needed symbols.

Apply this diff to the imports and changed annotations in this file:

-from typing import Optional, Type, Union
+from typing import Dict, List, Optional, Tuple, Type, Union

And update usages within the shown ranges:

-            self.server_perf_metrics: dict[str, dict[int, dict]] = {}
+            self.server_perf_metrics: Dict[str, Dict[int, dict]] = {}

For other occurrences in this file (e.g., method annotations using list[str]), similarly convert to List[str]/Tuple[...].


</blockquote></details>

</blockquote></details>

<details>
<summary>🧹 Nitpick comments (9)</summary><blockquote>

<details>
<summary>tensorrt_llm/serve/openai_server.py (4)</summary><blockquote>

`88-100`: **Bounded queue for metrics is good; guard against None config**

Deque with maxlen addresses unbounded growth nicely. To avoid surprises if the config remains Optional in some environments, coerce None to 0 before comparison.

Apply this diff:

```diff
-            max_perf_metrics = self.llm.args.perf_metrics_max_requests
+            max_perf_metrics = int(self.llm.args.perf_metrics_max_requests or 0)

169-169: Endpoint naming is fine; consider consistency with config knob

Route name /perf_metrics matches functionality. If you want stronger alignment with the controlling flag (return_perf_metrics), leaving as-is is ok. A rename to /request_perf_metrics could be considered later, but not required now.


266-309: Safe swap-and-drain pattern; minor structural suggestion

Swapping out the deque under lock and then formatting outside the lock is correct. A minor ergonomics improvement: instead of mutating elements in-place (metrics_dict["perf_metrics"] = metrics_json), build the final list to avoid mixing raw and formatted entries if future code reads the deque during formatting.


320-336: Ensure request_perf_metrics is present before accessing

Accessing res.outputs[0].request_perf_metrics assumes metrics were collected. If return_perf_metrics is true but upstream didn’t populate request_perf_metrics, this will raise. Add a guard to skip appending when it’s None.

Apply this diff:

-            item = {
-                "request_id": res.request_id,
-                "perf_metrics": res.outputs[0].request_perf_metrics
-            }
+            req_perf = res.outputs[0].request_perf_metrics
+            if req_perf is None:
+                return
+            item = {
+                "request_id": res.request_id,
+                "perf_metrics": req_perf,
+            }
tensorrt_llm/serve/openai_disagg_server.py (3)

146-157: Harden remote fetches: check HTTP status and skip failures

A single failing upstream /perf_metrics currently bubbles up as an exception after buffering. Consider skipping servers that return non-200 or raise client errors, logging a warning, so the endpoint remains robust.

Apply this diff:

-        try:
-            for server in self.ctx_servers + self.gen_servers:
-                async with self.session.get(f"{server}/perf_metrics") as response:
-                    server_perf_metrics = await response.json()
-                    perf_metrics[server] = server_perf_metrics
-        except Exception as e:
-            # Keep the exception to raise it after saving perf metrics
-            exc = e
+        try:
+            for server in self.ctx_servers + self.gen_servers:
+                try:
+                    async with self.session.get(f"{server}/perf_metrics") as response:
+                        if response.status != 200:
+                            logger.warning(f"Skipping {server}/perf_metrics: HTTP {response.status}")
+                            continue
+                        server_perf_metrics = await response.json()
+                        perf_metrics[server] = server_perf_metrics
+                except aiohttp.ClientError as e:
+                    logger.error(f"Failed to fetch {server}/perf_metrics: {e}")
+                    continue
+        except Exception as e:
+            # Keep the exception to raise it after saving perf metrics
+            exc = e

177-191: Avoid data loss when correlating: pop only after both sides present

You already pop the gen side only after checking its presence; consider mirroring the get-then-pop pattern for both sides to avoid races. This is fine currently but can be made more explicit and symmetric.

Apply this diff:

-                gen_perf_metrics = self.server_perf_metrics[gen_server].pop(ctx_request_id, None)
-                if gen_perf_metrics is None:
+                gen_perf_metrics = self.server_perf_metrics[gen_server].get(ctx_request_id)
+                if gen_perf_metrics is None:
                     # generation not finished
                     remain_keys.append((ctx_server, gen_server, ctx_request_id))
                     continue
-                ctx_perf_metrics = self.server_perf_metrics[ctx_server].pop(ctx_request_id, None)
+                ctx_perf_metrics = self.server_perf_metrics[ctx_server].get(ctx_request_id)
+                # both present; remove to avoid duplication next time
+                self.server_perf_metrics[gen_server].pop(ctx_request_id, None)
+                self.server_perf_metrics[ctx_server].pop(ctx_request_id, None)

118-123: create_error_response type is misleading and always raises

Signature says -> ErrorResponse, but it unconditionally raises HTTPException. Either return a JSONResponse with an ErrorResponse body or change the signature/name to reflect raising.

Apply this diff to return a proper response:

-    def create_error_response(
-            message: str,
-            status_code: HTTPStatus = HTTPStatus.BAD_REQUEST) -> ErrorResponse:
-        raise HTTPException(status_code=500, detail=f"Internal server error {message}")
+    def create_error_response(
+            message: str,
+            status_code: HTTPStatus = HTTPStatus.BAD_REQUEST) -> JSONResponse:
+        payload = {"error": message, "code": status_code.value}
+        return JSONResponse(content=payload, status_code=status_code.value)

And ensure call sites use this response (or keep raising where appropriate).

tensorrt_llm/commands/serve.py (2)

1-1: Missing NVIDIA copyright header.

Per repository guidelines, prepend the current year NVIDIA header to all source files.

Add at the very top:

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

206-210: Help text contradicts the default backend.

Help says “Default is cpp path” while default is "pytorch".

Consider:

-              help="Set to 'pytorch' for pytorch path. Default is cpp path.")
+              help="Set to 'pytorch' for PyTorch path. Default is 'pytorch'.")
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 848605a and 733fd6d.

📒 Files selected for processing (9)
  • tensorrt_llm/commands/serve.py (3 hunks)
  • tensorrt_llm/llmapi/disagg_utils.py (4 hunks)
  • tensorrt_llm/llmapi/llm_args.py (1 hunks)
  • tensorrt_llm/serve/openai_disagg_server.py (6 hunks)
  • tensorrt_llm/serve/openai_server.py (9 hunks)
  • tests/integration/defs/disaggregated/test_configs/disagg_config_metrics.yaml (1 hunks)
  • tests/integration/defs/disaggregated/test_disaggregated.py (6 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/unittest/llmapi/apps/_test_openai_perf_metrics.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/integration/defs/disaggregated/test_configs/disagg_config_metrics.yaml
  • tests/integration/defs/test_e2e.py
  • tests/unittest/llmapi/apps/_test_openai_perf_metrics.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/llmapi/llm_args.py
  • tests/integration/defs/disaggregated/test_disaggregated.py
  • tensorrt_llm/llmapi/disagg_utils.py
  • tensorrt_llm/serve/openai_server.py
  • tensorrt_llm/commands/serve.py
  • tensorrt_llm/serve/openai_disagg_server.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/llmapi/llm_args.py
  • tests/integration/defs/disaggregated/test_disaggregated.py
  • tensorrt_llm/llmapi/disagg_utils.py
  • tensorrt_llm/serve/openai_server.py
  • tensorrt_llm/commands/serve.py
  • tensorrt_llm/serve/openai_disagg_server.py
🧬 Code Graph Analysis (5)
tensorrt_llm/llmapi/llm_args.py (1)
tensorrt_llm/builder.py (1)
  • default (50-58)
tests/integration/defs/disaggregated/test_disaggregated.py (2)
tests/integration/defs/conftest.py (4)
  • disaggregated_test_root (2339-2344)
  • llm_venv (707-723)
  • disaggregated_example_root (270-275)
  • llama_model_root (964-1039)
tensorrt_llm/serve/openai_disagg_server.py (1)
  • perf_metrics (146-192)
tensorrt_llm/serve/openai_server.py (5)
tensorrt_llm/serve/openai_disagg_server.py (1)
  • perf_metrics (146-192)
tensorrt_llm/_utils.py (1)
  • set_prometheus_multiproc_dir (1118-1129)
tensorrt_llm/metrics/collector.py (2)
  • MetricsCollector (10-105)
  • log_metrics_dict (100-105)
tensorrt_llm/llmapi/llm.py (1)
  • RequestOutput (47-87)
tensorrt_llm/executor/result.py (3)
  • finished (544-545)
  • outputs (197-212)
  • request_id (521-522)
tensorrt_llm/commands/serve.py (1)
tensorrt_llm/llmapi/disagg_utils.py (2)
  • MetadataServerConfig (59-64)
  • ServerRole (19-22)
tensorrt_llm/serve/openai_disagg_server.py (3)
tensorrt_llm/llmapi/disagg_utils.py (3)
  • DisaggServerConfig (47-55)
  • MetadataServerConfig (59-64)
  • get_ctx_gen_server_urls (67-77)
tensorrt_llm/serve/metadata_server.py (7)
  • create_metadata_server (80-95)
  • get (17-18)
  • get (38-39)
  • get (66-68)
  • keys (29-30)
  • keys (47-58)
  • keys (76-77)
tensorrt_llm/serve/router.py (2)
  • create_router (615-653)
  • KvCacheAwareRouter (512-612)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (10)
tensorrt_llm/llmapi/disagg_utils.py (2)

55-56: Config knob addition looks good

DisaggServerConfig now carries perf_metrics_max_requests with a sane default. This aligns with the OpenAIDisaggServer usage.


131-135: Constructor wiring is consistent

Passing perf_metrics_max_requests into DisaggServerConfig is correct and consistent with new config flow.

tests/integration/defs/disaggregated/test_disaggregated.py (4)

20-20: Import Callable is appropriate

Using Callable for the new hook type improves clarity and type hints for editors.


39-45: Helper to derive server URL from YAML config looks good

Simple and robust defaults when hostname/port are not present. No issues.


166-167: Extending test harness with extra_endpoints_test is a nice pattern

The injection point cleanly decouples endpoint assertions from the core E2E loop.


246-249: Hook usage is correct

Calling the endpoint verifier after each iteration ensures the metrics endpoint gets exercised. Looks good.

tensorrt_llm/serve/openai_disagg_server.py (1)

55-67: Good: opt-in correlation buffers and locks are only instantiated when enabled

The gating on perf_metrics_max_requests > 0 avoids unnecessary overhead. No issues.

tensorrt_llm/commands/serve.py (3)

6-6: Typing import cleanup is correct.

Removing unused typing imports and keeping only Any, Optional aligns with the current usage.


22-24: serve.py import cleanup verified

No lingering references to CtxGenServerConfig in tensorrt_llm/commands/serve.py — the class remains in disagg_utils (and its tests) by design. Dropping the import here is correct.


473-476: Perf metrics retention is properly bounded

The DisaggServerConfig (llmapi/disagg_utils.py) wires perf_metrics_max_requests (default 0), and both servers enforce limits:

  • In openai_disagg_server.py
    self.perf_metrics_keys = deque(maxlen=self.perf_metrics_max_requests)
    self.server_perf_metrics prunes oldest entries when its length exceeds perf_metrics_max_requests.
    /perf_metrics endpoint is registered in register_routes.
  • In openai_server.py
    self.perf_metrics = deque(maxlen=self.llm.args.perf_metrics_max_requests)
    • Reinitialized on each fetch to keep only the last N entries.
    /perf_metrics endpoint is registered.

Defaults, wiring, and bounded retention are all correctly implemented—no changes needed.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15890 [ run ] triggered by Bot

@zhengd-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15893 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15890 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/llmapi/disagg_utils.py (1)

1-6: Add NVIDIA copyright header

This source file is missing the required NVIDIA copyright/SPDX header.

Apply at the top of the file:

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
tensorrt_llm/serve/openai_disagg_server.py (1)

445-451: Fix Python 3.8 typing in method annotations

Replace list[str] with List[str] and import List from typing (already added above).

Apply this diff:

-    async def wait_for_all_servers_ready(cls, session: aiohttp.ClientSession,
-                                         ctx_servers: list[str],
-                                         gen_servers: list[str],
+    async def wait_for_all_servers_ready(cls, session: aiohttp.ClientSession,
+                                         ctx_servers: List[str],
+                                         gen_servers: List[str],
                                          server_start_timeout_secs: int = 180):
-        async def get_unready_servers(servers: list[str]) -> list[str]:
+        async def get_unready_servers(servers: List[str]) -> List[str]:
♻️ Duplicate comments (3)
tensorrt_llm/serve/openai_disagg_server.py (3)

150-160: Harden cross-server fetch: check HTTP status and skip failures (don’t raise for a single bad server)

A failing server currently bubbles an exception after caching metrics, breaking the whole endpoint. Skip failed/non-200 servers and continue aggregating.

Apply this diff:

-        perf_metrics = {}
-        exc = None
-        try:
-            for server in self.ctx_servers + self.gen_servers:
-                async with self.session.get(f"{server}/perf_metrics") as response:
-                    server_perf_metrics = await response.json()
-                    perf_metrics[server] = server_perf_metrics
-        except Exception as e:
-            # Keep the exception to raise it after saving perf metrics
-            exc = e
+        perf_metrics = {}
+        for server in self.ctx_servers + self.gen_servers:
+            try:
+                async with self.session.get(f"{server}/perf_metrics") as response:
+                    if response.status != 200:
+                        logger.warning(f"Skipping {server}/perf_metrics: HTTP {response.status}")
+                        continue
+                    server_perf_metrics = await response.json()
+                    perf_metrics[server] = server_perf_metrics
+            except aiohttp.ClientError as e:
+                logger.error(f"Failed to fetch {server}/perf_metrics: {e}")
+                continue

And remove the exc handling below since we no longer re-raise.


11-12: Use typing.Dict/List/Tuple for Python 3.8 compatibility

PEP 585 built-in generics in type hints aren’t supported on 3.8. Import and use typing equivalents.

Apply this diff:

-from typing import Optional, Type, Union
+from typing import Dict, List, Optional, Tuple, Type, Union

Additionally, update annotations below as suggested in related diffs.


178-191: Avoid data loss when correlating ctx/gen metrics (don’t pop until both present)

If gen is ready but ctx isn’t, current code pops gen and drops the pair. Peek for both; only pop when both are present.

Apply this diff:

-            remain_keys = []
-            for ctx_server, gen_server, ctx_request_id in self.perf_metrics_keys:
-                gen_perf_metrics = self.server_perf_metrics[gen_server].pop(ctx_request_id, None)
-                if gen_perf_metrics is None:
-                    # generation not finished
-                    remain_keys.append((ctx_server, gen_server, ctx_request_id))
-                    continue
-                ctx_perf_metrics = self.server_perf_metrics[ctx_server].pop(ctx_request_id, None)
-                return_metrics.append({
-                    "ctx_server": ctx_server,
-                    "gen_server": gen_server,
-                    "ctx_perf_metrics": ctx_perf_metrics,
-                    "gen_perf_metrics": gen_perf_metrics})
+            remain_keys = []
+            for ctx_server, gen_server, ctx_request_id in self.perf_metrics_keys:
+                ctx_perf_metrics = self.server_perf_metrics.get(ctx_server, {}).get(ctx_request_id)
+                gen_perf_metrics = self.server_perf_metrics.get(gen_server, {}).get(ctx_request_id)
+                if not (ctx_perf_metrics and gen_perf_metrics):
+                    # one side not finished yet
+                    remain_keys.append((ctx_server, gen_server, ctx_request_id))
+                    continue
+                # both present; remove and return
+                self.server_perf_metrics[ctx_server].pop(ctx_request_id, None)
+                self.server_perf_metrics[gen_server].pop(ctx_request_id, None)
+                return_metrics.append({
+                    "ctx_server": ctx_server,
+                    "gen_server": gen_server,
+                    "ctx_perf_metrics": ctx_perf_metrics,
+                    "gen_perf_metrics": gen_perf_metrics})
🧹 Nitpick comments (3)
tensorrt_llm/llmapi/disagg_utils.py (1)

71-76: Validate server configs to avoid invalid URLs (http://None:None)

If hostname/port are unset, this builds invalid URLs. Fail fast with a clear error.

Apply this diff:

     ctx_server_urls = []
     gen_server_urls = []
     for cfg in server_configs:
-        if cfg.type == "ctx":
-            ctx_server_urls.append(f"http://{cfg.hostname}:{cfg.port}")
-        else:
-            gen_server_urls.append(f"http://{cfg.hostname}:{cfg.port}")
+        if cfg.hostname is None or cfg.port is None:
+            raise ValueError(f"Missing hostname/port for server config: {cfg}")
+        if cfg.type == "ctx":
+            ctx_server_urls.append(f"http://{cfg.hostname}:{cfg.port}")
+        elif cfg.type == "gen":
+            gen_server_urls.append(f"http://{cfg.hostname}:{cfg.port}")
+        else:
+            raise ValueError(f"Unknown server type: {cfg.type}")
tensorrt_llm/serve/openai_disagg_server.py (2)

36-36: Fix comment typo

Nit: “enale” -> “enable”.

Apply this diff:

-# yapf: enale
+# yapf: enable

401-401: Use typing.Callable for create_generator type hint

Minor typing cleanup; also import Callable from typing.

Apply this diff:

-                           response_type: Type[Union[CompletionResponse, ChatCompletionResponse]],
-                           create_generator: callable) -> Union[CompletionResponse, ChatCompletionResponse, StreamingResponse]:
+                           response_type: Type[Union[CompletionResponse, ChatCompletionResponse]],
+                           create_generator: Callable) -> Union[CompletionResponse, ChatCompletionResponse, StreamingResponse]:

Outside this range (imports):

from typing import Callable
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 733fd6d and d5f818a.

📒 Files selected for processing (6)
  • tensorrt_llm/llmapi/disagg_utils.py (4 hunks)
  • tensorrt_llm/llmapi/llm_args.py (1 hunks)
  • tensorrt_llm/serve/openai_disagg_server.py (6 hunks)
  • tests/integration/defs/disaggregated/test_disaggregated.py (6 hunks)
  • tests/integration/test_lists/test-db/l0_a10.yml (1 hunks)
  • tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/llmapi/llm_args.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/llmapi/disagg_utils.py
  • tests/integration/defs/disaggregated/test_disaggregated.py
  • tensorrt_llm/serve/openai_disagg_server.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/llmapi/disagg_utils.py
  • tests/integration/defs/disaggregated/test_disaggregated.py
  • tensorrt_llm/serve/openai_disagg_server.py
🧬 Code Graph Analysis (2)
tests/integration/defs/disaggregated/test_disaggregated.py (2)
tests/integration/defs/conftest.py (4)
  • disaggregated_test_root (2339-2344)
  • llm_venv (707-723)
  • disaggregated_example_root (270-275)
  • llama_model_root (964-1039)
tensorrt_llm/serve/openai_disagg_server.py (1)
  • perf_metrics (146-193)
tensorrt_llm/serve/openai_disagg_server.py (3)
tensorrt_llm/llmapi/disagg_utils.py (3)
  • DisaggServerConfig (47-55)
  • MetadataServerConfig (59-64)
  • get_ctx_gen_server_urls (67-78)
tensorrt_llm/serve/metadata_server.py (7)
  • create_metadata_server (80-95)
  • get (17-18)
  • get (38-39)
  • get (66-68)
  • keys (29-30)
  • keys (47-58)
  • keys (76-77)
tensorrt_llm/serve/router.py (2)
  • create_router (615-653)
  • KvCacheAwareRouter (512-612)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tests/integration/test_lists/test-db/l0_a10.yml (1)

28-28: Add perf-metrics E2E to A10 pre-merge: LGTM

The new entry properly wires the OpenAI perf-metrics E2E test into the A10 list.

tensorrt_llm/llmapi/disagg_utils.py (1)

55-55: New config field perf_metrics_max_requests: LGTM

The addition is consistent with the server’s retention policy and defaulting to 0 disables buffering.

tests/integration/defs/disaggregated/test_disaggregated.py (4)

39-45: Helper to resolve disagg server URL from YAML: LGTM

Simple and correct; aligns with the test configs’ hostname/port keys.


165-166: Extend runner with prompt_file and extra_endpoints_test: LGTM

The new parameters integrate cleanly with existing client invocations and post-iteration hooks.


246-249: Post-run endpoint checks hook: LGTM

Calling the extra test after each iteration is useful for validating /perf_metrics without interfering with output verification.


529-576: Disaggregated /perf_metrics test: LGTM

Good use of urlopen with timeout and json.load; field checks and temporal ordering assertions look solid.

tests/integration/test_lists/test-db/l0_h100.yml (1)

67-67: Add perf-metrics disagg test to H100 pre-merge: LGTM

Appropriately placed alongside other disaggregated tests.

tensorrt_llm/serve/openai_disagg_server.py (1)

127-128: Expose /perf_metrics route: LGTM

Public endpoint registration is correct and matches the tests’ expectations.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15893 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11946 completed with status: 'FAILURE'

@zhengd-nv
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/serve/openai_disagg_server.py (2)

415-419: Bug: aiohttp ClientResponse has no .ok; use status/raise_for_status instead

response.ok is a requests/HTTPX-ism; in aiohttp this will raise AttributeError at runtime.

-                        if not response.ok:
-                            logger.error(f"Received failed response {response_dict}")
-                            response.raise_for_status()
+                        if response.status >= 400:
+                            logger.error(f"Received failed response {response.status}: {response_dict}")
+                            response.raise_for_status()

466-469: Timeout handling catches wrong exception type

asyncio.wait_for raises asyncio.TimeoutError, not CancelledError. As-is, timeout won’t be turned into the intended TimeoutError.

-        try:
-            await asyncio.wait_for(check_all_servers_ready(), timeout=server_start_timeout_secs)
-        except asyncio.CancelledError:
-            raise TimeoutError("Timeout waiting for context and generation servers to be ready")
+        try:
+            await asyncio.wait_for(check_all_servers_ready(), timeout=server_start_timeout_secs)
+        except asyncio.TimeoutError:
+            raise TimeoutError("Timeout waiting for context and generation servers to be ready")
♻️ Duplicate comments (7)
tensorrt_llm/serve/openai_disagg_server.py (7)

446-453: Python 3.8 typing: replace list[str] with List[str]

Adhere to the 3.8+ target and repo policy.

-    async def wait_for_all_servers_ready(cls, session: aiohttp.ClientSession,
-                                         ctx_servers: list[str],
-                                         gen_servers: list[str],
+    async def wait_for_all_servers_ready(cls, session: aiohttp.ClientSession,
+                                         ctx_servers: List[str],
+                                         gen_servers: List[str],
                                          server_start_timeout_secs: int = 180):
-        async def get_unready_servers(servers: list[str]) -> list[str]:
+        async def get_unready_servers(servers: List[str]) -> List[str]:

11-11: Typing imports incomplete; add typing generics for 3.8 compatibility

You use Dict/List/Tuple/Deque/Callable/AsyncIterator in suggested fixes and to avoid PEP 585 built-ins. Extend imports.

-from typing import Optional, Type, Union
+from typing import AsyncIterator, Callable, Deque, Dict, List, Optional, Tuple, Type, Union

59-63: Python 3.8 compatibility: avoid PEP 585 built-in generics and add precise annotations

Replace built-in generics and annotate the deque precisely. Also keep server_perf_metrics’ value type stable.

-            # (ctx_server, gen_server, ctx_request_id)
-            self.perf_metrics_keys = deque(maxlen=self.perf_metrics_max_requests)
+            # (ctx_server, gen_server, ctx_request_id)
+            self.perf_metrics_keys: Deque[Tuple[str, str, int]]
+            self.perf_metrics_keys = deque(maxlen=self.perf_metrics_max_requests)
             self.perf_metrics_keys_lock = asyncio.Lock()
             # server_key -> {ctx_request_id: perf_metrics}
-            self.server_perf_metrics: dict[str, dict[int, dict]] = {}
+            self.server_perf_metrics: Dict[str, Dict[int, dict]] = {}

180-193: Fix data loss when correlating ctx/gen metrics: only pop after both sides are present

Current flow pops gen first; if ctx isn’t ready yet, you emit a record with ctx=None and drop gen, losing correlation forever. Peek with get(), only pop when both exist, and retain the key otherwise.

-            remain_keys = []
-            for ctx_server, gen_server, ctx_request_id in self.perf_metrics_keys:
-                gen_perf_metrics = self.server_perf_metrics[gen_server].pop(ctx_request_id, None)
-                if gen_perf_metrics is None:
-                    # generation not finished
-                    remain_keys.append((ctx_server, gen_server, ctx_request_id))
-                    continue
-                ctx_perf_metrics = self.server_perf_metrics[ctx_server].pop(ctx_request_id, None)
-                return_metrics.append({
-                    "ctx_server": ctx_server,
-                    "gen_server": gen_server,
-                    "ctx_perf_metrics": ctx_perf_metrics,
-                    "gen_perf_metrics": gen_perf_metrics})
+            remain_keys = []
+            for ctx_server, gen_server, ctx_request_id in self.perf_metrics_keys:
+                ctx_store = self.server_perf_metrics.get(ctx_server, {})
+                gen_store = self.server_perf_metrics.get(gen_server, {})
+                ctx_perf_metrics = ctx_store.get(ctx_request_id)
+                gen_perf_metrics = gen_store.get(ctx_request_id)
+                if not (ctx_perf_metrics and gen_perf_metrics):
+                    # one side not finished yet; keep the key
+                    remain_keys.append((ctx_server, gen_server, ctx_request_id))
+                    continue
+                # both present; now remove from stores to avoid duplication
+                ctx_store.pop(ctx_request_id, None)
+                gen_store.pop(ctx_request_id, None)
+                return_metrics.append({
+                    "ctx_server": ctx_server,
+                    "gen_server": gen_server,
+                    "ctx_perf_metrics": ctx_perf_metrics,
+                    "gen_perf_metrics": gen_perf_metrics,
+                })

1-1: Add NVIDIA SPDX header per repo guidelines

Source files must include the NVIDIA copyright/SPDX header.

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 #!/usr/bin/env python

59-63: Audit and update PEP 585 built-in generics for Python 3.8 compatibility

The repository currently uses built-in generics (e.g. list[str], dict[str, …], tuple[int, …], deque[int]) which are only supported on Python 3.9+. Since our target is Python 3.8, these must be replaced with the corresponding types from the typing module.

• In tensorrt_llm/serve/openai_disagg_server.py
– Lines 59–63:
```diff

  • self.server_perf_metrics: dict[str, dict[int, dict]] = {}
  • from typing import Dict
  • self.server_perf_metrics: Dict[str, Dict[int, Dict]] = {}

– Lines 446–453 (method signature):
```diff

  • async def get_unready_servers(servers: list[str]) -> list[str]:
  • from typing import List
  • async def get_unready_servers(servers: List[str]) -> List[str]:
    
    

• Apply the same conversion across the codebase wherever you see:

  • list[...]typing.List[...]
  • dict[...]typing.Dict[...]
  • set[...]typing.Set[...]
  • tuple[...]typing.Tuple[...]
  • deque[...]typing.Deque[...]

• Ensure you add the necessary imports, e.g.

from typing import List, Dict, Tuple, Set, Deque
from collections import deque  # if using deque instances

This refactor is mandatory to maintain compatibility with Python 3.8.


146-160: Harden /perf_metrics aggregation: check HTTP status, handle errors per-server, don’t raise global exception

As-is, any exception in the loop defers raising and then 500s the endpoint after partially updating caches. Prefer skipping failed servers and proceeding. Also check response.status before parsing JSON.

-        perf_metrics = {}
-        exc = None
-        try:
-            for server in self.ctx_servers + self.gen_servers:
-                async with self.session.get(f"{server}/perf_metrics") as response:
-                    server_perf_metrics = await response.json()
-                    perf_metrics[server] = server_perf_metrics
-        except Exception as e:
-            # Keep the exception to raise it after saving perf metrics
-            exc = e
+        perf_metrics: Dict[str, list] = {}
+        for server in self.ctx_servers + self.gen_servers:
+            try:
+                async with self.session.get(f"{server}/perf_metrics") as response:
+                    if response.status != 200:
+                        logger.warning(f"Skipping {server}/perf_metrics: HTTP {response.status}")
+                        continue
+                    try:
+                        server_perf_metrics = await response.json()
+                    except Exception as je:
+                        logger.warning(f"Invalid JSON from {server}/perf_metrics: {je}")
+                        continue
+                    perf_metrics[server] = server_perf_metrics
+            except aiohttp.ClientError as ce:
+                logger.error(f"Failed to fetch {server}/perf_metrics: {ce}")
+                continue

Optional (latency): fetch all servers concurrently with asyncio.gather and per-task error handling. Want a suggested patch?

🧹 Nitpick comments (5)
tensorrt_llm/serve/openai_disagg_server.py (5)

399-405: Tighten type of create_generator parameter

Use typing.Callable with an explicit signature that matches your generator helpers.

-    async def send_request(self, url: str,
+    async def send_request(self, url: str,
                            request: Union[CompletionRequest, ChatCompletionRequest],
                            endpoint: str,
                            response_type: Type[Union[CompletionResponse, ChatCompletionResponse]],
-                           create_generator: callable) -> Union[CompletionResponse, ChatCompletionResponse, StreamingResponse]:
+                           create_generator: Callable[[str, Union[CompletionRequest, ChatCompletionRequest]], AsyncIterator[bytes]]
+                           ) -> Union[CompletionResponse, ChatCompletionResponse, StreamingResponse]:

334-337: Consider awaiting key registration to reduce race window

Using asyncio.create_task makes correlation key insertion eventual. If /perf_metrics is called immediately, that request may be missed until the next poll. If acceptable, ignore; otherwise await self._add_perf_metrics_keys(...) here. Minor.

If you want to quantify any impact, I can script a stress test that triggers /perf_metrics concurrently with request dispatch to measure missed correlations.


36-36: Nit: fix typo in yapf directive

Spelling: “enable”.

-# yapf: enale
+# yapf: enable

118-123: create_error_response ignores provided status and always raises 500

Either return a proper JSONResponse with the given status or raise HTTPException(status_code=status_code, ...). Otherwise callers can’t control status codes.

-    def create_error_response(
-            message: str,
-            status_code: HTTPStatus = HTTPStatus.BAD_REQUEST) -> ErrorResponse:
-        raise HTTPException(status_code=500, detail=f"Internal server error {message}")
+    def create_error_response(
+            message: str,
+            status_code: HTTPStatus = HTTPStatus.BAD_REQUEST) -> ErrorResponse:
+        raise HTTPException(status_code=int(status_code), detail=str(message))

146-196: Add tests for /perf_metrics error-paths and pruning

  • One server returns non-200; verify endpoint still returns aggregated results.
  • JSON parse error from a server; verify skip.
  • Correlation path when ctx arrives before gen and vice versa; verify no data loss.
  • Pruning: exceed perf_metrics_max_requests; verify oldest entries are dropped deterministically.

I can draft pytest cases mirroring tests/unittest/llmapi/apps/_test_openai_perf_metrics.py style.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d5f818a and 3efaca3.

📒 Files selected for processing (2)
  • tensorrt_llm/serve/openai_disagg_server.py (6 hunks)
  • tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/test_lists/test-db/l0_h100.yml
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/serve/openai_disagg_server.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/serve/openai_disagg_server.py
🧠 Learnings (1)
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/serve/openai_disagg_server.py
🧬 Code Graph Analysis (1)
tensorrt_llm/serve/openai_disagg_server.py (3)
tensorrt_llm/llmapi/disagg_utils.py (3)
  • DisaggServerConfig (47-55)
  • MetadataServerConfig (59-64)
  • get_ctx_gen_server_urls (67-78)
tensorrt_llm/serve/metadata_server.py (7)
  • create_metadata_server (80-95)
  • get (17-18)
  • get (38-39)
  • get (66-68)
  • keys (29-30)
  • keys (47-58)
  • keys (76-77)
tensorrt_llm/serve/router.py (2)
  • create_router (615-653)
  • KvCacheAwareRouter (512-612)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15995 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15995 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12022 completed with status: 'FAILURE'

@zhengd-nv zhengd-nv requested a review from a team as a code owner August 22, 2025 02:13
@zhengd-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16100 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/unittest/api_stability/references/llm.yaml (1)

134-137: Consider grouping with the Stats section for readability

perf_metrics_max_requests is conceptually tied to return_perf_metrics and other stats controls. For maintainers scanning this reference, placing it under the “# Stats” block (near return_perf_metrics) would improve discoverability. No functional change—purely organizational.

Proposed edit (move block near the Stats entries):

@@
-      print_iter_log:
-        annotation: bool
-        default: False
-        status: beta
-      perf_metrics_max_requests:
-        annotation: int
-        default: 0
-        status: prototype
+      print_iter_log:
+        annotation: bool
+        default: False
+        status: beta
@@
       return_perf_metrics:
         annotation: bool
         default: False
         status: prototype
+      perf_metrics_max_requests:
+        annotation: int
+        default: 0
+        status: prototype
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3efaca3 and 275f920.

📒 Files selected for processing (1)
  • tests/unittest/api_stability/references/llm.yaml (1 hunks)

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16100 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12108 completed with status: 'FAILURE'

@zhengd-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16143 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/integration/defs/disaggregated/test_disaggregated.py (1)

543-551: Past feedback addressed: correct urllib import, timeout, and JSON parsing

Thanks for switching to urllib.request, adding a timeout, and using json.load on the response object. This resolves the previously raised runtime and robustness issues.

🧹 Nitpick comments (3)
tests/integration/defs/disaggregated/test_disaggregated.py (3)

39-45: Add return type annotation and a brief docstring; open YAML with explicit encoding

Minor polish to improve readability and consistency with the rest of the file where files are opened with encoding.

-def get_disagg_server_url_from_cfg(config_file: str):
-    with open(config_file, 'r') as file:
+def get_disagg_server_url_from_cfg(config_file: str) -> str:
+    """Return the disaggregated server base URL (http://host:port) from a YAML config."""
+    with open(config_file, 'r', encoding='utf-8') as file:
         config = yaml.safe_load(file)
     server_host = config.get('hostname', 'localhost')
     server_port = config.get('port', 8000)
     return f"http://{server_host}:{server_port}"

246-249: Avoid re-parsing YAML each loop iteration; compute server_url once

We open and parse the config file on every iteration just to derive the URL. Cache it once for slight speedup and less IO.

-                if extra_endpoints_test is not None:
-                    extra_endpoints_test(
-                        get_disagg_server_url_from_cfg(config_file))
+                if extra_endpoints_test is not None:
+                    extra_endpoints_test(server_url)

Add this initialization near where config_file is determined (outside the loop):

# After obtaining config_file and before entering the for-loop
server_url = get_disagg_server_url_from_cfg(config_file)

547-552: Add a retry loop when polling /perf_metrics to reduce flakes

I’ve verified that all metric producers emit both timing_metrics and kv_cache_metrics (see tensorrt_llm/serve/openai_server.py around lines 273–298), so it’s safe to wait briefly for the aggregator to combine them.

Please consider updating the test at tests/integration/defs/disaggregated/test_disaggregated.py (around lines 547–552) as follows:

-        with urllib.request.urlopen(f"{server_url}/perf_metrics",
-                                    timeout=10) as resp:
-            assert resp.status == 200
-            perf_metrics = json.load(resp)
-        assert len(perf_metrics) > 0
+        import time
+        perf_metrics = []
+        for _ in range(10):  # up to ~10s total
+            with urllib.request.urlopen(f"{server_url}/perf_metrics", timeout=10) as resp:
+                assert resp.status == 200
+                perf_metrics = json.load(resp)
+            if perf_metrics:
+                break
+            time.sleep(1)
+        assert perf_metrics, "No perf metrics returned after retries"
  • Location to update:
    • tests/integration/defs/disaggregated/test_disaggregated.py lines 547–552
  • Verified producers include kv_cache_metrics:
    • tensorrt_llm/serve/openai_server.py lines 273–298 [run_scripts output]

Optional refactor to improve test stability without major changes to production code.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 275f920 and ec90693.

📒 Files selected for processing (3)
  • tests/integration/defs/disaggregated/test_disaggregated.py (6 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/integration/test_lists/test-db/l0_h100.yml
  • tests/integration/defs/test_e2e.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/integration/defs/disaggregated/test_disaggregated.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/integration/defs/disaggregated/test_disaggregated.py
🧬 Code graph analysis (1)
tests/integration/defs/disaggregated/test_disaggregated.py (2)
tests/integration/defs/conftest.py (4)
  • disaggregated_test_root (2339-2344)
  • llm_venv (707-723)
  • disaggregated_example_root (270-275)
  • llama_model_root (964-1039)
tensorrt_llm/serve/openai_disagg_server.py (1)
  • perf_metrics (146-195)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tests/integration/defs/disaggregated/test_disaggregated.py (3)

20-20: Type annotations import looks good

Adding Callable is appropriate for the new callback parameter.


165-166: Signature extension is clear and typed

Adding prompt_file default and the typed extra_endpoints_test callback reads well.


69-69: perf_metrics config reference verified – LGTM

Verified that disagg_config_metrics.yaml exists under tests/integration/defs/disaggregated/test_configs/ and that the mapping in tests/integration/defs/disaggregated/test_disaggregated.py (line 69) correctly references it. No further changes needed.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16143 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12142 completed with status: 'FAILURE'

@chuangz0
Copy link
Collaborator

some timing are get from std::chrono::steady_clock::now() , which is unable compared across different nodes.
Metrics which need handled across nodes may need to use system_clock or other way to align.

Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
Signed-off-by: zhengd-nv <200704041+zhengd-nv@users.noreply.github.com>
@zhengd-nv
Copy link
Collaborator Author

zhengd-nv commented Aug 25, 2025

some timing are get from std::chrono::steady_clock::now() , which is unable compared across different nodes. Metrics which need handled across nodes may need to use system_clock or other way to align.

The steady clock and the python time.monotonic() are generally the time since boost, so as a workaround, we can run python -c "import time; print(time.time() - time.monotonic())" on each node to get the difference. The system clock is not recommended for measuring, so what we need is a synchronize point to align steady clocks on different nodes, which is beyond the scope of this PR.

@zhengd-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16349 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
tests/integration/defs/disaggregated/test_disaggregated.py (2)

544-551: Fix urllib import and add timeout for robustness

The code uses urllib.request.urlopen but only imports urllib, which will cause an AttributeError at runtime. Also, add a timeout to prevent indefinite blocking.

Apply this diff:

     def extra_endpoints_test(server_url: str):
         import json
-        import urllib.request
+        import urllib.request

         with urllib.request.urlopen(f"{server_url}/perf_metrics",
-                                    timeout=10) as resp:
+                                    timeout=30) as resp:
             assert resp.status == 200
             perf_metrics = json.load(resp)

559-571: Fix: kv_cache_size is not in timing_metrics and timestamps need robust checks

The test incorrectly tries to access kv_cache_size from gen_metrics (which are timing_metrics). The field is actually in kv_cache_metrics. Also, the strict timing comparisons could be flaky.

Apply this diff:

         ctx_metrics = item["ctx_perf_metrics"]["perf_metrics"]["timing_metrics"]
         gen_metrics = item["gen_perf_metrics"]["perf_metrics"]["timing_metrics"]
+        gen_kv_metrics = item["gen_perf_metrics"]["perf_metrics"].get("kv_cache_metrics", {})
         # only one token is generated in ctx
         assert ctx_metrics["last_token_time"] - ctx_metrics[
             "first_token_time"] < 1e-3
         assert ctx_metrics["last_token_time"] < gen_metrics["arrival_time"]
-        assert gen_metrics["kv_cache_size"] > 0
+        # Check kv_cache_metrics if available
+        if gen_kv_metrics:
+            # Note: kv_cache_size might be in timing_metrics for legacy reasons
+            if "kv_cache_size" in gen_metrics:
+                assert gen_metrics["kv_cache_size"] > 0
         assert gen_metrics["arrival_time"] < gen_metrics[
             "kv_cache_transfer_start"]
         assert gen_metrics["kv_cache_transfer_start"] < gen_metrics[
             "kv_cache_transfer_end"]
         assert gen_metrics["kv_cache_transfer_end"] < gen_metrics[
             "first_scheduled_time"]
tensorrt_llm/serve/openai_disagg_server.py (2)

55-66: Fix type annotations for Python 3.8 compatibility

The code uses PEP 585 built-in generic dict[str, dict[int, dict]] which is not supported in Python 3.8. Replace with typing module equivalents.

Apply this diff:

+from typing import Dict
 
             # server_key -> {ctx_request_id: perf_metrics}
-            self.server_perf_metrics: dict[str, dict[int, dict]] = {}
+            self.server_perf_metrics: Dict[str, Dict[int, dict]] = {}

150-159: Add error handling for individual server failures

If any server's /perf_metrics endpoint fails or returns non-200, the entire aggregation fails. Consider skipping failed servers to keep the endpoint robust.

Apply this diff:

         perf_metrics = {}
         exc = None
         try:
             for server in self.ctx_servers + self.gen_servers:
-                async with self.session.get(f"{server}/perf_metrics") as response:
-                    server_perf_metrics = await response.json()
-                    perf_metrics[server] = server_perf_metrics
+                try:
+                    async with self.session.get(f"{server}/perf_metrics") as response:
+                        if response.status != 200:
+                            logger.warning(f"Skipping {server}/perf_metrics: HTTP {response.status}")
+                            continue
+                        server_perf_metrics = await response.json()
+                        perf_metrics[server] = server_perf_metrics
+                except Exception as e:
+                    logger.error(f"Failed to fetch {server}/perf_metrics: {e}")
+                    # Don't set exc here - allow partial aggregation
         except Exception as e:
             # Keep the exception to raise it after saving perf metrics
             exc = e
🧹 Nitpick comments (2)
tensorrt_llm/serve/openai_server.py (1)

266-308: Consider adding error handling for missing perf_metrics fields

The function accesses nested attributes without checking if they exist. Consider adding defensive checks to handle cases where some metrics might be None or missing.

Add defensive checks:

     for metrics_dict in perf_metrics:
         metrics = metrics_dict["perf_metrics"]
+        if metrics is None:
+            continue
         timing_metrics = metrics.timing_metrics
         kv_cache_metrics = metrics.kv_cache_metrics
         speculative_decoding = metrics.speculative_decoding
tensorrt_llm/serve/openai_disagg_server.py (1)

177-178: Consider not re-raising the exception to allow partial results

Re-raising the exception here causes the endpoint to return 500 even when some metrics were successfully collected. Consider returning partial results instead.

-            if exc is not None:
-                raise exc
+            # Log the exception but continue with partial results
+            if exc is not None:
+                logger.error(f"Error during perf metrics collection: {exc}")
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ec90693 and 0f06d5c.

📒 Files selected for processing (12)
  • tensorrt_llm/commands/serve.py (3 hunks)
  • tensorrt_llm/llmapi/disagg_utils.py (4 hunks)
  • tensorrt_llm/llmapi/llm_args.py (1 hunks)
  • tensorrt_llm/serve/openai_disagg_server.py (6 hunks)
  • tensorrt_llm/serve/openai_server.py (9 hunks)
  • tests/integration/defs/disaggregated/test_configs/disagg_config_metrics.yaml (1 hunks)
  • tests/integration/defs/disaggregated/test_disaggregated.py (7 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/integration/test_lists/test-db/l0_a10.yml (1 hunks)
  • tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)
  • tests/unittest/api_stability/references/llm.yaml (1 hunks)
  • tests/unittest/llmapi/apps/_test_openai_perf_metrics.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • tests/integration/test_lists/test-db/l0_h100.yml
  • tests/integration/test_lists/test-db/l0_a10.yml
  • tensorrt_llm/llmapi/llm_args.py
  • tests/unittest/api_stability/references/llm.yaml
  • tests/unittest/llmapi/apps/_test_openai_perf_metrics.py
  • tensorrt_llm/llmapi/disagg_utils.py
  • tests/integration/defs/disaggregated/test_configs/disagg_config_metrics.yaml
  • tests/integration/defs/test_e2e.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/serve/openai_server.py
  • tests/integration/defs/disaggregated/test_disaggregated.py
  • tensorrt_llm/serve/openai_disagg_server.py
  • tensorrt_llm/commands/serve.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/serve/openai_server.py
  • tests/integration/defs/disaggregated/test_disaggregated.py
  • tensorrt_llm/serve/openai_disagg_server.py
  • tensorrt_llm/commands/serve.py
🧠 Learnings (1)
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/serve/openai_disagg_server.py
🧬 Code graph analysis (4)
tensorrt_llm/serve/openai_server.py (5)
tensorrt_llm/serve/openai_disagg_server.py (1)
  • perf_metrics (146-195)
tensorrt_llm/_utils.py (1)
  • set_prometheus_multiproc_dir (1118-1129)
tensorrt_llm/metrics/collector.py (2)
  • MetricsCollector (10-105)
  • log_metrics_dict (100-105)
tensorrt_llm/llmapi/llm.py (1)
  • RequestOutput (47-87)
tensorrt_llm/executor/result.py (3)
  • finished (544-545)
  • outputs (197-212)
  • request_id (521-522)
tests/integration/defs/disaggregated/test_disaggregated.py (2)
tests/integration/defs/conftest.py (4)
  • disaggregated_test_root (2339-2344)
  • llm_venv (707-723)
  • disaggregated_example_root (270-275)
  • llama_model_root (964-1039)
tensorrt_llm/serve/openai_disagg_server.py (1)
  • perf_metrics (146-195)
tensorrt_llm/serve/openai_disagg_server.py (4)
cpp/tensorrt_llm/common/envUtils.h (1)
  • tensorrt_llm (23-119)
tensorrt_llm/llmapi/disagg_utils.py (3)
  • DisaggServerConfig (47-55)
  • MetadataServerConfig (59-64)
  • get_ctx_gen_server_urls (67-78)
tensorrt_llm/serve/metadata_server.py (7)
  • create_metadata_server (80-95)
  • get (17-18)
  • get (38-39)
  • get (66-68)
  • keys (29-30)
  • keys (47-58)
  • keys (76-77)
tensorrt_llm/serve/router.py (2)
  • create_router (615-653)
  • KvCacheAwareRouter (512-612)
tensorrt_llm/commands/serve.py (2)
tensorrt_llm/llmapi/disagg_utils.py (2)
  • MetadataServerConfig (59-64)
  • ServerRole (19-22)
tensorrt_llm/serve/openai_disagg_server.py (1)
  • OpenAIDisaggServer (39-472)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (23)
tests/integration/defs/disaggregated/test_disaggregated.py (7)

20-20: LGTM!

The import of Callable from the typing module is correctly added and follows Python 3.8+ compatibility requirements.


39-44: LGTM! Good helper function for extracting server URL from config.

The function correctly extracts hostname and port from the YAML config with sensible defaults (localhost:8000), which aligns with the disaggregated server defaults.


69-69: LGTM!

The addition of the perf_metrics test configuration entry follows the established pattern for test configurations.


165-166: LGTM! Clean API extension for extra endpoint testing.

The optional extra_endpoints_test parameter is well-designed, using Optional[Callable[[str], None]] which is compatible with Python 3.8+.


186-186: LGTM!

Correctly derives the server URL from the config file before starting the test.


247-248: LGTM! Good integration point for additional testing.

The extra endpoints test is invoked at the right time - after the client runs but before output verification, allowing validation of metrics collected during the test.


573-577: LGTM!

The test function is correctly integrated with the test harness, passing the extra_endpoints_test callback to validate the perf_metrics endpoint.

tensorrt_llm/serve/openai_server.py (5)

7-7: LGTM!

The import of deque from the collections module is correctly added for the perf metrics queue implementation.


88-99: LGTM! Well-structured initialization of perf metrics state.

The initialization properly sets up the perf metrics collection infrastructure only when both return_perf_metrics is enabled and perf_metrics_max_requests > 0. The bounded deque ensures memory is controlled.


169-169: LGTM!

The /perf_metrics endpoint is correctly registered in the routes.


320-335: LGTM! Clean extraction and storage of metrics.

The _extract_metrics method properly checks if the request is finished before processing, logs metrics if a collector is available, and safely appends to the perf_metrics queue with proper locking.


352-352: LGTM! Consistent metrics extraction across all completion paths.

The metrics extraction is correctly integrated at all completion points - streaming chat, non-streaming chat, streaming completion, and non-streaming completion.

Also applies to: 370-370, 539-539, 575-575

tensorrt_llm/serve/openai_disagg_server.py (7)

4-4: LGTM!

The imports for itertools and deque are correctly added for the perf metrics implementation.

Also applies to: 8-8


11-11: LGTM!

The typing import update correctly removes unused List and keeps only what's needed.


22-24: LGTM!

The imports are correctly updated to use the new config-driven API with DisaggServerConfig and get_ctx_gen_server_urls.


41-54: LGTM! Clean config-driven initialization.

The constructor is well refactored to use the single DisaggServerConfig object, deriving servers and routers from it. The implementation is cleaner and more maintainable than the previous multi-parameter approach.


127-127: LGTM!

The /perf_metrics endpoint is correctly registered in the routes.


142-144: LGTM! Thread-safe helper for adding perf metrics keys.

The helper method properly uses async locking to safely append correlation keys to the deque.


334-336: LGTM! Correct integration of perf metrics tracking.

The perf metrics keys are properly registered when both context processing is needed and perf metrics tracking is enabled. Using asyncio.create_task ensures non-blocking execution.

tensorrt_llm/commands/serve.py (4)

3-3: LGTM!

The signal module import is correctly added for signal handling functionality.


6-6: LGTM!

The typing import update correctly removes unused List and keeps only Optional and Any.


23-25: LGTM! Clean import update for config-driven API.

The imports are correctly updated to support the new config-driven disaggregated server setup, removing the now-unused CtxGenServerConfig.


483-486: LGTM! Clean config-driven server instantiation.

The disaggregated server instantiation is cleanly updated to use the single config object approach, which is more maintainable than the previous multi-parameter approach.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16349 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12287 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@Shixiaowei02 Shixiaowei02 merged commit cf50ba2 into NVIDIA:main Aug 26, 2025
5 checks passed
@zhengd-nv zhengd-nv deleted the perf-metrics branch August 26, 2025 08:20
continue
server_metrics[ctx_request_id] = request_perf_metrics

if len(server_metrics) > self.perf_metrics_max_requests:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, you could use a ring buffer with size = perf_metrics_max_requests instead. Would be faster and easier to understand.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx_request_id is used as a look-up key in later metrics aggregation, so server_metrics here has to be a map with max entries.

return JSONResponse(content=[])
async with self.perf_metrics_lock:
perf_metrics = self.perf_metrics
self.perf_metrics = deque(maxlen=self.llm.args.perf_metrics_max_requests)
Copy link
Contributor

@nv-kmcgill53 nv-kmcgill53 Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug. Subsequent calls to this /perf_metrics API will return and empty list [] rather than returning the last perf_metrics_max_requests number of requests. Right now only the first caller gets the metrics for the queued requests. See my output below:

root@f269356a9f20:/dynamo/build# curl localhost:8000/perf_metrics | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   348  100   348    0     0   175k      0 --:--:-- --:--:-- --:--:--  339k
[
  {
    "request_id": 3,
    "perf_metrics": {
      "first_iter": 821,
      "last_iter": 1810,
      "timing_metrics": {
        "arrival_time": 1946966.715181,
        "first_scheduled_time": 1946966.715351,
        "first_token_time": 1946966.744874,
        "last_token_time": 1946976.857435
      },
      "kv_cache_metrics": {
        "num_total_allocated_blocks": 1,
        "num_new_allocated_blocks": 0,
        "num_reused_blocks": 2,
        "num_missed_blocks": 0
      }
    }
  }
]

# Immediately calling the same endpoint again.
root@f269356a9f20:/dynamo/build# curl localhost:8000/perf_metrics | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100     2  100     2    0     0   1098      0 --:--:-- --:--:-- --:--:--  2000
[]

I would expect the second call to this api endpoint to return the data for request_id: 3

My recommendation to fixing this is to keep a list (or my aforementioned ring buffer with a lock) that holds the previous perf_metrics_max_requests number of items to return to the caller. This way the API is idempotent and the race condition for who calls this API first is removed.

Copy link
Collaborator Author

@zhengd-nv zhengd-nv Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is consistent with the /metrics endpoint, that entries are cached and batch retrieved. After a call to the endpoint, the cached entries are returned and cleared. The perf_metrics_max_requests means how much entries to be kept at most if not requested frequently, not how much would be returned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I disagree with the structure of this API then. In the event the caller who is processing these metrics goes down, there is no recovery for the metrics which were disbursed. If this is a large number then a large chunk of data is lost. This also limits the number of data processors to one if we want to have multiple or redundant data processors.

If you are worried about memory footprint for a large number of perf_metrics items then you can hard code an upper bound like: user_max_perf_items > MAX_PERF_ITEMS ? MAX_PERF_ITEMS : user_max_perf_items.

I want to understand the fundamental reason why the queue must be used other than, "this is what the spec says."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants