[TRTLLM-7846][feat] Http disagg-cluster management implemention #7869

reasonsolo · 2025-09-19T09:12:32Z

This is the first PR for the whole epic, only including the basic tools for service discovery.

Summary by CodeRabbit

New Features
- Distributed autoscaling for clusters with dynamic worker registration, heartbeats, and readiness checks.
- Pluggable cluster storage: HTTP in-memory server/client and optional etcd backend with TTL/expiration and change notifications.
Refactor
- Server role enumeration now uses integer-based values for improved interoperability.
Tests
- Comprehensive tests for storage backends and autoscaling workflows, covering set/get, expiration, watch events, and readiness transitions.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-09-26T06:40:33Z

📝 Walkthrough

Walkthrough

Introduces a new autoscaling subsystem and a cluster storage layer with HTTP server/client and watch semantics; adds tests. Also changes ServerRole to inherit from IntEnum. New components include ClusterManager, ClusterWorker, storage abstractions and implementations, and comprehensive unit tests for storage and autoscaling behavior.

Changes

Cohort / File(s)	Change summary
Server role enum change `tensorrt_llm/llmapi/disagg_utils.py`	Switches ServerRole base from Enum to IntEnum; updates import accordingly.
Autoscaling components `tensorrt_llm/serve/auto_scaling.py`	Adds WorkerInfo dataclass; utilities for worker keys; ClusterManager for tracking workers, watching storage events, readiness checks; ClusterWorker for registration, deregistration, and heartbeat with TTL.
Cluster storage (HTTP + interface) `tensorrt_llm/serve/cluster_storage.py`	Adds storage models (StorageItem, WatchEventType, WatchEvent), watch queue, abstract ClusterStorage API; HTTP in-memory server with TTL and watch; HTTP client; factories for storage/server creation; background expiration task; FastAPI route wiring.
Tests: storage and autoscaling `tests/unittest/serve/test_cluster_manager_worker.py`, `tests/unittest/serve/test_cluster_storage.py`	Adds async tests for storage CRUD, TTL expiration, prefix queries, watch notifications; tests for ClusterManager/ClusterWorker registration, event watching, readiness transitions, and heartbeat behavior across HTTP and etcd backends.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Worker as ClusterWorker
  participant Storage as ClusterStorage
  participant Manager as ClusterManager

  Note over Worker,Storage: Registration
  Worker->>Storage: set(worker_key, worker_info, ttl)
  Storage-->>Worker: ack

  Note over Manager,Storage: Watch worker registry
  Manager->>Storage: watch(worker_key_prefix)
  Storage-->>Manager: WatchEventQueue

  loop Heartbeat (periodic)
    Worker->>Storage: expire(worker_key, ttl)
    Storage-->>Worker: ack
  end

  Note over Storage,Manager: Change notifications
  Storage-->>Manager: WatchEvent(SET/DELETE)
  Manager->>Manager: update internal worker maps and counts

sequenceDiagram
  autonumber
  participant Test as Test Suite
  participant HTTP as HttpClusterStorageServer
  participant Client as HttpClusterStorageClient

  Test->>HTTP: start()
  Test->>Client: set(key, value, ttl)
  Client->>HTTP: POST /set
  HTTP-->>Client: ok

  Test->>Client: watch(prefix)
  Client->>HTTP: GET /watch (stub/queue)
  HTTP-->>Client: WatchEventQueue

  Test->>Client: delete(key)
  Client->>HTTP: POST /delete
  HTTP-->>Client: ok
  HTTP-->>Client: notify(DELETE)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description Check	⚠️ Warning	The pull request description is largely incomplete and retains the template placeholders for the Description and Test Coverage sections without any substantive content, and it does not provide a properly formatted PR title or summary according to the repository’s template.	Please fill in the Description section with a concise explanation of the changes and rationale, list relevant tests under Test Coverage, and ensure the PR title follows the “[JIRA ticket][type] Summary” format as specified by the template.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title follows the required format by including the JIRA ticket, change type, and a concise description of the main feature—implementing HTTP-based disaggregation cluster management—so it accurately reflects the primary change in the pull request.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

tests/unittest/serve/test_cluster_storage.py (1)
92-95: Swap the blocking sleep for asyncio.sleep.

This coroutine is running on the event loop; the synchronous time.sleep call blocks the loop and can delay other async tasks (like the expiry check) for the full duration. Replace it with await asyncio.sleep(...) so the loop stays cooperative.
-        time.sleep(1)
+        await asyncio.sleep(1)
-        time.sleep(2)
+        await asyncio.sleep(2)
tensorrt_llm/serve/cluster_storage.py (1)
293-297: Make the session cleanup resilient when no event loop is running.

asyncio.get_event_loop() raises RuntimeError once the loop is closed (common at interpreter shutdown), so the destructor will emit unhandled exceptions instead of closing the session. Wrap the lookup and only schedule a close when a running loop is available:
-        if asyncio.get_event_loop():
-            asyncio.run_coroutine_threadsafe(self._session.close(),
-                                             asyncio.get_event_loop())
+        try:
+            loop = asyncio.get_running_loop()
+        except RuntimeError:
+            return
+        asyncio.run_coroutine_threadsafe(self._session.close(), loop)
This keeps teardown quiet and still closes the client session.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2db22fb and 93ab23c.

📒 Files selected for processing (5)

tensorrt_llm/llmapi/disagg_utils.py (2 hunks)
tensorrt_llm/serve/auto_scaling.py (1 hunks)
tensorrt_llm/serve/cluster_storage.py (1 hunks)
tests/unittest/serve/test_cluster_manager_worker.py (1 hunks)
tests/unittest/serve/test_cluster_storage.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/serve/auto_scaling.py
tests/unittest/serve/test_cluster_storage.py
tests/unittest/serve/test_cluster_manager_worker.py
tensorrt_llm/serve/cluster_storage.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/serve/auto_scaling.py
tests/unittest/serve/test_cluster_storage.py
tests/unittest/serve/test_cluster_manager_worker.py
tensorrt_llm/serve/cluster_storage.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/llmapi/disagg_utils.py
tensorrt_llm/serve/auto_scaling.py
tests/unittest/serve/test_cluster_storage.py
tests/unittest/serve/test_cluster_manager_worker.py
tensorrt_llm/serve/cluster_storage.py

🧬 Code graph analysis (4)

tensorrt_llm/serve/auto_scaling.py (2)

tensorrt_llm/llmapi/disagg_utils.py (1)

ServerRole (19-22)

tensorrt_llm/serve/cluster_storage.py (25)

ClusterStorage (57-96)

StorageItem (18-23)

WatchEvent (32-34)

WatchEventType (26-28)

key_time (130-131)

start (63-64)

start (156-159)

stop (67-68)

stop (161-164)

get_prefix (93-96)

get_prefix (217-224)

get_prefix (354-359)

watch (87-88)

watch (226-235)

watch (370-372)

unwatch (90-91)

unwatch (237-244)

unwatch (374-376)

drain (44-54)

delete (84-85)

delete (207-215)

delete (361-368)

expire (78-79)

expire (200-205)

expire (345-349)

tests/unittest/serve/test_cluster_storage.py (1)

tensorrt_llm/serve/cluster_storage.py (24)

HttpClusterStorageServer (134-281)

StorageItem (18-23)

WatchEvent (32-34)

WatchEventType (26-28)

create_cluster_storage (99-105)

create_cluster_storage_client (108-114)

start (63-64)

start (156-159)

get (81-82)

get (188-198)

get (351-352)

get_prefix (93-96)

get_prefix (217-224)

get_prefix (354-359)

watch (87-88)

watch (226-235)

watch (370-372)

drain (44-54)

delete (84-85)

delete (207-215)

delete (361-368)

stop (67-68)

stop (161-164)

add_routes (145-154)

tests/unittest/serve/test_cluster_manager_worker.py (4)

tensorrt_llm/llmapi/disagg_utils.py (1)

ServerRole (19-22)

tensorrt_llm/serve/auto_scaling.py (13)

ClusterManager (33-178)

ClusterWorker (181-283)

start (45-46)

stop (48-49)

register_worker (212-247)

watch_workers (82-96)

worker_id (196-197)

is_ready (173-174)

current_ctx_worker_num (71-72)

current_gen_worker_num (75-76)

get_worker_events (102-118)

worker_info (200-205)

deregister_worker (249-258)

tensorrt_llm/serve/cluster_storage.py (7)

WatchEventType (26-28)

create_cluster_storage (99-105)

create_cluster_storage_client (108-114)

start (63-64)

start (156-159)

stop (67-68)

stop (161-164)

tests/unittest/serve/test_cluster_storage.py (5)

http_server_storage (174-187)

storage_server (194-198)

storage_server (205-212)

run_in_thread (36-45)

storage_client (52-54)

tensorrt_llm/serve/cluster_storage.py (1)

tensorrt_llm/serve/auto_scaling.py (2)

start (45-46)

stop (48-49)

🪛 Ruff (0.13.1)

tensorrt_llm/serve/auto_scaling.py

29-29: Unused function argument: role

(ARG001)

113-113: Do not catch blind exception: Exception

(BLE001)

131-131: Avoid specifying long messages outside the exception class

(TRY003)

139-139: Avoid specifying long messages outside the exception class

(TRY003)

161-161: Do not catch blind exception: Exception

(BLE001)

167-167: Avoid specifying long messages outside the exception class

(TRY003)

193-193: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

tests/unittest/serve/test_cluster_storage.py

70-70: Unused method argument: storage_server

(ARG002)

78-78: Unused method argument: storage_server

(ARG002)

86-86: Unused method argument: storage_server

(ARG002)

99-99: Unused method argument: storage_server

(ARG002)

116-116: Unused method argument: storage_client

(ARG002)

178-178: Unused function argument: app

(ARG001)

207-207: subprocess call: check for execution of untrusted input

(S603)

208-208: Starting a process with a partial executable path

(S607)

tests/unittest/serve/test_cluster_manager_worker.py

33-33: f-string without any placeholders

Remove extraneous f prefix

(F541)

35-35: f-string without any placeholders

Remove extraneous f prefix

(F541)

37-37: Avoid specifying long messages outside the exception class

(TRY003)

60-60: subprocess call: check for execution of untrusted input

(S603)

61-61: Starting a process with a partial executable path

(S607)

68-68: Avoid specifying long messages outside the exception class

(TRY003)

73-73: Unpacked variable cluster_uri is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

104-104: Avoid equality comparisons to True; use await cluster_manager.is_ready(): for truth checks

Replace with await cluster_manager.is_ready()

(E712)

113-113: Pointless comparison. Did you mean to assign a value? Otherwise, prepend assert or remove it.

(B015)

114-114: Pointless comparison. Did you mean to assign a value? Otherwise, prepend assert or remove it.

(B015)

118-118: Avoid equality comparisons to False; use not await cluster_manager.is_ready(): for false checks

Replace with not await cluster_manager.is_ready()

(E712)

128-128: Avoid equality comparisons to False; use not await cluster_manager.is_ready(): for false checks

Replace with not await cluster_manager.is_ready()

(E712)

137-137: Avoid equality comparisons to True; use await cluster_manager.is_ready(): for truth checks

Replace with await cluster_manager.is_ready()

(E712)

144-144: Avoid equality comparisons to False; use not await cluster_manager.is_ready(): for false checks

Replace with not await cluster_manager.is_ready()

(E712)

151-151: Avoid equality comparisons to False; use not await cluster_manager.is_ready(): for false checks

Replace with not await cluster_manager.is_ready()

(E712)

200-200: Avoid equality comparisons to True; use await cluster_manager.is_ready(): for truth checks

Replace with await cluster_manager.is_ready()

(E712)

203-203: Avoid equality comparisons to True; use await cluster_manager.is_ready(): for truth checks

Replace with await cluster_manager.is_ready()

(E712)

211-211: Avoid equality comparisons to False; use not await cluster_manager.is_ready(): for false checks

Replace with not await cluster_manager.is_ready()

(E712)

tensorrt_llm/serve/cluster_storage.py

57-57: ClusterStorage is an abstract base class, but it has no abstract methods or properties

(B024)

59-60: ClusterStorage.__init__ is an empty method in an abstract base class, but has no abstract decorator

(B027)

63-64: ClusterStorage.start is an empty method in an abstract base class, but has no abstract decorator

(B027)

67-68: ClusterStorage.stop is an empty method in an abstract base class, but has no abstract decorator

(B027)

70-75: ClusterStorage.set is an empty method in an abstract base class, but has no abstract decorator

(B027)

77-77: Comment contains ambiguous ’ (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?

(RUF003)

78-79: ClusterStorage.expire is an empty method in an abstract base class, but has no abstract decorator

(B027)

81-82: ClusterStorage.get is an empty method in an abstract base class, but has no abstract decorator

(B027)

84-85: ClusterStorage.delete is an empty method in an abstract base class, but has no abstract decorator

(B027)

87-88: ClusterStorage.watch is an empty method in an abstract base class, but has no abstract decorator

(B027)

90-91: ClusterStorage.unwatch is an empty method in an abstract base class, but has no abstract decorator

(B027)

93-96: ClusterStorage.get_prefix is an empty method in an abstract base class, but has no abstract decorator

(B027)

105-105: Avoid specifying long messages outside the exception class

(TRY003)

114-114: Avoid specifying long messages outside the exception class

(TRY003)

136-136: Unused method argument: cluster_uri

(ARG002)

136-136: Unused method argument: cluster_name

(ARG002)

243-244: Avoid specifying long messages outside the exception class

(TRY003)

280-280: Do not catch blind exception: Exception

(BLE001)

281-281: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

304-304: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tensorrt_llm/llmapi/disagg_utils.py (1)

3-23: IntEnum switch looks good.

Moving ServerRole to inherit from IntEnum keeps existing semantics but now plays nicely with JSON serialization in the autoscaler/storage paths. ✅

tensorrt_llm/serve/auto_scaling.py

tests/unittest/serve/test_cluster_manager_worker.py

tests/unittest/serve/test_cluster_storage.py

reasonsolo · 2025-09-26T07:08:25Z

/bot run

tensorrt-cicd · 2025-09-26T07:15:54Z

PR_Github #20060 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-26T08:51:58Z

PR_Github #20060 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #15113 completed with status: 'FAILURE'

reasonsolo · 2025-10-07T01:07:45Z

/bot run

tensorrt-cicd · 2025-10-07T01:15:58Z

PR_Github #20692 [ run ] triggered by Bot

reasonsolo · 2025-10-07T02:21:05Z

/bot run

tensorrt-cicd · 2025-10-07T02:27:07Z

PR_Github #20698 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-07T02:27:09Z

PR_Github #20692 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #15631 (Blue Ocean) completed with status: ABORTED

tensorrt-cicd · 2025-10-07T04:12:42Z

PR_Github #20698 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15636 completed with status: 'FAILURE'

reasonsolo · 2025-10-07T04:49:33Z

/bot run

tensorrt-cicd · 2025-10-07T04:54:43Z

PR_Github #20708 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-07T07:56:04Z

PR_Github #20708 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15645 completed with status: 'FAILURE'

reasonsolo · 2025-10-07T08:12:21Z

/bot run

tensorrt-cicd · 2025-10-07T08:17:25Z

PR_Github #20718 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-07T10:11:22Z

PR_Github #20718 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15654 completed with status: 'FAILURE'

tensorrt_llm/llmapi/disagg_utils.py

tests/unittest/disaggregated/__init__.py

tensorrt_llm/serve/auto_scaling.py

tensorrt_llm/serve/disagg_auto_scaling.py

tensorrt_llm/serve/cluster_storage.py

tests/unittest/disaggregated/test_cluster_storage.py

tensorrt_llm/serve/cluster_storage.py

tensorrt_llm/serve/auto_scaling.py

tensorrt_llm/serve/disagg_auto_scaling.py

tests/unittest/disaggregated/test_disagg_cluster_manager_worker.py

tensorrt_llm/serve/disagg_auto_scaling.py

tests/unittest/disaggregated/test_cluster_manager_worker.py

tensorrt_llm/serve/disagg_auto_scaling.py

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

reasonsolo · 2025-10-08T04:34:03Z

/bot run

tensorrt-cicd · 2025-10-08T04:40:16Z

PR_Github #20766 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-08T07:20:17Z

PR_Github #20766 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15694 completed with status: 'FAILURE'

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

reasonsolo · 2025-10-08T12:26:27Z

/bot run

tensorrt-cicd · 2025-10-08T12:31:45Z

PR_Github #20795 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: run

tensorrt-cicd · 2025-10-08T12:32:03Z

PR_Github #20796 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-08T17:34:13Z

PR_Github #20796 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15721 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

tensorrt_llm/serve/disagg_auto_scaling.py

…IA#7869) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

reasonsolo force-pushed the tllm7846_sdhttp branch 6 times, most recently from 97215ab to 93ab23c Compare September 26, 2025 06:29

reasonsolo marked this pull request as ready for review September 26, 2025 06:29

reasonsolo requested a review from a team as a code owner September 26, 2025 06:29

reasonsolo requested a review from hchings September 26, 2025 06:29

coderabbitai bot reviewed Sep 26, 2025

View reviewed changes

tensorrt_llm/serve/auto_scaling.py Outdated Show resolved Hide resolved

tests/unittest/serve/test_cluster_manager_worker.py Outdated Show resolved Hide resolved

tests/unittest/serve/test_cluster_storage.py Outdated Show resolved Hide resolved

reasonsolo force-pushed the tllm7846_sdhttp branch 2 times, most recently from 71be18b to bcfcd6d Compare October 7, 2025 02:18

reasonsolo force-pushed the tllm7846_sdhttp branch from bcfcd6d to d37f425 Compare October 7, 2025 04:49

reasonsolo force-pushed the tllm7846_sdhttp branch from d37f425 to fc43659 Compare October 7, 2025 08:11

pcastonguay reviewed Oct 7, 2025

View reviewed changes

hchings reviewed Oct 7, 2025

View reviewed changes

reasonsolo added 2 commits October 7, 2025 19:44

tested

5ba27ba

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

add tests defines

da42ff2

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

reasonsolo force-pushed the tllm7846_sdhttp branch 2 times, most recently from 71f6dc5 to e5271c6 Compare October 8, 2025 04:28

reasonsolo requested review from hchings and pcastonguay October 8, 2025 04:30

fix by review comments

62c250d

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

reasonsolo force-pushed the tllm7846_sdhttp branch from e5271c6 to 62c250d Compare October 8, 2025 12:23

pcastonguay approved these changes Oct 8, 2025

View reviewed changes

tensorrt_llm/serve/disagg_auto_scaling.py Show resolved Hide resolved

hchings approved these changes Oct 8, 2025

View reviewed changes

reasonsolo merged commit fdf29ab into NVIDIA:main Oct 9, 2025
5 checks passed

kris1025 pushed a commit to kris1025/TensorRT-LLM that referenced this pull request Oct 14, 2025

[TRTLLM-7846][feat] Http disagg-cluster management implemention (NVID…

e8834bc

…IA#7869) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

[TRTLLM-7846][feat] Http disagg-cluster management implemention #7869

[TRTLLM-7846][feat] Http disagg-cluster management implemention #7869

Uh oh!

Conversation

reasonsolo commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

reasonsolo commented Sep 26, 2025

Uh oh!

tensorrt-cicd commented Sep 26, 2025

Uh oh!

tensorrt-cicd commented Sep 26, 2025

Uh oh!

reasonsolo commented Oct 7, 2025

Uh oh!

tensorrt-cicd commented Oct 7, 2025

Uh oh!

reasonsolo commented Oct 7, 2025

Uh oh!

tensorrt-cicd commented Oct 7, 2025

Uh oh!

tensorrt-cicd commented Oct 7, 2025

Uh oh!

tensorrt-cicd commented Oct 7, 2025

Uh oh!

reasonsolo commented Oct 7, 2025

Uh oh!

tensorrt-cicd commented Oct 7, 2025

Uh oh!

tensorrt-cicd commented Oct 7, 2025

Uh oh!

reasonsolo commented Oct 7, 2025

Uh oh!

tensorrt-cicd commented Oct 7, 2025

Uh oh!

tensorrt-cicd commented Oct 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

reasonsolo commented Oct 8, 2025

Uh oh!

tensorrt-cicd commented Oct 8, 2025

Uh oh!

reasonsolo commented Sep 19, 2025 •

edited

Loading

coderabbitai bot commented Sep 26, 2025 •

edited

Loading