[TRTLLM-7349][feat] Adding new orchestrator type -- ray #7520

joyang-nv · 2025-09-04T06:15:29Z

Summary by CodeRabbit

New Features
- Optional Ray-based orchestration for inference (multi-node, disaggregated, async).
- Non-MPI path using PyTorch ProcessGroup, DeviceMesh-based mapping, and PG-backed collectives (allgather/allreduce/reducescatter).
- Ray executor and GPU worker with queue-based result delivery.
- LLM API: collective RPC and async/sync weight update helpers.
Documentation
- New READMEs for Ray workflows and disaggregated serving.
Examples
- Ray inference scripts, multi-node cluster launcher, disagg-serving CLI.
Tests
- Ray-enabled unit/integration tests and pytest markers/flag.
Chores
- Added ray dependency; build/packaging updates for new utilities.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

requirements.txt

coderabbitai · 2025-09-04T06:33:45Z

📝 Walkthrough

Walkthrough

Introduces a ProcessGroup-based distributed path alongside MPI across C++ and Python: new pg_utils library, CacheTransceiverComm abstraction, PG-backed collectives in thop ops, MPI gating via TLLM_DISABLE_MPI, and Ray-based orchestration (executor, workers, examples, tests). Adds bindings, build targets, packaging entries, and tests for PG/Ray flows.

Changes

Cohort / File(s)	Summary
Build & Packaging `cpp/tensorrt_llm/runtime/CMakeLists.txt`, `cpp/tensorrt_llm/runtime/utils/CMakeLists.txt`, `cpp/tensorrt_llm/batch_manager/CMakeLists.txt`, `cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/CMakeLists.txt`, `cpp/tensorrt_llm/thop/CMakeLists.txt`, `scripts/build_wheel.py`, `setup.py`, `requirements.txt`, `.gitignore`	Adds pg_utils library and bindings, links Torch/Torch Python where needed, packages `libpg_utils.so` and `pgBroker.cpp`, includes Ray dependency, ignores `pg_utils_bindings.*.so`.
Runtime Utils: PG/MPI `cpp/include/tensorrt_llm/runtime/utils/pgUtils.h`, `cpp/tensorrt_llm/runtime/utils/pgUtils.cpp`, `cpp/tensorrt_llm/runtime/utils/pgUtilsBindings.cpp`, `tensorrt_llm/_torch/distributed/pg_utils.py`, `tensorrt_llm/_torch/distributed/pgBroker.cpp`, `cpp/include/tensorrt_llm/runtime/utils/mpiUtils.h`, `cpp/tensorrt_llm/runtime/utils/mpiUtils.cpp`, `tensorrt_llm/_utils.py`	Adds ProcessGroup helper APIs, global PG broker init from Python, bindings, and Python-side `split`. Introduces `MpiComm::couldUseMPI()` and environment-driven `mpi_disabled()` with rank/trace adjustments.
Batch Manager Comm Abstraction `cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h`, `cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`, `cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp`	Introduces `CacheTransceiverComm` (MPI/PG wrapper), refactors transceiver to use unified comm, implements PG path (split, allgather/allgatherv), and adds temporary Python bindings.
UCX Cache Transmission `cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.{h,cpp}`, `cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp`	Makes rank/world-size backend-agnostic (MPI or PG), updates logs to use manager rank, adds PG-based gather paths.
Torch Ops PG Variants `cpp/tensorrt_llm/thop/allgatherOp.cpp`, `cpp/tensorrt_llm/thop/reducescatterOp.cpp`, `cpp/tensorrt_llm/thop/allreduceOp.cpp`	Adds ProcessGroup-backed implementations and Torch bindings for allgather/reducescatter/allreduce; dispatches between NCCL and PG.
Distributed Communicator (Python) `tensorrt_llm/_torch/distributed/communicator.py`, `tensorrt_llm/_torch/distributed/ops.py`, `tensorrt_llm/_ipc_utils.py`, `tensorrt_llm/_torch/pyexecutor/{py_executor.py,executor_request_queue.py,py_executor_creator.py,resource_manager.py,sampler.py}`	Introduces TorchDist with cluster/local groups, PG-based pp-comm alternative, PG collectives when MPI disabled, Ray-friendly recv/send paths, timeouts, and async object transfers.
Device Mesh & Mapping `tensorrt_llm/_torch/device_mesh.py`, `tensorrt_llm/mapping.py`	Adds DeviceMeshTopology and Mapping wrapper selecting MPI vs DeviceMesh at runtime; exposes PG group accessors and lazy mesh build.
Executor & Ray Orchestration `tensorrt_llm/executor/executor.py`, `tensorrt_llm/executor/ray_executor.py`, `tensorrt_llm/executor/ray_gpu_worker.py`, `tensorrt_llm/executor/result.py`	Adds RayExecutor, Ray GPU worker/actor wrapper, Ray-backed result queues, and factory routing via orchestrator_type.
LLM API Extensions `tensorrt_llm/llmapi/llm.py`, `tensorrt_llm/llmapi/llm_args.py`	Adds `orchestrator_type` arg, enables Ray path when set or MPI disabled, exposes `collective_rpc` and async/sync weight update methods.
Examples (Ray) `examples/ray/`, `examples/ray/disaggregated/`, `examples/ray/multi_nodes/`, `examples/ray/to_delete/`	Adds Ray-based examples, scripts, and experimental tools for async/distributed inference, disaggregated serving, benchmarks, and tests.
Tests (Ray/PG) `tests/unittest/_torch/multi_gpu/`, `tests/unittest/_torch/ray/`, `tests/unittest/conftest.py`, `tests/integration/defs/{conftest.py,trt_test_alternative.py}`, `tests/unittest/_torch/executor/test_overlap_scheduler.py`, `tests/unittest/llmapi/*`, `tests/unittest/utils/util.py`	Adds PG/Ray multi-GPU tests, Ray test controls/markers, MPI↔Ray parity parametrization, and utility for free port discovery.
Minor Edits `tensorrt_llm/_torch/models/modeling_{phi4mm,utils}.py`, `tensorrt_llm/_torch/auto_deploy/distributed/common.py`	Formatting tweaks and internal port utility redirection.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant LLM as LLM.create()
  participant Exec as GenerationExecutor
  participant Ray as RayExecutor
  participant PG as Torch Dist (PG)
  participant W as RayGPUWorker[*]

  User->>LLM: create(model, orchestrator_type="ray", tp_size)
  LLM->>Exec: GenerationExecutor.create(**args)
  alt orchestrator_type == "ray"
    Exec->>Ray: _create_ray_executor(worker_kwargs, world_size, tp_size, ...)
    Ray->>Ray: init Ray cluster / placement group
    Ray->>PG: initialize process group(s)
    Ray->>W: create actors (world_size)
  else
    Exec->>Exec: fallback (MPI)
  end
  User->>LLM: generate/request
  LLM->>Ray: submit(request)
  Ray->>W: enqueue_request(leader)
  W-->>Ray: result stream/items
  Ray-->>LLM: GenerationResult
  LLM-->>User: outputs

sequenceDiagram
  autonumber
  participant CT as CacheTransceiver
  participant Comm as CacheTransceiverComm
  participant MPI as MPI Comm
  participant PG as ProcessGroup
  note over CT: Initialize
  CT->>Comm: construct (from MPI or PG)
  alt MPI path
    Comm->>MPI: split(color,key)
    Comm->>MPI: allgather/allgatherv
  else PG path
    Comm->>PG: split via Python helper
    Comm->>PG: allgather/allgatherv (PgHelper)
  end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~150 minutes

Possibly related PRs

[None][chore] Create PyExecutor from TorchLlmArgs Part 1 #7105 — Adjusts GenerationExecutor creation/workers; overlaps with new RayExecutor factory routing.
[TRTLLM-5508][feat] check input tokens + improve error handling #5170 — Modifies PyExecutor APIs; related to enqueue signature and recv/send changes here.
[None][feat] Add NCCL Symmetric Integration for All Reduce #4500 — Alters Allreduce operator infrastructure; intersects with added PG allreduce path.

Suggested labels

Community want to contribute

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 93

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
973-996: Preserve MetaInit path when ModelConfig.clone() is unavailable.

Switching from deepcopy to clone() risks skipping the MetaInit path if clone() isn't implemented (AttributeError) or not universally supported across configs, causing a fallback to full materialization and potential OOM/perf regressions. Prefer clone() but fall back to deepcopy only for the copy step to retain the MetaInit flow.
             try:
-                # config will be modified in-place for some models, like Qwen2
-                config_copy = config.clone()
+                # config will be modified in-place for some models, like Qwen2
+                # Prefer lightweight clone; fall back to deepcopy to preserve MetaInit path if clone is unavailable.
+                try:
+                    if hasattr(config, "clone"):
+                        config_copy = config.clone()
+                    else:
+                        import copy as _copy
+                        config_copy = _copy.deepcopy(config)
+                except Exception:
+                    import copy as _copy
+                    config_copy = _copy.deepcopy(config)
                 with MetaInitMode():
                     model = AutoModelForCausalLM.from_config(config_copy)
 
                 memo = dict()
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
255-256: Fix logger call to avoid formatting error at runtime.

logger.info("ATTENTION RUNTIME FEATURES: ", attn_runtime_features) will be formatted with % under std logging and can raise. Use %s or f-string.
-    logger.info("ATTENTION RUNTIME FEATURES: ", attn_runtime_features)
+    logger.info("ATTENTION RUNTIME FEATURES: %s", attn_runtime_features)
cpp/tensorrt_llm/runtime/utils/mpiUtils.cpp (2)
303-316: Fix incorrect printf specifiers for size_t in log messages.

size is size_t but logged with %d, which is UB on LP64 and truncates on 64-bit. Use %zu (or cast to unsigned long long with %llu).

Apply:
-    TLLM_LOG_DEBUG("start MPI_Isend with dest %d, tag %d, size %d", dest, static_cast<int>(tag), size);
+    TLLM_LOG_DEBUG("start MPI_Isend with dest %d, tag %d, size %zu", dest, static_cast<int>(tag), size);
@@
-    TLLM_LOG_DEBUG("end MPI_Isend with dest %d, tag %d, size %d", dest, static_cast<int>(tag), size);
+    TLLM_LOG_DEBUG("end MPI_Isend with dest %d, tag %d, size %zu", dest, static_cast<int>(tag), size);
@@
-    TLLM_LOG_DEBUG("start MPI_Send with dest %d, tag %d, size %d", dest, tag, size);
+    TLLM_LOG_DEBUG("start MPI_Send with dest %d, tag %d, size %zu", dest, tag, size);
@@
-    TLLM_LOG_DEBUG("end MPI_Send with dest %d, tag %d, size %d", dest, tag, size);
+    TLLM_LOG_DEBUG("end MPI_Send with dest %d, tag %d, size %zu", dest, tag, size);
@@
-    TLLM_LOG_DEBUG("start MPI_Recv with source %d, tag %d, size %d", source, tag, size);
+    TLLM_LOG_DEBUG("start MPI_Recv with source %d, tag %d, size %zu", source, tag, size);
@@
-    TLLM_LOG_DEBUG("end MPI_Recv with source %d, tag %d, size %d", source, tag, size);
+    TLLM_LOG_DEBUG("end MPI_Recv with source %d, tag %d, size %zu", source, tag, size);
Also applies to: 324-334, 348-360

429-466: Gate probe APIs with couldUseMPI() for consistent MPI-disable behavior.

mprobe/improbe/iprobe bypass the new runtime MPI guard. They should early-guard like other ops.
 void MpiComm::mprobeRawTag(int source, int tag, MPI_Message* msg, MPI_Status* status) const
 {
+    couldUseMPI();
 #if ENABLE_MULTI_DEVICE
@@
 bool MpiComm::improbe(int source, MpiTag tag, MPI_Message* msg, MPI_Status* status) const
 {
+    couldUseMPI();
 #if ENABLE_MULTI_DEVICE
@@
 bool MpiComm::iprobe(int source, MpiTag tag, MPI_Status* status) const
 {
+    couldUseMPI();
 #if ENABLE_MULTI_DEVICE
setup.py (2)
175-187: Fix undefined exception variable 'e' when raising SetupError.

e is not defined in this scope; this raises UnboundLocalError instead of the intended SetupError.
-            else:
-                raise SetupError(
-                    f"Failed to get wheel file from {precompiled_path}.") from e
+            else:
+                raise SetupError(
+                    f"Failed to get wheel file from {precompiled_path}.")
206-211: Align python_requires with actual syntax usage (PEP 604 unions).

The file uses type hints like str | None, which require Python 3.10+. Current python_requires is ">=3.7", which will break installs.
-    python_requires=">=3.7, <4")
+    python_requires=">=3.10, <4")
Also applies to: 264-264
cpp/tensorrt_llm/batch_manager/CMakeLists.txt (1)
112-116: Ensure Python3::Python target exists before linking
In cpp/tensorrt_llm/batch_manager/CMakeLists.txt before line 112, wrap the Python target usage with a guarded find_package:
+if(NOT TARGET Python3::Python)
+  find_package(Python3 REQUIRED COMPONENTS Interpreter Development)
+endif()
 find_library(TORCH_PYTHON_LIB torch_python REQUIRED
              HINTS ${TORCH_INSTALL_PREFIX}/lib)
 target_link_libraries(${BATCH_MANAGER_STATIC_TARGET}
                       PUBLIC ${TORCH_PYTHON_LIB} Python3::Python pg_utils)
tests/unittest/_torch/ray/test_placement.py (1)
54-67: Test cleanup and potential race condition concerns

The test modifies and deletes CUDA_VISIBLE_DEVICES but doesn't ensure proper cleanup if the test fails. Additionally, there's no Ray cleanup.

Apply this diff to ensure proper cleanup:
 @pytest.mark.gpu2
 def test_cuda_visible_device():
     """Placement via cuda_visible_device"""
+    original_cuda_visible = os.environ.get("CUDA_VISIBLE_DEVICES")
     os.environ["CUDA_VISIBLE_DEVICES"] = "1"
-
-    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
-              orchestrator_type="ray")
-
-    infer_actor_uuids = llm.collective_rpc("report_device_id")
-
-    del os.environ["CUDA_VISIBLE_DEVICES"]
-    assert infer_actor_uuids[0] == get_device_uuid(1)
-    print(f"{infer_actor_uuids=}")
+    try:
+        ray.init()
+        llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+                  orchestrator_type="ray")
+        
+        infer_actor_uuids = llm.collective_rpc("report_device_id")
+        assert infer_actor_uuids[0] == get_device_uuid(1)
+        print(f"{infer_actor_uuids=}")
+    finally:
+        # Restore original environment
+        if original_cuda_visible is not None:
+            os.environ["CUDA_VISIBLE_DEVICES"] = original_cuda_visible
+        else:
+            os.environ.pop("CUDA_VISIBLE_DEVICES", None)
+        ray.shutdown()
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

862-871: Disjoint tag namespaces for different inter-PP messages.

Tokens and logits both use tag=prev_microbatch_id between the same src/dst. Per prior incident, these must not share a tag space to avoid message collisions.

I used the retrieved learning from PR #7455. Suggest using distinct, documented offsets, e.g., kTOKENS_TAG_BASE=0, kLOGITS_TAG_BASE=100000, then tag=k*_BASE+prev_microbatch_id.

Also applies to: 1849-1863

518-546: Initialize or remove self.global_rank
tensorrt_llm/_torch/pyexecutor/py_executor.py logs self.global_rank (around lines 552–556) but PyExecutor.__init__ never sets it, causing an AttributeError at runtime. Either add self.global_rank = dist.rank in the constructor or drop it from the log.

cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp

cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp

cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h

tensorrt_llm/executor/ray_gpu_worker.py

tensorrt_llm/llmapi/llm_args.py

coderabbitai

Review continued from previous batch...

cpp/include/tensorrt_llm/runtime/utils/mpiUtils.h

cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp

cpp/tensorrt_llm/runtime/utils/CMakeLists.txt

tests/unittest/_torch/ray/test_mapping.py

tests/unittest/_torch/ray/test_placement.py

tests/unittest/conftest.py

coderabbitai

Review continued from previous batch...

cpp/include/tensorrt_llm/runtime/utils/pgUtils.h

cpp/tensorrt_llm/thop/allreduceOp.cpp

tensorrt_llm/executor/ray_executor.py

tensorrt_llm/mapping.py

tongyuantongyu · 2025-09-04T08:17:46Z

/bot run --stage-list "H100_PCIe-PyTorch-1,H100_PCIe-PyTorch-Ray-1" --disable-fail-fast

tongyuantongyu · 2025-09-04T09:13:28Z

/bot run --stage-list "H100_PCIe-PyTorch-1, H100_PCIe-PyTorch-Ray-1" --disable-fail-fast

tensorrt-cicd · 2025-09-04T09:18:24Z

PR_Github #17661 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-04T11:56:10Z

PR_Github #17676 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-03T08:56:21Z

PR_Github #20574 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-03T10:34:31Z

PR_Github #20574 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15529 (Partly Tested) completed with status: 'SUCCESS'

joyang-nv · 2025-10-03T10:35:18Z

/bot run --disable-reuse-test --stage-list "DGX_B200-4_GPUs-PyTorch-Ray-1,DGX_H100-2_GPUs-PyTorch-Ray-1,H100_PCIe-PyTorch-Ray-1"

joyang-nv · 2025-10-03T11:46:04Z

/bot kill

joyang-nv · 2025-10-03T11:46:55Z

/bot run --disable-reuse-test --stage-list "DGX_B200-4_GPUs-PyTorch-Ray-1,DGX_H100-2_GPUs-PyTorch-Ray-1,H100_PCIe-PyTorch-Ray-1"

tensorrt-cicd · 2025-10-03T11:51:54Z

PR_Github #20579 [ kill ] triggered by Bot

tensorrt-cicd · 2025-10-03T11:51:56Z

PR_Github #20579 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit efc9b16

tensorrt-cicd · 2025-10-03T11:52:27Z

PR_Github #20580 [ run ] triggered by Bot

joyang-nv · 2025-10-03T13:32:40Z

/bot kill

joyang-nv · 2025-10-03T13:33:21Z

/bot run --disable-reuse-test

tensorrt-cicd · 2025-10-03T13:37:30Z

PR_Github #20580 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15534 (Partly Tested) completed with status: 'SUCCESS'

tensorrt-cicd · 2025-10-03T13:38:21Z

PR_Github #20587 [ kill ] triggered by Bot

tensorrt-cicd · 2025-10-03T13:38:23Z

PR_Github #20587 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit efc9b16

tensorrt-cicd · 2025-10-03T13:38:33Z

PR_Github #20588 [ run ] triggered by Bot

joyang-nv · 2025-10-03T14:48:29Z

/bot run

tensorrt-cicd · 2025-10-03T14:54:08Z

PR_Github #20594 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-03T14:54:11Z

PR_Github #20588 [ run ] completed with state ABORTED
/LLM/main/L0_MergeRequest_PR pipeline #15541 completed with status: 'FAILURE'

joyang-nv · 2025-10-03T17:53:49Z

/bot kill

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Add ray cleanup fixture Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Add Ray test cases Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> address comments and fixes Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> add tests for ray examples, refactor to BaseWorker Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Add ray example tests to CI list Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Fix CUDA Graph + PG NCCL Coalescing Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Unify WorkerExit Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> review cleanup Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Make sub_pg inherit backend from global pg and some cleanup Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> Cleanup Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> CI fix Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> fix single gpu and api stability tests Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Fix fake ray import Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> Fix device mesh on single gpu Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> build & CI review Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Move ray requirements declaration Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Fix error response handling Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Remove failing case for now Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> skip get stats tests Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> skip test_fp8_block_scales_4gpus Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Resolve rebase conflict and revert result_wait_queue cleanup Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> fix test_disaggregated_ctx**_gen** Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Remove empty case Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> fix ci tests Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Back result_wait_queue cleanup Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> Update jenkins/L0_Test.groovy Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com> Move ray stage using 4 gpus to b200 Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> Add orchestrator mpi to other test groups in b200 yaml Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> Skip get perf metrics tests Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> Skip unsupported tests in ray stage Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> minor test fix & nit Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> skip newly added test Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> skip test_disaggregated_serving.py in ray stage until we add cleanup

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

joyang-nv · 2025-10-03T17:58:17Z

/bot run

tensorrt-cicd · 2025-10-03T18:03:39Z

PR_Github #20608 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-03T18:03:40Z

PR_Github #20594 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #15547 (Blue Ocean) completed with status: ABORTED

tensorrt-cicd · 2025-10-03T22:38:12Z

PR_Github #20608 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15557 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

joyang-nv requested review from a team as code owners September 4, 2025 06:15

joyang-nv requested review from amukkara, chuangz0, galagam, kaiyux, laikhtewari, pcastonguay and syuoni September 4, 2025 06:15

hchings changed the title ~~Adding ray for new orchestrator type~~ [TRTLLM-7349][feat] Adding new Ray orchestrator Sep 4, 2025

hchings requested review from Superjomn, leslie-fang25 and lowsfer September 4, 2025 06:17

hchings reviewed Sep 4, 2025

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

coderabbitai bot reviewed Sep 4, 2025

View reviewed changes

joyang-nv changed the title ~~[TRTLLM-7349][feat] Adding new Ray orchestrator~~ [TRTLLM-7349][feat] Adding new orchestrator type -- ray Sep 4, 2025

joyang-nv marked this pull request as draft September 4, 2025 07:42

joyang-nv disabled auto-merge October 3, 2025 10:34

joyang-nv enabled auto-merge (squash) October 3, 2025 13:33

tongyuantongyu and others added 2 commits October 4, 2025 01:55

force local cluster

b6a3abd

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

joyang-nv force-pushed the user/joyang/ray branch from efc9b16 to b6a3abd Compare October 3, 2025 17:57

Merge branch 'main' into user/joyang/ray

df503a7

joyang-nv disabled auto-merge October 4, 2025 00:12

joyang-nv merged commit 88ea2c4 into NVIDIA:main Oct 4, 2025
4 checks passed

[TRTLLM-7349][feat] Adding new orchestrator type -- ray #7520

[TRTLLM-7349][feat] Adding new orchestrator type -- ray #7520

Uh oh!

Conversation

joyang-nv commented Sep 4, 2025 • edited by tongyuantongyu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

Uh oh!

coderabbitai bot commented Sep 4, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tongyuantongyu commented Sep 4, 2025

Uh oh!

tongyuantongyu commented Sep 4, 2025

Uh oh!

tensorrt-cicd commented Sep 4, 2025

Uh oh!

tensorrt-cicd commented Sep 4, 2025

Uh oh!

tensorrt-cicd commented Oct 3, 2025

Uh oh!

tensorrt-cicd commented Oct 3, 2025

Uh oh!

joyang-nv commented Sep 4, 2025 •

edited by tongyuantongyu

Loading