KEMBAR78
[TRTLLM-7349][feat] Adding new orchestrator type -- ray by joyang-nv · Pull Request #7520 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@joyang-nv
Copy link
Collaborator

@joyang-nv joyang-nv commented Sep 4, 2025

Summary by CodeRabbit

  • New Features
    • Optional Ray-based orchestration for inference (multi-node, disaggregated, async).
    • Non-MPI path using PyTorch ProcessGroup, DeviceMesh-based mapping, and PG-backed collectives (allgather/allreduce/reducescatter).
    • Ray executor and GPU worker with queue-based result delivery.
    • LLM API: collective RPC and async/sync weight update helpers.
  • Documentation
    • New READMEs for Ray workflows and disaggregated serving.
  • Examples
    • Ray inference scripts, multi-node cluster launcher, disagg-serving CLI.
  • Tests
    • Ray-enabled unit/integration tests and pytest markers/flag.
  • Chores
    • Added ray dependency; build/packaging updates for new utilities.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@joyang-nv joyang-nv requested review from a team as code owners September 4, 2025 06:15
@hchings hchings changed the title Adding ray for new orchestrator type [TRTLLM-7349][feat] Adding new Ray orchestrator Sep 4, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 4, 2025

📝 Walkthrough

Walkthrough

Introduces a ProcessGroup-based distributed path alongside MPI across C++ and Python: new pg_utils library, CacheTransceiverComm abstraction, PG-backed collectives in thop ops, MPI gating via TLLM_DISABLE_MPI, and Ray-based orchestration (executor, workers, examples, tests). Adds bindings, build targets, packaging entries, and tests for PG/Ray flows.

Changes

Cohort / File(s) Summary
Build & Packaging
cpp/tensorrt_llm/runtime/CMakeLists.txt, cpp/tensorrt_llm/runtime/utils/CMakeLists.txt, cpp/tensorrt_llm/batch_manager/CMakeLists.txt, cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/CMakeLists.txt, cpp/tensorrt_llm/thop/CMakeLists.txt, scripts/build_wheel.py, setup.py, requirements.txt, .gitignore
Adds pg_utils library and bindings, links Torch/Torch Python where needed, packages libpg_utils.so and pgBroker.cpp, includes Ray dependency, ignores pg_utils_bindings.*.so.
Runtime Utils: PG/MPI
cpp/include/tensorrt_llm/runtime/utils/pgUtils.h, cpp/tensorrt_llm/runtime/utils/pgUtils.cpp, cpp/tensorrt_llm/runtime/utils/pgUtilsBindings.cpp, tensorrt_llm/_torch/distributed/pg_utils.py, tensorrt_llm/_torch/distributed/pgBroker.cpp, cpp/include/tensorrt_llm/runtime/utils/mpiUtils.h, cpp/tensorrt_llm/runtime/utils/mpiUtils.cpp, tensorrt_llm/_utils.py
Adds ProcessGroup helper APIs, global PG broker init from Python, bindings, and Python-side split. Introduces MpiComm::couldUseMPI() and environment-driven mpi_disabled() with rank/trace adjustments.
Batch Manager Comm Abstraction
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h, cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp, cpp/tensorrt_llm/pybind/batch_manager/cacheTransceiver.cpp
Introduces CacheTransceiverComm (MPI/PG wrapper), refactors transceiver to use unified comm, implements PG path (split, allgather/allgatherv), and adds temporary Python bindings.
UCX Cache Transmission
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.{h,cpp}, cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp
Makes rank/world-size backend-agnostic (MPI or PG), updates logs to use manager rank, adds PG-based gather paths.
Torch Ops PG Variants
cpp/tensorrt_llm/thop/allgatherOp.cpp, cpp/tensorrt_llm/thop/reducescatterOp.cpp, cpp/tensorrt_llm/thop/allreduceOp.cpp
Adds ProcessGroup-backed implementations and Torch bindings for allgather/reducescatter/allreduce; dispatches between NCCL and PG.
Distributed Communicator (Python)
tensorrt_llm/_torch/distributed/communicator.py, tensorrt_llm/_torch/distributed/ops.py, tensorrt_llm/_ipc_utils.py, tensorrt_llm/_torch/pyexecutor/{py_executor.py,executor_request_queue.py,py_executor_creator.py,resource_manager.py,sampler.py}
Introduces TorchDist with cluster/local groups, PG-based pp-comm alternative, PG collectives when MPI disabled, Ray-friendly recv/send paths, timeouts, and async object transfers.
Device Mesh & Mapping
tensorrt_llm/_torch/device_mesh.py, tensorrt_llm/mapping.py
Adds DeviceMeshTopology and Mapping wrapper selecting MPI vs DeviceMesh at runtime; exposes PG group accessors and lazy mesh build.
Executor & Ray Orchestration
tensorrt_llm/executor/executor.py, tensorrt_llm/executor/ray_executor.py, tensorrt_llm/executor/ray_gpu_worker.py, tensorrt_llm/executor/result.py
Adds RayExecutor, Ray GPU worker/actor wrapper, Ray-backed result queues, and factory routing via orchestrator_type.
LLM API Extensions
tensorrt_llm/llmapi/llm.py, tensorrt_llm/llmapi/llm_args.py
Adds orchestrator_type arg, enables Ray path when set or MPI disabled, exposes collective_rpc and async/sync weight update methods.
Examples (Ray)
examples/ray/*, examples/ray/disaggregated/*, examples/ray/multi_nodes/*, examples/ray/to_delete/*
Adds Ray-based examples, scripts, and experimental tools for async/distributed inference, disaggregated serving, benchmarks, and tests.
Tests (Ray/PG)
tests/unittest/_torch/multi_gpu/*, tests/unittest/_torch/ray/*, tests/unittest/conftest.py, tests/integration/defs/{conftest.py,trt_test_alternative.py}, tests/unittest/_torch/executor/test_overlap_scheduler.py, tests/unittest/llmapi/*, tests/unittest/utils/util.py
Adds PG/Ray multi-GPU tests, Ray test controls/markers, MPI↔Ray parity parametrization, and utility for free port discovery.
Minor Edits
tensorrt_llm/_torch/models/modeling_{phi4mm,utils}.py, tensorrt_llm/_torch/auto_deploy/distributed/common.py
Formatting tweaks and internal port utility redirection.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant LLM as LLM.create()
  participant Exec as GenerationExecutor
  participant Ray as RayExecutor
  participant PG as Torch Dist (PG)
  participant W as RayGPUWorker[*]

  User->>LLM: create(model, orchestrator_type="ray", tp_size)
  LLM->>Exec: GenerationExecutor.create(**args)
  alt orchestrator_type == "ray"
    Exec->>Ray: _create_ray_executor(worker_kwargs, world_size, tp_size, ...)
    Ray->>Ray: init Ray cluster / placement group
    Ray->>PG: initialize process group(s)
    Ray->>W: create actors (world_size)
  else
    Exec->>Exec: fallback (MPI)
  end
  User->>LLM: generate/request
  LLM->>Ray: submit(request)
  Ray->>W: enqueue_request(leader)
  W-->>Ray: result stream/items
  Ray-->>LLM: GenerationResult
  LLM-->>User: outputs
Loading
sequenceDiagram
  autonumber
  participant CT as CacheTransceiver
  participant Comm as CacheTransceiverComm
  participant MPI as MPI Comm
  participant PG as ProcessGroup
  note over CT: Initialize
  CT->>Comm: construct (from MPI or PG)
  alt MPI path
    Comm->>MPI: split(color,key)
    Comm->>MPI: allgather/allgatherv
  else PG path
    Comm->>PG: split via Python helper
    Comm->>PG: allgather/allgatherv (PgHelper)
  end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~150 minutes

Possibly related PRs

Suggested labels

Community want to contribute

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 93

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

973-996: Preserve MetaInit path when ModelConfig.clone() is unavailable.

Switching from deepcopy to clone() risks skipping the MetaInit path if clone() isn't implemented (AttributeError) or not universally supported across configs, causing a fallback to full materialization and potential OOM/perf regressions. Prefer clone() but fall back to deepcopy only for the copy step to retain the MetaInit flow.

             try:
-                # config will be modified in-place for some models, like Qwen2
-                config_copy = config.clone()
+                # config will be modified in-place for some models, like Qwen2
+                # Prefer lightweight clone; fall back to deepcopy to preserve MetaInit path if clone is unavailable.
+                try:
+                    if hasattr(config, "clone"):
+                        config_copy = config.clone()
+                    else:
+                        import copy as _copy
+                        config_copy = _copy.deepcopy(config)
+                except Exception:
+                    import copy as _copy
+                    config_copy = _copy.deepcopy(config)
                 with MetaInitMode():
                     model = AutoModelForCausalLM.from_config(config_copy)
 
                 memo = dict()
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)

255-256: Fix logger call to avoid formatting error at runtime.

logger.info("ATTENTION RUNTIME FEATURES: ", attn_runtime_features) will be formatted with % under std logging and can raise. Use %s or f-string.

-    logger.info("ATTENTION RUNTIME FEATURES: ", attn_runtime_features)
+    logger.info("ATTENTION RUNTIME FEATURES: %s", attn_runtime_features)
cpp/tensorrt_llm/runtime/utils/mpiUtils.cpp (2)

303-316: Fix incorrect printf specifiers for size_t in log messages.

size is size_t but logged with %d, which is UB on LP64 and truncates on 64-bit. Use %zu (or cast to unsigned long long with %llu).

Apply:

-    TLLM_LOG_DEBUG("start MPI_Isend with dest %d, tag %d, size %d", dest, static_cast<int>(tag), size);
+    TLLM_LOG_DEBUG("start MPI_Isend with dest %d, tag %d, size %zu", dest, static_cast<int>(tag), size);
@@
-    TLLM_LOG_DEBUG("end MPI_Isend with dest %d, tag %d, size %d", dest, static_cast<int>(tag), size);
+    TLLM_LOG_DEBUG("end MPI_Isend with dest %d, tag %d, size %zu", dest, static_cast<int>(tag), size);
@@
-    TLLM_LOG_DEBUG("start MPI_Send with dest %d, tag %d, size %d", dest, tag, size);
+    TLLM_LOG_DEBUG("start MPI_Send with dest %d, tag %d, size %zu", dest, tag, size);
@@
-    TLLM_LOG_DEBUG("end MPI_Send with dest %d, tag %d, size %d", dest, tag, size);
+    TLLM_LOG_DEBUG("end MPI_Send with dest %d, tag %d, size %zu", dest, tag, size);
@@
-    TLLM_LOG_DEBUG("start MPI_Recv with source %d, tag %d, size %d", source, tag, size);
+    TLLM_LOG_DEBUG("start MPI_Recv with source %d, tag %d, size %zu", source, tag, size);
@@
-    TLLM_LOG_DEBUG("end MPI_Recv with source %d, tag %d, size %d", source, tag, size);
+    TLLM_LOG_DEBUG("end MPI_Recv with source %d, tag %d, size %zu", source, tag, size);

Also applies to: 324-334, 348-360


429-466: Gate probe APIs with couldUseMPI() for consistent MPI-disable behavior.

mprobe/improbe/iprobe bypass the new runtime MPI guard. They should early-guard like other ops.

 void MpiComm::mprobeRawTag(int source, int tag, MPI_Message* msg, MPI_Status* status) const
 {
+    couldUseMPI();
 #if ENABLE_MULTI_DEVICE
@@
 bool MpiComm::improbe(int source, MpiTag tag, MPI_Message* msg, MPI_Status* status) const
 {
+    couldUseMPI();
 #if ENABLE_MULTI_DEVICE
@@
 bool MpiComm::iprobe(int source, MpiTag tag, MPI_Status* status) const
 {
+    couldUseMPI();
 #if ENABLE_MULTI_DEVICE
setup.py (2)

175-187: Fix undefined exception variable 'e' when raising SetupError.

e is not defined in this scope; this raises UnboundLocalError instead of the intended SetupError.

-            else:
-                raise SetupError(
-                    f"Failed to get wheel file from {precompiled_path}.") from e
+            else:
+                raise SetupError(
+                    f"Failed to get wheel file from {precompiled_path}.")

206-211: Align python_requires with actual syntax usage (PEP 604 unions).

The file uses type hints like str | None, which require Python 3.10+. Current python_requires is ">=3.7", which will break installs.

-    python_requires=">=3.7, <4")
+    python_requires=">=3.10, <4")

Also applies to: 264-264

cpp/tensorrt_llm/batch_manager/CMakeLists.txt (1)

112-116: Ensure Python3::Python target exists before linking
In cpp/tensorrt_llm/batch_manager/CMakeLists.txt before line 112, wrap the Python target usage with a guarded find_package:

+if(NOT TARGET Python3::Python)
+  find_package(Python3 REQUIRED COMPONENTS Interpreter Development)
+endif()
 find_library(TORCH_PYTHON_LIB torch_python REQUIRED
              HINTS ${TORCH_INSTALL_PREFIX}/lib)
 target_link_libraries(${BATCH_MANAGER_STATIC_TARGET}
                       PUBLIC ${TORCH_PYTHON_LIB} Python3::Python pg_utils)
tests/unittest/_torch/ray/test_placement.py (1)

54-67: Test cleanup and potential race condition concerns

The test modifies and deletes CUDA_VISIBLE_DEVICES but doesn't ensure proper cleanup if the test fails. Additionally, there's no Ray cleanup.

Apply this diff to ensure proper cleanup:

 @pytest.mark.gpu2
 def test_cuda_visible_device():
     """Placement via cuda_visible_device"""
+    original_cuda_visible = os.environ.get("CUDA_VISIBLE_DEVICES")
     os.environ["CUDA_VISIBLE_DEVICES"] = "1"
-
-    llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
-              orchestrator_type="ray")
-
-    infer_actor_uuids = llm.collective_rpc("report_device_id")
-
-    del os.environ["CUDA_VISIBLE_DEVICES"]
-    assert infer_actor_uuids[0] == get_device_uuid(1)
-    print(f"{infer_actor_uuids=}")
+    try:
+        ray.init()
+        llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+                  orchestrator_type="ray")
+        
+        infer_actor_uuids = llm.collective_rpc("report_device_id")
+        assert infer_actor_uuids[0] == get_device_uuid(1)
+        print(f"{infer_actor_uuids=}")
+    finally:
+        # Restore original environment
+        if original_cuda_visible is not None:
+            os.environ["CUDA_VISIBLE_DEVICES"] = original_cuda_visible
+        else:
+            os.environ.pop("CUDA_VISIBLE_DEVICES", None)
+        ray.shutdown()
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

862-871: Disjoint tag namespaces for different inter-PP messages.

Tokens and logits both use tag=prev_microbatch_id between the same src/dst. Per prior incident, these must not share a tag space to avoid message collisions.

I used the retrieved learning from PR #7455. Suggest using distinct, documented offsets, e.g., kTOKENS_TAG_BASE=0, kLOGITS_TAG_BASE=100000, then tag=k*_BASE+prev_microbatch_id.

Also applies to: 1849-1863


518-546: Initialize or remove self.global_rank
tensorrt_llm/_torch/pyexecutor/py_executor.py logs self.global_rank (around lines 552–556) but PyExecutor.__init__ never sets it, causing an AttributeError at runtime. Either add self.global_rank = dist.rank in the constructor or drop it from the log.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review continued from previous batch...

@joyang-nv joyang-nv changed the title [TRTLLM-7349][feat] Adding new Ray orchestrator [TRTLLM-7349][feat] Adding new orchestrator type -- ray Sep 4, 2025
@joyang-nv joyang-nv marked this pull request as draft September 4, 2025 07:42
@tongyuantongyu
Copy link
Member

/bot run --stage-list "H100_PCIe-PyTorch-1,H100_PCIe-PyTorch-Ray-1" --disable-fail-fast

@tongyuantongyu
Copy link
Member

/bot run --stage-list "H100_PCIe-PyTorch-1, H100_PCIe-PyTorch-Ray-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17661 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17676 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20574 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20574 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15529 (Partly Tested) completed with status: 'SUCCESS'

@joyang-nv joyang-nv disabled auto-merge October 3, 2025 10:34
@joyang-nv
Copy link
Collaborator Author

/bot run --disable-reuse-test --stage-list "DGX_B200-4_GPUs-PyTorch-Ray-1,DGX_H100-2_GPUs-PyTorch-Ray-1,H100_PCIe-PyTorch-Ray-1"

@joyang-nv
Copy link
Collaborator Author

/bot kill

@joyang-nv
Copy link
Collaborator Author

/bot run --disable-reuse-test --stage-list "DGX_B200-4_GPUs-PyTorch-Ray-1,DGX_H100-2_GPUs-PyTorch-Ray-1,H100_PCIe-PyTorch-Ray-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20579 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20579 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit efc9b16

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20580 [ run ] triggered by Bot

@joyang-nv
Copy link
Collaborator Author

/bot kill

@joyang-nv joyang-nv enabled auto-merge (squash) October 3, 2025 13:33
@joyang-nv
Copy link
Collaborator Author

/bot run --disable-reuse-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20580 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15534 (Partly Tested) completed with status: 'SUCCESS'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20587 [ kill ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20587 [ kill ] completed with state SUCCESS
Successfully killed previous jobs for commit efc9b16

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20588 [ run ] triggered by Bot

@joyang-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20594 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20588 [ run ] completed with state ABORTED
/LLM/main/L0_MergeRequest_PR pipeline #15541 completed with status: 'FAILURE'

@joyang-nv
Copy link
Collaborator Author

/bot kill

tongyuantongyu and others added 2 commits October 4, 2025 01:55
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>

Add ray cleanup fixture

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>

Add Ray test cases

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>

address comments and fixes

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

add tests for ray examples, refactor to BaseWorker

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

Add ray example tests to CI list

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>

Fix CUDA Graph + PG NCCL Coalescing

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>

Unify WorkerExit

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>

review cleanup

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>

Make sub_pg inherit backend from global pg and some cleanup

Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>

Cleanup

Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>

CI fix

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>

fix single gpu and api stability tests

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

Fix fake ray import

Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>

Fix device mesh on single gpu

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>

build & CI review

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>

Move ray requirements declaration

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>

Fix error response handling

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>

Remove failing case for now

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>

skip get stats tests

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

skip test_fp8_block_scales_4gpus

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

Resolve rebase conflict and revert result_wait_queue cleanup

Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>

fix test_disaggregated_ctx**_gen**

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

Remove empty case

Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com>

fix ci tests

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

Back result_wait_queue cleanup

Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>

Update jenkins/L0_Test.groovy

Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com>

Move ray stage using 4 gpus to b200

Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>

Add orchestrator mpi to other test groups in b200 yaml

Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>

Skip get perf metrics tests

Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>

Skip unsupported tests in ray stage

Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com>

minor test fix & nit

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

skip newly added test

Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>

skip test_disaggregated_serving.py in ray stage until we add cleanup
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
@joyang-nv
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20608 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20594 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #15547 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20608 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15557 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@joyang-nv joyang-nv disabled auto-merge October 4, 2025 00:12
@joyang-nv joyang-nv merged commit 88ea2c4 into NVIDIA:main Oct 4, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.