-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-7349][feat] Adding new orchestrator type -- ray #7520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughIntroduces a ProcessGroup-based distributed path alongside MPI across C++ and Python: new pg_utils library, CacheTransceiverComm abstraction, PG-backed collectives in thop ops, MPI gating via TLLM_DISABLE_MPI, and Ray-based orchestration (executor, workers, examples, tests). Adds bindings, build targets, packaging entries, and tests for PG/Ray flows. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant LLM as LLM.create()
participant Exec as GenerationExecutor
participant Ray as RayExecutor
participant PG as Torch Dist (PG)
participant W as RayGPUWorker[*]
User->>LLM: create(model, orchestrator_type="ray", tp_size)
LLM->>Exec: GenerationExecutor.create(**args)
alt orchestrator_type == "ray"
Exec->>Ray: _create_ray_executor(worker_kwargs, world_size, tp_size, ...)
Ray->>Ray: init Ray cluster / placement group
Ray->>PG: initialize process group(s)
Ray->>W: create actors (world_size)
else
Exec->>Exec: fallback (MPI)
end
User->>LLM: generate/request
LLM->>Ray: submit(request)
Ray->>W: enqueue_request(leader)
W-->>Ray: result stream/items
Ray-->>LLM: GenerationResult
LLM-->>User: outputs
sequenceDiagram
autonumber
participant CT as CacheTransceiver
participant Comm as CacheTransceiverComm
participant MPI as MPI Comm
participant PG as ProcessGroup
note over CT: Initialize
CT->>Comm: construct (from MPI or PG)
alt MPI path
Comm->>MPI: split(color,key)
Comm->>MPI: allgather/allgatherv
else PG path
Comm->>PG: split via Python helper
Comm->>PG: allgather/allgatherv (PgHelper)
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~150 minutes Possibly related PRs
Suggested labels
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 93
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (10)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
973-996
: Preserve MetaInit path whenModelConfig.clone()
is unavailable.Switching from
deepcopy
toclone()
risks skipping the MetaInit path ifclone()
isn't implemented (AttributeError) or not universally supported across configs, causing a fallback to full materialization and potential OOM/perf regressions. Preferclone()
but fall back todeepcopy
only for the copy step to retain the MetaInit flow.try: - # config will be modified in-place for some models, like Qwen2 - config_copy = config.clone() + # config will be modified in-place for some models, like Qwen2 + # Prefer lightweight clone; fall back to deepcopy to preserve MetaInit path if clone is unavailable. + try: + if hasattr(config, "clone"): + config_copy = config.clone() + else: + import copy as _copy + config_copy = _copy.deepcopy(config) + except Exception: + import copy as _copy + config_copy = _copy.deepcopy(config) with MetaInitMode(): model = AutoModelForCausalLM.from_config(config_copy) memo = dict()tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
255-256
: Fix logger call to avoid formatting error at runtime.
logger.info("ATTENTION RUNTIME FEATURES: ", attn_runtime_features)
will be formatted with%
under std logging and can raise. Use%s
or f-string.- logger.info("ATTENTION RUNTIME FEATURES: ", attn_runtime_features) + logger.info("ATTENTION RUNTIME FEATURES: %s", attn_runtime_features)cpp/tensorrt_llm/runtime/utils/mpiUtils.cpp (2)
303-316
: Fix incorrect printf specifiers for size_t in log messages.size is size_t but logged with %d, which is UB on LP64 and truncates on 64-bit. Use %zu (or cast to unsigned long long with %llu).
Apply:
- TLLM_LOG_DEBUG("start MPI_Isend with dest %d, tag %d, size %d", dest, static_cast<int>(tag), size); + TLLM_LOG_DEBUG("start MPI_Isend with dest %d, tag %d, size %zu", dest, static_cast<int>(tag), size); @@ - TLLM_LOG_DEBUG("end MPI_Isend with dest %d, tag %d, size %d", dest, static_cast<int>(tag), size); + TLLM_LOG_DEBUG("end MPI_Isend with dest %d, tag %d, size %zu", dest, static_cast<int>(tag), size); @@ - TLLM_LOG_DEBUG("start MPI_Send with dest %d, tag %d, size %d", dest, tag, size); + TLLM_LOG_DEBUG("start MPI_Send with dest %d, tag %d, size %zu", dest, tag, size); @@ - TLLM_LOG_DEBUG("end MPI_Send with dest %d, tag %d, size %d", dest, tag, size); + TLLM_LOG_DEBUG("end MPI_Send with dest %d, tag %d, size %zu", dest, tag, size); @@ - TLLM_LOG_DEBUG("start MPI_Recv with source %d, tag %d, size %d", source, tag, size); + TLLM_LOG_DEBUG("start MPI_Recv with source %d, tag %d, size %zu", source, tag, size); @@ - TLLM_LOG_DEBUG("end MPI_Recv with source %d, tag %d, size %d", source, tag, size); + TLLM_LOG_DEBUG("end MPI_Recv with source %d, tag %d, size %zu", source, tag, size);Also applies to: 324-334, 348-360
429-466
: Gate probe APIs with couldUseMPI() for consistent MPI-disable behavior.mprobe/improbe/iprobe bypass the new runtime MPI guard. They should early-guard like other ops.
void MpiComm::mprobeRawTag(int source, int tag, MPI_Message* msg, MPI_Status* status) const { + couldUseMPI(); #if ENABLE_MULTI_DEVICE @@ bool MpiComm::improbe(int source, MpiTag tag, MPI_Message* msg, MPI_Status* status) const { + couldUseMPI(); #if ENABLE_MULTI_DEVICE @@ bool MpiComm::iprobe(int source, MpiTag tag, MPI_Status* status) const { + couldUseMPI(); #if ENABLE_MULTI_DEVICEsetup.py (2)
175-187
: Fix undefined exception variable 'e' when raising SetupError.e is not defined in this scope; this raises UnboundLocalError instead of the intended SetupError.
- else: - raise SetupError( - f"Failed to get wheel file from {precompiled_path}.") from e + else: + raise SetupError( + f"Failed to get wheel file from {precompiled_path}.")
206-211
: Align python_requires with actual syntax usage (PEP 604 unions).The file uses type hints like
str | None
, which require Python 3.10+. Current python_requires is ">=3.7", which will break installs.- python_requires=">=3.7, <4") + python_requires=">=3.10, <4")Also applies to: 264-264
cpp/tensorrt_llm/batch_manager/CMakeLists.txt (1)
112-116
: Ensure Python3::Python target exists before linking
Incpp/tensorrt_llm/batch_manager/CMakeLists.txt
before line 112, wrap the Python target usage with a guarded find_package:+if(NOT TARGET Python3::Python) + find_package(Python3 REQUIRED COMPONENTS Interpreter Development) +endif() find_library(TORCH_PYTHON_LIB torch_python REQUIRED HINTS ${TORCH_INSTALL_PREFIX}/lib) target_link_libraries(${BATCH_MANAGER_STATIC_TARGET} PUBLIC ${TORCH_PYTHON_LIB} Python3::Python pg_utils)tests/unittest/_torch/ray/test_placement.py (1)
54-67
: Test cleanup and potential race condition concernsThe test modifies and deletes
CUDA_VISIBLE_DEVICES
but doesn't ensure proper cleanup if the test fails. Additionally, there's no Ray cleanup.Apply this diff to ensure proper cleanup:
@pytest.mark.gpu2 def test_cuda_visible_device(): """Placement via cuda_visible_device""" + original_cuda_visible = os.environ.get("CUDA_VISIBLE_DEVICES") os.environ["CUDA_VISIBLE_DEVICES"] = "1" - - llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", - orchestrator_type="ray") - - infer_actor_uuids = llm.collective_rpc("report_device_id") - - del os.environ["CUDA_VISIBLE_DEVICES"] - assert infer_actor_uuids[0] == get_device_uuid(1) - print(f"{infer_actor_uuids=}") + try: + ray.init() + llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", + orchestrator_type="ray") + + infer_actor_uuids = llm.collective_rpc("report_device_id") + assert infer_actor_uuids[0] == get_device_uuid(1) + print(f"{infer_actor_uuids=}") + finally: + # Restore original environment + if original_cuda_visible is not None: + os.environ["CUDA_VISIBLE_DEVICES"] = original_cuda_visible + else: + os.environ.pop("CUDA_VISIBLE_DEVICES", None) + ray.shutdown()tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
862-871
: Disjoint tag namespaces for different inter-PP messages.Tokens and logits both use
tag=prev_microbatch_id
between the same src/dst. Per prior incident, these must not share a tag space to avoid message collisions.I used the retrieved learning from PR #7455. Suggest using distinct, documented offsets, e.g.,
kTOKENS_TAG_BASE=0
,kLOGITS_TAG_BASE=100000
, thentag=k*_BASE+prev_microbatch_id
.Also applies to: 1849-1863
518-546
: Initialize or removeself.global_rank
tensorrt_llm/_torch/pyexecutor/py_executor.py
logsself.global_rank
(around lines 552–556) butPyExecutor.__init__
never sets it, causing an AttributeError at runtime. Either addself.global_rank = dist.rank
in the constructor or drop it from the log.
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
Show resolved
Hide resolved
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.cpp
Show resolved
Hide resolved
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h
Show resolved
Hide resolved
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/ucxCacheCommunicator.h
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review continued from previous batch...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review continued from previous batch...
/bot run --stage-list "H100_PCIe-PyTorch-1,H100_PCIe-PyTorch-Ray-1" --disable-fail-fast |
/bot run --stage-list "H100_PCIe-PyTorch-1, H100_PCIe-PyTorch-Ray-1" --disable-fail-fast |
PR_Github #17661 [ run ] triggered by Bot |
PR_Github #17676 [ run ] triggered by Bot |
PR_Github #20574 [ run ] triggered by Bot |
PR_Github #20574 [ run ] completed with state |
/bot run --disable-reuse-test --stage-list "DGX_B200-4_GPUs-PyTorch-Ray-1,DGX_H100-2_GPUs-PyTorch-Ray-1,H100_PCIe-PyTorch-Ray-1" |
/bot kill |
/bot run --disable-reuse-test --stage-list "DGX_B200-4_GPUs-PyTorch-Ray-1,DGX_H100-2_GPUs-PyTorch-Ray-1,H100_PCIe-PyTorch-Ray-1" |
PR_Github #20579 [ kill ] triggered by Bot |
PR_Github #20579 [ kill ] completed with state |
PR_Github #20580 [ run ] triggered by Bot |
/bot kill |
/bot run --disable-reuse-test |
PR_Github #20580 [ run ] completed with state |
PR_Github #20587 [ kill ] triggered by Bot |
PR_Github #20587 [ kill ] completed with state |
PR_Github #20588 [ run ] triggered by Bot |
/bot run |
PR_Github #20594 [ run ] triggered by Bot |
PR_Github #20588 [ run ] completed with state |
/bot kill |
Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Add ray cleanup fixture Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Add Ray test cases Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> address comments and fixes Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> add tests for ray examples, refactor to BaseWorker Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Add ray example tests to CI list Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Fix CUDA Graph + PG NCCL Coalescing Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Unify WorkerExit Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> review cleanup Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Make sub_pg inherit backend from global pg and some cleanup Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> Cleanup Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> CI fix Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> fix single gpu and api stability tests Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Fix fake ray import Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> Fix device mesh on single gpu Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> build & CI review Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Move ray requirements declaration Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Fix error response handling Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> Remove failing case for now Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> skip get stats tests Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> skip test_fp8_block_scales_4gpus Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Resolve rebase conflict and revert result_wait_queue cleanup Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> fix test_disaggregated_ctx**_gen** Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Remove empty case Signed-off-by: Yuan Tong <13075180+tongyuantongyu@users.noreply.github.com> fix ci tests Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> Back result_wait_queue cleanup Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> Update jenkins/L0_Test.groovy Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: shuyixiong <219646547+shuyixiong@users.noreply.github.com> Move ray stage using 4 gpus to b200 Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> Add orchestrator mpi to other test groups in b200 yaml Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> Skip get perf metrics tests Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> Skip unsupported tests in ray stage Signed-off-by: shuyix <219646547+shuyixiong@users.noreply.github.com> minor test fix & nit Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> skip newly added test Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> skip test_disaggregated_serving.py in ray stage until we add cleanup
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
efc9b16
to
b6a3abd
Compare
/bot run |
PR_Github #20608 [ run ] triggered by Bot |
PR_Github #20594 [ run ] completed with state |
PR_Github #20608 [ run ] completed with state |
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.