[TRTLLM-5195][feat] Multimodal Disagg Support in TRTLLM #5000

chang-l · 2025-06-06T20:03:30Z

Multimodal Disagg Support in TRTLLM

This is a POC of enabling disagg support for multimodal inputs in pytorch flow.

To-do before merging:

~~Add example/doc/benchmark script~~
~~Add functional test for torch.Tensor cuda IPC/shared memory utility~~
Add accuracy test for multimodal model engine/executor
~~Add e2e test~~

Some preliminary results

Model: llava-hf/llava-v1.6-mistral-7b-hf
Tool: genai-perf
Setup: 1 LLM server (TP1) and/or 1 MM server (TP1)

Concurrency	Request Cnt/Rate	ISL	OSL	Image Size	TRT-LLM_Type	Latency(ms) p75	TTFT(ms) p75	ITL(ms) p75	Throughput(tokens/sec)
N/A	100/10	64	64	(512, 512)	Disagg	867	76	12.6	603
N/A	100/10	64	64	(512, 512)	PyTorch	1573	181	22	597
1	50/None	64	64	(512, 512)	Disagg	691	61	10	91
1	50/None	64	64	(512, 512)	PyTorch	844	216	10	75
10	50/None	64	64	(512, 512)	Disagg	1021	300	14	621
10	50/None	64	64	(512, 512)	PyTorch	1937	1183	18	345
100	500/None	64	64	(512, 512)	Disagg	4705	538	66	1289
100	500/None	64	64	(512, 512)	PyTorch	9969	9948	0.07	630

Test Coverage

accuracy test (see README)
genai-perf benchmarks:

 ./test_client_disag_mm.sh --concurrency 2 --port 8003
 ./test_client_disag_mm.sh --request-rate 15 --port 8001

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Initial commit to add standalone encoder engine Add encoder server Add ForkingPickler to enable sharemem transfer Add cudaIPC support and intermed to disagg e2e Enable E2E in disagg mode [1/N] Refactor: Relocate mm_encoder and MultiModalParams [2/N] Refactor: move MM request/response/result to dedicated files [3/N] Refactor: move shared IPC tensor to a dedicated place [4/N] Refactor: Enable multigpu on llm server + Diable sharedtensor pool [5/N] Cleanup: Remove unnecessary authkey sync Genai-perf script + Delay decre sharetensor ref + Port image load fix from gh-main

examples/multimodal_disaggregated/README.md

tensorrt_llm/_torch/models/modeling_llava_next.py

tensorrt_llm/_torch/multimodal/mm_encoder.py

tensorrt_llm/serve/openai_server.py

yechank-nvidia · 2025-06-10T13:21:06Z

tensorrt_llm/_torch/pyexecutor/model_engine.py

+                    tensor_pool = get_handle_buffer()
+                    tensor_pool.add_handle(str(request.py_request_id), shared_tensor)
+                    multimodal_embedding.copy_(shared_tensor)
+                    self.mm_emb_dist.broadcast(multimodal_embedding)


Not sure this is the right position to broadcast. Should we broadcast on _fetch_new_requests?

I got a comment that adding action on prepare_tp_inputs could decrease the performance of sheculder_overlap.

I see.. In fact, this is nccl bcast and should not block cpu.

One concern is that since we need to bcast mm_embedding cuda tensor for every request, moving to _fetch_new_requests would require us to loop all requests (in batch) to bcast their mm_embedding in one place Not sure about the perf implications compared to the current flow, i.e., each forward pass broadcasts mm_embed and consumes it immediately.

jaedeokk · 2025-06-12T06:33:24Z

let me add @Shunkangz for viz.

pcastonguay

The MR is quite large (>3000 new lines). Can it be broken into multiple smaller MRs with unit tests to make it easier to review? Also, right now we only have 1 e2e test, and 1 unit test for shared tensor. This is not sufficient for the 3000+ new lines of code.

tensorrt_llm/_torch/models/modeling_llava_next.py

symphonylyh · 2025-06-19T05:00:27Z

tensorrt_llm/_torch/pyexecutor/multimodal/multimodal_executor.py

this multimodal executor is to executor the standalone vision encoder, right?
I think replicate/inherit most of the code from PyExecutor might not be ideal, as it's harder to maintain one more PyExecutor class.
Should we consider reusing the no-KV path in the PyExecutor (e.g., this path can run BERT which is also an encoder model, previous PR https://gitlab-master.nvidia.com/ftp/tekit/-/merge_requests/8280) and extend PyExecutor w/o duplicating the class?

chang-l · 2025-06-21T13:53:40Z

Thanks @pcastonguay.
I agree this PR can be split into smaller, interdependent ones to ease the review process. To keep each part self-contained, I’ll split it into: one PR for shared tensor support, one for the multimodal PyExecutor, and one for the multimodal disagg serving support.

chang-l added 2 commits June 5, 2025 18:15

Fix rebase errors

0caf826

chang-l self-assigned this Jun 6, 2025

chang-l requested review from pcastonguay, schetlur-nv and symphonylyh June 6, 2025 20:46

chang-l added 3 commits June 6, 2025 17:57

Add example and cleanup

868a06e

Remove unnecessary example

7f1e485

Remove redundant func

08a7e54

chang-l marked this pull request as ready for review June 7, 2025 01:36

chang-l requested review from a team as code owners June 7, 2025 01:36

chang-l requested review from amukkara, jaedeok-nvidia, rakib-hasan and yechank-nvidia June 7, 2025 01:37

chang-l changed the title ~~feat: [POC] Multimodal Disagg Support in TRTLLM~~ [TRTLLM-5195][feat] Multimodal Disagg Support in TRTLLM Jun 7, 2025

chang-l added 2 commits June 6, 2025 22:47

Minor update

39957bc

Some cleanup

42678a7

yechank-nvidia reviewed Jun 10, 2025

View reviewed changes

chang-l added 2 commits June 10, 2025 15:35

Add some tests

cbe8727

Address some comments

02f966a

chang-l requested a review from dongxuy04 June 13, 2025 02:55

pcastonguay requested changes Jun 17, 2025

View reviewed changes

symphonylyh reviewed Jun 19, 2025

View reviewed changes

chang-l mentioned this pull request Jun 21, 2025

[1/N][TRTLLM-5195][feat] Share PyTorch tensor between processes #5396

Merged

chang-l mentioned this pull request Jul 24, 2025

[TRTLLM-6654][feat] Add support for external multimodal embeddings #6263

Merged

chang-l mentioned this pull request Aug 8, 2025

[TRTLLM-7326][feat] Add standalone multimodal encoder #6743

Merged

chang-l mentioned this pull request Sep 5, 2025

[TRTLLM-7328][feat] E-PD Disagg Support via llmapi (3/N) #7577

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TRTLLM-5195][feat] Multimodal Disagg Support in TRTLLM #5000

[TRTLLM-5195][feat] Multimodal Disagg Support in TRTLLM #5000

Uh oh!

chang-l commented Jun 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yechank-nvidia Jun 10, 2025

Uh oh!

chang-l Jun 11, 2025

Uh oh!

jaedeokk commented Jun 12, 2025

Uh oh!

pcastonguay left a comment

Uh oh!

Uh oh!

symphonylyh Jun 19, 2025

Uh oh!

chang-l commented Jun 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[TRTLLM-5195][feat] Multimodal Disagg Support in TRTLLM #5000

Are you sure you want to change the base?

[TRTLLM-5195][feat] Multimodal Disagg Support in TRTLLM #5000

Uh oh!

Conversation

chang-l commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Multimodal Disagg Support in TRTLLM

To-do before merging:

Some preliminary results

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yechank-nvidia Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

chang-l Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

jaedeokk commented Jun 12, 2025

Uh oh!

pcastonguay left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

symphonylyh Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

chang-l commented Jun 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chang-l commented Jun 6, 2025 •

edited

Loading