-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-6453][feat] Support chunked prefill on spec decode 2 model #6104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes introduce a new attribute, Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant PyExecutor
participant LlmRequest
participant ModelDrafter
User->>PyExecutor: Submit request
PyExecutor->>LlmRequest: Initialize (py_last_context_chunk = (None, None))
loop For each context chunk
PyExecutor->>LlmRequest: Update py_last_context_chunk (start, end)
PyExecutor->>ModelDrafter: Prepare draft batch (with chunk info)
ModelDrafter->>LlmRequest: Create/Update context request with chunk info
ModelDrafter->>ModelDrafter: Process decoded tokens (synchronize with target)
end
Estimated code review effort2 (~20 minutes) Possibly related PRs
Suggested labels
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
840fb61
to
95ab84c
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/unittest/_torch/speculative/test_eagle3.py (1)
78-90
: Address line length violation and improve prompt readability.The long prompt string on line 81 exceeds the 120 character limit flagged by static analysis.
- prompts = [ - "The capital of France is a city of romance, art, fashion, and cuisine. Paris is a must-visit destination for anyone who loves history, architecture, and culture. From the iconic Eiffel Tower to the world-famous Louvre Museum, Paris has something to offer for every interest and age.\nThe city is divided into 20 arrondissements, each with its own unique character and charm. The Latin Quarter is a popular area for students and young travelers, while the Champs-Élysées is a hub for shopping and dining. The Montmartre neighborhood is famous for its bohemian vibe and stunning views of the city.\nParis is also known for its beautiful parks and gardens, such as the Luxembourg Gardens and the Tuileries Garden. The city has a rich history, with landmarks like the Notre-Dame Cathedral and the Arc de Triomphe. Visitors can also explore the city's many museums, including the Musée d'Orsay and the Musée Rodin.\nIn addition to its cultural and historical attractions, Paris is also a great destination for foodies. The city is famous for its cuisine, including croissants, baguettes, and cheese. Visitors can sample the city's famous dishes at one of the many restaurants, cafes, and " - ] + prompts = [ + ("The capital of France is a city of romance, art, fashion, and cuisine. " + "Paris is a must-visit destination for anyone who loves history, architecture, and culture. " + "From the iconic Eiffel Tower to the world-famous Louvre Museum, Paris has something to offer " + "for every interest and age.\nThe city is divided into 20 arrondissements, each with its own " + "unique character and charm. The Latin Quarter is a popular area for students and young travelers, " + "while the Champs-Élysées is a hub for shopping and dining. The Montmartre neighborhood is famous " + "for its bohemian vibe and stunning views of the city.\nParis is also known for its beautiful " + "parks and gardens, such as the Luxembourg Gardens and the Tuileries Garden. The city has a rich " + "history, with landmarks like the Notre-Dame Cathedral and the Arc de Triomphe. Visitors can also " + "explore the city's many museums, including the Musée d'Orsay and the Musée Rodin.\nIn addition " + "to its cultural and historical attractions, Paris is also a great destination for foodies. The " + "city is famous for its cuisine, including croissants, baguettes, and cheese. Visitors can sample " + "the city's famous dishes at one of the many restaurants, cafes, and ") + ]
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
tensorrt_llm/_torch/pyexecutor/llm_request.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(1 hunks)tensorrt_llm/_torch/speculative/model_drafter.py
(5 hunks)tests/unittest/_torch/speculative/test_eagle3.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tensorrt_llm/_torch/pyexecutor/llm_request.py
- tensorrt_llm/_torch/pyexecutor/py_executor.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/unittest/_torch/speculative/test_eagle3.py (2)
tensorrt_llm/llmapi/llm.py (2)
tokenizer
(657-661)tokenizer
(664-665)tests/unittest/llmapi/test_llm.py (1)
encode
(308-309)
🪛 Ruff (0.12.2)
tests/unittest/_torch/speculative/test_eagle3.py
81-81: Line too long (1197 > 120)
(E501)
🔇 Additional comments (7)
tests/unittest/_torch/speculative/test_eagle3.py (2)
17-31
: Test coverage for chunked prefill looks comprehensive.The parametrize decorator appropriately adds
enable_chunked_prefill
parameter with test cases covering both chunked and non-chunked scenarios across different configurations.
62-66
: Configuration for chunked prefill is correctly implemented.The conditional logic properly enables chunked prefill and reduces
max_num_tokens
to 64 to trigger the chunked prefill code path, which aligns with the test objectives.tensorrt_llm/_torch/speculative/model_drafter.py (5)
79-88
: Context request creation properly handles chunked prefill boundaries.The method correctly extracts chunk boundaries from
py_last_context_chunk
and setscontext_current_position
andcontext_chunk_size
appropriately for chunked prefill scenarios.
103-118
: Method rename improves clarity and maintains correct logic.The rename from
_create_chunked_context_request
to_create_accepted_tokens_request
better describes the method's purpose. The logic for handling accepted tokens in chunked context remains correct.
180-194
: Chunked prefill handling in draft batch preparation is well-implemented.The logic correctly:
- Skips requests with
context_current_position == 0
(still need target model processing)- Handles chunked prefill by reconstructing input tokens and creating context requests
- Properly integrates with the existing draft batch workflow
285-289
: Token processing correctly defers draft token addition for chunked prefill.The logic appropriately checks if the target model request is not in
GENERATION_IN_PROGRESS
state and defers adding draft tokens until the entire prompt is processed, while properly freeing resources.
142-143
: Method call update aligns with the renamed method.The call to
_create_accepted_tokens_request
correctly reflects the method rename and maintains the same parameters.
PR_Github #12222 [ run ] triggered by Bot |
PR_Github #12222 [ run ] completed with state |
/bot run |
PR_Github #12227 [ run ] triggered by Bot |
PR_Github #12227 [ run ] completed with state |
a8c5d0b
to
92b4a83
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/unittest/_torch/speculative/test_eagle3.py (1)
78-90
: Fix the line length violation for better readability.The prompt selection logic is well-implemented and appropriate for testing chunked prefill functionality. However, the long prompt string on line 81 exceeds the 120-character limit.
Consider breaking the long prompt into multiple lines for better readability:
- "The capital of France is a city of romance, art, fashion, and cuisine. Paris is a must-visit destination for anyone who loves history, architecture, and culture. From the iconic Eiffel Tower to the world-famous Louvre Museum, Paris has something to offer for every interest and age.\nThe city is divided into 20 arrondissements, each with its own unique character and charm. The Latin Quarter is a popular area for students and young travelers, while the Champs-Élysées is a hub for shopping and dining. The Montmartre neighborhood is famous for its bohemian vibe and stunning views of the city.\nParis is also known for its beautiful parks and gardens, such as the Luxembourg Gardens and the Tuileries Garden. The city has a rich history, with landmarks like the Notre-Dame Cathedral and the Arc de Triomphe. Visitors can also explore the city's many museums, including the Musée d'Orsay and the Musée Rodin.\nIn addition to its cultural and historical attractions, Paris is also a great destination for foodies. The city is famous for its cuisine, including croissants, baguettes, and cheese. Visitors can sample the city's famous dishes at one of the many restaurants, cafes, and " + ("The capital of France is a city of romance, art, fashion, and cuisine. " + "Paris is a must-visit destination for anyone who loves history, architecture, and culture. " + "From the iconic Eiffel Tower to the world-famous Louvre Museum, Paris has something to offer for every interest and age.\n" + "The city is divided into 20 arrondissements, each with its own unique character and charm. " + "The Latin Quarter is a popular area for students and young travelers, while the Champs-Élysées is a hub for shopping and dining. " + "The Montmartre neighborhood is famous for its bohemian vibe and stunning views of the city.\n" + "Paris is also known for its beautiful parks and gardens, such as the Luxembourg Gardens and the Tuileries Garden. " + "The city has a rich history, with landmarks like the Notre-Dame Cathedral and the Arc de Triomphe. " + "Visitors can also explore the city's many museums, including the Musée d'Orsay and the Musée Rodin.\n" + "In addition to its cultural and historical attractions, Paris is also a great destination for foodies. " + "The city is famous for its cuisine, including croissants, baguettes, and cheese. " + "Visitors can sample the city's famous dishes at one of the many restaurants, cafes, and ")
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
tensorrt_llm/_torch/pyexecutor/llm_request.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(1 hunks)tensorrt_llm/_torch/speculative/model_drafter.py
(5 hunks)tests/unittest/_torch/speculative/test_eagle3.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tensorrt_llm/_torch/pyexecutor/llm_request.py
- tensorrt_llm/_torch/pyexecutor/py_executor.py
- tensorrt_llm/_torch/speculative/model_drafter.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tests/unittest/_torch/speculative/test_eagle3.py
81-81: Line too long (1197 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tests/unittest/_torch/speculative/test_eagle3.py (3)
16-27
: LGTM! Comprehensive test coverage for chunked prefill feature.The parameterization correctly adds the new
enable_chunked_prefill
parameter and includes test cases for both single-model and two-model scenarios with chunked prefill enabled. The existing test cases are preserved to maintain backward compatibility.
31-31
: Function signature properly updated.The function signature correctly includes the new
enable_chunked_prefill
parameter with proper type annotation.
62-66
: Well-implemented chunked prefill configuration.The configuration correctly enables chunked prefill and sets
max_num_tokens
to 64 to ensure the chunked prefill code path is exercised during testing. The comment provides clear context for this choice.
PR_Github #12330 [ run ] triggered by Bot |
PR_Github #12330 [ run ] completed with state |
/bot run |
PR_Github #12345 [ run ] triggered by Bot |
/bot run |
PR_Github #12719 [ run ] triggered by Bot |
PR_Github #12719 [ run ] completed with state |
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
/bot run |
PR_Github #12740 [ run ] triggered by Bot |
PR_Github #12740 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #12866 [ run ] triggered by Bot |
PR_Github #12866 [ run ] completed with state |
/bot run |
PR_Github #12888 [ run ] triggered by Bot |
PR_Github #12888 [ run ] completed with state |
…VIDIA#6104) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>
…VIDIA#6104) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Ransiki Zhang <ransikiz@nvidia.com>
…VIDIA#6104) Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com> Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
Description
This PR adds chunked prefill support to the 2-model spec decode flow. In this design, prefill chunks are sent to the draft model immediately after they are processed by the target.
One consequence of this setup is that we'll have to load the draft model on prefill workers for disagg scenarios.
Test Coverage
Added new unit test for both one model and 2 model. Manually verified that AR is the same on a set of long prompts after enabling chunked prefill.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]
Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit