KEMBAR78
[None][feat] AutoDeploy: dive deeper into token generation bugs + enable_block_reuse by lucaslie · Pull Request #8108 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@lucaslie
Copy link
Member

@lucaslie lucaslie commented Oct 1, 2025

Summary by CodeRabbit

  • New Features
    • Exposed KV cache configuration across AutoDeploy and CLI.
    • Updated default sampling: top_k is now unset by default.
  • Bug Fixes
    • Correctly slice context tokens from each request’s current position.
  • Chores
    • Enforced validation for incompatible KV cache reuse options with clear warnings/errors.
    • Pass kv_cache_config through without implicit overrides; added completion log after weight initialization.
  • Tests
    • Improved model path resolution for integration tests, falling back to hub IDs.
    • Removed legacy KV cache config from test defaults.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
@lucaslie lucaslie self-assigned this Oct 1, 2025
@lucaslie lucaslie added the AutoDeploy <NV> AutoDeploy Backend label Oct 1, 2025
@lucaslie
Copy link
Member Author

lucaslie commented Oct 1, 2025

Debug configurations

config.yaml

model: meta-llama/Meta-Llama-3.1-8B-Instruct
args:
  mode: graph
  world_size: 2
  runtime: trtllm
  compile_backend: torch-cudagraph
  max_batch_size: 32
  attn_backend: flashinfer
  model_factory: AutoModelForCausalLM
# # UNCOMMENT TO REPRODUCE OLD SAMPLING PARAMETERS
# prompt:
#   sp_kwargs:
#     top_k: 100

Sample output NOW

--> highlights all fixes

[09/30/2025-17:01:41] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 0] How big is the universe? :  The short answer is, we don’t know for sure.  But we can make some educated guesses based on what we know about the universe and the tools we have to measure it.
The observable universe is the part of the universe that we can see.  It is estimated to have a diameter of around 93 billion light-years.  However, there may be parts of the universe that are beyond what we can see, which would make the universe even larger.
The universe is
[09/30/2025-17:01:41] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: :  Gravity is a force that pulls objects towards each other, with the strength of the force depending on the mass of the objects and the distance between them.
What is the concept of gravity?
Gravity is a fundamental force of nature that causes objects with mass to attract each other. The strength of the gravitational force depends on the mass of the objects and the distance between them. The more massive the objects and the closer they are to each other, the stronger the gravitational force between them.
What is the definition

NOW with duplicate call to llm.generate in build_and_run_ad.py

[09/30/2025-17:15:09] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 0] How big is the universe? :  The short answer is, we don’t know for sure.  But we can make some educated guesses based on what we know about the universe and the tools we have to measure it.
The observable universe is the part of the universe that we can see.  It is estimated to have a diameter of around 93 billion light-years.  However, there may be parts of the universe that are beyond what we can see, which would make the universe even larger.
The universe is
[09/30/2025-17:15:09] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: :  Gravity is a force that pulls objects towards each other, with the strength of the force depending on the mass of the objects and the distance between them.
What is the concept of gravity?
Gravity is a fundamental force of nature that causes objects with mass to attract each other. The strength of the gravitational force depends on the mass of the objects and the distance between them. The more massive the objects and the closer they are to each other, the stronger the gravitational force between them.
What is the definition

[09/30/2025-17:15:09] [TRT-LLM AUTO-DEPLOY] [I] Running example prompts twice...
Processed requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.34s/it]

[09/30/2025-17:15:12] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 0] How big is the universe? :  The short answer is, we don’t know for sure.  But we can make some educated guesses based on what we know about the universe and the tools we have to measure it.
The observable universe is the part of the universe that we can see.  It is estimated to have a diameter of around 93 billion light-years.  However, there may be parts of the universe that are beyond what we can see, which would make the universe even larger.
The universe is
[09/30/2025-17:15:12] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: :  Gravity is a force that pulls objects towards each other, with the strength of the force depending on the mass of the objects and the distance between them.
What is the concept of gravity?
Gravity is a fundamental force of nature that causes objects with mass to attract each other. The strength of the gravitational force depends on the mass of the objects and the distance between them. The more massive the objects and the closer they are to each other, the stronger the gravitational force between them.
What is the definition

NOW with block reuse and duplicate call to llm.generate in build_and_run_ad.py

--> Highlights incorrect handling of block reuse at the moment

[09/30/2025-17:12:50] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 0] How big is the universe? :  The short answer is, we don’t know for sure.  But we can make some educated guesses based on what we know about the universe and the tools we have to measure it.
The observable universe is the part of the universe that we can see.  It is estimated to have a diameter of around 93 billion light-years.  However, there may be parts of the universe that are beyond what we can see, which would make the universe even larger.
The universe is
[09/30/2025-17:12:50] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: :  Gravity is a force that pulls objects towards each other, with the strength of the force depending on the mass of the objects and the distance between them.
What is the concept of gravity?
Gravity is a fundamental force of nature that causes objects with mass to attract each other. The strength of the gravitational force depends on the mass of the objects and the distance between them. The more massive the objects and the closer they are to each other, the stronger the gravitational force between them.
What is the definition

[09/30/2025-17:12:50] [TRT-LLM AUTO-DEPLOY] [I] Running example prompts twice...
Processed requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.33s/it]

[09/30/2025-17:12:52] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 0] How big is the universe? :  The following is a list of the 10 most popular songs of 2019, according to the Billboard Hot 100 chart. The songs are ranked based on their performance on the chart, which is based on sales, streaming activity, and radio airplay. The songs are listed in order of their peak position on the chart.

1.  "Old Town Road" by Lil Nas X feat. Billy Ray Cyrus
2.  "7 Rings" by Ariana Grande
3.  "
[09/30/2025-17:12:52] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: :            (1)  The  State  of  California  has  a  long  history  of  providing  a  wide  range  of  public  services  to  its  citizens,  including  education,  health  care,  public  safety,  and  infrastructure.  The

NOTE: this breaks trtllm-bench

NOW with old sampling settings and duplicate call to llm.generate in build_and_run_ad.py

--> highlights issue where sampling seems to have issues on the very first call

[09/30/2025-17:07:35] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 0] How big is the universe? : 2019-01-20 10:59:28 Post No. 143
How big is the universe?
How many galaxies are there? 
What's the farthest thing we've sent into space? 
The observable universe is estimated to be around 93 billion light-years in diameter. However, new research suggests it could be even bigger, with estimates ranging from 250 to 300 billion light-years. The universe as a whole is thought to be infinite, but the observable universe is
[09/30/2025-17:07:35] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: :  Gravity is an invisible force that attracts any objects with mass (weight) towards each other.

We often experience gravity all around us, especially with a strong presence felt when walking, running, or jumping. What is the mysterious force behind this attraction of planets and objects in space?
 
The ancient Greeks were the first to believe that objects fall towards the ground regardless of size or shape. Physicist Galileo Galilei was the first to assert that objects in free fall under the sole influence of

[09/30/2025-17:07:35] [TRT-LLM AUTO-DEPLOY] [I] Running example prompts twice...
Processed requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.41s/it]

[09/30/2025-17:07:38] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 0] How big is the universe? : 2.5 million light-years across? With galaxies stretching beyond what we can see in a 3-decade Hubble Space Telescope snapshot.
Our galaxy, the Milky Way, is just a small part of a vast universe. How big is the universe? Scientists have been studying the universe's size for decades. To answer that question, we'll talk about a measurement made possible by the Hubble Space Telescope.
The Hubble Space Telescope has revolutionized our understanding of the universe, making unprecedented measurements of
[09/30/2025-17:07:38] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: :  Gravity is the force that attracts objects towards each other, depending on how massive they are and their distance apart. 
How does the law of conservation of energy relate to the Law of Universal Gravitation?
The law of conservation of energy means that the total energy of an isolated system remains constant over time. The Law of Universal Gravitation relates to this concept by stating that the energy required to lift an object to a certain height, and then releasing it, results in the conversion of potential energy into

BEFORE

--> highlights torch-cudagraph issue with careless dummy requests due to rounding up of to cuda graph batch size

[09/30/2025-17:02:54] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 0] How big is the universe? : 2019-2020
How big is the universe? 2019-2020 How big is the universe?
How big is the universe? 2019-2020
How big is the universe? 2019-2020
How big is the universe? 2019-2020 How big is the universe? 2019-2020
How big is the universe? 2019-2020
How big is the universe? 2019-2020
[09/30/2025-17:02:54] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: :  Gravity is an invisible force that attracts any objects with mass (weight) towards each other.

We often discover simple objects around us and their properties. Let's now try to break down the concept of gravity, which is crucial in the natural phenomena of the universe. It's time to understand that gravity is more than just a force that pulls objects - it's an enabler of many awe-inspiring processes in the universe. It attracts anything with a mass or weight towards each other, which makes things

Git Diffs for Debugging

Duplicate call to llm.generate

diff --git a/examples/auto_deploy/build_and_run_ad.py b/examples/auto_deploy/build_and_run_ad.py
index d1faf8fdc6..340676a03c 100644
--- a/examples/auto_deploy/build_and_run_ad.py
+++ b/examples/auto_deploy/build_and_run_ad.py
@@ -277,6 +277,13 @@ def main(config: Optional[ExperimentConfig] = None):
     )
     results = {"prompts_and_outputs": print_outputs(outs)}
 
+    ad_logger.info("Running example prompts twice...")
+    outs = llm.generate(
+        config.prompt.queries,
+        sampling_params=SamplingParams(**config.prompt.sp_kwargs),
+    )
+    results = {"prompts_and_outputs": print_outputs(outs)}
+
     # run a benchmark for the model with batch_size == config.benchmark_bs
     if config.benchmark.enabled and config.args.runtime != "trtllm":
         ad_logger.info("Running benchmark...")

Enable block reuse

diff --git a/tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py b/tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
index 815348d24f..bd78781cef 100644
--- a/tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
+++ b/tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
@@ -335,6 +335,9 @@ def create_autodeploy_executor(ad_config: LlmArgs):
             " in AutoDeploy. Please set them to False."
         )
 
+    ad_config.kv_cache_config.enable_block_reuse = True
+    ad_config.kv_cache_config.enable_partial_reuse = True
+
     # resource managers
     kv_cache_manager = _CacheManagerWithFakePool(
         ad_config.kv_cache_config,

@lucaslie lucaslie changed the title [None][fix] AutoDeploy: dive deeper into token generation bugs [None][feat] AutoDeploy: dive deeper into token generation bugs + enable_block_reuse Oct 1, 2025
@lucaslie lucaslie moved this from Backlog to In progress in AutoDeploy Board Oct 1, 2025
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
@lucaslie
Copy link
Member Author

lucaslie commented Oct 1, 2025

Accuracy results and speed-ups from AD accuracy test

torch-opt, enable_block_reuse=True

[10/01/2025-13:34:37] [TRT-LLM] [I] TRTLLM execution time: 65.268 seconds.
[10/01/2025-13:36:33] [TRT-LLM] [I] TRTLLM execution time: 114.564 seconds.
===========================================================
= ACCURACY HYPOTHESIS TESTING
===========================================================
Alpha (Type I:  False Positive): 0.002
Beta  (Type II: False Negative): 0.200
Sigma (Standard deviation): 11.060
#Samples: 512
Higher is better: True
Theta (Minimum detectable effect): 2.571
Reference accuracy: 24.360
Threshold: 22.370
===========================================================
Evaluated accuracy: 24.445
===========================================================
Submitting requests: 4104it [00:06, 618.48it/s]
Fetching responses: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4104/4104 [01:47<00:00, 38.06it/s]
[10/01/2025-16:13:13] [TRT-LLM] [I] TRTLLM execution time: 114.473 seconds.
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 31.94 (72) - abstract_algebra
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 55.56 (72) - anatomy
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 73.61 (72) - astronomy
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 62.50 (72) - business_ethics
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 75.00 (72) - clinical_knowledge
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 83.33 (72) - college_biology
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 43.06 (72) - college_chemistry
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 47.22 (72) - college_computer_science
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 31.94 (72) - college_mathematics
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 63.89 (72) - college_medicine
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 43.06 (72) - college_physics
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 84.72 (72) - computer_security
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 61.11 (72) - conceptual_physics
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 54.17 (72) - econometrics
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 63.89 (72) - electrical_engineering
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 44.44 (72) - elementary_mathematics
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 50.00 (72) - formal_logic
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 30.56 (72) - global_facts
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 76.39 (72) - high_school_biology
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 45.83 (72) - high_school_chemistry
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 70.83 (72) - high_school_computer_science
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 76.39 (72) - high_school_european_history
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 84.72 (72) - high_school_geography
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 87.50 (72) - high_school_government_and_politics
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 68.06 (72) - high_school_macroeconomics
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 55.56 (72) - high_school_mathematics
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 72.22 (72) - high_school_microeconomics
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 45.83 (72) - high_school_physics
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 80.56 (72) - high_school_psychology
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 50.00 (72) - high_school_statistics
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 90.28 (72) - high_school_us_history
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 84.72 (72) - high_school_world_history
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 70.83 (72) - human_aging
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 75.00 (72) - human_sexuality
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 83.33 (72) - international_law
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 73.61 (72) - jurisprudence
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 72.22 (72) - logical_fallacies
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 43.06 (72) - machine_learning
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 81.94 (72) - management
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 87.50 (72) - marketing
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 80.56 (72) - medical_genetics
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 81.94 (72) - miscellaneous
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 66.67 (72) - moral_disputes
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 40.28 (72) - moral_scenarios
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 81.94 (72) - nutrition
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 72.22 (72) - philosophy
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 63.89 (72) - prehistory
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 52.78 (72) - professional_accounting
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 48.61 (72) - professional_law
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 69.44 (72) - professional_medicine
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 72.22 (72) - professional_psychology
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 69.44 (72) - public_relations
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 73.61 (72) - security_studies
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 87.50 (72) - sociology
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 88.89 (72) - us_foreign_policy
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 56.94 (72) - virology
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 87.50 (72) - world_religions
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 42.78 (360) - math
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 69.27 (576) - health
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 55.90 (288) - physics
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 77.31 (216) - business
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 79.86 (144) - biology
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 44.44 (144) - chemistry
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 61.46 (288) - computer science
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 64.81 (216) - economics
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 63.89 (72) - engineering
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 64.81 (432) - philosophy
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 55.09 (216) - other
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 78.82 (288) - history
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 84.72 (72) - geography
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 79.86 (288) - politics
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 76.39 (144) - psychology
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 81.25 (144) - culture
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 68.52 (216) - law
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 55.32 (1296) - STEM
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 69.98 (936) - humanities
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 76.16 (864) - social sciences
[10/01/2025-16:13:13] [TRT-LLM] [I] Average accuracy 67.96 (1008) - other (business, health, misc.)
[10/01/2025-16:13:13] [TRT-LLM] [I] MMLU weighted average accuracy: 66.15 (4104)
[10/01/2025-16:13:13] [TRT-LLM] [I] Hypothesis testing report:
===========================================================
= ACCURACY HYPOTHESIS TESTING
===========================================================
Alpha (Type I:  False Positive): 0.050
Beta  (Type II: False Negative): 0.200
Sigma (Standard deviation): 50.000
#Samples: 4096
Higher is better: True
Theta (Minimum detectable effect): 2.747
Reference accuracy: 66.060
Threshold: 64.243
===========================================================
Evaluated accuracy: 66.155
===========================================================

torch-opt, enable_block_reuse=False on e9e4632e4

[10/01/2025-13:56:40] [TRT-LLM] [I] TRTLLM execution time: 65.040 seconds.
[10/01/2025-14:03:21] [TRT-LLM] [I] TRTLLM execution time: 400.051 seconds.
[10/02/2025-12:13:05] [TRT-LLM] [I] TRTLLM execution time: 64.925 seconds.
[2025-10-02 12:13:05] INFO rouge_scorer.py:83: Using default tokenizer.
[10/02/2025-12:13:06] [TRT-LLM] [I] Beam 0 rouge scores:
[10/02/2025-12:13:06] [TRT-LLM] [I]     rouge1: 24.361
[10/02/2025-12:13:06] [TRT-LLM] [I]     rouge2: 7.603
[10/02/2025-12:13:06] [TRT-LLM] [I]     rougeL: 16.758
[10/02/2025-12:13:06] [TRT-LLM] [I]     rougeLsum: 20.791
[10/02/2025-12:13:06] [TRT-LLM] [I] Hypothesis testing report:
===========================================================
= ACCURACY HYPOTHESIS TESTING
===========================================================
Alpha (Type I:  False Positive): 0.002
Beta  (Type II: False Negative): 0.200
Sigma (Standard deviation): 11.060
#Samples: 512
Higher is better: True
Theta (Minimum detectable effect): 2.571
Reference accuracy: 24.360
Threshold: 22.370
===========================================================
Evaluated accuracy: 24.361
===========================================================
Submitting requests: 4104it [00:06, 612.63it/s]
Fetching responses: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4104/4104 [06:33<00:00, 10.44it/s]
[10/02/2025-12:19:46] [TRT-LLM] [I] TRTLLM execution time: 399.961 seconds.
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 33.33 (72) - abstract_algebra
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 55.56 (72) - anatomy
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 73.61 (72) - astronomy
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 61.11 (72) - business_ethics
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 73.61 (72) - clinical_knowledge
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 83.33 (72) - college_biology
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 44.44 (72) - college_chemistry
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 48.61 (72) - college_computer_science
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 30.56 (72) - college_mathematics
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 63.89 (72) - college_medicine
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 43.06 (72) - college_physics
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 84.72 (72) - computer_security
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 62.50 (72) - conceptual_physics
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 55.56 (72) - econometrics
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 65.28 (72) - electrical_engineering
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 44.44 (72) - elementary_mathematics
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 51.39 (72) - formal_logic
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 30.56 (72) - global_facts
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 77.78 (72) - high_school_biology
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 47.22 (72) - high_school_chemistry
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 70.83 (72) - high_school_computer_science
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 76.39 (72) - high_school_european_history
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 84.72 (72) - high_school_geography
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 87.50 (72) - high_school_government_and_politics
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 68.06 (72) - high_school_macroeconomics
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 51.39 (72) - high_school_mathematics
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 73.61 (72) - high_school_microeconomics
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 48.61 (72) - high_school_physics
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 79.17 (72) - high_school_psychology
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 50.00 (72) - high_school_statistics
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 90.28 (72) - high_school_us_history
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 84.72 (72) - high_school_world_history
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 69.44 (72) - human_aging
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 77.78 (72) - human_sexuality
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 83.33 (72) - international_law
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 73.61 (72) - jurisprudence
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 73.61 (72) - logical_fallacies
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 44.44 (72) - machine_learning
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 81.94 (72) - management
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 87.50 (72) - marketing
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 81.94 (72) - medical_genetics
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 81.94 (72) - miscellaneous
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 68.06 (72) - moral_disputes
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 40.28 (72) - moral_scenarios
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 80.56 (72) - nutrition
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 70.83 (72) - philosophy
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 65.28 (72) - prehistory
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 52.78 (72) - professional_accounting
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 47.22 (72) - professional_law
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 68.06 (72) - professional_medicine
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 73.61 (72) - professional_psychology
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 69.44 (72) - public_relations
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 73.61 (72) - security_studies
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 88.89 (72) - sociology
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 88.89 (72) - us_foreign_policy
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 58.33 (72) - virology
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 88.89 (72) - world_religions
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 41.94 (360) - math
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 68.92 (576) - health
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 56.94 (288) - physics
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 76.85 (216) - business
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 80.56 (144) - biology
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 45.83 (144) - chemistry
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 62.15 (288) - computer science
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 65.74 (216) - economics
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 65.28 (72) - engineering
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 65.51 (432) - philosophy
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 55.09 (216) - other
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 79.17 (288) - history
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 84.72 (72) - geography
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 79.86 (288) - politics
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 76.39 (144) - psychology
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 83.33 (144) - culture
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 68.06 (216) - law
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 55.79 (1296) - STEM
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 70.30 (936) - humanities
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 76.74 (864) - social sciences
[10/02/2025-12:19:46] [TRT-LLM] [I] Average accuracy 67.66 (1008) - other (business, health, misc.)
[10/02/2025-12:19:46] [TRT-LLM] [I] MMLU weighted average accuracy: 66.42 (4104)
[10/02/2025-12:19:46] [TRT-LLM] [I] Hypothesis testing report:
===========================================================
= ACCURACY HYPOTHESIS TESTING
===========================================================
Alpha (Type I:  False Positive): 0.050
Beta  (Type II: False Negative): 0.200
Sigma (Standard deviation): 50.000
#Samples: 4096
Higher is better: True
Theta (Minimum detectable effect): 2.747
Reference accuracy: 66.060
Threshold: 64.243
===========================================================
Evaluated accuracy: 66.423
===========================================================

trtllm-bench results

--> note that block reuse is disabled in trtllm-bench to ensure better perf comparisons

on this branch

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     46.0544
Total Output Throughput (tokens/sec):             5894.9622
Total Token Throughput (tokens/sec):              11789.9243
Total Latency (ms):                               21713.4557
Average request latency (ms):                     17017.1326
Per User Output Throughput [w/ ctx] (tps/user):   8.0923
Per GPU Output Throughput (tps/gpu):              1473.7405

before (e9e4632e4)

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     45.1022
Total Output Throughput (tokens/sec):             5773.0773
Total Token Throughput (tokens/sec):              11546.1547
Total Latency (ms):                               22171.8838
Average request latency (ms):                     17146.3541
Per User Output Throughput [w/ ctx] (tps/user):   8.0699
Per GPU Output Throughput (tps/gpu):              1443.2693

@lucaslie lucaslie marked this pull request as ready for review October 1, 2025 23:01
@lucaslie lucaslie requested review from a team as code owners October 1, 2025 23:01
@lucaslie
Copy link
Member Author

lucaslie commented Oct 1, 2025

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 1, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 1, 2025

📝 Walkthrough

Walkthrough

The changes introduce KV cache configuration fields, add KV cache reuse validation and revised context slicing in the AutoDeploy executor, adjust internal attention interfaces to use reset values and unique index selection, pass kv_cache_config through unchanged in serving, tweak a default prompt sampling parameter, add a completion log on weight load, and update tests to resolve model paths dynamically.

Changes

Cohort / File(s) Summary
Prompt defaults
examples/auto_deploy/build_and_run_ad.py
Updated default PromptConfig.sp_kwargs.top_k from 200 to None; no other logic changes.
Attention interface internals
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Added Number/Set typing; changed _store_arg to use reset_val (Optional[Number]); introduced _get_unique_value to find unused indices; updated nest_sequences to use reset_val for seq_len/input_pos/cache_loc/pages_per_seq/slot_idx; integrated max_num_cache_loc_assignments; extended logging.
KV cache config plumbing
tensorrt_llm/_torch/auto_deploy/llm_args.py
Added KvCacheConfig import; introduced public kv_cache_config: KvCacheConfig in AutoDeployConfig and LlmArgs with default factory (enable_partial_reuse=False).
Executor behavior & validation
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
Context tokens now sliced from request.context_current_position; added validation: error if both enable_block_reuse and enable_partial_reuse true; warning if enable_block_reuse true; logged kv_cache_config.
Serve passthrough
tensorrt_llm/commands/serve.py
Removed mutation of kv_cache_config.enable_block_reuse; now passes kv_cache_config unchanged (excluding build_config).
Model factory logging
tensorrt_llm/_torch/auto_deploy/models/factory.py
Added final info log after weights load/initialization completion.
Test updates
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
Added _hf_model_dir_or_hub_id helper to resolve local dir vs hub ID; imported os; replaced hard-coded model path usage; removed kv_cache_config block from default kwargs.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User as Client
  participant Serve as CLI Serve
  participant LLM as AutoDeployLLM
  participant Exec as AD Executor
  participant Cache as KV Cache Manager

  User->>Serve: Start with args (includes kv_cache_config)
  Serve->>LLM: AutoDeployLLM(**llm_args with kv_cache_config passthrough)
  LLM->>Exec: Initialize executor
  Exec->>Exec: Validate kv_cache_config
  alt Both block_reuse and partial_reuse true
    Exec-->>LLM: Raise RuntimeError
  else Block reuse enabled
    Exec-->>LLM: Log warning (possible SSM incompat)
  end
  Exec->>Cache: Construct with kv_cache_config
  User->>LLM: Inference request
  LLM->>Exec: Prepare inputs
  Exec->>Exec: Slice context tokens from request.context_current_position
  Exec->>Cache: Allocate/resolve KV blocks
  Exec-->>User: Return outputs
Loading
sequenceDiagram
  autonumber
  participant Upstream as Sequence Builder
  participant Attn as attention_interface.SequenceInfo

  Upstream->>Attn: nest_sequences(...)
  note right of Attn: Use reset_val per field
  Attn->>Attn: _store_arg(name, tnsr_like, reset_val=?)
  opt cache_loc / slot_idx
    Attn->>Attn: _get_unique_value(occupied, max_val)
    Attn-->>Attn: Assign neutral unique indices
  end
  Attn-->>Upstream: Stored args with new reset semantics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ⚠️ Warning The pull request description consists solely of the unfilled repository template with no actual title, summary of the change, description of the issue and solution, or details on test coverage, leaving all key sections empty and undeclared. Please replace the template placeholders with a proper PR title (including ticket/issue and type), a concise yet informative description of the change and its rationale, and a clear list of relevant tests covering new code paths.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title mentions “token generation bugs” and “enable_block_reuse,” which are indeed aspects of the changeset, but it does not convey the broader scope of API refactoring and configuration additions such as KV cache settings and reset_val semantics. The phrasing “dive deeper into” is vague and the use of a plus sign combines two distinct changes into one title, making it less concise. While it references real parts of the change, it could be more focused and descriptive of the primary enhancement.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
examples/auto_deploy/build_and_run_ad.py (1)

1-1: Add project license header.

Per coding guidelines, prepend the NVIDIA Apache‑2.0 header (2025).

Apply:

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 """Main entrypoint to build, test, and prompt AutoDeploy inference models."""
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)

1-1: Add project license header.

Please prepend the NVIDIA Apache‑2.0 header (2025).

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 """The model factory interface used by auto-deploy to build custom models."""
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2)

64-68: Fix logging call formatting.

ad_logger.info is passed an extra arg without a % placeholder; message won’t include self.num_blocks.

-        ad_logger.info("Using fake cache manager with head_dim=0 and num pages:", self.num_blocks)
+        ad_logger.info("Using fake cache manager with head_dim=0 and num pages: %s", self.num_blocks)

1-1: Add project license header.

Please prepend the NVIDIA Apache‑2.0 header (2025).

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 from collections import defaultdict
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

1-1: Add project license header.

Please prepend the NVIDIA Apache‑2.0 header (2025).

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 from importlib.resources import files
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

1-1: Add project license header.

Please prepend the NVIDIA Apache‑2.0 header (2025).

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 """Attention Interface to handle various attention operators and cache operations.
🧹 Nitpick comments (3)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)

339-342: Compact exception message (ruff TRY003).

Use a single-line message to avoid multi-part string concatenation.

-        raise RuntimeError(
-            f"enable_block_reuse with {enable_partial_reuse=} set to True is NOT supported"
-            " in AutoDeploy. Please set it to False."
-        )
+        raise RuntimeError("enable_block_reuse with enable_partial_reuse=True is NOT supported in AutoDeploy. Set enable_partial_reuse to False.")
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)

621-625: Avoid full-tensor fill on every store; fill only the tail.

Filling the whole device tensor each call is O(N) over the max capacity and can regress perf. Fill only the unused tail.

-            if reset_val is not None:
-                tnsr_device.fill_(reset_val)
-            tnsr_device[: len(tnsr_like)].copy_(tnsr_host, non_blocking=True)
+            if reset_val is not None and tnsr_device.numel() > len(tnsr_like):
+                tnsr_device[len(tnsr_like):].fill_(reset_val)
+            tnsr_device[:len(tnsr_like)].copy_(tnsr_host, non_blocking=True)

642-653: Make unique-value choice deterministic.

set.pop() is non-deterministic across runs. Prefer smallest free value for stability.

-        full_range = set(range(max_val))
-        free_values = full_range - occupied
+        full_range = set(range(max_val))
+        free_values = sorted(full_range - occupied)
         out_of_range = occupied - full_range
         assert not out_of_range, f"Out of range values: {out_of_range}"
-        return free_values.pop() if free_values else 0
+        return free_values[0] if free_values else 0
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1ad7bc4 and 0fa77dd.

📒 Files selected for processing (7)
  • examples/auto_deploy/build_and_run_ad.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (8 hunks)
  • tensorrt_llm/_torch/auto_deploy/llm_args.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/models/factory.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2 hunks)
  • tensorrt_llm/commands/serve.py (0 hunks)
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py (2 hunks)
💤 Files with no reviewable changes (1)
  • tensorrt_llm/commands/serve.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
  • tensorrt_llm/_torch/auto_deploy/models/factory.py
  • examples/auto_deploy/build_and_run_ad.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
  • tensorrt_llm/_torch/auto_deploy/llm_args.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
  • tensorrt_llm/_torch/auto_deploy/models/factory.py
  • examples/auto_deploy/build_and_run_ad.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
  • tensorrt_llm/_torch/auto_deploy/llm_args.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
  • tensorrt_llm/_torch/auto_deploy/models/factory.py
  • examples/auto_deploy/build_and_run_ad.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
  • tensorrt_llm/_torch/auto_deploy/llm_args.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
🧬 Code graph analysis (3)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)
tests/integration/defs/accuracy/accuracy_core.py (1)
  • LlmapiAccuracyTestHarness (844-855)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
tensorrt_llm/_torch/attention_backend/flashinfer.py (1)
  • page_size (185-189)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
tensorrt_llm/llmapi/llm_args.py (2)
  • KvCacheConfig (1058-1192)
  • Field (70-97)
🪛 Ruff (0.13.2)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py

339-342: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)

238-238: Nice completion log.

Good for tracing init time boundaries. No behavior change.

tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)

27-37: Good: prefer local model dir with hub fallback.

Improves portability of CI and developer runs.

Also applies to: 41-42

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

698-706: Potential aliasing on fallback when no free pages/slots exist.

_get_unique_value(...)=0 when set is full can alias a real page/slot. In cudagraph-padded batches this could write over a valid location. Consider reserving a dedicated “padding” page/slot or repeating the last valid location explicitly for dummies.

Would you run a cudagraph case where b<b_cudagraph and num_pages == occupied to confirm no cache pollution? If needed, I can draft a minimal repro.

tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

307-311: Override kv_cache_config defaults: confirm no conflict with BaseLlmArgs.

LlmArgs redefines kv_cache_config (tensorrt_llm/_torch/auto_deploy/llm_args.py:307–311). Repo search returned many KvCacheConfig usages but did not locate a BaseLlmArgs or another Field definition for kv_cache_config — verify this override does not shadow a base-field or cause duplicate Pydantic validators/serialization, ensure serving code reads the intended attribute, and if intentional either move the default to the base or add a short inline justification for forcing enable_partial_reuse=False.

examples/auto_deploy/build_and_run_ad.py (1)

65-65: Default top_k=None — confirm callers and tests expect this

SamplingParams allows top_k: Optional[int] = None (tensorrt_llm/sampling_params.py), but runtime handling is mixed and at least one code path will fail if scfg.top_k is None (tensorrt_llm/runtime/generation.py ~1356–1363 uses torch.full with scfg.top_k). Also demollm and pyexecutor treat None differently. Run unit tests/CI and targeted tests that exercise generation and the auto_deploy shim; if the change is intentional, either add guards where scfg.top_k may be None or restore an explicit numeric default in the example and document the behavioral change.

@lucaslie
Copy link
Member Author

lucaslie commented Oct 1, 2025

NOTE: I am assuming we have never hit accuracy issues in our accuracy benchmark before since we always hit cudagraph with a batch_size for which we exactly have a stored cudagraph and don't need rounding up (checked the logs for that)

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
@lucaslie
Copy link
Member Author

lucaslie commented Oct 2, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20541 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20541 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15498 completed with status: 'FAILURE'

@lucaslie lucaslie enabled auto-merge (squash) October 2, 2025 19:31
@lucaslie
Copy link
Member Author

lucaslie commented Oct 2, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20552 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20552 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15509 completed with status: 'FAILURE'

@lucaslie
Copy link
Member Author

lucaslie commented Oct 3, 2025

/bot run

1 similar comment
@lucaslie
Copy link
Member Author

lucaslie commented Oct 3, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20565 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20564 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20565 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20564 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15522 completed with status: 'FAILURE'

@lucaslie
Copy link
Member Author

lucaslie commented Oct 3, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20576 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20576 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15531 completed with status: 'SUCCESS'

@lucaslie lucaslie merged commit 5faa5e9 into NVIDIA:main Oct 3, 2025
5 checks passed
@github-project-automation github-project-automation bot moved this from In progress to Done in AutoDeploy Board Oct 3, 2025
evezhier pushed a commit to evezhier/TensorRT-LLM that referenced this pull request Oct 3, 2025
…ble_block_reuse (NVIDIA#8108)

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AutoDeploy <NV> AutoDeploy Backend

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants