KEMBAR78
[None][doc] Update kvcache part by nv-guomingz · Pull Request #7549 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@nv-guomingz
Copy link
Collaborator

@nv-guomingz nv-guomingz commented Sep 5, 2025

Cherry-pick #7382 into 1.0 branch

Summary by CodeRabbit

  • New Features

    • Added a KV cache retention configuration for per-token-range priority and configurable decode retention during generation.
  • Documentation

    • Clarified KV cache defaults (memory fraction applies after weights load) and improved Python quick-start examples.
    • Added examples showing KV cache configuration and disabling block reuse.
    • Clarified retention policy semantics (priority ordering, duration/None behavior) and refined speculative-decoding guidance (overlap scheduler auto-disabled for two-model setups).

@nv-guomingz nv-guomingz requested a review from a team as a code owner September 5, 2025 01:27
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 5, 2025

📝 Walkthrough

Walkthrough

Documentation updates clarifying KV cache configuration and retention semantics, adding KvCacheRetentionConfig API examples, reformatting code blocks, and removing/relaxing speculative-decoding constraints (overlap scheduler auto-disabled for two-model setups). Minor wording and capitalization edits across related docs.

Changes

Cohort / File(s) Summary of changes
Examples: KV cache config
docs/source/examples/kvcacheconfig.md
Clarified default free GPU memory fraction (applies after weights load), emphasized passing kv_cache_config when creating the LLM, added KvCacheConfig example (free_gpu_memory_fraction=0.7) and post-creation example (enable_block_reuse = False), converted code blocks to Python fences, minor wording/capitalization edits.
Examples: KV cache retention config & API
docs/source/examples/kvcacheretentionconfig.md
Introduced KvCacheRetentionConfig and nested TokenRangeRetentionConfig usage in examples; shows token-range priority (tokens 0–4 → priority 100), decode_retention_priority=35, decode_duration_ms=None, and passing kv_cache_retention_config to llm.generate; clarified defaults and single-vs-list config usage.
Features: KV cache behavior
docs/source/features/kvcache.md
Clarified retention semantics (lower-priority blocks freed first; priority reverts to default after duration_ms from first reuse; None disables expiry for that part), updated example references and wording, minor editorial fixes.
Features: Speculative decoding
docs/source/features/speculative-decoding.md
Removed explicit KV cache reuse / overlap-scheduling constraints in quick start and MTP-specific notes; consolidated two-model limitation to “overlap scheduler disabled automatically.”

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant L as LLM API
  participant E as Engine
  participant K as KV Cache Manager

  U->>L: llm.generate(prompts, kv_cache_config, kv_cache_retention_config)
  L->>E: initialize/dispatch with configs
  E->>K: allocate/tag KV blocks (respect kv_cache_config)
  rect rgb(220,235,255)
    note over K: Apply retention policy<br/>token-range priorities & default priority
    E->>K: tag blocks with priorities/durations
  end
  alt Reuse eligible
    E->>K: reuse high-priority blocks
  else Eviction needed
    K-->K: evict lower-priority blocks first
  end
  E-->>L: token stream
  L-->>U: generated output
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested labels

Documentation

Suggested reviewers

  • laikhtewari
  • kaiyux
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@nv-guomingz nv-guomingz changed the title [None][Doc] Update kvcache part and rename TensorRT-LLM to TensorRT LLM. [None][Doc] Update kvcache part Sep 5, 2025
@nv-guomingz nv-guomingz force-pushed the user/guomingz/1.0_doc_minor_update branch from 94a6315 to 9b44c22 Compare September 5, 2025 01:28
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (8)
docs/source/features/speculative-decoding.md (3)

188-189: Grammar: add article and tighten phrasing.
“Two-model based speculation implementations do not support overlap scheduler. It will be disabled automatically.” → improve clarity.

-Two-model based speculation implementations do not support overlap scheduler. It will be disabled automatically.
+Two‑model–based speculation implementations do not support the overlap scheduler; it is disabled automatically.

43-44: Make overlap-scheduler guidance consistent with the “auto-disabled” statement.
Examples still pass disable_overlap_scheduler=True unconditionally. Either remove it for two‑model setups or gate it on eagle3_one_model.

-llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True)
+llm = LLM("/path/to/target_model", speculative_config=speculative_config)
-# Only need to disable overlap scheduler if eagle3_one_model is False.
-llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True)
+# Disable overlap scheduler only for the two‑model path.
+llm = LLM("/path/to/target_model",
+          speculative_config=speculative_config,
+          disable_overlap_scheduler=not eagle3_one_model)
-llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True)
+llm = LLM("/path/to/target_model", speculative_config=speculative_config)

Also consider updating the YAML example (Lines 134–140) to reflect this nuance.

Also applies to: 64-66, 86-87


228-233: Typos and grammar in acceptance description.
Fix missing word and “drat” typo.

-Currently, only greedy sampling is supported for speculative decoding. A draft token is accepted if
-matches the previously decoded token exactly.
+Currently, only greedy sampling is supported for speculative decoding. A draft token is accepted if
+it matches the previously decoded token exactly.
@@
-`[t, d1, d2, d3]`, where `d1`, `d2`, and `d3` are drat tokens.
+`[t, d1, d2, d3]`, where `d1`, `d2`, and `d3` are draft tokens.
docs/source/examples/kvcacheconfig.md (2)

3-3: Inline code formatting: don’t include “argument” inside backticks.

-Set KV cache behavior by providing the optional ```kv_cache_config argument``` when you create the LLM engine.
+Set KV cache behavior by providing the optional `kv_cache_config` argument when you create the LLM engine.

3-3: Consistency: “quickstart” vs. “quick start”.
Use one form throughout (“quick start” is used elsewhere).

-Consider the quickstart example found in ```examples/pytorch/quickstart.py```:
+Consider the quick start example found in `examples/pytorch/quickstart.py`:
-This code disables block reuse for the quick start example.
+This code disables block reuse for the quick start example.

(Apply the same spelling uniformly across the docs.)

Also applies to: 47-47

docs/source/examples/kvcacheretentionconfig.md (2)

31-31: Clarify token-range bounds (inclusive/exclusive).
You say “first four tokens” and pass (0, 4). If end is exclusive, that’s correct; please state it explicitly to avoid ambiguity.

-Assume you know that the first four tokens of each prompt represent a system prompt ...
+Assume you know that the first four tokens of each prompt (token indices 0–3; end index is exclusive) represent a system prompt ...

49-56: Variable naming style in example.
Prefer snake_case for Python variables.

-    tokenRangeRetentionConfig = KvCacheRetentionConfig.TokenRangeRetentionConfig(0, 4, 100, None)
-    kv_cache_retention_config = KvCacheRetentionConfig(
-        token_range_retention_configs=[tokenRangeRetentionConfig],
+    token_range_retention_config = KvCacheRetentionConfig.TokenRangeRetentionConfig(0, 4, 100, None)
+    kv_cache_retention_config = KvCacheRetentionConfig(
+        token_range_retention_configs=[token_range_retention_config],
docs/source/features/kvcache.md (1)

45-46: Class name casing.
Use KvCacheConfig (matching the API and link target), not “KVCacheConfig”.

-Many of the features in the KV cache system are optional or have user defined properties that alter how they work. Users can control KV cache features through class [KVCacheConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheConfig).
+Many of the features in the KV cache system are optional or have user‑defined properties that alter how they work. Users can control KV cache features through class [KvCacheConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheConfig).
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e07fa9d and 9b44c22.

📒 Files selected for processing (4)
  • docs/source/examples/kvcacheconfig.md (2 hunks)
  • docs/source/examples/kvcacheretentionconfig.md (4 hunks)
  • docs/source/features/kvcache.md (2 hunks)
  • docs/source/features/speculative-decoding.md (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • docs/source/examples/kvcacheretentionconfig.md
  • docs/source/features/kvcache.md
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • docs/source/features/kvcache.md
🪛 LanguageTool
docs/source/examples/kvcacheretentionconfig.md

[grammar] ~31-~31: There might be a mistake here.
Context: ... 35 on a scale from 1 to 100, where 100 is highest priority and 1 is lowest priori...

(QB_NEW_EN)


[grammar] ~31-~31: There might be a mistake here.
Context: ...00, where 100 is highest priority and 1 is lowest priority. Assume you know that t...

(QB_NEW_EN)

docs/source/features/speculative-decoding.md

[grammar] ~188-~188: There might be a mistake here.
Context: ...ased speculation implementations do not support overlap scheduler. It will be disabled ...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
docs/source/features/kvcache.md (1)

33-34: Cross‑doc consistency: reuse + speculative decoding.
This now claims “reuse across requests is supported by all speculative decoding models.” Please verify against the current backend behavior, and align examples in speculative‑decoding.md (which still set disable_overlap_scheduler=True in two‑model examples).

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
@nv-guomingz nv-guomingz force-pushed the user/guomingz/1.0_doc_minor_update branch from 9b44c22 to 7a7f006 Compare September 5, 2025 06:47
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (11)
docs/source/features/speculative-decoding.md (5)

188-188: Polish grammar: add article and hyphenation.

Suggested: “Two-model-based speculation implementations do not support the overlap scheduler; it will be disabled automatically.”

-Two-model based speculation implementations do not support overlap scheduler. It will be disabled automatically.
+Two-model-based speculation implementations do not support the overlap scheduler; it will be disabled automatically.

43-44: Align examples with auto-disable behavior: remove redundant flag.

Since two-model setups auto-disable overlap scheduling, drop the explicit disable_overlap_scheduler=True to avoid confusion.

-llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True)
+llm = LLM("/path/to/target_model", speculative_config=speculative_config)

65-66: Make EAGLE 3 snippet reflect conditional need.

Either omit the flag entirely (recommended) or show it conditionally only when using two-model. Example below removes it for clarity.

-# Only need to disable overlap scheduler if eagle3_one_model is False.
-llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True)
+# Two-model setups auto-disable overlap scheduling.
+llm = LLM("/path/to/target_model", speculative_config=speculative_config)

86-87: NGram example: drop disable_overlap_scheduler=True.

Auto-disable applies to two-model algorithms; keep the example minimal.

-llm = LLM("/path/to/target_model", speculative_config=speculative_config, disable_overlap_scheduler=True)
+llm = LLM("/path/to/target_model", speculative_config=speculative_config)

134-140: YAML sample: remove disable_overlap_scheduler or note it’s auto/ignored.

To reduce user confusion, either delete the key or add a brief comment that it’s auto-disabled for two-model setups.

-disable_overlap_scheduler: true
 speculative_config:
   decoding_type: Eagle
   max_draft_len: 4
   speculative_model: /path/to/draft/model
docs/source/examples/kvcacheconfig.md (3)

3-3: Inline code formatting: remove triple backticks and “argument” from code span.

Use single backticks and keep prose outside the code span.

-Set KV cache behavior by providing the optional ```kv_cache_config argument``` when you create the LLM engine. Consider the quickstart example found in ```examples/pytorch/quickstart.py```:
+Set KV cache behavior by providing the optional `kv_cache_config` argument when you create the LLM engine. Consider the quick start example found in `examples/pytorch/quickstart.py`:

31-31: Consistent terminology: “quick start” (two words).

Matches usage elsewhere in the docs.

-You can reduce this value to 0.7 by adding the following lines to the quickstart example:
+You can reduce this value to 0.7 by adding the following lines to the quick start example:

39-39: Inline code formatting for class name.

Use single backticks, not triple.

-You can also set properties after you create ```KvCacheConfig```. For example:
+You can also set properties after you create `KvCacheConfig`. For example:
docs/source/examples/kvcacheretentionconfig.md (3)

3-3: Inline code formatting: use single backticks.

Applies to both occurrences on this line.

-You can change block priority by providing the optional ```kv_cache_retention_config``` argument when you submit a request to the LLM engine. Consider the quick start example found in ```examples/pytorch/quickstart.py```:
+You can change block priority by providing the optional `kv_cache_retention_config` argument when you submit a request to the LLM engine. Consider the quick start example found in `examples/pytorch/quickstart.py`:

49-56: PEP 8 naming + clarify token range bounds.

  • Use snake_case for variables in Python examples.
  • Please clarify whether the end index is inclusive or exclusive to prevent off-by-one errors.
-    tokenRangeRetentionConfig = KvCacheRetentionConfig.TokenRangeRetentionConfig(0, 4, 100, None)
-    kv_cache_retention_config = KvCacheRetentionConfig(
-        token_range_retention_configs=[tokenRangeRetentionConfig],
+    token_range_retention_config = KvCacheRetentionConfig.TokenRangeRetentionConfig(0, 4, 100, None)
+    kv_cache_retention_config = KvCacheRetentionConfig(
+        token_range_retention_configs=[token_range_retention_config],
         decode_retention_priority=35, # Set generated tokens to default priority
         decode_duration_ms=None)

Follow-up: If the constructor expects a half-open range [start, end), consider adding a note like “Indices are [start, end) (end exclusive).”


68-68: Inline code formatting: single backticks.

-This example uses a single ```kv_cache_retention_config``` object for all the prompts. You can also provide a list that must have the same length as the list of prompts.
+This example uses a single `kv_cache_retention_config` object for all the prompts. You can also provide a list that must have the same length as the list of prompts.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9b44c22 and 7a7f006.

📒 Files selected for processing (4)
  • docs/source/examples/kvcacheconfig.md (2 hunks)
  • docs/source/examples/kvcacheretentionconfig.md (4 hunks)
  • docs/source/features/kvcache.md (2 hunks)
  • docs/source/features/speculative-decoding.md (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/source/features/kvcache.md
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • docs/source/examples/kvcacheretentionconfig.md
🪛 LanguageTool
docs/source/examples/kvcacheretentionconfig.md

[grammar] ~31-~31: There might be a mistake here.
Context: ... 35 on a scale from 1 to 100, where 100 is highest priority and 1 is lowest priori...

(QB_NEW_EN)


[grammar] ~31-~31: There might be a mistake here.
Context: ...00, where 100 is highest priority and 1 is lowest priority. Assume you know that t...

(QB_NEW_EN)

docs/source/features/speculative-decoding.md

[grammar] ~188-~188: There might be a mistake here.
Context: ...ased speculation implementations do not support overlap scheduler. It will be disabled ...

(QB_NEW_EN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@nv-guomingz nv-guomingz changed the title [None][Doc] Update kvcache part [None][doc] Update kvcache part Sep 5, 2025
@nv-guomingz
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17758 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17758 [ run ] completed with state DISABLED
L0 testing is limited to prioritized users. User nv-guomingz is not in the prioritized list. L0 testing cannot be triggered.

@nv-guomingz
Copy link
Collaborator Author

/bot skip --comment "docs only change"

@nv-guomingz nv-guomingz enabled auto-merge (squash) September 5, 2025 07:25
@tensorrt-cicd
Copy link
Collaborator

PR_Github #17759 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17759 [ skip ] completed with state SUCCESS
Skipping testing for commit 7a7f006

@nv-guomingz nv-guomingz merged commit f9187b2 into NVIDIA:release/1.0 Sep 5, 2025
4 of 6 checks passed
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 8, 2025
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 8, 2025
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 8, 2025
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 8, 2025
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 8, 2025
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 9, 2025
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 9, 2025
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 9, 2025
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 9, 2025
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
nv-guomingz added a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 9, 2025
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 9, 2025
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
nv-guomingz added a commit that referenced this pull request Sep 9, 2025
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025
Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
@nv-guomingz nv-guomingz deleted the user/guomingz/1.0_doc_minor_update branch September 30, 2025 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants