[https://nvbugs/5416501][doc] add known issues to llmapi doc #7560

Superjomn · 2025-09-05T08:16:42Z

Summary by CodeRabbit

Documentation
- Promoted troubleshooting items to clear subsections for faster scanning.
- Added a new “Single node hanging when using docker run --net=host” section with root-cause notes and workarounds.
- Provided alternatives (--ipc=host), recommended MPI environment variables, and an example mpirun invocation to avoid stalls.
- Clarified steps to prevent related runtime issues; no code or API changes.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-09-05T08:16:48Z

📝 Walkthrough

Walkthrough

Documentation update in docs/source/llm-api/index.md: promoted several troubleshooting items to H3 headings and added a new subsection "Single node hanging when using docker run --net=host" describing mpi4py-related causes, suggested environment variables, and an alternative mpirun command. No code or API changes.

Changes

Cohort / File(s)	Summary of changes
Docs: Troubleshooting reformat + new guidance `docs/source/llm-api/index.md`	- Converted multiple troubleshooting entries into H3 headings: “RuntimeError: only rank 0 can start multi-node session, got 1”, “Hang issue on Slurm Node”, “MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1.”, “Cannot quit after generation”. - Added subsection “Single node hanging when using `docker run --net=host`” attributing root cause to `mpi4py`, recommending switching to `--ipc=host`, providing env vars `OMPI_MCA_btl_tcp_if_include=lo` and `OMPI_MCA_oob_tcp_if_include=lo`, and suggesting `mpirun -n 1 --oversubscribe --allow-run-as-root python my_llm_task.py`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

1.0_doc

Suggested reviewers

litaotju
nv-guomingz
QiJune

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>

Superjomn · 2025-09-05T08:18:52Z

/bot skip --comment "only single md file change"

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

docs/source/llm-api/index.md (2)

62-67: Fix minor grammar: “add a prefix”.

Change “add prefix” → “add a prefix” for readability.

-  If you experience a hang or other issue on a node managed with Slurm, add prefix `mpirun -n 1 --oversubscribe --allow-run-as-root` to your launch script.
+  If you experience a hang or other issue on a node managed with Slurm, add a prefix `mpirun -n 1 --oversubscribe --allow-run-as-root` to your launch script.

68-73: Remove trailing punctuation in heading; add explicit main guard example.

The heading ends with a period (MD026). Also, provide a minimal code snippet to illustrate the advice.

-### MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1.
+### MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1
@@
   Because the LLM API relies on the `mpi4py` library, put the LLM class in a function and protect the main entrypoint to the program under the `__main__` namespace to avoid a [recursive spawn](https://mpi4py.readthedocs.io/en/stable/mpi4py.futures.html#mpipoolexecutor) process in `mpi4py`.
 
   This limitation is applicable for multi-GPU inference only.
+```python
+def main():
+    # create and use LLM() inside a function
+    pass
+
+if __name__ == "__main__":
+    main()
+```

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f9187b2 and 512d48e.

📒 Files selected for processing (1)

docs/source/llm-api/index.md (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.

🪛 LanguageTool

docs/source/llm-api/index.md

[grammar] ~56-~56: There might be a mistake here.
Context: ...LLM: ### RuntimeError: only rank 0 can start multi-node session, got 1 There is n...

(QB_NEW_EN)

[grammar] ~64-~64: There might be a mistake here.
Context: ...her issue on a node managed with Slurm, add prefix `mpirun -n 1 --oversubscribe --a...

(QB_NEW_EN)

[grammar] ~78-~78: Ensure spelling is correct
Context: ...ith LLM(...) as llm: ...`, the shutdown methed will be invoked automatically once it g...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.17.2)

docs/source/llm-api/index.md

68-68: Trailing punctuation in heading
Punctuation: '.'

(MD026, no-trailing-punctuation)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

docs/source/llm-api/index.md (2)

56-61: LGTM on promoting this tip to a heading.

Clear and concise; the example command is helpful.

80-97: The verification script to extract lines 80–97 from docs/source/llm-api/index.md is running. I’ll update once we have the actual content.

docs/source/llm-api/index.md

tensorrt-cicd · 2025-09-05T08:24:06Z

PR_Github #17765 [ skip ] triggered by Bot

tensorrt-cicd · 2025-09-05T08:40:05Z

PR_Github #17765 [ skip ] completed with state SUCCESS
Skipping testing for commit 512d48e

Co-authored-by: Ryan McCormick <mccormick.codes@gmail.com> Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

docs/source/llm-api/index.md (1)

74-79: Fix typos and grammar (“context manager”, “method”); remove awkward hyphenation.

Same as the earlier bot note; proposing the corrected sentence.

   The LLM instance manages threads and processes, which may prevent its reference count from reaching zero. To address this issue, there are two common solutions:
   1. Wrap the LLM instance in a function, as demonstrated in the quickstart guide. This will reduce the reference count and trigger the shutdown process.
-  2. Use LLM as an contextmanager, with the following code: `with LLM(...) as llm: ...`, the shutdown methed will be invoked automatically once it goes out of the `with`-statement block.
+  2. Use LLM as a context manager: `with LLM(...) as llm: ...`. The shutdown method is invoked automatically when exiting the with block.

🧹 Nitpick comments (5)

docs/source/llm-api/index.md (5)

58-61: Tighten wording for the single-node note.

Clarify phrasing and explicitly say “inference.”

-  There is no need to add an `mpirun` prefix for launching single node multi-GPU inference with the LLM API.
+  Do not prefix single-node multi-GPU runs with `mpirun` when using the LLM API.

-  For example, you can run `python llm_inference_distributed.py` to perform multi-GPU on a single node.
+  For example, run `python llm_inference_distributed.py` to perform multi-GPU inference on a single node.

62-67: Normalize heading and improve phrasing for Slurm guidance.

Lowercase “node,” and use “prepend” for clarity.

-### Hang issue on Slurm Node
+### Hang on Slurm node

-  If you experience a hang or other issue on a node managed with Slurm, add prefix `mpirun -n 1 --oversubscribe --allow-run-as-root` to your launch script.
+  If you experience a hang on a Slurm-managed node, prepend `mpirun -n 1 --oversubscribe --allow-run-as-root` to your launch command.

-  For example, try `mpirun -n 1 --oversubscribe --allow-run-as-root python llm_inference_distributed.py`.
+  For example: `mpirun -n 1 --oversubscribe --allow-run-as-root python llm_inference_distributed.py`.

68-68: Fix markdownlint: remove trailing punctuation in heading.

Removes the period to satisfy MD026.

-### MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1.
+### MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1

70-72: Crisp guidance on avoiding recursive spawn.

Shorten and name the guard explicitly.

-  Because the LLM API relies on the `mpi4py` library, put the LLM class in a function and protect the main entrypoint to the program under the `__main__` namespace to avoid a [recursive spawn](https://mpi4py.readthedocs.io/en/stable/mpi4py.futures.html#mpipoolexecutor) process in `mpi4py`.
+  Because the LLM API relies on `mpi4py`, place LLM construction inside a function and guard the entry point with `if __name__ == "__main__":` to avoid a [recursive spawn](https://mpi4py.readthedocs.io/en/stable/mpi4py.futures.html#mpipoolexecutor).

80-96: Polish the new docker/mpi4py subsection; add single-node caveat for env vars.

Hyphenate “Single-node,” tighten wording, and note that the env vars restrict comms to loopback.

-### Single node hanging when using `docker run --net=host`
+### Single-node hang when using `docker run --net=host`

-The root cause may be related to `mpi4py`. There is a [workaround](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) suggesting a change from `--net=host` to `--ipc=host`, or setting the following environment variables:
+This may be related to `mpi4py`. As a [workaround](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609), prefer `--ipc=host` instead of `--net=host`, or set the following environment variables (single-node only; they restrict Open MPI to the loopback interface):

-Another option to improve compatibility with `mpi4py` is to launch the task using:
+Alternatively, launch the task with:

-This command can help avoid related runtime issues.
+This can help avoid the hang.

Please confirm the “loopback-only” note matches your intent and won’t be copied into multi-node guidance.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 512d48e and 84924a8.

📒 Files selected for processing (1)

docs/source/llm-api/index.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

docs/source/llm-api/index.md

[grammar] ~56-~56: There might be a mistake here.
Context: ...LLM: ### RuntimeError: only rank 0 can start multi-node session, got 1 There is n...

(QB_NEW_EN)

[grammar] ~64-~64: There might be a mistake here.
Context: ...her issue on a node managed with Slurm, add prefix `mpirun -n 1 --oversubscribe --a...

(QB_NEW_EN)

[grammar] ~78-~78: Ensure spelling is correct
Context: ...ith LLM(...) as llm: ...`, the shutdown methed will be invoked automatically once it g...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.17.2)

docs/source/llm-api/index.md

68-68: Trailing punctuation in heading
Punctuation: '.'

(MD026, no-trailing-punctuation)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

Superjomn · 2025-09-08T08:20:07Z

/bot skip --comment "only changed a single MD file"

tensorrt-cicd · 2025-09-08T08:25:53Z

PR_Github #18009 [ skip ] triggered by Bot

tensorrt-cicd · 2025-09-08T08:42:51Z

PR_Github #18009 [ skip ] completed with state SUCCESS
Skipping testing for commit 84924a8

…7560) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Co-authored-by: Ryan McCormick <mccormick.codes@gmail.com>

…7560) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Co-authored-by: Ryan McCormick <mccormick.codes@gmail.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

…7560) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Co-authored-by: Ryan McCormick <mccormick.codes@gmail.com>

…7560) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Co-authored-by: Ryan McCormick <mccormick.codes@gmail.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Co-authored-by: Ryan McCormick <mccormick.codes@gmail.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

…7560) Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com> Co-authored-by: Ryan McCormick <mccormick.codes@gmail.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

Superjomn requested a review from a team as a code owner September 5, 2025 08:16

Superjomn requested a review from nv-guomingz September 5, 2025 08:17

init

512d48e

Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>

Superjomn force-pushed the add-known-issues-to-doc branch from 106e8b3 to 512d48e Compare September 5, 2025 08:17

coderabbitai bot reviewed Sep 5, 2025

View reviewed changes

docs/source/llm-api/index.md Show resolved Hide resolved

rmccorm4 reviewed Sep 5, 2025

View reviewed changes

docs/source/llm-api/index.md Outdated Show resolved Hide resolved

nv-guomingz approved these changes Sep 5, 2025

View reviewed changes

Superjomn requested a review from litaotju September 5, 2025 23:48

Update docs/source/llm-api/index.md

84924a8

Co-authored-by: Ryan McCormick <mccormick.codes@gmail.com> Signed-off-by: Yan Chunwei <328693+Superjomn@users.noreply.github.com>

coderabbitai bot reviewed Sep 5, 2025

View reviewed changes

litaotju approved these changes Sep 6, 2025

View reviewed changes

Superjomn enabled auto-merge (squash) September 8, 2025 08:19

Superjomn merged commit 1204133 into NVIDIA:release/1.0 Sep 8, 2025
5 checks passed

coderabbitai bot mentioned this pull request Sep 8, 2025

[None][doc] Fix a invalid link. #7617

Merged

1 task

litaotju mentioned this pull request Sep 9, 2025

[TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices #7568

Merged

1 task

dominicshanshan mentioned this pull request Sep 9, 2025

[None][chore] Mass integration of release/1.0 - 5th #7640

Merged

1 task

[https://nvbugs/5416501][doc] add known issues to llmapi doc #7560

[https://nvbugs/5416501][doc] add known issues to llmapi doc #7560

Conversation

Superjomn commented Sep 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

Superjomn commented Sep 5, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Sep 5, 2025

Uh oh!

tensorrt-cicd commented Sep 5, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Superjomn commented Sep 8, 2025

Uh oh!

tensorrt-cicd commented Sep 8, 2025

Uh oh!

tensorrt-cicd commented Sep 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Superjomn commented Sep 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 5, 2025 •

edited

Loading