KEMBAR78
[fix] Remove SpecConfig and fix thread leak issues by mikeiovine · Pull Request #5931 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@mikeiovine
Copy link
Collaborator

Description

In #5639, we had to keep the unused SpecConfig class around. Removing this line would inexplicably cause thread leak issues in certain seemingly unrelated tests:

from tensorrt_llm._torch.speculative import SpecConfig

The root cause is:

  1. When you import tensorrt_llm._torch.speculative, torch._inductor.lowering gets imported implicitly.
  2. This import starts a new thread:
import threading

def get_threads():
    return [t.name for t in threading.enumerate()]

print("Before", get_threads())
from torch._inductor import lowering
print("After", get_threads())

# Prints the following
# Before ['MainThread']
# After ['MainThread', 'Thread-1 (_read_thread)']
  1. If the import occurs before the test starts, there are no issues. But if the import occurs after the test starts (e.g. by a lazy import in the LLM construction), pytest will think you leaked the thread and the test will fail.

This PR fixes the issue by adding import torch._inductor.lowering to __init__.py in our testing folder. I don't think it makes sense to add this to TRTLLM itself since this issue is strictly confined to testing with thread leak checks.

Also cleaned up some leftovers from #5639 (remove SpecConfig and get rid of incorrect type annotations).

Test Coverage

Existing tests.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@mikeiovine mikeiovine requested a review from QiJune July 10, 2025 20:03
@mikeiovine mikeiovine requested a review from a team as a code owner July 10, 2025 20:03
@mikeiovine mikeiovine requested a review from litaotju July 10, 2025 20:03
Signed-off-by: Mike Iovine <6158008+mikeiovine@users.noreply.github.com>
@mikeiovine
Copy link
Collaborator Author

/bot run

@mikeiovine mikeiovine requested a review from wili-65535 July 10, 2025 20:08
@tensorrt-cicd
Copy link
Collaborator

PR_Github #11570 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11570 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8566 completed with status: 'FAILURE'

Copy link
Collaborator

@wili-65535 wili-65535 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!
The phenomenon is reproduced locally with Michael's script.
This improvement makes the code much clearer.

Copy link
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mikeiovine
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11661 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11661 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8636 completed with status: 'SUCCESS'

@mikeiovine mikeiovine requested a review from a team as a code owner July 12, 2025 11:45
@mikeiovine mikeiovine requested a review from juney-nvidia July 12, 2025 11:45
@mikeiovine
Copy link
Collaborator Author

/bot reuse-pipeline --comment "Pipeline passed before rebase"

@mikeiovine mikeiovine enabled auto-merge (squash) July 12, 2025 11:47
@tensorrt-cicd
Copy link
Collaborator

PR_Github #11705 [ reuse-pipeline ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11705 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #11661 for commit 141be7c

@mikeiovine mikeiovine merged commit 8950223 into NVIDIA:main Jul 12, 2025
3 checks passed
@mikeiovine mikeiovine deleted the fix-thread-leak branch July 23, 2025 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants