KEMBAR78
[None][ci] Some improvements for Slurm CI by chzblych · Pull Request #7689 · NVIDIA/TensorRT-LLM · GitHub
Skip to content

Conversation

@chzblych
Copy link
Collaborator

@chzblych chzblych commented Sep 11, 2025

Summary by CodeRabbit

  • New Features
    • Robust multi-node launcher for LLM API with MPI-aware orchestration and proxy IPC support.
    • Per-stage test results tarballed and uploaded; junit accepts empty results.
  • Bug Fixes
    • Hardened error handling, timeouts, retries and SSH/login-node selection; added random sleeps to improve sequencing.
    • Docker runtime args now logged for debugging.
  • Tests
    • Expanded matrix with 5-GPU-per-node variants.
  • Chores
    • Added retries for container registry logins.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@chzblych chzblych requested review from a team as code owners September 11, 2025 14:33
@chzblych chzblych requested review from poweiw and ruodil September 11, 2025 14:33
@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, GB200-PyTorch-1, GB200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 11, 2025

📝 Walkthrough

Walkthrough

Jenkins pipeline and Slurm execution flow updated: tightened SSH options and randomized login-node selection for SSH transfers/cleanup; docker GPU handling refactored into assembled dockerArgs with logging; Slurm Docker runs wrapped in a timeout and broader exception handling; results tarballed and uploaded with junit allowing empty results; new trtllm-llmapi-launch MPI launcher script added; docker login steps use a retry wrapper; Slurm run script hardened.

Changes

Cohort / File(s) Summary of Changes
Jenkins pipeline and Slurm orchestration
jenkins/L0_Test.groovy
Tighten COMMON_SSH_OPTIONS; derive remote targets via SlurmConfig.getRandomLoginNode(cluster.host) for uploads, downloads and cleanup; introduce dockerArgs assembled from NV_GPU/CUDA_VISIBLE_DEVICES or gpuCount and log final args; wrap docker multi-stage runs with timeout (SlurmConfig.DEFAULT_TIMEOUT) and broaden catch to Exception; tarball per-stage results and upload; junit calls use allowEmptyResults: true; add random sleeps and test-matrix tweaks (add 5‑GPU-per-node variant, gate/comment other variants).
Slurm run script hardening
jenkins/scripts/slurm_run.sh
Add set -Eeuo pipefail and an ERR trap to print failing command and exit code; robust hostNodeName resolution; replace launcher invocation with jenkins/scripts/trtllm-llmapi-launch; expand pytest invocation flags (timeouts, coverage, perf flags, outputs); set and export adjusted LD_LIBRARY_PATH; capture and propagate pytest exit code.
Docker login retry wrapper
jenkins/BuildDockerImage.groovy
Replace raw docker login shell calls with trtllm_utils.llmExecStepWithRetry(this, script: "...") for URM and self-hosted registry logins to add retry behavior.
New TRT‑LLM MPI launcher
jenkins/scripts/trtllm-llmapi-launch
New launcher that determines MPI rank/size, provisions proxy IPC addr and HMAC key (port discovery), exports TLLM_SPAWN_PROXY_PROCESS* and tllm_mpi_size, sanitizes MPI env for child task, runs task in background, runs mgmn_leader_node on rank 0 and mgmn_worker_node on non-zero ranks, coordinates stop and exit codes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Jenkins as Jenkins Pipeline
  participant Slurm as Slurm Scheduler
  participant Login as Random Login Node
  participant Node as Compute Node
  participant Docker as Docker Container
  participant Results as Results Store

  Jenkins->>Slurm: Submit job (with pre-sleep/backoff)
  Slurm-->>Login: Use SlurmConfig.getRandomLoginNode(...)
  Login->>Node: SSH / dispatch job
  Note over Jenkins,Node: Compute and log Final dockerArgs (GPU selection)
  Jenkins->>Node: Start Docker run (timeout = SlurmConfig.DEFAULT_TIMEOUT)
  activate Docker
  Docker->>Node: Run tests (pytest/coverage/perf)
  Docker-->>Jenkins: Return status (normal/timeout/exception)
  deactivate Docker
  Node->>Login: Tarball stage results
  Login->>Jenkins: Upload results tar.gz
  Jenkins->>Results: Publish junit (allowEmptyResults:true)
Loading
sequenceDiagram
  autonumber
  participant Rank0 as Rank 0 (launcher)
  participant RankN as Rank >0
  participant Task as Background Task (pytest/cmd)
  participant Leader as mgmn_leader_node
  participant Worker as mgmn_worker_node
  participant Proxy as TRTLLM Proxy (IPC)

  Note over Rank0,RankN: `trtllm-llmapi-launch` sets IPC addr/key and `tllm_mpi_size`
  Rank0->>Proxy: Export TLLM_SPAWN_PROXY_PROCESS variables
  Rank0->>Task: Start task in background (MPI env sanitized)
  Rank0->>Leader: Start leader (foreground)
  RankN->>Worker: Start worker (foreground)
  Task-->>Rank0: Exit code captured
  Rank0->>Leader: Request stop
  Leader-->>Rank0: Stop status
  Rank0-->>Rank0: Exit with combined status
  RankN-->>RankN: Exit with worker status
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Possibly related PRs

Suggested reviewers

  • venkywonka

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description is largely the unchanged template with no filled "Description" or "Test Coverage" sections and contains only template comments and a checklist, so it does not explain what was changed or how to validate it. The detailed changes appear in the raw_summary and pr_objectives but were not copied into the PR description, leaving reviewers without necessary context. Therefore the description fails to meet the repository's required template and is incomplete for review. Please populate the PR Description with a concise summary of the functional changes (mention key files and behavioral changes such as login-node selection, dockerArgs refactor, slurm_run.sh hardening), add a Test Coverage section listing relevant tests and how to run or reproduce them, and complete the PR Checklist items (or explain why items do not apply). Also include any related issue/ticket links and links or identifiers for representative CI runs that exercise the changes to help reviewers validate the change quickly.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "[None][ci] Some improvements for Slurm CI" follows the repository template and clearly identifies the primary scope of the changes (Slurm CI and related Jenkins scripts). It is concise and directly related to the main changes in the changeset as summarized in the raw_summary and pr_objectives. This makes it appropriate for quick scanning in the commit/PR history.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18442 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
jenkins/scripts/slurm_run.sh (1)

3-5: Strengthen ERR trap and avoid noisy xtrace

Good addition. Two tweaks will make failures clearer and propagate traps into functions/subshells.

Apply this diff:

-# Set up error handling
-set -exuo pipefail
-trap 'echo "Script failed on line $LINENO"; exit 1' ERR
+# Set up error handling
+set -Eeuo pipefail
+trap 'rc=$?; echo "Error on line $LINENO: $BASH_COMMAND (exit $rc)"; exit $rc' ERR

Optional: temporarily disable set -x around any env | sort or secrets-related commands to reduce credential leakage in logs.

jenkins/L0_Test.groovy (5)

108-108: SSH resilience: add a connect timeout

New keepalive settings look good. Add an explicit connect timeout to fail fast on dead frontends.

-COMMON_SSH_OPTIONS = "-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o TCPKeepAlive=no -o ServerAliveInterval=30 -o ServerAliveCountMax=20"
+COMMON_SSH_OPTIONS = "-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o TCPKeepAlive=no -o ServerAliveInterval=30 -o ServerAliveCountMax=20 -o ConnectTimeout=30"

114-118: Avoid per-call random login-node selection to reduce flakiness

Using SlurmConfig.getRandomLoginNode(cluster.host) at every call can produce different frontends within a single job. If home dirs or Slurm state are not perfectly shared, result fetch/cleanup can sporadically fail.

Apply these in-function changes:

  • Compute once and reuse:
// Near where `cluster` is defined in each function
def loginHost = SlurmConfig.getRandomLoginNode(cluster.host)
  • Replace the map entries in the shown ranges:
- host         : SlurmConfig.getRandomLoginNode(cluster.host),
+ host         : loginHost,

Confirm GB/ DGX login nodes share the same homedir and Slurm ctlview; if not, this refactor is recommended.

Also applies to: 146-152, 206-214, 354-361, 471-478


392-401: Simplify --gpus quoting to avoid shell/inside quoting edge cases

The current echo with nested quotes can surface as literal quotes depending on shell and Jenkins docker.inside wrapping. Docker accepts unquoted device lists.

-                        if [ -n "$NV_GPU" ]; then
-                            echo "--gpus '\"device=$NV_GPU\"'"
-                        elif [ -n "$CUDA_VISIBLE_DEVICES" ]; then
-                            echo "--gpus '\"device=$CUDA_VISIBLE_DEVICES\"'"
+                        if [ -n "$NV_GPU" ]; then
+                            echo "--gpus device=$NV_GPU"
+                        elif [ -n "$CUDA_VISIBLE_DEVICES" ]; then
+                            echo "--gpus device=$CUDA_VISIBLE_DEVICES"
                         else
                             echo "--gpus ${gpuCount}"
                         fi

Please sanity-check one run where NV_GPU is a comma list and another where it's unset to ensure docker.inside() accepts the resulting arg unchanged.


403-418: Docker args: small hardening nits

Looks good. Two small suggestions:

  • Ensure volume paths are quoted-safe if envs ever contain spaces (unlikely here).
  • Consider adding --pull=missing or keeping the explicit pull you already have; both are fine.

No code change required if you prefer current style.


543-545: Mirror bash strict mode improvements in the multi-node launcher

Same recommendation as slurm_run.sh for better diagnostics and trap propagation.

-    set -exuo pipefail
-    trap 'echo "Script failed on line $LINENO"; exit 1' ERR
+    set -Eeuo pipefail
+    trap 'rc=$?; echo "Error on line $LINENO: $BASH_COMMAND (exit $rc)"; exit $rc' ERR
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef676fc and 61d3379.

📒 Files selected for processing (2)
  • jenkins/L0_Test.groovy (13 hunks)
  • jenkins/scripts/slurm_run.sh (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
jenkins/L0_Test.groovy (2)

1-1: Shared lib pinned to a feature branch—confirm stability

Switching to bloom-jenkins-shared-lib@yanchaol-dlcluster can break pipelines if that branch is deleted/force-pushed. If this is intentional for a temporary rollout, OK; otherwise consider pinning to a tag or @main.

Would you like me to raise a follow-up to move this to a stable ref once validated?


1921-1927: Timeout wrapper around docker.inside is a solid guard

This prevents Slurm/Docker hangs from blocking the pipeline indefinitely. Good change.

@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, GB200-PyTorch-1, GB200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18446 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18442 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #13835 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18446 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13839 (Partly Tested) completed with status: 'FAILURE'

@chzblych
Copy link
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 12, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, GB200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
jenkins/scripts/slurm_run.sh (4)

77-85: Wheel path extraction is brittle; avoid space-mangling and parsing pip output.

Use Python to resolve the package path reliably; avoid sed/cut with spaces.

Apply this diff:

-trtllmWhlPath=$(pip3 show tensorrt_llm | grep Location | cut -d ' ' -f 2)
-trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g')
-echo "TRTLLM WHEEL PATH: $trtllmWhlPath"
-sed -i "s|---wheel_path---|$trtllmWhlPath|g" "$coverageConfigFile"
+trtllmWhlPath="$(python3 - <<'PY'
+import importlib.util, pathlib
+spec = importlib.util.find_spec("tensorrt_llm")
+print(str(pathlib.Path(spec.origin).parent))
+PY
+)"
+echo "TRTLLM WHEEL PATH: $trtllmWhlPath"
+sed -i "s|---wheel_path---|$trtllmWhlPath|g" "$coverageConfigFile"

88-95: LD_LIBRARY_PATH injection: path building strips spaces; compute libs dir robustly.

Don’t remove spaces from paths; derive libs/ via Python and prepend if missing.

Apply this diff:

-containerPipLLMLibPath=$(pip3 show tensorrt_llm | grep "Location" | awk -F ":" '{ gsub(/ /, "", $2); print $2"/tensorrt_llm/libs"}')
-containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | sed 's/[[:space:]]+/_/g')
-containerLDLibPath=$LD_LIBRARY_PATH
-containerLDLibPath=$(echo "$containerLDLibPath" | sed 's/[[:space:]]+/_/g')
+containerPipLLMLibPath="$(python3 - <<'PY'
+import importlib.util, pathlib
+spec = importlib.util.find_spec("tensorrt_llm")
+print(str(pathlib.Path(spec.origin).parent / "libs"))
+PY
+)"
+containerLDLibPath="${LD_LIBRARY_PATH-}"

97-100: Avoid dumping full environment; risk of secrets leakage in CI logs.

Print a whitelist or redact sensitive keys.

Apply this diff:

-echo "Library Path:"
-echo "$LD_LIBRARY_PATH"
-env | sort
+echo "Library Path:"
+echo "$LD_LIBRARY_PATH"
+# Redact potential secrets from environment dump
+env | grep -Ev '(PASSWORD|TOKEN|SECRET|KEY|CREDENTIAL|AWS_|AZURE_|GCP_|COOKIE)' | sort

If possible, prefer whitelisting specific variables instead of regex redaction.


100-103: Avoid eval; execute the array directly and log a shell-escaped preview.

This removes injection/word-splitting pitfalls.

Apply this diff:

-fullCmd="${testCmdLines[*]}"
-echo "Running: $testCase"
-echo "Full Command: $fullCmd"
-eval $fullCmd
+printf -v fullCmd '%q ' "${testCmdLines[@]}"
+echo "Running: $testCase"
+echo "Full Command: $fullCmd"
+"${testCmdLines[@]}"
🧹 Nitpick comments (7)
jenkins/scripts/slurm_run.sh (7)

3-6: Good hardening; fix ShellCheck SC2154 and minor trap robustness.

Predeclare rc (or disable the warning) to satisfy ShellCheck with set -u, and keep the trap minimal.

Apply this diff:

 # Set up error handling
-set -Exeuo pipefail
-trap 'rc=$?; echo "Error on line $LINENO: $BASH_COMMAND (exit $rc)"; exit $rc' ERR
+set -Exeuo pipefail
+# shellcheck disable=SC2154 # rc is set from $? inside the trap
+rc=0  # predeclare for nounset + shellcheck
+trap 'rc=$?; echo "Error on line $LINENO: $BASH_COMMAND (exit '"$rc"')"; exit "$rc"' ERR

7-12: Quote path variables to avoid word-splitting/globbing.

Paths from CI envs can contain spaces. Quote cd and file paths.

Apply this diff:

-cd $resourcePathNode
-llmSrcNode=$resourcePathNode/TensorRT-LLM/src
+cd "$resourcePathNode"
+llmSrcNode="$resourcePathNode/TensorRT-LLM/src"
@@
-coverageConfigFile="$jobWorkspace/.coveragerc"
+coverageConfigFile="$jobWorkspace/.coveragerc"

24-35: Ensure apt-get is available; avoid flakiness.

Running apt-get on shared nodes can fail without update or permissions. Gate it and update indexes.

Apply this diff:

-    apt-get install -y libffi-dev
+    if command -v apt-get >/dev/null 2>&1; then
+        apt-get update -y && apt-get install -y --no-install-recommends libffi-dev
+    else
+        echo "apt-get not found; skipping libffi-dev install"
+    fi

35-41: Host resolution: quote and add a more reliable fallback.

getent hosts is safer across distros than relying on hostname -f only.

Apply this diff:

-    if [ -n "$HOST_NODE_NAME" ]; then
-        hostNodeName="$HOST_NODE_NAME"
-    else
-        hostNodeName=$(hostname -f || hostname)
-    fi
-    echo "HOST_NODE_NAME = $hostNodeName ; GPU_UUIDS = $gpuUuids ; STAGE_NAME = $stageName"
+    if [ -n "${HOST_NODE_NAME:-}" ]; then
+        hostNodeName="$HOST_NODE_NAME"
+    else
+        hostNodeName="$(getent hosts "$(hostname -f 2>/dev/null || hostname)" 2>/dev/null | awk '{print $2}' | head -1)"
+        hostNodeName="${hostNodeName:-$(hostname -f 2>/dev/null || hostname)}"
+    fi
+    echo "HOST_NODE_NAME = $hostNodeName ; GPU_UUIDS = $gpuUuids ; STAGE_NAME = $stageName"

47-47: Unused or unclear variable recomposition.

testList is recomputed but not used below (the script uses testListPathNode). Remove or use consistently.

Apply this diff:

-testList="$testList_$splitId"
+# testList is unused; remove if not required elsewhere
+# testList="${testList}_${splitId}"

25-34: General quoting of command arguments.

Please quote variable expansions in wget, tar, pip, and cd to avoid surprises.

Apply this diff:

-    wget -nv $llmTarfile
-    tar -zxf $tarName
+    wget -nv "$llmTarfile"
+    tar -zxf "$tarName"
@@
-    cd $llmSrcNode && pip3 install --retries 1 -r requirements-dev.txt
-    cd $resourcePathNode &&  pip3 install --force-reinstall --no-deps TensorRT-LLM/tensorrt_llm-*.whl
+    cd "$llmSrcNode" && pip3 install --retries 1 -r requirements-dev.txt
+    cd "$resourcePathNode" && pip3 install --force-reinstall --no-deps TensorRT-LLM/tensorrt_llm-*.whl

27-28: Prefer command -v over which.

which can be a shell builtin or external; command -v is more portable.

Apply this diff:

-    which python3
+    command -v python3
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61d3379 and abf1c60.

📒 Files selected for processing (2)
  • jenkins/L0_Test.groovy (15 hunks)
  • jenkins/scripts/slurm_run.sh (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • jenkins/L0_Test.groovy
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh

[warning] 5-5: rc is referenced but not assigned.

(SC2154)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/scripts/slurm_run.sh (1)

54-69: Arg pairing: pass options and values as separate array elements.

You already do this in most places; ensure every "--opt value" pair is split into two elements for consistent logging/execution.

If any element still contains spaces (e.g., "--rootdir $llmSrcNode/tests/..."), consider splitting them into two entries in testCmdLines for clarity: "--rootdir" and "$llmSrcNode/tests/...".

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18455 [ run ] triggered by Bot

@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, GB200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18455 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #13845 (Partly Tested) completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18459 [ run ] triggered by Bot

@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, GB200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18464 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18459 [ run ] completed with state ABORTED
/LLM/main/L0_MergeRequest_PR pipeline #13848 (Partly Tested) completed with status: 'FAILURE'

@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, GB200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18467 [ run ] triggered by Bot

@chzblych
Copy link
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 13, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18490 [ run ] triggered by Bot

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 13, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
jenkins/scripts/slurm_run.sh (1)

96-102: Make pytest exit-code capture reliable under set -e

With set -e, a failing eval exits before recording $?. Temporarily disable -e to capture and then re-enable.

-fullCmd="${testCmdLines[*]}"
-echo "Full Command: $fullCmd"
-eval $fullCmd
-
-exitCode=$?
-echo "Pytest exit code: $exitCode"
-exit $exitCode
+fullCmd="${testCmdLines[*]}"
+echo "Full Command: $fullCmd"
+set +e
+eval "$fullCmd"
+exitCode=$?
+set -e
+echo "Pytest exit code: $exitCode"
+exit $exitCode
jenkins/L0_Test.groovy (3)

407-414: Simplify dockerArgs quoting to avoid nested quotes

Nested quoting can break when passed to Docker Pipeline. No spaces in device list; quotes are unnecessary.

-                    dockerArgs = sh(script: """
-                        if [ -n "\$NV_GPU" ]; then
-                            echo "--gpus '\\"device=\$NV_GPU\\"'"
-                        elif [ -n "\$CUDA_VISIBLE_DEVICES" ]; then
-                            echo "--gpus '\\"device=\$CUDA_VISIBLE_DEVICES\\"'"
-                        else
-                            echo "--gpus ${gpuCount}"
-                        fi
-                    """, returnStdout: true).trim()
+                    dockerArgs = sh(script: """
+                        if [ -n "\$NV_GPU" ]; then
+                            echo "--gpus device=\$NV_GPU"
+                        elif [ -n "\$CUDA_VISIBLE_DEVICES" ]; then
+                            echo "--gpus device=\$CUDA_VISIBLE_DEVICES"
+                        else
+                            echo "--gpus ${gpuCount}"
+                        fi
+                    """, returnStdout: true).trim()

1033-1034: git safe.directory wildcard in doc build

Limit scope to the checked-out repo root instead of "*".

-    trtllm_utils.llmExecStepWithRetry(pipeline, script: "git config --global --add safe.directory \"*\"")
+    def repoRoot = sh(script: "git -C ${LLM_ROOT} rev-parse --show-toplevel || echo ''", returnStdout: true).trim()
+    if (repoRoot) {
+        trtllm_utils.llmExecStepWithRetry(pipeline, script: "git config --global --add safe.directory \"${repoRoot}\"")
+    }

1559-1559: git safe.directory wildcard in test runner container

Same concern in the test path; restrict to the repo root.

-        trtllm_utils.llmExecStepWithRetry(pipeline, script: "git config --global --add safe.directory \"*\"")
+        def repoRoot = sh(script: "git -C ${llmSrc} rev-parse --show-toplevel || echo ''", returnStdout: true).trim()
+        if (repoRoot) {
+            trtllm_utils.llmExecStepWithRetry(pipeline, script: "git config --global --add safe.directory \"${repoRoot}\"")
+        }
♻️ Duplicate comments (2)
jenkins/scripts/slurm_run.sh (1)

33-33: git safe.directory scope is overly broad

Avoid *; restrict to the repo path to reduce attack surface.

-    git config --global --add safe.directory "*"
+    if git -C "$resourcePathNode/TensorRT-LLM" rev-parse --show-toplevel >/dev/null 2>&1; then
+      git config --global --add safe.directory "$resourcePathNode/TensorRT-LLM"
+    fi
jenkins/L0_Test.groovy (1)

2149-2157: Split-count mismatch for GB200 2-node stages (pre-merge vs YAML)

Stage 1–4 use split_count=4; Stage 5 introduces a 5th split, which will be empty if the YAML still defines 4 pre_merge tests. Align counts or extend YAML.

Run a quick check locally to confirm actual splits before merging:

#!/bin/bash
python3 scripts/test_to_stage_mapping.py --list-stages | grep 'GB200-8_GPUs-2_Nodes-PyTorch-' || true
🧹 Nitpick comments (4)
tensorrt_llm/llmapi/trtllm-llmapi-launch (1)

2-2: Optional: strengthen error handling

Consider set -Eeo pipefail to improve propagation through functions/pipelines without enabling -u (which would require guarding env reads).

-set -e
+set -Eeo pipefail
jenkins/scripts/slurm_run.sh (1)

3-6: Silence SC2154 and keep robust ERR trap

ShellCheck warns rc may be unassigned in the trap. Predeclare rc (no functional change) to satisfy static analysis.

-# Set up error handling
-set -Exeuo pipefail
-trap 'rc=$?; echo "Error in file ${BASH_SOURCE[0]} on line $LINENO: $BASH_COMMAND (exit $rc)"; exit $rc' ERR
+# Set up error handling
+rc=0  # for ShellCheck; overwritten in trap
+set -Exeuo pipefail
+trap 'rc=$?; echo "Error in file ${BASH_SOURCE[0]} on line $LINENO: $BASH_COMMAND (exit $rc)"; exit $rc' ERR
jenkins/L0_Test.groovy (2)

112-116: Stabilize login-node selection across steps

SlurmConfig.getRandomLoginNode(...) is called repeatedly; if nodes differ across steps, paths under /home/svc_tensorrt/bloom/scripts/... may not exist on the new frontend (depending on FS layout). Prefer selecting once and reusing for the whole job (and pass into helpers).

Would you like me to draft a patch that threads a single loginNode through uploadResults, cleanUpNodeResources*, and the Slurm run functions?

Also applies to: 153-160, 216-223, 305-313, 367-374, 485-492


558-560: Shell trap in generated slurm_launch.sh: SC2154 nit

Mirror the earlier ShellCheck fix to predeclare rc in the generated script.

-                    set -Exeuo pipefail
-                    trap 'rc=\$?; echo "Error in file \${BASH_SOURCE[0]} on line \$LINENO: \$BASH_COMMAND (exit \$rc)"; exit \$rc' ERR
+                    rc=0
+                    set -Exeuo pipefail
+                    trap 'rc=\$?; echo "Error in file \${BASH_SOURCE[0]} on line \$LINENO: \$BASH_COMMAND (exit \$rc)"; exit \$rc' ERR
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between abf1c60 and b39a9b7.

📒 Files selected for processing (4)
  • jenkins/BuildDockerImage.groovy (2 hunks)
  • jenkins/L0_Test.groovy (15 hunks)
  • jenkins/scripts/slurm_run.sh (3 hunks)
  • tensorrt_llm/llmapi/trtllm-llmapi-launch (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
PR: NVIDIA/TensorRT-LLM#7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • jenkins/scripts/slurm_run.sh
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh

[warning] 5-5: rc is referenced but not assigned.

(SC2154)

🔇 Additional comments (2)
jenkins/scripts/slurm_run.sh (1)

35-36: LGTM: resilient host name resolution

Fallback to hostname if FQDN fails and capture via HOST_NODE_NAME is solid.

jenkins/L0_Test.groovy (1)

1956-1962: LGTM: add timeout around docker.inside to prevent hung Slurm runs

This guards against stuck containers without changing success paths.

@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18491 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18490 [ run ] completed with state ABORTED
/LLM/main/L0_MergeRequest_PR pipeline #13877 (Partly Tested) completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18491 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #13878 (Blue Ocean) completed with status: ABORTED

@chzblych
Copy link
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 13, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
jenkins/L0_Test.groovy (1)

406-415: Quote handling in dockerArgs may be off

Echo yields --gpus '"device=…"', which can confuse docker.image(...).inside(...). Emit simple double-quoted args.

-                        if [ -n "\$NV_GPU" ]; then
-                            echo "--gpus '\\"device=\$NV_GPU\\"'"
-                        elif [ -n "\$CUDA_VISIBLE_DEVICES" ]; then
-                            echo "--gpus '\\"device=\$CUDA_VISIBLE_DEVICES\\"'"
+                        if [ -n "\$NV_GPU" ]; then
+                            echo "--gpus \"device=\$NV_GPU\""
+                        elif [ -n "\$CUDA_VISIBLE_DEVICES" ]; then
+                            echo "--gpus \"device=\$CUDA_VISIBLE_DEVICES\""
♻️ Duplicate comments (2)
jenkins/scripts/slurm_run.sh (1)

33-33: Don’t mark all repos as safe.directory

Wildcarding “*” weakens git safety in shared runners. Restrict to the actual repo root.

-    git config --global --add safe.directory "*"
+    repo_root="$(cd "$resourcePathNode/TensorRT-LLM/src" && git rev-parse --show-toplevel 2>/dev/null || true)"
+    if [ -n "$repo_root" ]; then
+      git config --global --add safe.directory "$repo_root"
+    fi
jenkins/L0_Test.groovy (1)

2147-2158: Pre-merge split-count mismatch: stage 5 likely empty

You added “GB200-8_GPUs-2_Nodes-PyTorch-5” with split_count=5, but the test-db pre_merge section has 4 tests; split 5 will be empty. Drop stage 5 or add a 5th pre_merge test.

-        "GB200-8_GPUs-2_Nodes-PyTorch-5": ["gb200-multi-node", "l0_gb200_multi_nodes", 5, 5, 8, 2],
+        // Pre-merge currently has 4 splits in l0_gb200_multi_nodes.yml
+        // "GB200-8_GPUs-2_Nodes-PyTorch-5": ["gb200-multi-node", "l0_gb200_multi_nodes", 5, 5, 8, 2],

To verify quickly:

#!/bin/bash
# Expect 4 pre-merge splits to have tests; 5th should be empty today.
rg -n "GB200-8_GPUs-2_Nodes-PyTorch-[1-5]" jenkins/L0_Test.groovy
python3 scripts/test_to_stage_mapping.py --list-stages | grep 'GB200-8_GPUs-2_Nodes-PyTorch-' | wc -l || true
🧹 Nitpick comments (6)
jenkins/scripts/slurm_run.sh (2)

3-5: Good: proactive error diagnostics; minor caution on logging commands

Trap prints $BASH_COMMAND which may include secrets. Consider gating it behind a DEBUG flag or redacting known tokens.


49-49: Ensure launcher is executable before use

Tarball perms can vary. Safer to chmod the new launcher.

-    "$llmSrcNode/jenkins/scripts/trtllm-llmapi-launch"
+    "chmod +x $llmSrcNode/jenkins/scripts/trtllm-llmapi-launch" \
+    "$llmSrcNode/jenkins/scripts/trtllm-llmapi-launch"
jenkins/scripts/trtllm-llmapi-launch (3)

2-2: Enable -u for unset vars

Keep consistency with other scripts and fail fast on typos/unset env.

-set -Exeo pipefail
+set -Eeuo pipefail

26-41: Port/HMAC setup: add fallbacks and reduce race window

  • Port probe can race; prefer ephemeral ports or keep the socket bound until export completes.
  • Openssl might be missing; fall back to Python secrets.
-function export_free_tcp_addr_for_spawn_proxy_process {
-    # find free port starting from 10012
-    local free_port=$(python -c 'import socket; s=socket.socket();
-port = 10012
-while True:
-    try:
-        s.bind(("", port))
-        break
-    except OSError:
-        port += 1
-print(port); s.close()')
-    export TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR="tcp://127.0.0.1:${free_port}"
-    log_stderr "TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR: $TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR"
-
-    export TLLM_SPAWN_PROXY_PROCESS_IPC_HMAC_KEY=$(openssl rand -hex 32)
-}
+function export_free_tcp_addr_for_spawn_proxy_process {
+    # Prefer ephemeral port to minimize races; fall back to scan from 10012
+    local free_port
+    free_port=$(python3 - <<'PY'
+import socket
+s=socket.socket()
+try:
+    s.bind(("127.0.0.1", 0))
+    print(s.getsockname()[1])
+finally:
+    s.close()
+PY
+    ) || free_port=""
+    if [ -z "$free_port" ]; then
+        free_port=$(python3 - <<'PY'
+import socket
+s=socket.socket(); port=10012
+while True:
+    try:
+        s.bind(("127.0.0.1", port)); print(port); break
+    except OSError:
+        port += 1
+s.close()
+PY
+        )
+    fi
+    export TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR="tcp://127.0.0.1:${free_port}"
+    log_stderr "TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR: $TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR"
+    if command -v openssl >/dev/null 2>&1; then
+        export TLLM_SPAWN_PROXY_PROCESS_IPC_HMAC_KEY="$(openssl rand -hex 32)"
+    else
+        export TLLM_SPAWN_PROXY_PROCESS_IPC_HMAC_KEY="$(python3 - <<'PY'
+import secrets; print(secrets.token_hex(32))
+PY
+        )"
+    fi
+}

5-12: Remove unused vars

native_mpi_rank and pid are never used.

-native_mpi_rank=$OMPI_COMM_WORLD_RANK
...
-pid=$(ps -o pid= -p $$ | tr -d ' ')
jenkins/L0_Test.groovy (1)

326-327: Nit: fix log grammar

“Sleeping to before …” → “Sleeping before …”.

-Utils.exec(pipeline, script: "echo Sleeping to before Slurm job submission; sleep \$((RANDOM % 29 + 1))")
+Utils.exec(pipeline, script: "echo Sleeping before Slurm job submission; sleep \$((RANDOM % 29 + 1))")

Also applies to: 585-586

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b39a9b7 and 65d26f0.

📒 Files selected for processing (3)
  • jenkins/L0_Test.groovy (15 hunks)
  • jenkins/scripts/slurm_run.sh (3 hunks)
  • jenkins/scripts/trtllm-llmapi-launch (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
PR: NVIDIA/TensorRT-LLM#7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • jenkins/scripts/slurm_run.sh
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • jenkins/scripts/slurm_run.sh
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • jenkins/scripts/slurm_run.sh
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh

[warning] 5-5: rc is referenced but not assigned.

(SC2154)


[warning] 43-43: pytestTestTimeout is referenced but not assigned.

(SC2154)

🔇 Additional comments (7)
jenkins/scripts/slurm_run.sh (2)

35-36: LGTM: robust host name resolution and logging

Using HOST_NODE_NAME fallback + GPU UUIDs improves traceability.


43-44: Verify pytestTestTimeout is exported by the caller

This script references pytestTestTimeout to set CPP_TEST_TIMEOUT_OVERRIDDEN. Confirm the wrapper always exports it (multi-node path does).

jenkins/scripts/trtllm-llmapi-launch (1)

58-71: Confirm unsetting UCX_ is intended for test subprocess*

You’re clearing UCX_ in the test subshell. If tests rely on UCX for IB/SHM, this may alter behavior. Validate that only MPI pollution is removed.

jenkins/L0_Test.groovy (4)

108-108: LGTM: tighter SSH keepalive settings

Reasonable defaults to reduce flaky lingering sessions.


112-119: LGTM: random login-node selection

Good for load-spreading and resilience.


431-431: LGTM: emitting final dockerArgs

Helpful for debugging GPU assignment logic.


1956-1962: LGTM: timeout wrapper around docker-inside

Prevents stuck Slurm jobs; error handling remains intact.

@chzblych
Copy link
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 14, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
jenkins/scripts/slurm_run.sh (1)

25-33: Do not run apt-get in the CI runtime — move the install to the image or guard it.

apt-get is absent in the CI environment (verification: "apt-get not present"). Remove or guard the apt-get install -y libffi-dev in jenkins/scripts/slurm_run.sh (lines 25–33); install libffi-dev during image build, or add cross-distro fallbacks (apt/apk/dnf) or use prebuilt wheels so the runtime doesn't depend on apt.

♻️ Duplicate comments (2)
jenkins/scripts/slurm_run.sh (2)

33-33: git safe.directory scope is overly broad.

-    git config --global --add safe.directory "*"
+    repoRoot="$(git -C "$resourcePathNode/TensorRT-LLM" rev-parse --show-toplevel 2>/dev/null || true)"
+    if [[ -n "$repoRoot" ]]; then
+      git config --global --add safe.directory "$repoRoot"
+    fi

98-108: Drop eval; execute argv array and preserve exit code under set -e.

Current eval loses quoting and can mis-handle args; also brittle with set -e.

-fullCmd="${testCmdLines[*]}"
-echo "Full Command: $fullCmd"
-
-# Turn off "exit on error" so the following lines always run
-set +e
-trap - ERR
-
-eval $fullCmd
-exitCode=$?
-echo "Pytest exit code: $exitCode"
-exit $exitCode
+echo "Full Command: ${testCmdLines[*]}"
+# Turn off "exit on error" so the following lines always run
+set +e
+trap - ERR
+"${testCmdLines[@]}"
+exitCode=$?
+set -e
+echo "Pytest exit code: $exitCode"
+exit "$exitCode"

Companion fix: convert options currently encoded as a single token with a space to either “--opt=value” or two separate tokens, otherwise the array form will pass them as one argument.

-    "--rootdir $llmSrcNode/tests/integration/defs"
-    "--splits $splits"
-    "--group $splitId"
-    "--junit-xml $resultsPath/results.xml"
+    "--rootdir=$llmSrcNode/tests/integration/defs"
+    "--splits=$splits"
+    "--group=$splitId"
+    "--junit-xml=$resultsPath/results.xml"

If perf mode is used:

-        "--perf-log-formats csv"
-        "--perf-log-formats yaml"
+        "--perf-log-formats=csv"
+        "--perf-log-formats=yaml"
🧹 Nitpick comments (10)
jenkins/scripts/trtllm-llmapi-launch (5)

2-2: Enable nounset; current script references unset envs.

Add -u; we already guard mpi_rank via default expansions.

-set -Eeo pipefail
+set -Eeuo pipefail

8-9: Harden logger; avoid echo -e and unquoted $@.

-log_stderr() { echo -e "\033[33m$@\033[0m" >&2; }
+log_stderr() { printf '\033[33m%s\033[0m\n' "$*" >&2; }

5-5: Remove unused variable.

-native_mpi_rank=$OMPI_COMM_WORLD_RANK

11-11: Remove unused PID capture.

-pid=$(ps -o pid= -p $$ | tr -d ' ')

49-52: Possible race: task may start before leader is ready.

You spawn the task, then start the leader. If the client connects early, it may fail/flap. Prefer starting the leader, waiting until ready, then running the task; or add client-side retry.

I can provide a small “wait-ready” loop around the leader if desired.

jenkins/scripts/slurm_run.sh (5)

48-49: Verify launcher path before chmod.

-llmapiLaunchScript="$llmSrcNode/jenkins/scripts/trtllm-llmapi-launch"
-chmod +x $llmapiLaunchScript
+llmapiLaunchScript="$llmSrcNode/jenkins/scripts/trtllm-llmapi-launch"
+if [[ ! -x "$llmapiLaunchScript" ]]; then
+  [[ -f "$llmapiLaunchScript" ]] || { echo "Missing: $llmapiLaunchScript"; exit 2; }
+  chmod +x "$llmapiLaunchScript"
+fi

86-94: Handle empty LD_LIBRARY_PATH under -u; drop no-op sed.

Under set -u, an unset LD_LIBRARY_PATH aborts the script. Also, sed '[[:space:]]+/_/g' doesn’t match without -E and isn’t needed.

-containerLDLibPath=$LD_LIBRARY_PATH
-containerLDLibPath=$(echo "$containerLDLibPath" | sed 's/[[:space:]]+/_/g')
+containerLDLibPath="${LD_LIBRARY_PATH-}"

5-5: ShellCheck warning about rc is benign; optional tidy.

-trap 'rc=$?; echo "Error in file ${BASH_SOURCE[0]} on line $LINENO: $BASH_COMMAND (exit $rc)"; exit $rc' ERR
+trap 'st=$?; echo "Error in file ${BASH_SOURCE[0]} on line $LINENO: $BASH_COMMAND (exit ${st})"; exit ${st}' ERR

59-66: Minor: prefer --rootdir= over split tokens for clarity.

Already covered in the array-exec change; keeping a single token with a space is fragile.


43-43: Unset envs with -u can abort the script.

CPP_TEST_TIMEOUT_OVERRIDDEN reads pytestTestTimeout, which may be unset depending on the Jenkins job.

Consider defaults:

-export CPP_TEST_TIMEOUT_OVERRIDDEN=$pytestTestTimeout
+export CPP_TEST_TIMEOUT_OVERRIDDEN="${pytestTestTimeout:-600}"

Repeat this pattern for MODEL_CACHE_DIR, STAGE_NAME, etc., or fail fast with a clear error if they’re required.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 65d26f0 and df4c913.

📒 Files selected for processing (3)
  • jenkins/L0_Test.groovy (14 hunks)
  • jenkins/scripts/slurm_run.sh (3 hunks)
  • jenkins/scripts/trtllm-llmapi-launch (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • jenkins/L0_Test.groovy
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
PR: NVIDIA/TensorRT-LLM#7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • jenkins/scripts/slurm_run.sh
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • jenkins/scripts/slurm_run.sh
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh

[warning] 5-5: rc is referenced but not assigned.

(SC2154)


[warning] 43-43: pytestTestTimeout is referenced but not assigned.

(SC2154)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
jenkins/scripts/trtllm-llmapi-launch (1)

58-60: Sanitizing SLURM_ may bite tests; confirm.*

Some pytest plugins/tests rely on SLURM_* (e.g., node counts, tmp dirs). Consider dropping SLURM_ from the blacklist or whitelisting needed vars.

Would you confirm no tests depend on SLURM_*? If they do, I can propose a minimal whitelist.

jenkins/scripts/slurm_run.sh (2)

35-37: Host/GPU log: OK; minor nit on variable names.

LGTM functionally. Consider consistent labels (HOST_NODE_NAME/GPU_UUIDS/STAGE_NAME already fine).


52-67: Good: launcher integrated as argv[0] of the test command.

This is the right insertion point for the MPI-aware wrapper.

Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #18504 [ run ] triggered by Bot

Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
@tensorrt-cicd
Copy link
Collaborator

PR_Github #18504 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13889 (Partly Tested) completed with status: 'FAILURE'

@chzblych chzblych merged commit 89fc136 into NVIDIA:main Sep 14, 2025
4 checks passed
@chzblych chzblych deleted the yanchaol-ci-improve branch September 14, 2025 08:56
@chzblych
Copy link
Collaborator Author

PR_Github #18504 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #13889 (Partly Tested) completed with status: 'FAILURE'

Partial negative testing is sufficient.

Wong4j pushed a commit to Wong4j/TensorRT-LLM that referenced this pull request Sep 20, 2025
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
MrGeva pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Sep 21, 2025
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants