-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][ci] Some improvements for Slurm CI #7689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e8a3c05
to
61d3379
Compare
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, GB200-PyTorch-1, GB200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast |
📝 WalkthroughWalkthroughJenkins pipeline and Slurm execution flow updated: tightened SSH options and randomized login-node selection for SSH transfers/cleanup; docker GPU handling refactored into assembled Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Jenkins as Jenkins Pipeline
participant Slurm as Slurm Scheduler
participant Login as Random Login Node
participant Node as Compute Node
participant Docker as Docker Container
participant Results as Results Store
Jenkins->>Slurm: Submit job (with pre-sleep/backoff)
Slurm-->>Login: Use SlurmConfig.getRandomLoginNode(...)
Login->>Node: SSH / dispatch job
Note over Jenkins,Node: Compute and log Final dockerArgs (GPU selection)
Jenkins->>Node: Start Docker run (timeout = SlurmConfig.DEFAULT_TIMEOUT)
activate Docker
Docker->>Node: Run tests (pytest/coverage/perf)
Docker-->>Jenkins: Return status (normal/timeout/exception)
deactivate Docker
Node->>Login: Tarball stage results
Login->>Jenkins: Upload results tar.gz
Jenkins->>Results: Publish junit (allowEmptyResults:true)
sequenceDiagram
autonumber
participant Rank0 as Rank 0 (launcher)
participant RankN as Rank >0
participant Task as Background Task (pytest/cmd)
participant Leader as mgmn_leader_node
participant Worker as mgmn_worker_node
participant Proxy as TRTLLM Proxy (IPC)
Note over Rank0,RankN: `trtllm-llmapi-launch` sets IPC addr/key and `tllm_mpi_size`
Rank0->>Proxy: Export TLLM_SPAWN_PROXY_PROCESS variables
Rank0->>Task: Start task in background (MPI env sanitized)
Rank0->>Leader: Start leader (foreground)
RankN->>Worker: Start worker (foreground)
Task-->>Rank0: Exit code captured
Rank0->>Leader: Request stop
Leader-->>Rank0: Stop status
Rank0-->>Rank0: Exit with combined status
RankN-->>RankN: Exit with worker status
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–90 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR_Github #18442 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
jenkins/scripts/slurm_run.sh (1)
3-5
: Strengthen ERR trap and avoid noisy xtraceGood addition. Two tweaks will make failures clearer and propagate traps into functions/subshells.
Apply this diff:
-# Set up error handling -set -exuo pipefail -trap 'echo "Script failed on line $LINENO"; exit 1' ERR +# Set up error handling +set -Eeuo pipefail +trap 'rc=$?; echo "Error on line $LINENO: $BASH_COMMAND (exit $rc)"; exit $rc' ERROptional: temporarily disable
set -x
around anyenv | sort
or secrets-related commands to reduce credential leakage in logs.jenkins/L0_Test.groovy (5)
108-108
: SSH resilience: add a connect timeoutNew keepalive settings look good. Add an explicit connect timeout to fail fast on dead frontends.
-COMMON_SSH_OPTIONS = "-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o TCPKeepAlive=no -o ServerAliveInterval=30 -o ServerAliveCountMax=20" +COMMON_SSH_OPTIONS = "-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o TCPKeepAlive=no -o ServerAliveInterval=30 -o ServerAliveCountMax=20 -o ConnectTimeout=30"
114-118
: Avoid per-call random login-node selection to reduce flakinessUsing
SlurmConfig.getRandomLoginNode(cluster.host)
at every call can produce different frontends within a single job. If home dirs or Slurm state are not perfectly shared, result fetch/cleanup can sporadically fail.Apply these in-function changes:
- Compute once and reuse:
// Near where `cluster` is defined in each function def loginHost = SlurmConfig.getRandomLoginNode(cluster.host)
- Replace the map entries in the shown ranges:
- host : SlurmConfig.getRandomLoginNode(cluster.host), + host : loginHost,Confirm GB/ DGX login nodes share the same homedir and Slurm ctlview; if not, this refactor is recommended.
Also applies to: 146-152, 206-214, 354-361, 471-478
392-401
: Simplify --gpus quoting to avoid shell/inside quoting edge casesThe current echo with nested quotes can surface as literal quotes depending on shell and Jenkins docker.inside wrapping. Docker accepts unquoted device lists.
- if [ -n "$NV_GPU" ]; then - echo "--gpus '\"device=$NV_GPU\"'" - elif [ -n "$CUDA_VISIBLE_DEVICES" ]; then - echo "--gpus '\"device=$CUDA_VISIBLE_DEVICES\"'" + if [ -n "$NV_GPU" ]; then + echo "--gpus device=$NV_GPU" + elif [ -n "$CUDA_VISIBLE_DEVICES" ]; then + echo "--gpus device=$CUDA_VISIBLE_DEVICES" else echo "--gpus ${gpuCount}" fiPlease sanity-check one run where
NV_GPU
is a comma list and another where it's unset to ensuredocker.inside()
accepts the resulting arg unchanged.
403-418
: Docker args: small hardening nitsLooks good. Two small suggestions:
- Ensure volume paths are quoted-safe if envs ever contain spaces (unlikely here).
- Consider adding
--pull=missing
or keeping the explicit pull you already have; both are fine.No code change required if you prefer current style.
543-545
: Mirror bash strict mode improvements in the multi-node launcherSame recommendation as slurm_run.sh for better diagnostics and trap propagation.
- set -exuo pipefail - trap 'echo "Script failed on line $LINENO"; exit 1' ERR + set -Eeuo pipefail + trap 'rc=$?; echo "Error on line $LINENO: $BASH_COMMAND (exit $rc)"; exit $rc' ERR
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
jenkins/L0_Test.groovy
(13 hunks)jenkins/scripts/slurm_run.sh
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
jenkins/L0_Test.groovy (2)
1-1
: Shared lib pinned to a feature branch—confirm stabilitySwitching to
bloom-jenkins-shared-lib@yanchaol-dlcluster
can break pipelines if that branch is deleted/force-pushed. If this is intentional for a temporary rollout, OK; otherwise consider pinning to a tag or@main
.Would you like me to raise a follow-up to move this to a stable ref once validated?
1921-1927
: Timeout wrapper around docker.inside is a solid guardThis prevents Slurm/Docker hangs from blocking the pipeline indefinitely. Good change.
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, GB200-PyTorch-1, GB200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast |
PR_Github #18446 [ run ] triggered by Bot |
PR_Github #18442 [ run ] completed with state |
PR_Github #18446 [ run ] completed with state |
61d3379
to
abf1c60
Compare
@coderabbitai review |
✅ Actions performedReview triggered.
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, GB200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
jenkins/scripts/slurm_run.sh (4)
77-85
: Wheel path extraction is brittle; avoid space-mangling and parsing pip output.Use Python to resolve the package path reliably; avoid sed/cut with spaces.
Apply this diff:
-trtllmWhlPath=$(pip3 show tensorrt_llm | grep Location | cut -d ' ' -f 2) -trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g') -echo "TRTLLM WHEEL PATH: $trtllmWhlPath" -sed -i "s|---wheel_path---|$trtllmWhlPath|g" "$coverageConfigFile" +trtllmWhlPath="$(python3 - <<'PY' +import importlib.util, pathlib +spec = importlib.util.find_spec("tensorrt_llm") +print(str(pathlib.Path(spec.origin).parent)) +PY +)" +echo "TRTLLM WHEEL PATH: $trtllmWhlPath" +sed -i "s|---wheel_path---|$trtllmWhlPath|g" "$coverageConfigFile"
88-95
: LD_LIBRARY_PATH injection: path building strips spaces; compute libs dir robustly.Don’t remove spaces from paths; derive libs/ via Python and prepend if missing.
Apply this diff:
-containerPipLLMLibPath=$(pip3 show tensorrt_llm | grep "Location" | awk -F ":" '{ gsub(/ /, "", $2); print $2"/tensorrt_llm/libs"}') -containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | sed 's/[[:space:]]+/_/g') -containerLDLibPath=$LD_LIBRARY_PATH -containerLDLibPath=$(echo "$containerLDLibPath" | sed 's/[[:space:]]+/_/g') +containerPipLLMLibPath="$(python3 - <<'PY' +import importlib.util, pathlib +spec = importlib.util.find_spec("tensorrt_llm") +print(str(pathlib.Path(spec.origin).parent / "libs")) +PY +)" +containerLDLibPath="${LD_LIBRARY_PATH-}"
97-100
: Avoid dumping full environment; risk of secrets leakage in CI logs.Print a whitelist or redact sensitive keys.
Apply this diff:
-echo "Library Path:" -echo "$LD_LIBRARY_PATH" -env | sort +echo "Library Path:" +echo "$LD_LIBRARY_PATH" +# Redact potential secrets from environment dump +env | grep -Ev '(PASSWORD|TOKEN|SECRET|KEY|CREDENTIAL|AWS_|AZURE_|GCP_|COOKIE)' | sortIf possible, prefer whitelisting specific variables instead of regex redaction.
100-103
: Avoid eval; execute the array directly and log a shell-escaped preview.This removes injection/word-splitting pitfalls.
Apply this diff:
-fullCmd="${testCmdLines[*]}" -echo "Running: $testCase" -echo "Full Command: $fullCmd" -eval $fullCmd +printf -v fullCmd '%q ' "${testCmdLines[@]}" +echo "Running: $testCase" +echo "Full Command: $fullCmd" +"${testCmdLines[@]}"
🧹 Nitpick comments (7)
jenkins/scripts/slurm_run.sh (7)
3-6
: Good hardening; fix ShellCheck SC2154 and minor trap robustness.Predeclare rc (or disable the warning) to satisfy ShellCheck with set -u, and keep the trap minimal.
Apply this diff:
# Set up error handling -set -Exeuo pipefail -trap 'rc=$?; echo "Error on line $LINENO: $BASH_COMMAND (exit $rc)"; exit $rc' ERR +set -Exeuo pipefail +# shellcheck disable=SC2154 # rc is set from $? inside the trap +rc=0 # predeclare for nounset + shellcheck +trap 'rc=$?; echo "Error on line $LINENO: $BASH_COMMAND (exit '"$rc"')"; exit "$rc"' ERR
7-12
: Quote path variables to avoid word-splitting/globbing.Paths from CI envs can contain spaces. Quote cd and file paths.
Apply this diff:
-cd $resourcePathNode -llmSrcNode=$resourcePathNode/TensorRT-LLM/src +cd "$resourcePathNode" +llmSrcNode="$resourcePathNode/TensorRT-LLM/src" @@ -coverageConfigFile="$jobWorkspace/.coveragerc" +coverageConfigFile="$jobWorkspace/.coveragerc"
24-35
: Ensure apt-get is available; avoid flakiness.Running apt-get on shared nodes can fail without update or permissions. Gate it and update indexes.
Apply this diff:
- apt-get install -y libffi-dev + if command -v apt-get >/dev/null 2>&1; then + apt-get update -y && apt-get install -y --no-install-recommends libffi-dev + else + echo "apt-get not found; skipping libffi-dev install" + fi
35-41
: Host resolution: quote and add a more reliable fallback.getent hosts is safer across distros than relying on hostname -f only.
Apply this diff:
- if [ -n "$HOST_NODE_NAME" ]; then - hostNodeName="$HOST_NODE_NAME" - else - hostNodeName=$(hostname -f || hostname) - fi - echo "HOST_NODE_NAME = $hostNodeName ; GPU_UUIDS = $gpuUuids ; STAGE_NAME = $stageName" + if [ -n "${HOST_NODE_NAME:-}" ]; then + hostNodeName="$HOST_NODE_NAME" + else + hostNodeName="$(getent hosts "$(hostname -f 2>/dev/null || hostname)" 2>/dev/null | awk '{print $2}' | head -1)" + hostNodeName="${hostNodeName:-$(hostname -f 2>/dev/null || hostname)}" + fi + echo "HOST_NODE_NAME = $hostNodeName ; GPU_UUIDS = $gpuUuids ; STAGE_NAME = $stageName"
47-47
: Unused or unclear variable recomposition.testList is recomputed but not used below (the script uses testListPathNode). Remove or use consistently.
Apply this diff:
-testList="$testList_$splitId" +# testList is unused; remove if not required elsewhere +# testList="${testList}_${splitId}"
25-34
: General quoting of command arguments.Please quote variable expansions in wget, tar, pip, and cd to avoid surprises.
Apply this diff:
- wget -nv $llmTarfile - tar -zxf $tarName + wget -nv "$llmTarfile" + tar -zxf "$tarName" @@ - cd $llmSrcNode && pip3 install --retries 1 -r requirements-dev.txt - cd $resourcePathNode && pip3 install --force-reinstall --no-deps TensorRT-LLM/tensorrt_llm-*.whl + cd "$llmSrcNode" && pip3 install --retries 1 -r requirements-dev.txt + cd "$resourcePathNode" && pip3 install --force-reinstall --no-deps TensorRT-LLM/tensorrt_llm-*.whl
27-28
: Prefer command -v over which.which can be a shell builtin or external; command -v is more portable.
Apply this diff:
- which python3 + command -v python3
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
jenkins/L0_Test.groovy
(15 hunks)jenkins/scripts/slurm_run.sh
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- jenkins/L0_Test.groovy
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh
[warning] 5-5: rc is referenced but not assigned.
(SC2154)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/scripts/slurm_run.sh (1)
54-69
: Arg pairing: pass options and values as separate array elements.You already do this in most places; ensure every "--opt value" pair is split into two elements for consistent logging/execution.
If any element still contains spaces (e.g., "--rootdir $llmSrcNode/tests/..."), consider splitting them into two entries in testCmdLines for clarity: "--rootdir" and "$llmSrcNode/tests/...".
PR_Github #18455 [ run ] triggered by Bot |
abf1c60
to
469b2a6
Compare
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, GB200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast |
PR_Github #18455 [ run ] completed with state |
PR_Github #18459 [ run ] triggered by Bot |
469b2a6
to
c036eae
Compare
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, GB200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast |
PR_Github #18464 [ run ] triggered by Bot |
PR_Github #18459 [ run ] completed with state |
c036eae
to
141059c
Compare
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, GB200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast |
PR_Github #18467 [ run ] triggered by Bot |
@coderabbitai review |
✅ Actions performedReview triggered.
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast |
PR_Github #18490 [ run ] triggered by Bot |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
jenkins/scripts/slurm_run.sh (1)
96-102
: Make pytest exit-code capture reliable underset -e
With
set -e
, a failing eval exits before recording$?
. Temporarily disable-e
to capture and then re-enable.-fullCmd="${testCmdLines[*]}" -echo "Full Command: $fullCmd" -eval $fullCmd - -exitCode=$? -echo "Pytest exit code: $exitCode" -exit $exitCode +fullCmd="${testCmdLines[*]}" +echo "Full Command: $fullCmd" +set +e +eval "$fullCmd" +exitCode=$? +set -e +echo "Pytest exit code: $exitCode" +exit $exitCodejenkins/L0_Test.groovy (3)
407-414
: Simplify dockerArgs quoting to avoid nested quotesNested quoting can break when passed to Docker Pipeline. No spaces in device list; quotes are unnecessary.
- dockerArgs = sh(script: """ - if [ -n "\$NV_GPU" ]; then - echo "--gpus '\\"device=\$NV_GPU\\"'" - elif [ -n "\$CUDA_VISIBLE_DEVICES" ]; then - echo "--gpus '\\"device=\$CUDA_VISIBLE_DEVICES\\"'" - else - echo "--gpus ${gpuCount}" - fi - """, returnStdout: true).trim() + dockerArgs = sh(script: """ + if [ -n "\$NV_GPU" ]; then + echo "--gpus device=\$NV_GPU" + elif [ -n "\$CUDA_VISIBLE_DEVICES" ]; then + echo "--gpus device=\$CUDA_VISIBLE_DEVICES" + else + echo "--gpus ${gpuCount}" + fi + """, returnStdout: true).trim()
1033-1034
: git safe.directory wildcard in doc buildLimit scope to the checked-out repo root instead of
"*"
.- trtllm_utils.llmExecStepWithRetry(pipeline, script: "git config --global --add safe.directory \"*\"") + def repoRoot = sh(script: "git -C ${LLM_ROOT} rev-parse --show-toplevel || echo ''", returnStdout: true).trim() + if (repoRoot) { + trtllm_utils.llmExecStepWithRetry(pipeline, script: "git config --global --add safe.directory \"${repoRoot}\"") + }
1559-1559
: git safe.directory wildcard in test runner containerSame concern in the test path; restrict to the repo root.
- trtllm_utils.llmExecStepWithRetry(pipeline, script: "git config --global --add safe.directory \"*\"") + def repoRoot = sh(script: "git -C ${llmSrc} rev-parse --show-toplevel || echo ''", returnStdout: true).trim() + if (repoRoot) { + trtllm_utils.llmExecStepWithRetry(pipeline, script: "git config --global --add safe.directory \"${repoRoot}\"") + }
♻️ Duplicate comments (2)
jenkins/scripts/slurm_run.sh (1)
33-33
: git safe.directory scope is overly broadAvoid
*
; restrict to the repo path to reduce attack surface.- git config --global --add safe.directory "*" + if git -C "$resourcePathNode/TensorRT-LLM" rev-parse --show-toplevel >/dev/null 2>&1; then + git config --global --add safe.directory "$resourcePathNode/TensorRT-LLM" + fijenkins/L0_Test.groovy (1)
2149-2157
: Split-count mismatch for GB200 2-node stages (pre-merge vs YAML)Stage 1–4 use split_count=4; Stage 5 introduces a 5th split, which will be empty if the YAML still defines 4 pre_merge tests. Align counts or extend YAML.
Run a quick check locally to confirm actual splits before merging:
#!/bin/bash python3 scripts/test_to_stage_mapping.py --list-stages | grep 'GB200-8_GPUs-2_Nodes-PyTorch-' || true
🧹 Nitpick comments (4)
tensorrt_llm/llmapi/trtllm-llmapi-launch (1)
2-2
: Optional: strengthen error handlingConsider
set -Eeo pipefail
to improve propagation through functions/pipelines without enabling-u
(which would require guarding env reads).-set -e +set -Eeo pipefailjenkins/scripts/slurm_run.sh (1)
3-6
: Silence SC2154 and keep robust ERR trapShellCheck warns rc may be unassigned in the trap. Predeclare rc (no functional change) to satisfy static analysis.
-# Set up error handling -set -Exeuo pipefail -trap 'rc=$?; echo "Error in file ${BASH_SOURCE[0]} on line $LINENO: $BASH_COMMAND (exit $rc)"; exit $rc' ERR +# Set up error handling +rc=0 # for ShellCheck; overwritten in trap +set -Exeuo pipefail +trap 'rc=$?; echo "Error in file ${BASH_SOURCE[0]} on line $LINENO: $BASH_COMMAND (exit $rc)"; exit $rc' ERRjenkins/L0_Test.groovy (2)
112-116
: Stabilize login-node selection across steps
SlurmConfig.getRandomLoginNode(...)
is called repeatedly; if nodes differ across steps, paths under/home/svc_tensorrt/bloom/scripts/...
may not exist on the new frontend (depending on FS layout). Prefer selecting once and reusing for the whole job (and pass into helpers).Would you like me to draft a patch that threads a single
loginNode
throughuploadResults
,cleanUpNodeResources*
, and the Slurm run functions?Also applies to: 153-160, 216-223, 305-313, 367-374, 485-492
558-560
: Shell trap in generated slurm_launch.sh: SC2154 nitMirror the earlier ShellCheck fix to predeclare rc in the generated script.
- set -Exeuo pipefail - trap 'rc=\$?; echo "Error in file \${BASH_SOURCE[0]} on line \$LINENO: \$BASH_COMMAND (exit \$rc)"; exit \$rc' ERR + rc=0 + set -Exeuo pipefail + trap 'rc=\$?; echo "Error in file \${BASH_SOURCE[0]} on line \$LINENO: \$BASH_COMMAND (exit \$rc)"; exit \$rc' ERR
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
jenkins/BuildDockerImage.groovy
(2 hunks)jenkins/L0_Test.groovy
(15 hunks)jenkins/scripts/slurm_run.sh
(3 hunks)tensorrt_llm/llmapi/trtllm-llmapi-launch
(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
PR: NVIDIA/TensorRT-LLM#7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
jenkins/scripts/slurm_run.sh
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh
[warning] 5-5: rc is referenced but not assigned.
(SC2154)
🔇 Additional comments (2)
jenkins/scripts/slurm_run.sh (1)
35-36
: LGTM: resilient host name resolutionFallback to
hostname
if FQDN fails and capture via HOST_NODE_NAME is solid.jenkins/L0_Test.groovy (1)
1956-1962
: LGTM: add timeout around docker.inside to prevent hung Slurm runsThis guards against stuck containers without changing success paths.
b39a9b7
to
b334aea
Compare
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast |
PR_Github #18491 [ run ] triggered by Bot |
PR_Github #18490 [ run ] completed with state |
PR_Github #18491 [ run ] completed with state |
b334aea
to
65d26f0
Compare
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
jenkins/L0_Test.groovy (1)
406-415
: Quote handling indockerArgs
may be offEcho yields
--gpus '"device=…"'
, which can confusedocker.image(...).inside(...)
. Emit simple double-quoted args.- if [ -n "\$NV_GPU" ]; then - echo "--gpus '\\"device=\$NV_GPU\\"'" - elif [ -n "\$CUDA_VISIBLE_DEVICES" ]; then - echo "--gpus '\\"device=\$CUDA_VISIBLE_DEVICES\\"'" + if [ -n "\$NV_GPU" ]; then + echo "--gpus \"device=\$NV_GPU\"" + elif [ -n "\$CUDA_VISIBLE_DEVICES" ]; then + echo "--gpus \"device=\$CUDA_VISIBLE_DEVICES\""
♻️ Duplicate comments (2)
jenkins/scripts/slurm_run.sh (1)
33-33
: Don’t mark all repos as safe.directoryWildcarding “*” weakens git safety in shared runners. Restrict to the actual repo root.
- git config --global --add safe.directory "*" + repo_root="$(cd "$resourcePathNode/TensorRT-LLM/src" && git rev-parse --show-toplevel 2>/dev/null || true)" + if [ -n "$repo_root" ]; then + git config --global --add safe.directory "$repo_root" + fijenkins/L0_Test.groovy (1)
2147-2158
: Pre-merge split-count mismatch: stage 5 likely emptyYou added “GB200-8_GPUs-2_Nodes-PyTorch-5” with split_count=5, but the test-db pre_merge section has 4 tests; split 5 will be empty. Drop stage 5 or add a 5th pre_merge test.
- "GB200-8_GPUs-2_Nodes-PyTorch-5": ["gb200-multi-node", "l0_gb200_multi_nodes", 5, 5, 8, 2], + // Pre-merge currently has 4 splits in l0_gb200_multi_nodes.yml + // "GB200-8_GPUs-2_Nodes-PyTorch-5": ["gb200-multi-node", "l0_gb200_multi_nodes", 5, 5, 8, 2],To verify quickly:
#!/bin/bash # Expect 4 pre-merge splits to have tests; 5th should be empty today. rg -n "GB200-8_GPUs-2_Nodes-PyTorch-[1-5]" jenkins/L0_Test.groovy python3 scripts/test_to_stage_mapping.py --list-stages | grep 'GB200-8_GPUs-2_Nodes-PyTorch-' | wc -l || true
🧹 Nitpick comments (6)
jenkins/scripts/slurm_run.sh (2)
3-5
: Good: proactive error diagnostics; minor caution on logging commandsTrap prints $BASH_COMMAND which may include secrets. Consider gating it behind a DEBUG flag or redacting known tokens.
49-49
: Ensure launcher is executable before useTarball perms can vary. Safer to chmod the new launcher.
- "$llmSrcNode/jenkins/scripts/trtllm-llmapi-launch" + "chmod +x $llmSrcNode/jenkins/scripts/trtllm-llmapi-launch" \ + "$llmSrcNode/jenkins/scripts/trtllm-llmapi-launch"jenkins/scripts/trtllm-llmapi-launch (3)
2-2
: Enable-u
for unset varsKeep consistency with other scripts and fail fast on typos/unset env.
-set -Exeo pipefail +set -Eeuo pipefail
26-41
: Port/HMAC setup: add fallbacks and reduce race window
- Port probe can race; prefer ephemeral ports or keep the socket bound until export completes.
- Openssl might be missing; fall back to Python secrets.
-function export_free_tcp_addr_for_spawn_proxy_process { - # find free port starting from 10012 - local free_port=$(python -c 'import socket; s=socket.socket(); -port = 10012 -while True: - try: - s.bind(("", port)) - break - except OSError: - port += 1 -print(port); s.close()') - export TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR="tcp://127.0.0.1:${free_port}" - log_stderr "TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR: $TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR" - - export TLLM_SPAWN_PROXY_PROCESS_IPC_HMAC_KEY=$(openssl rand -hex 32) -} +function export_free_tcp_addr_for_spawn_proxy_process { + # Prefer ephemeral port to minimize races; fall back to scan from 10012 + local free_port + free_port=$(python3 - <<'PY' +import socket +s=socket.socket() +try: + s.bind(("127.0.0.1", 0)) + print(s.getsockname()[1]) +finally: + s.close() +PY + ) || free_port="" + if [ -z "$free_port" ]; then + free_port=$(python3 - <<'PY' +import socket +s=socket.socket(); port=10012 +while True: + try: + s.bind(("127.0.0.1", port)); print(port); break + except OSError: + port += 1 +s.close() +PY + ) + fi + export TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR="tcp://127.0.0.1:${free_port}" + log_stderr "TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR: $TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR" + if command -v openssl >/dev/null 2>&1; then + export TLLM_SPAWN_PROXY_PROCESS_IPC_HMAC_KEY="$(openssl rand -hex 32)" + else + export TLLM_SPAWN_PROXY_PROCESS_IPC_HMAC_KEY="$(python3 - <<'PY' +import secrets; print(secrets.token_hex(32)) +PY + )" + fi +}
5-12
: Remove unused vars
native_mpi_rank
andpid
are never used.-native_mpi_rank=$OMPI_COMM_WORLD_RANK ... -pid=$(ps -o pid= -p $$ | tr -d ' ')jenkins/L0_Test.groovy (1)
326-327
: Nit: fix log grammar“Sleeping to before …” → “Sleeping before …”.
-Utils.exec(pipeline, script: "echo Sleeping to before Slurm job submission; sleep \$((RANDOM % 29 + 1))") +Utils.exec(pipeline, script: "echo Sleeping before Slurm job submission; sleep \$((RANDOM % 29 + 1))")Also applies to: 585-586
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
jenkins/L0_Test.groovy
(15 hunks)jenkins/scripts/slurm_run.sh
(3 hunks)jenkins/scripts/trtllm-llmapi-launch
(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
PR: NVIDIA/TensorRT-LLM#7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
jenkins/scripts/slurm_run.sh
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
jenkins/scripts/slurm_run.sh
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
jenkins/scripts/slurm_run.sh
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh
[warning] 5-5: rc is referenced but not assigned.
(SC2154)
[warning] 43-43: pytestTestTimeout is referenced but not assigned.
(SC2154)
🔇 Additional comments (7)
jenkins/scripts/slurm_run.sh (2)
35-36
: LGTM: robust host name resolution and loggingUsing HOST_NODE_NAME fallback + GPU UUIDs improves traceability.
43-44
: VerifypytestTestTimeout
is exported by the callerThis script references
pytestTestTimeout
to setCPP_TEST_TIMEOUT_OVERRIDDEN
. Confirm the wrapper always exports it (multi-node path does).jenkins/scripts/trtllm-llmapi-launch (1)
58-71
: Confirm unsetting UCX_ is intended for test subprocess*You’re clearing UCX_ in the test subshell. If tests rely on UCX for IB/SHM, this may alter behavior. Validate that only MPI pollution is removed.
jenkins/L0_Test.groovy (4)
108-108
: LGTM: tighter SSH keepalive settingsReasonable defaults to reduce flaky lingering sessions.
112-119
: LGTM: random login-node selectionGood for load-spreading and resilience.
431-431
: LGTM: emitting final dockerArgsHelpful for debugging GPU assignment logic.
1956-1962
: LGTM: timeout wrapper around docker-insidePrevents stuck Slurm jobs; error handling remains intact.
65d26f0
to
df4c913
Compare
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
jenkins/scripts/slurm_run.sh (1)
25-33
: Do not run apt-get in the CI runtime — move the install to the image or guard it.apt-get is absent in the CI environment (verification: "apt-get not present"). Remove or guard the
apt-get install -y libffi-dev
in jenkins/scripts/slurm_run.sh (lines 25–33); install libffi-dev during image build, or add cross-distro fallbacks (apt/apk/dnf) or use prebuilt wheels so the runtime doesn't depend on apt.
♻️ Duplicate comments (2)
jenkins/scripts/slurm_run.sh (2)
33-33
: git safe.directory scope is overly broad.- git config --global --add safe.directory "*" + repoRoot="$(git -C "$resourcePathNode/TensorRT-LLM" rev-parse --show-toplevel 2>/dev/null || true)" + if [[ -n "$repoRoot" ]]; then + git config --global --add safe.directory "$repoRoot" + fi
98-108
: Drop eval; execute argv array and preserve exit code under set -e.Current eval loses quoting and can mis-handle args; also brittle with set -e.
-fullCmd="${testCmdLines[*]}" -echo "Full Command: $fullCmd" - -# Turn off "exit on error" so the following lines always run -set +e -trap - ERR - -eval $fullCmd -exitCode=$? -echo "Pytest exit code: $exitCode" -exit $exitCode +echo "Full Command: ${testCmdLines[*]}" +# Turn off "exit on error" so the following lines always run +set +e +trap - ERR +"${testCmdLines[@]}" +exitCode=$? +set -e +echo "Pytest exit code: $exitCode" +exit "$exitCode"Companion fix: convert options currently encoded as a single token with a space to either “--opt=value” or two separate tokens, otherwise the array form will pass them as one argument.
- "--rootdir $llmSrcNode/tests/integration/defs" - "--splits $splits" - "--group $splitId" - "--junit-xml $resultsPath/results.xml" + "--rootdir=$llmSrcNode/tests/integration/defs" + "--splits=$splits" + "--group=$splitId" + "--junit-xml=$resultsPath/results.xml"If perf mode is used:
- "--perf-log-formats csv" - "--perf-log-formats yaml" + "--perf-log-formats=csv" + "--perf-log-formats=yaml"
🧹 Nitpick comments (10)
jenkins/scripts/trtllm-llmapi-launch (5)
2-2
: Enable nounset; current script references unset envs.Add -u; we already guard mpi_rank via default expansions.
-set -Eeo pipefail +set -Eeuo pipefail
8-9
: Harden logger; avoid echo -e and unquoted $@.-log_stderr() { echo -e "\033[33m$@\033[0m" >&2; } +log_stderr() { printf '\033[33m%s\033[0m\n' "$*" >&2; }
5-5
: Remove unused variable.-native_mpi_rank=$OMPI_COMM_WORLD_RANK
11-11
: Remove unused PID capture.-pid=$(ps -o pid= -p $$ | tr -d ' ')
49-52
: Possible race: task may start before leader is ready.You spawn the task, then start the leader. If the client connects early, it may fail/flap. Prefer starting the leader, waiting until ready, then running the task; or add client-side retry.
I can provide a small “wait-ready” loop around the leader if desired.
jenkins/scripts/slurm_run.sh (5)
48-49
: Verify launcher path before chmod.-llmapiLaunchScript="$llmSrcNode/jenkins/scripts/trtllm-llmapi-launch" -chmod +x $llmapiLaunchScript +llmapiLaunchScript="$llmSrcNode/jenkins/scripts/trtllm-llmapi-launch" +if [[ ! -x "$llmapiLaunchScript" ]]; then + [[ -f "$llmapiLaunchScript" ]] || { echo "Missing: $llmapiLaunchScript"; exit 2; } + chmod +x "$llmapiLaunchScript" +fi
86-94
: Handle empty LD_LIBRARY_PATH under -u; drop no-op sed.Under set -u, an unset LD_LIBRARY_PATH aborts the script. Also, sed '[[:space:]]+/_/g' doesn’t match without -E and isn’t needed.
-containerLDLibPath=$LD_LIBRARY_PATH -containerLDLibPath=$(echo "$containerLDLibPath" | sed 's/[[:space:]]+/_/g') +containerLDLibPath="${LD_LIBRARY_PATH-}"
5-5
: ShellCheck warning about rc is benign; optional tidy.-trap 'rc=$?; echo "Error in file ${BASH_SOURCE[0]} on line $LINENO: $BASH_COMMAND (exit $rc)"; exit $rc' ERR +trap 'st=$?; echo "Error in file ${BASH_SOURCE[0]} on line $LINENO: $BASH_COMMAND (exit ${st})"; exit ${st}' ERR
59-66
: Minor: prefer --rootdir= over split tokens for clarity.Already covered in the array-exec change; keeping a single token with a space is fragile.
43-43
: Unset envs with -u can abort the script.CPP_TEST_TIMEOUT_OVERRIDDEN reads pytestTestTimeout, which may be unset depending on the Jenkins job.
Consider defaults:
-export CPP_TEST_TIMEOUT_OVERRIDDEN=$pytestTestTimeout +export CPP_TEST_TIMEOUT_OVERRIDDEN="${pytestTestTimeout:-600}"Repeat this pattern for MODEL_CACHE_DIR, STAGE_NAME, etc., or fail fast with a clear error if they’re required.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
jenkins/L0_Test.groovy
(14 hunks)jenkins/scripts/slurm_run.sh
(3 hunks)jenkins/scripts/trtllm-llmapi-launch
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- jenkins/L0_Test.groovy
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
PR: NVIDIA/TensorRT-LLM#7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
jenkins/scripts/slurm_run.sh
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
jenkins/scripts/slurm_run.sh
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh
[warning] 5-5: rc is referenced but not assigned.
(SC2154)
[warning] 43-43: pytestTestTimeout is referenced but not assigned.
(SC2154)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
jenkins/scripts/trtllm-llmapi-launch (1)
58-60
: Sanitizing SLURM_ may bite tests; confirm.*Some pytest plugins/tests rely on SLURM_* (e.g., node counts, tmp dirs). Consider dropping SLURM_ from the blacklist or whitelisting needed vars.
Would you confirm no tests depend on SLURM_*? If they do, I can propose a minimal whitelist.
jenkins/scripts/slurm_run.sh (2)
35-37
: Host/GPU log: OK; minor nit on variable names.LGTM functionally. Consider consistent labels (HOST_NODE_NAME/GPU_UUIDS/STAGE_NAME already fine).
52-67
: Good: launcher integrated as argv[0] of the test command.This is the right insertion point for the MPI-aware wrapper.
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
df4c913
to
14f1977
Compare
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-5, DGX_B200-4_GPUs-PyTorch-1, GB200-4_GPUs-PyTorch-1" --disable-fail-fast |
PR_Github #18504 [ run ] triggered by Bot |
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
PR_Github #18504 [ run ] completed with state |
Partial negative testing is sufficient. |
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.