[None][infra] Cherry-pick #6836 from main branch and improve SSH connection #6971

chzblych · 2025-08-18T01:06:55Z

Summary by CodeRabbit

Chores
- Improved CI stability with randomized delays and retries for container pulls/builds.
- Clearer logs via sorted environment output.
- Stricter image argument parsing and automatic routing to internal registries/mirrors when configured.
- Conditional redirection of third‑party dependency fetches to an internal mirror when the mirror is set.
Refactor
- Centralized SSH options across test workflows for consistent host‑key handling and easier maintenance.

…ailures (NVIDIA#6836) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>

coderabbitai · 2025-08-18T01:07:01Z

📝 Walkthrough

Walkthrough

Adds an environment-driven mirror redirection for UCXX’s rapids-cmake fetch in CMake. Tightens and mirrors image ARG extraction in the Jenkins Docker build, adds randomized sleeps, retries, and sorted env prints. Centralizes SSH options in the L0 test Jenkins script via a new COMMON_SSH_OPTIONS variable.

Changes

Cohort / File(s)	Summary of Changes
UCXX fetch mirror override `cpp/CMakeLists.txt`	If UCX is found and `GITHUB_MIRROR` is set and `${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake` exists, read and replace the rapids-cmake raw.githubusercontent URL with the mirror path, write file back, and emit a warning. No other build steps modified.
Docker build pipeline hardening `jenkins/BuildDockerImage.groovy`	Use `env
Centralized SSH options for L0 tests `jenkins/L0_Test.groovy`	Add public `COMMON_SSH_OPTIONS = "-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null"` and replace scattered SSH/SCP flags across upload, SLURM, port-forwarding, port checks, and other remote operations; minor mkdir `-p` and spacing adjustments.

Sequence Diagram(s)

sequenceDiagram
  participant CMake as CMake Configure
  participant Env as Environment
  participant FS as Filesystem

  CMake->>Env: Check UCX found and GITHUB_MIRROR set
  alt Conditions met
    CMake->>FS: Read 3RDPARTY_DIR/ucxx/fetch_rapids.cmake
    FS-->>CMake: File content
    CMake->>CMake: Replace rapids-cmake URL with $GITHUB_MIRROR path
    CMake->>FS: Write updated file
    CMake->>CMake: Warn: replaced fetch_rapids.cmake URL
  else
    CMake->>CMake: No change
  end

sequenceDiagram
  participant Jenkins as Jenkins Pipeline
  participant Make as Make/Dockerfile
  participant Registry as Docker Registry
  participant Docker as Docker

  Jenkins->>Make: Extract BASE/ARG images (anchored grep)
  Jenkins->>Jenkins: Replace nvcr.io with internal mirror
  Jenkins->>Jenkins: Compute randomSleep (300–599s)
  Jenkins->>Jenkins: env | sort (log)
  Jenkins->>Jenkins: Sleep randomSleep
  Jenkins->>Registry: docker pull base/triton image
  loop up to 3 retries
    Jenkins->>Docker: docker build (with sleepInSecs: randomSleep)
    alt build fails
      Jenkins->>Jenkins: Retry
    else build succeeds
      Jenkins-->>Jenkins: Done
    end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

[TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures #6836 — Implements the same UCXX rapids-cmake raw.githubusercontent -> internal GITHUB_MIRROR replacement in CMake (directly related).

Suggested reviewers

niukuo
Shixiaowei02
yuanjingx87
ZhanruiSunCh

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 85759ab and 347fc44.

📒 Files selected for processing (1)

jenkins/L0_Test.groovy (11 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

jenkins/L0_Test.groovy

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (7)

cpp/CMakeLists.txt (1)

498-508: Mirror rewrite logic is sound; add a guard to avoid unnecessary writes and aid diagnostics

The env-gated mirror redirection looks good and is safely scoped. To improve robustness:

Only write the file if a replacement actually occurred.
Emit a STATUS message when no match was found (helps when UCXX updates the fetch script format).

Apply within this block:

-        file(READ "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake" FILE_CONTENTS)
-        string(
-          REPLACE "https://raw.githubusercontent.com/rapidsai/rapids-cmake"
-                  "$ENV{GITHUB_MIRROR}/rapidsai/rapids-cmake/raw/refs/heads"
-                  FILE_CONTENTS "${FILE_CONTENTS}")
-        file(WRITE "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake" "${FILE_CONTENTS}")
-        message(WARNING "Replace UCXX fetch_rapids.cmake with internal mirror")
+        file(READ "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake" FILE_CONTENTS)
+        set(_OLD_URL "https://raw.githubusercontent.com/rapidsai/rapids-cmake")
+        set(_NEW_URL "$ENV{GITHUB_MIRROR}/rapidsai/rapids-cmake/raw/refs/heads")
+        if(FILE_CONTENTS MATCHES "${_OLD_URL}")
+          string(REPLACE "${_OLD_URL}" "${_NEW_URL}" FILE_CONTENTS "${FILE_CONTENTS}")
+          file(WRITE "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake" "${FILE_CONTENTS}")
+          message(WARNING "Replace UCXX fetch_rapids.cmake with internal mirror: ${_NEW_URL}")
+        else()
+          message(STATUS "UCXX fetch_rapids.cmake contains no '${_OLD_URL}' occurrences. Skipping mirror rewrite.")
+        endif()

jenkins/BuildDockerImage.groovy (4)

284-287: ARG extraction: works; consider simpler, more resilient parsing

Your anchored grep avoids incidental matches. For readability and fewer processes, awk can extract the value in one pass and trim quotes:

-        def BASE_IMAGE = sh(script: "cd ${LLM_ROOT} && grep '^ARG BASE_IMAGE=' docker/Dockerfile.multi | grep -o '=.*' | tr -d '=\"'", returnStdout: true).trim()
-        def TRITON_IMAGE = sh(script: "cd ${LLM_ROOT} && grep '^ARG TRITON_IMAGE=' docker/Dockerfile.multi | grep -o '=.*' | tr -d '=\"'", returnStdout: true).trim()
-        def TRITON_BASE_TAG = sh(script: "cd ${LLM_ROOT} && grep '^ARG TRITON_BASE_TAG=' docker/Dockerfile.multi | grep -o '=.*' | tr -d '=\"'", returnStdout: true).trim()
+        def BASE_IMAGE = sh(script: "cd ${LLM_ROOT} && awk -F= '/^ARG BASE_IMAGE=/ {gsub(/\"/, \"\", \$2); print \$2}' docker/Dockerfile.multi", returnStdout: true).trim()
+        def TRITON_IMAGE = sh(script: "cd ${LLM_ROOT} && awk -F= '/^ARG TRITON_IMAGE=/ {gsub(/\"/, \"\", \$2); print \$2}' docker/Dockerfile.multi", returnStdout: true).trim()
+        def TRITON_BASE_TAG = sh(script: "cd ${LLM_ROOT} && awk -F= '/^ARG TRITON_BASE_TAG=/ {gsub(/\"/, \"\", \$2); print \$2}' docker/Dockerfile.multi", returnStdout: true).trim()

Also consider asserting non-empty outputs and failing early with a clear message if any are missing.

289-290: RockyLinux8 Makefile extraction: anchored pattern is good; add a fallback check

Anchoring to ^jenkins-rockylinux8_%: BASE_IMAGE = is safer. Add a guard to fail fast if the variable is empty to prevent pulling "null" images later.

         if (target == "rockylinux8") {
-            BASE_IMAGE = sh(script: "cd ${LLM_ROOT} && grep '^jenkins-rockylinux8_%: BASE_IMAGE =' docker/Makefile | grep -o '=.*' | tr -d '=\"'", returnStdout: true).trim()
+            BASE_IMAGE = sh(script: "cd ${LLM_ROOT} && awk -F= '/^jenkins-rockylinux8_%: BASE_IMAGE =/ {gsub(/\"/, \"\", \$2); print \$2}' docker/Makefile", returnStdout: true).trim()
+            if (!BASE_IMAGE) {
+                error "Failed to resolve BASE_IMAGE from docker/Makefile for rockylinux8"
+            }
         }

298-309: Retry docker pull as well; keep retry policy consistent

You added retries to the long make step; mirror the same for the preceding docker pull to smooth transient registry hiccups.

-                def randomSleep = (Math.random() * 300 + 300).toInteger()
-                trtllm_utils.llmExecStepWithRetry(this, script: "docker pull ${TRITON_IMAGE}:${TRITON_BASE_TAG}", sleepInSecs: randomSleep, shortCommondRunTimeMax: 7200)
+                def randomSleep = (Math.random() * 300 + 300).toInteger()
+                trtllm_utils.llmExecStepWithRetry(this, script: "docker pull ${TRITON_IMAGE}:${TRITON_BASE_TAG}", sleepInSecs: randomSleep, numRetries: 3, shortCommondRunTimeMax: 7200)

326-337: Also retry the main-stage docker pull

Same rationale as the dependent stage; keep retry semantics consistent.

-            sh "env | sort"
-            def randomSleep = (Math.random() * 300 + 300).toInteger()
-            trtllm_utils.llmExecStepWithRetry(this, script: "docker pull ${TRITON_IMAGE}:${TRITON_BASE_TAG}", sleepInSecs: randomSleep, shortCommondRunTimeMax: 7200)
+            sh "env | sort"
+            def randomSleep = (Math.random() * 300 + 300).toInteger()
+            trtllm_utils.llmExecStepWithRetry(this, script: "docker pull ${TRITON_IMAGE}:${TRITON_BASE_TAG}", sleepInSecs: randomSleep, numRetries: 3, shortCommondRunTimeMax: 7200)

jenkins/L0_Test.groovy (2)

102-103: Centralized SSH options improve consistency

Defining COMMON_SSH_OPTIONS is a solid cleanup. Consider appending -o BatchMode=yes to avoid interactive prompts in CI.

1378-1379: Port-forwarding sequence: validate duplicate remote forwards

You establish remote forwarding twice (first via ssh with -NR ${userPort}:... and then via autossh with -NR ${userPort}:...). The second may fail with "address already in use" or leave the first redundant.

If autossh is the intended long-lived tunnel, drop the -NR ${userPort}:... on the preceding ssh command and keep the monitor -R ${monitorPort}:.../-L as needed. Example:
-                sh "ssh -f ${COMMON_SSH_OPTIONS} -L 1111:127.0.0.1:${monitorPort} -R ${monitorPort}:127.0.0.1:1112 -NR ${userPort}:localhost:22 ${USERNAME}@${HOST_NAME}"
+                sh "ssh -f ${COMMON_SSH_OPTIONS} -L 1111:127.0.0.1:${monitorPort} -R ${monitorPort}:127.0.0.1:1112 -N ${USERNAME}@${HOST_NAME}"
Then let autossh own the ${userPort} reverse tunnel.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 7f7a301 and 85759ab.

📒 Files selected for processing (3)

cpp/CMakeLists.txt (1 hunks)
jenkins/BuildDockerImage.groovy (6 hunks)
jenkins/L0_Test.groovy (10 hunks)

🔇 Additional comments (2)

jenkins/BuildDockerImage.groovy (1)

262-262: Sorted environment printout is a good logging enhancement

Using env | sort improves reproducibility of logs and diagnostics. No issues.

jenkins/L0_Test.groovy (1)

118-121: LGTM: Replacing scattered SSH/SCP flags with COMMON_SSH_OPTIONS

The substitutions are correct for scp/ssh invocations in these places, and help future maintenance.

Also applies to: 244-245, 333-333, 341-342, 345-346, 355-355, 399-399

jenkins/L0_Test.groovy

…tion Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>

chzblych · 2025-08-18T06:06:20Z

/bot run --stage-list "GB200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1"

tensorrt-cicd · 2025-08-18T06:12:01Z

PR_Github #15598 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-18T11:22:42Z

PR_Github #15598 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #173 (Partly Tested) completed with status: 'SUCCESS'

…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>

…ection (#6971) (#7005) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>

…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>

…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>

[TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network f…

47c9d89

…ailures (NVIDIA#6836) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>

chzblych requested a review from a team as a code owner August 18, 2025 01:06

coderabbitai bot reviewed Aug 18, 2025

View reviewed changes

jenkins/L0_Test.groovy Outdated Show resolved Hide resolved

jenkins/L0_Test.groovy Show resolved Hide resolved

[None][infra] Revert a WAR on main branch only and improve SSH connec…

347fc44

…tion Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>

chzblych force-pushed the cherry-pick-from-main branch from 85759ab to 347fc44 Compare August 18, 2025 06:03

yiqingy0 approved these changes Aug 18, 2025

View reviewed changes

chzblych merged commit 6fda8dd into NVIDIA:release/1.0 Aug 18, 2025
4 of 5 checks passed

chzblych deleted the cherry-pick-from-main branch August 18, 2025 17:11

coderabbitai bot mentioned this pull request Aug 18, 2025

[None][infra] Cherry-pick #6836 from main branch and improve SSH connection (#6971) #7005

Merged

coderabbitai bot mentioned this pull request Sep 5, 2025

[None][ci] Improve SSH connection stability #7567

Merged

1 task

[None][infra] Cherry-pick #6836 from main branch and improve SSH connection #6971

[None][infra] Cherry-pick #6836 from main branch and improve SSH connection #6971

Uh oh!

Conversation

chzblych commented Aug 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chzblych commented Aug 18, 2025

Uh oh!

tensorrt-cicd commented Aug 18, 2025

Uh oh!

tensorrt-cicd commented Aug 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chzblych commented Aug 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 18, 2025 •

edited

Loading