-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[None][infra] Cherry-pick #6836 from main branch and improve SSH connection #6971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ailures (NVIDIA#6836) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
📝 WalkthroughWalkthroughAdds an environment-driven mirror redirection for UCXX’s rapids-cmake fetch in CMake. Tightens and mirrors image ARG extraction in the Jenkins Docker build, adds randomized sleeps, retries, and sorted env prints. Centralizes SSH options in the L0 test Jenkins script via a new COMMON_SSH_OPTIONS variable. Changes
Sequence Diagram(s)sequenceDiagram
participant CMake as CMake Configure
participant Env as Environment
participant FS as Filesystem
CMake->>Env: Check UCX found and GITHUB_MIRROR set
alt Conditions met
CMake->>FS: Read 3RDPARTY_DIR/ucxx/fetch_rapids.cmake
FS-->>CMake: File content
CMake->>CMake: Replace rapids-cmake URL with $GITHUB_MIRROR path
CMake->>FS: Write updated file
CMake->>CMake: Warn: replaced fetch_rapids.cmake URL
else
CMake->>CMake: No change
end
sequenceDiagram
participant Jenkins as Jenkins Pipeline
participant Make as Make/Dockerfile
participant Registry as Docker Registry
participant Docker as Docker
Jenkins->>Make: Extract BASE/ARG images (anchored grep)
Jenkins->>Jenkins: Replace nvcr.io with internal mirror
Jenkins->>Jenkins: Compute randomSleep (300–599s)
Jenkins->>Jenkins: env | sort (log)
Jenkins->>Jenkins: Sleep randomSleep
Jenkins->>Registry: docker pull base/triton image
loop up to 3 retries
Jenkins->>Docker: docker build (with sleepInSecs: randomSleep)
alt build fails
Jenkins->>Jenkins: Retry
else build succeeds
Jenkins-->>Jenkins: Done
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (7)
cpp/CMakeLists.txt (1)
498-508: Mirror rewrite logic is sound; add a guard to avoid unnecessary writes and aid diagnosticsThe env-gated mirror redirection looks good and is safely scoped. To improve robustness:
- Only write the file if a replacement actually occurred.
- Emit a STATUS message when no match was found (helps when UCXX updates the fetch script format).
Apply within this block:
- file(READ "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake" FILE_CONTENTS) - string( - REPLACE "https://raw.githubusercontent.com/rapidsai/rapids-cmake" - "$ENV{GITHUB_MIRROR}/rapidsai/rapids-cmake/raw/refs/heads" - FILE_CONTENTS "${FILE_CONTENTS}") - file(WRITE "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake" "${FILE_CONTENTS}") - message(WARNING "Replace UCXX fetch_rapids.cmake with internal mirror") + file(READ "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake" FILE_CONTENTS) + set(_OLD_URL "https://raw.githubusercontent.com/rapidsai/rapids-cmake") + set(_NEW_URL "$ENV{GITHUB_MIRROR}/rapidsai/rapids-cmake/raw/refs/heads") + if(FILE_CONTENTS MATCHES "${_OLD_URL}") + string(REPLACE "${_OLD_URL}" "${_NEW_URL}" FILE_CONTENTS "${FILE_CONTENTS}") + file(WRITE "${3RDPARTY_DIR}/ucxx/fetch_rapids.cmake" "${FILE_CONTENTS}") + message(WARNING "Replace UCXX fetch_rapids.cmake with internal mirror: ${_NEW_URL}") + else() + message(STATUS "UCXX fetch_rapids.cmake contains no '${_OLD_URL}' occurrences. Skipping mirror rewrite.") + endif()jenkins/BuildDockerImage.groovy (4)
284-287: ARG extraction: works; consider simpler, more resilient parsingYour anchored grep avoids incidental matches. For readability and fewer processes, awk can extract the value in one pass and trim quotes:
- def BASE_IMAGE = sh(script: "cd ${LLM_ROOT} && grep '^ARG BASE_IMAGE=' docker/Dockerfile.multi | grep -o '=.*' | tr -d '=\"'", returnStdout: true).trim() - def TRITON_IMAGE = sh(script: "cd ${LLM_ROOT} && grep '^ARG TRITON_IMAGE=' docker/Dockerfile.multi | grep -o '=.*' | tr -d '=\"'", returnStdout: true).trim() - def TRITON_BASE_TAG = sh(script: "cd ${LLM_ROOT} && grep '^ARG TRITON_BASE_TAG=' docker/Dockerfile.multi | grep -o '=.*' | tr -d '=\"'", returnStdout: true).trim() + def BASE_IMAGE = sh(script: "cd ${LLM_ROOT} && awk -F= '/^ARG BASE_IMAGE=/ {gsub(/\"/, \"\", \$2); print \$2}' docker/Dockerfile.multi", returnStdout: true).trim() + def TRITON_IMAGE = sh(script: "cd ${LLM_ROOT} && awk -F= '/^ARG TRITON_IMAGE=/ {gsub(/\"/, \"\", \$2); print \$2}' docker/Dockerfile.multi", returnStdout: true).trim() + def TRITON_BASE_TAG = sh(script: "cd ${LLM_ROOT} && awk -F= '/^ARG TRITON_BASE_TAG=/ {gsub(/\"/, \"\", \$2); print \$2}' docker/Dockerfile.multi", returnStdout: true).trim()Also consider asserting non-empty outputs and failing early with a clear message if any are missing.
289-290: RockyLinux8 Makefile extraction: anchored pattern is good; add a fallback checkAnchoring to
^jenkins-rockylinux8_%: BASE_IMAGE =is safer. Add a guard to fail fast if the variable is empty to prevent pulling "null" images later.if (target == "rockylinux8") { - BASE_IMAGE = sh(script: "cd ${LLM_ROOT} && grep '^jenkins-rockylinux8_%: BASE_IMAGE =' docker/Makefile | grep -o '=.*' | tr -d '=\"'", returnStdout: true).trim() + BASE_IMAGE = sh(script: "cd ${LLM_ROOT} && awk -F= '/^jenkins-rockylinux8_%: BASE_IMAGE =/ {gsub(/\"/, \"\", \$2); print \$2}' docker/Makefile", returnStdout: true).trim() + if (!BASE_IMAGE) { + error "Failed to resolve BASE_IMAGE from docker/Makefile for rockylinux8" + } }
298-309: Retry docker pull as well; keep retry policy consistentYou added retries to the long make step; mirror the same for the preceding docker pull to smooth transient registry hiccups.
- def randomSleep = (Math.random() * 300 + 300).toInteger() - trtllm_utils.llmExecStepWithRetry(this, script: "docker pull ${TRITON_IMAGE}:${TRITON_BASE_TAG}", sleepInSecs: randomSleep, shortCommondRunTimeMax: 7200) + def randomSleep = (Math.random() * 300 + 300).toInteger() + trtllm_utils.llmExecStepWithRetry(this, script: "docker pull ${TRITON_IMAGE}:${TRITON_BASE_TAG}", sleepInSecs: randomSleep, numRetries: 3, shortCommondRunTimeMax: 7200)
326-337: Also retry the main-stage docker pullSame rationale as the dependent stage; keep retry semantics consistent.
- sh "env | sort" - def randomSleep = (Math.random() * 300 + 300).toInteger() - trtllm_utils.llmExecStepWithRetry(this, script: "docker pull ${TRITON_IMAGE}:${TRITON_BASE_TAG}", sleepInSecs: randomSleep, shortCommondRunTimeMax: 7200) + sh "env | sort" + def randomSleep = (Math.random() * 300 + 300).toInteger() + trtllm_utils.llmExecStepWithRetry(this, script: "docker pull ${TRITON_IMAGE}:${TRITON_BASE_TAG}", sleepInSecs: randomSleep, numRetries: 3, shortCommondRunTimeMax: 7200)jenkins/L0_Test.groovy (2)
102-103: Centralized SSH options improve consistencyDefining COMMON_SSH_OPTIONS is a solid cleanup. Consider appending
-o BatchMode=yesto avoid interactive prompts in CI.
1378-1379: Port-forwarding sequence: validate duplicate remote forwardsYou establish remote forwarding twice (first via ssh with
-NR ${userPort}:...and then via autossh with-NR ${userPort}:...). The second may fail with "address already in use" or leave the first redundant.If autossh is the intended long-lived tunnel, drop the
-NR ${userPort}:...on the preceding ssh command and keep the monitor-R ${monitorPort}:.../-Las needed. Example:- sh "ssh -f ${COMMON_SSH_OPTIONS} -L 1111:127.0.0.1:${monitorPort} -R ${monitorPort}:127.0.0.1:1112 -NR ${userPort}:localhost:22 ${USERNAME}@${HOST_NAME}" + sh "ssh -f ${COMMON_SSH_OPTIONS} -L 1111:127.0.0.1:${monitorPort} -R ${monitorPort}:127.0.0.1:1112 -N ${USERNAME}@${HOST_NAME}"Then let autossh own the
${userPort}reverse tunnel.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
cpp/CMakeLists.txt(1 hunks)jenkins/BuildDockerImage.groovy(6 hunks)jenkins/L0_Test.groovy(10 hunks)
🔇 Additional comments (2)
jenkins/BuildDockerImage.groovy (1)
262-262: Sorted environment printout is a good logging enhancementUsing
env | sortimproves reproducibility of logs and diagnostics. No issues.jenkins/L0_Test.groovy (1)
118-121: LGTM: Replacing scattered SSH/SCP flags with COMMON_SSH_OPTIONSThe substitutions are correct for scp/ssh invocations in these places, and help future maintenance.
Also applies to: 244-245, 333-333, 341-342, 345-346, 355-355, 399-399
…tion Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
85759ab to
347fc44
Compare
|
/bot run --stage-list "GB200-4_GPUs-PyTorch-1, DGX_B200-4_GPUs-PyTorch-Post-Merge-1, GB200-8_GPUs-2_Nodes-PyTorch-Post-Merge-1" |
|
PR_Github #15598 [ run ] triggered by Bot |
|
PR_Github #15598 [ run ] completed with state |
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
…H connection (NVIDIA#6971) Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Zhanrui Sun <184402041+ZhanruiSunCh@users.noreply.github.com> Signed-off-by: Wangshanshan <30051912+dominicshanshan@users.noreply.github.com>
Summary by CodeRabbit
Chores
Refactor