KEMBAR78
Use amazon linux 2023 runners for Docker builds by atalman · Pull Request #136544 · pytorch/pytorch · GitHub
Skip to content

Conversation

@atalman
Copy link
Contributor

@atalman atalman commented Sep 24, 2024

Migrate these builds to linux 2023. We want to build and test the Docker images in CD.

Looks like we are hitting this issue: docker/buildx#379 when trying to build Docker on Amazon Linux 2023.

Conda Docker build is timing out. While Manywheel is executing but failing because BUILDKIT is turned off: https://github.com/pytorch/pytorch/actions/runs/11036043157/job/30653543264?pr=136544

Proposed Solution is to fix it in user_data . Please see: pytorch/test-infra#5712

I see docker builds are executed successfully here: https://github.com/pytorch/pytorch/actions/runs/11040149229/job/30667448668?pr=136544

Workaround timeout problem (reported in https://bugzilla.redhat.com/show_bug.cgi?id=1537564 ) by configuring number of open files per container to 1048576

@atalman atalman requested a review from a team as a code owner September 24, 2024 16:43
@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Sep 24, 2024
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136544

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit be8ec13 with merge base eac04fe (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ZainRizvi
ZainRizvi previously approved these changes Sep 24, 2024
@atalman atalman requested a review from jeffdaily as a code owner September 25, 2024 00:36
@ZainRizvi ZainRizvi dismissed their stale review September 25, 2024 16:36

dismissing until the docker builds actually work

test

test

test

test

test

test

test

test

test

test

test

test

test

fix

restart_docker

Add desciption
Copy link
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not resolve the fix in test-infra first and then merge the pytorch PR after it starts consuming those test-infra changes? That would help test the changes in a clean environment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove

@malfet
Copy link
Contributor

malfet commented Sep 25, 2024

@pytorchbot merge -f "Lint is green, let's test it in prod..."

set -x
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please add a link to the issue that you discovered which explains the problem in more detail

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

BoyuanFeng pushed a commit to BoyuanFeng/pytorch that referenced this pull request Sep 25, 2024
Migrate these builds to linux 2023. We want to build and test the Docker images in CD.

Looks like we are hitting this issue: docker/buildx#379 when trying to build Docker on Amazon Linux 2023.

Conda Docker build is timing out. While Manywheel is executing but failing because BUILDKIT is turned off: https://github.com/pytorch/pytorch/actions/runs/11036043157/job/30653543264?pr=136544

Proposed Solution is to fix it in user_data . Please see: pytorch/test-infra#5712

I see docker builds are executed successfully here: https://github.com/pytorch/pytorch/actions/runs/11040149229/job/30667448668?pr=136544

Workaround timeout problem (reported in https://bugzilla.redhat.com/show_bug.cgi?id=1537564 ) by configuring number of open files per container to 1048576
Pull Request resolved: pytorch#136544
Approved by: https://github.com/ZainRizvi

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants