KEMBAR78
[CI] Introduces experiment `awsa100` to `inductor-perf-compare.yml` workflow using `_runner-determinator.yml` by jeanschmidt · Pull Request #138204 · pytorch/pytorch · GitHub
Skip to content

Conversation

@jeanschmidt
Copy link
Contributor

@jeanschmidt jeanschmidt commented Oct 17, 2024

Adds the job get-test-label-type in .github/workflows/inductor-perf-compare.yml checking for the experiment awsa100.

It is then used by the job linux-focal-cuda12_1-py3_10-gcc9-inductor-build to define the prefix for the runners that will run the benchmark.

Those runners temporarily accept the labels awsa100.linux.gcp.a100 and linux.aws.a100. This is used so we can migrate via experimentation from linux.gcp.a100. After successfully experiment with those instances we will remove those labels and update the workflows to use linux.aws.a100 and decomisson the gcp fleet.

@jeanschmidt jeanschmidt requested a review from a team as a code owner October 17, 2024 12:41
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 17, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138204

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 970eb24 with merge base 354bc3a (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jeanschmidt
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased jeanschmidt/perf_compare_aws onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout jeanschmidt/perf_compare_aws && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the jeanschmidt/perf_compare_aws branch from 68e70f2 to 0688c84 Compare October 18, 2024 14:59
jobs:
get-label-type:
name: get-label-type
get-build-label-type:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Feels kinda weird to call this the build label since this is technically the default, cross-fleet label. It makes sense in the context of this file, but this isn't a convention that can be copied over to other workflows

Naming things is hard

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just keep the job name as get-label-type?

Feels confusing to me TBH

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps default-label-prefix could be a new convention?

Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay, this is finally here. Let me know if you plan to switch to a different runner label as awsa100.linux.gpu.a100 sounds weird

@jeanschmidt
Copy link
Contributor Author

@huydhn indeed it sounds very weird, this is done so we can leverage the current experiment stack without the need to perform some complex rewrite of the tooling.

The idea is to migrate to linux.aws.a100 as the final version.

@jeanschmidt
Copy link
Contributor Author

@pytorchbot merge -f "changes can't impact trunk"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@jeanschmidt jeanschmidt deleted the jeanschmidt/perf_compare_aws branch October 23, 2024 14:03
SamGinzburg pushed a commit that referenced this pull request Oct 28, 2024
…orkflow using `_runner-determinator.yml` (#138204)

Adds the job `get-test-label-type` in `.github/workflows/inductor-perf-compare.yml` checking for the experiment `awsa100`.

It is then used by the job `linux-focal-cuda12_1-py3_10-gcc9-inductor-build` to define the prefix for the runners that will run the benchmark.

Those runners temporarily accept the labels `awsa100.linux.gcp.a100` and `linux.aws.a100`. This is used so we can migrate via experimentation from `linux.gcp.a100`. After successfully experiment with those instances we will remove those labels and update the workflows to use `linux.aws.a100` and decomisson the gcp fleet.
Pull Request resolved: #138204
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants