KEMBAR78
[Reland]Use cpuinfo to determine c10::ThreadPool thread number by cyyever · Pull Request #107339 · pytorch/pytorch · GitHub
Skip to content

Conversation

@cyyever
Copy link
Collaborator

@cyyever cyyever commented Aug 17, 2023

Relands PR #107010 and fixes BUCK builds.

cc @ezyang @izaitsevfb

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 17, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107339

Note: Links to docs will display an error until the docs builds have been completed.

❌ 130 New Failures, 6 Unrelated Failures

As of commit 59b327c with merge base 5531a23 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions
Copy link
Contributor

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@cyyever
Copy link
Collaborator Author

cyyever commented Aug 17, 2023

@pytorchbot label ciflow/periodic

@pytorch-bot pytorch-bot bot added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Aug 17, 2023
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 17, 2023

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'tag' (choose from 'merge', 'revert', 'rebase', 'label', 'drci')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci} ...

Try @pytorchbot --help for more info.

@cyyever
Copy link
Collaborator Author

cyyever commented Aug 17, 2023

@pytorchbot label ciflow/slow

@cyyever
Copy link
Collaborator Author

cyyever commented Aug 17, 2023

@pytorchbot label ciflow/trunk

@pytorch-bot pytorch-bot bot added ciflow/slow ciflow/trunk Trigger trunk jobs on your pull request labels Aug 17, 2023
@cyyever
Copy link
Collaborator Author

cyyever commented Aug 17, 2023

@pytorchbot label ciflow/inductor

@soulitzer soulitzer requested a review from ezyang August 17, 2023 03:52
@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 17, 2023
@facebook-github-bot
Copy link
Contributor

@izaitsevfb has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@izaitsevfb
Copy link
Contributor

I've imported the PR to get the signal from internal meta builds ahead of the merge. Will let you know the results as soon as they are ready.

@izaitsevfb
Copy link
Contributor

Unfortunately, the internal builds are still failing. I'm trying to see how to forward fix them them on meta's side.

@izaitsevfb
Copy link
Contributor

izaitsevfb commented Aug 18, 2023

Please don't merge yet, I'm still working on the internal build fix.

@cyyever
Copy link
Collaborator Author

cyyever commented Aug 30, 2023

@izaitsevfb What is the situation?

@izaitsevfb
Copy link
Contributor

@izaitsevfb What is the situation?

I'm sorry for the delay, unfortunately, the fix is not trivial. We don't have a precedent yet of including third-party libraries into c10 core internally, and my attempts so far were unsuccessful.

But I'm still on it, and I'll try to solve this EoW. If not, I'll escalate.

@cyyever
Copy link
Collaborator Author

cyyever commented Sep 9, 2023

@pytorchbot label ciflow/binaries

@pytorch-bot pytorch-bot bot added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Sep 9, 2023
@cyyever
Copy link
Collaborator Author

cyyever commented Sep 10, 2023

@izaitsevfb If it is hard to add cpuinfo to c10, may we remove the cpuinfo test temporarily and try that in another PR?

izaitsevfb added a commit to izaitsevfb/pytorch that referenced this pull request Sep 11, 2023
…r + internal patch

Summary: Testing pytorch#107339 combined with internal patches.

Differential Revision: D49109231
@izaitsevfb
Copy link
Contributor

@cyyever

@izaitsevfb If it is hard to add cpuinfo to c10, may we remove the cpuinfo test temporarily and try that in another PR?

I think I finally found everything that needs to be patched internally.

The following changes should be made to your PR (I can do this myself if you're ok with that) (see #109079):

  • in c10/ovrsource_defs.bzl "//third-party/cpuinfo:cpuinfo", should be added.
  • in c10/core/build.bzl I changed dependency to ":cpuinfo", but your original variant might also work (testing this now).

I'll try to merge this PR tomorrow. Apologies again for the delay.

@cyyever
Copy link
Collaborator Author

cyyever commented Sep 12, 2023

@izaitsevfb Thank you so much for the efforts. I will add the changes.

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased cpu_core_count2 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cpu_core_count2 && git pull --rebase)

@facebook-github-bot
Copy link
Contributor

@izaitsevfb has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@izaitsevfb
Copy link
Contributor

Ok, unfortunately :cpuinfo in c10/core/build.bzl doesn't work in OSS bazel build (I missed the build failures on my test PR, as they are suppressed as flaky). I'm trying to figure out the hack to have a consolidated dependency syntax that works both internally and in OSS.

@cyyever
Copy link
Collaborator Author

cyyever commented Sep 13, 2023

@izaitsevfb I had a local bazel build and "//third_party/cpuinfo" could work.

@izaitsevfb
Copy link
Contributor

izaitsevfb commented Sep 13, 2023

@izaitsevfb I had a local bazel build and "//third_party/cpuinfo" could work.

Yep, that's the way to go. I found how to support this variant internally as well, and all internal tests are passing now on the patched version.

So the plan is following (mostly for the record):

  • import the PR
  • patch the diff with internal-only changes (they are already prepared)
  • land both versions simultaneously in a co-dev style

@facebook-github-bot
Copy link
Contributor

@izaitsevfb has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@izaitsevfb
Copy link
Contributor

The corresponding diff was merged internally, so our tooling should merge this PR shortly (if it doesn't, I'll force-merge it).

@izaitsevfb
Copy link
Contributor

@pytorchbot merge -f 'Landed internally as D48443523'

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@cyyever cyyever deleted the cpu_core_count2 branch September 14, 2023 23:46
@cyyever
Copy link
Collaborator Author

cyyever commented Sep 14, 2023

@izaitsevfb Thank you!

@izaitsevfb
Copy link
Contributor

@cyyever thank you for you patience, that was quite an adventure :)

huydhn added a commit to pytorch/test-infra that referenced this pull request Sep 25, 2023
This addresses a confusing bug on HUD and Dr.CI where a bunch of
unrelated cancelled signals showing up, forcing people to force merge.
For example,

* Dr.CI pytorch/pytorch#107339
* HUD https://hud.pytorch.org/pr/107339
* Dr.CI pytorch/pytorch#105055
* HUD https://hud.pytorch.org/pr/105055 

They are cancelled signals from the previous workflow run that had been
retried successfully. The cancelled signal were there because the names
are different, i.e. `manywheel-py3_10-cuda11_8-test (cancel)` became
`manywheel-py3_10-cuda11_8-test / test (success)` after retrying

The fix I have here is to use a trie search to check if a cancelled job
has been retried successfully and remove it from the list accordingly.

### Testing

*
https://torchci-git-fork-huydhn-remove-wrong-cancel-948947-fbopensource.vercel.app/pr/107339
*
https://torchci-git-fork-huydhn-remove-wrong-cancel-948947-fbopensource.vercel.app/pr/105055

* **Dr.CI  #107339**
<!-- drci-comment-start -->

## 🔗 Helpful Links
### 🧪 See artifacts and rendered test results at
[hud.pytorch.org/pr/107339](https://hud.pytorch.org/pr/107339)
* 📄 Preview [Python docs built from this
PR](https://docs-preview.pytorch.org/pytorch/pytorch/107339/index.html)
* 📄 Preview [C++ docs built from this
PR](https://docs-preview.pytorch.org/pytorch/pytorch/107339/cppdocs/index.html)
* ❓ Need help or want to give feedback on the CI? Visit the
[bot commands
wiki](https://github.com/pytorch/pytorch/wiki/Bot-commands) or our
[office
hours](https://github.com/pytorch/pytorch/wiki/Dev-Infra-Office-Hours)

Note: Links to docs will display an error until the docs builds have
been completed.


## ❌ 1 New Failure, 6 Unrelated Failures
As of commit 59b327cdb6891318111fe98c0dbc72c7da0e7b95 with merge base
5531a23b204b4daa2c0bb3c52610e9a0ba79dacf (<sub><sub><img alt="image"
width=70
src="https://img.shields.io/date/1694521453?label=&color=FFFFFF&style=flat-square"></sub></sub>):
<details open><summary><b>NEW FAILURE</b> - The following job has
failed:</summary><p>

*
[wheel-py3_10-cpu-test](https://hud.pytorch.org/pr/pytorch/pytorch/107339#16727706927)
([gh](https://github.com/pytorch/pytorch/actions/runs/6163427856/job/16727706927))
</p></details>
<details ><summary><b>FLAKY</b> - The following job failed but was
likely due to flakiness present on trunk:</summary><p>

* [macos-12-py3-x86-64 / test (default, 1, 4,
macos-12)](https://hud.pytorch.org/pr/pytorch/pytorch/107339#16727674082)
([gh](https://github.com/pytorch/pytorch/actions/runs/6163427777/job/16727674082))
</p></details>
<details ><summary><b>UNSTABLE</b> - The following jobs failed but were
likely due to flakiness present on trunk and has been marked as
unstable:</summary><p>

* [linux-focal-cuda11.8-py3.9-gcc9 / test (multigpu, 1, 1,
linux.g5.12xlarge.nvidia.gpu,
unstable)](https://hud.pytorch.org/pr/pytorch/pytorch/107339#16727757788)
([gh](https://github.com/pytorch/pytorch/actions/runs/6163427777/job/16727757788))
* [linux-focal-py3.11-clang10 / test (dynamo, 1, 2, linux.2xlarge,
unstable)](https://hud.pytorch.org/pr/pytorch/pytorch/107339#16724273623)
([gh](https://github.com/pytorch/pytorch/actions/runs/6162428606/job/16724273623))
* [linux-focal-py3.11-clang10 / test (dynamo, 2, 2, linux.2xlarge,
unstable)](https://hud.pytorch.org/pr/pytorch/pytorch/107339#16724273805)
([gh](https://github.com/pytorch/pytorch/actions/runs/6162428606/job/16724273805))
* [linux-focal-py3.8-clang10 / test (dynamo, 1, 2, linux.2xlarge,
unstable)](https://hud.pytorch.org/pr/pytorch/pytorch/107339#16724267602)
([gh](https://github.com/pytorch/pytorch/actions/runs/6162428606/job/16724267602))
* [linux-focal-py3.8-clang10 / test (dynamo, 2, 2, linux.2xlarge,
unstable)](https://hud.pytorch.org/pr/pytorch/pytorch/107339#16724267789)
([gh](https://github.com/pytorch/pytorch/actions/runs/6162428606/job/16724267789))
</p></details>


This comment was automatically generated by Dr. CI and updates every 15
minutes.
<!-- drci-comment-end -->

* **Dr.CI #105055**
<!-- drci-comment-start -->

## 🔗 Helpful Links
### 🧪 See artifacts and rendered test results at
[hud.pytorch.org/pr/105055](https://hud.pytorch.org/pr/105055)
* 📄 Preview [Python docs built from this
PR](https://docs-preview.pytorch.org/pytorch/pytorch/105055/index.html)
* 📄 Preview [C++ docs built from this
PR](https://docs-preview.pytorch.org/pytorch/pytorch/105055/cppdocs/index.html)
* ❓ Need help or want to give feedback on the CI? Visit the
[bot commands
wiki](https://github.com/pytorch/pytorch/wiki/Bot-commands) or our
[office
hours](https://github.com/pytorch/pytorch/wiki/Dev-Infra-Office-Hours)

Note: Links to docs will display an error until the docs builds have
been completed.


## ✅ You can merge normally! (2 Unrelated Failures)
As of commit 1f762dfc92b46323950c3e6e95d99d9687741451 with merge base
d0f8ee45bdd3d68895dfecf38b39c363ebf82483 (<sub><sub><img alt="image"
width=70
src="https://img.shields.io/date/1692769911?label=&color=FFFFFF&style=flat-square"></sub></sub>):
<details ><summary><b>FLAKY</b> - The following job failed but was
likely due to flakiness present on trunk:</summary><p>

* [cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test
(aot_eager_torchbench, 1, 1,
linux.g5.4xlarge.nvidia.gpu)](https://hud.pytorch.org/pr/pytorch/pytorch/105055#16135412390)
([gh](https://github.com/pytorch/pytorch/actions/runs/5949203435/job/16135412390))
</p></details>
<details ><summary><b>UNSTABLE</b> - The following job failed but was
likely due to flakiness present on trunk and has been marked as
unstable:</summary><p>

* [linux-focal-rocm5.6-py3.8 / test (default, 1, 3, linux.rocm.gpu,
unstable)](https://hud.pytorch.org/pr/pytorch/pytorch/105055#16135257231)
([gh](https://github.com/pytorch/pytorch/actions/runs/5949203290/job/16135257231))
</p></details>


This comment was automatically generated by Dr. CI and updates every 15
minutes.
<!-- drci-comment-end -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/inductor ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/slow ciflow/trunk Trigger trunk jobs on your pull request Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants