KEMBAR78

catch tensor.numel() == 0 in nan detector by HarounH · Pull Request #140741 · pytorch/pytorch · GitHub

catch tensor.numel() == 0 in nan detector #140741

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

HarounH wants to merge 1 commit into pytorch:main from HarounH:export-D65956095

Contributor

HarounH commented Nov 14, 2024 •

edited

Loading

Context: we are trying to pass an empty tensor through the system now (sometimes;... its an edge case); and it seems to cause all_reduce to seg fault, which is unexpected to me

Deep Shah and Pavan identified the issue, I'm just pushing for a fix :)

Test Plan: idk what i'm doing here, someone help

Reviewed By: shuqiangzhang

Differential Revision: D65956095

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

pytorch-bot bot commented Nov 14, 2024

This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @HarounH, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team.

linux-foundation-easycla bot commented Nov 14, 2024 •

edited

Loading

The committers listed above are authorized under a signed CLA.

✅ login: HarounH / name: Haroun H (2b117c7)

pytorch-bot bot added oncall: distributed release notes: distributed (c10d) labels

pytorch-bot bot commented Nov 14, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140741

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

✅ No Failures

As of commit 2b117c7 with merge base 27c7caf ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Contributor

facebook-github-bot commented Nov 14, 2024

This pull request was exported from Phabricator. Differential Revision: D65956095

facebook-github-bot added the fb-exported label

HarounH force-pushed the export-D65956095 branch from 34eb847 to 9575ada Compare

November 14, 2024 20:45

Contributor

facebook-github-bot commented Nov 14, 2024

This pull request was exported from Phabricator. Differential Revision: D65956095

HarounH force-pushed the export-D65956095 branch from 9575ada to f2c96df Compare

November 14, 2024 20:48

Contributor

facebook-github-bot commented Nov 14, 2024

This pull request was exported from Phabricator. Differential Revision: D65956095

shuqiangzhang approved these changes

View reviewed changes

HarounH force-pushed the export-D65956095 branch from f2c96df to e076a8f Compare

November 14, 2024 21:05

Contributor

facebook-github-bot commented Nov 14, 2024

This pull request was exported from Phabricator. Differential Revision: D65956095


          catch tensor.numel() == 0 in nan detector (pytorch#140741)

2b117c7

Summary:

Pull Request resolved:
pytorch#140741

Test Plan: idk what i'm doing here, someone help

Reviewed By: shuqiangzhang

Differential Revision: D65956095

HarounH force-pushed the export-D65956095 branch from e076a8f to 2b117c7 Compare

November 14, 2024 21:06

Contributor

facebook-github-bot commented Nov 14, 2024

This pull request was exported from Phabricator. Differential Revision: D65956095

Contributor

shuqiangzhang commented Nov 15, 2024

@pytorchbot merge -f "merging"

pytorch-bot bot commented Nov 15, 2024

You need to provide a reason for using force merge, in the format @pytorchbot merge -f 'Explanation'.
The explanation needs to be clear on why this is needed. Here are some good examples:

Bypass checks due to unrelated upstream failures from ...
This is a minor fix to ..., which shouldn't break anything
This is pre-tested in a previous CI run
Bypass flaky ... check

Contributor

shuqiangzhang commented Nov 15, 2024

@pytorchbot merge -f "no CI failure"

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Nov 15, 2024

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot closed this in

8043e67

pytorchmergebot added Merged and removed merging labels

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          catch tensor.numel() == 0 in nan detector (pytorch#140741)

8ee3158

Context: we are trying to pass an empty tensor through the system now (sometimes;... its an edge case); and it seems to cause all_reduce to seg fault, which is unexpected to me

Deep Shah and Pavan identified the issue, I'm just pushing for a fix :)

Test Plan: idk what i'm doing here, someone help

Reviewed By: shuqiangzhang

Differential Revision: D65956095

Pull Request resolved: pytorch#140741
Approved by: https://github.com/shuqiangzhang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fb-exported Merged oncall: distributed release notes: distributed (c10d)