KEMBAR78
catch tensor.numel() == 0 in nan detector by HarounH · Pull Request #140741 · pytorch/pytorch · GitHub
Skip to content

Conversation

@HarounH
Copy link
Contributor

@HarounH HarounH commented Nov 14, 2024

Context: we are trying to pass an empty tensor through the system now (sometimes;... its an edge case); and it seems to cause all_reduce to seg fault, which is unexpected to me

Deep Shah and Pavan identified the issue, I'm just pushing for a fix :)

Test Plan: idk what i'm doing here, someone help

Reviewed By: shuqiangzhang

Differential Revision: D65956095

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2024

This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @HarounH, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 14, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: HarounH / name: Haroun H (2b117c7)

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 14, 2024
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140741

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 2b117c7 with merge base 27c7caf (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65956095

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65956095

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65956095

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65956095

Summary:

Pull Request resolved:
pytorch#140741

Test Plan: idk what i'm doing here, someone help

Reviewed By: shuqiangzhang

Differential Revision: D65956095
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65956095

@shuqiangzhang
Copy link
Contributor

@pytorchbot merge -f "merging"

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 15, 2024

You need to provide a reason for using force merge, in the format @pytorchbot merge -f 'Explanation'.
The explanation needs to be clear on why this is needed. Here are some good examples:

  • Bypass checks due to unrelated upstream failures from ...
  • This is a minor fix to ..., which shouldn't break anything
  • This is pre-tested in a previous CI run
  • Bypass flaky ... check

@shuqiangzhang
Copy link
Contributor

@pytorchbot merge -f "no CI failure"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
Context: we are trying to pass an empty tensor through the system now (sometimes;... its an edge case); and it seems to cause all_reduce to seg fault, which is unexpected to me

Deep Shah and Pavan identified the issue, I'm just pushing for a fix :)

Test Plan: idk what i'm doing here, someone help

Reviewed By: shuqiangzhang

Differential Revision: D65956095

Pull Request resolved: pytorch#140741
Approved by: https://github.com/shuqiangzhang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants