-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Add Bfloat16 scalar support to gloo backend #113557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Bfloat16 uses c10::BFloat16 which define conversions from/to float, so calculations are made on floats.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113557
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit abb0a4c with merge base 115da02 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@XilunWu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding BFloat16 support to gloo PG. I'm not sure if this is the best way to support BFloat16 because we do have float16 support in gloo (Half
i.e. gloo:;float16
).
@XilunWu are you suggesting to add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add UT.
@jgong5 That seems to be a reasonable call, right? But to quickly unblock, we can merge this PR once the windows part is fixed, and add bfloat16 to gloo later. |
Yes, that sounds reasonable. |
Just a reminder that the windows part has issue. Need to fix. |
Fix Windows and add unit tests for bfloat.
I added 2 test cases for bfloat. I think this will be enough and will not increase too much the time needed for testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as long as CI passes.
@XilunWu can we go forward with this change? |
@XilunWu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Really appreciate the effort of adding BFloat16 scalar support!
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
There was missing support for bfloat scalars. When I use gloo backend
torch.distributed.init_process_group(backend='gloo')
and run
torch.nn.parallel.DistributedDataParallel(model)
and model has Bfloat16 features I receive following error:
RuntimeError: Invalid scalar type
This change fix this issue.
c10::BFloat16 defines conversions from/to float, so calculations are made on float for bfloat.
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu