-
Notifications
You must be signed in to change notification settings - Fork 30.9k
Change constant torch.tensor to torch.full #20061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for your PR!
@michaelbenayoun would this cause any issue for our torch FX/ONNX conversions?
|
For FX, I think this is already tested in the CI so I guess it does not break things. |
|
Following the change, the training with ONNX Runtime breaks as |
What does this PR do?
Change
torch.tensortotorch.fullfrom GPT-2 to avoid CPU-GPU synchronization.Benchmarks with PyTorch Profiler
Here's a trace of a single GPT-2 training iteration with 12 GPT-2 blocks, 2 GPUs, and DDP.
From
_attnfunction, there are twotorch.tensorcalls. Those invoke CPU to GPU memory movement, thus callingcudaStreamSynchronize.How to fix
From PyTorch Recipes, we can avoid CPU-GPU synchronization by directly calling
torch.fullinstead oftorch.tensorortorch.to. Since twotorch.tensorcreate constant tensors, we can change those intotorch.full([], ...), and it will behave the same way.After the patch, every
cudaStreamSynchronizeis gone, and the duration of a single iteration is reduced by 0.5%.Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@LysandreJik