-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[JIT][AD]matmul/dropout #17523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JIT][AD]matmul/dropout #17523
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The non-convolution changes look good. In person: let's merge the non-convolution parts of this PR but hold off on changing _convolution. It is not clear dropping really complicated _convolution op into the differentiation will allow any backends to do meaningful optimization.
aten/src/ATen/native/Convolution.cpp
Outdated
| // impl_index 6 - 12 are in _convolution_nogroup | ||
| auto returned_tuple = at::_convolution_nogroup( | ||
| input, weight, bias, params.stride, params.padding, params.dilation, params.transposed, params.output_padding); | ||
| output = std::get<0>(returned_tuple); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::tie would be nicer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ailzhang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: * Add AD formula for _convolution & matmul & dropout * add prim::range, fixes #17483 Example: ``` dim = 3 x = range(dim) ``` Pull Request resolved: pytorch/pytorch#17523 Differential Revision: D14254002 Pulled By: ailzhang fbshipit-source-id: ba60d77b047db347929b72beca2623fb26aec957
|
|
||
| if (n->matches( | ||
| "aten::dropout(Tensor input, float p, bool train) -> Tensor")) { | ||
| auto train = n->get<bool>(attr::train).value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You never checked that this is a constant. If train is a computed value, this will throw. You have to do n->matches("...", attr::train) above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also nit: no need to assign to a local just to return
| }; | ||
| }), | ||
| Operator( | ||
| "prim::range(int n) -> int[]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I'm a bit ambivalent about adding this, since those objects have completely different characteristics in Python.
for i in range(2 ** 64):
...
is fine in Python 3, but we're using Python 2 semantics in TorchScript? That seems like not a bad choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I wasn't sure about this but it seems very useful atm as it saves me from writing for unnecessary loops in torchscript.
An alternative of this I thought about was list comprehension which in general I think it's also super useful, I went for prim::range because it's quick to implement I think. Possibly we could not document this feature and remove it when list comprehension is ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with adding a special primitive, just don't call it range. If we use Python names we should match Python semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that makes sense, I will rename it in a following PR. :) Thanks!
| train: bool): | ||
| mask = torch.empty_like(input) | ||
| mask.bernoulli_(1 - p) | ||
| res = mask * input / (1.0 - p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we replacing forward code with this? It's not necessary, and avoiding it might open up more optimization opportunities and is less bug prone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I wrote this I had 2 options :
- changing https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Dropout.cpp#L54 to return the mask as well
- write a simple dropout implementation in torchscript.
I went for the latter solution mainly because it's straightforward to do with a few lines.
I think a design question here is : do we want to keep AD formula as close as possible to its ATen counterpart (for easier possible future deduplication and consistency), or do we want to keep AD only math and simple? In some cases our aten functions considers backend/special sizes etc, I'm not certain if we would like to include those in AD formula or not. Let me know your thoughts on this, would love to have some guidelines on it. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I forgot that you need the mask! Let's keep it like that.
|
|
||
| auto element = getBoolItem(list->elements(), idx); | ||
| push(stack, std::move(element)); | ||
| push(stack, element); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we loose the move here? Seems like a regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm iirc this is done by clang-tidy hook. It was suggesting moving a bool doesn't make a perf difference and I thought that makes sense. I could bring it back if this might cause regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhh I haven't noticed it's a bool. Yeah, in that case it shouldn't be there
|
@zdevito FWIW adding convolution might still be nice because we could wrap the whole CNN in a single DifferentiableGraph and save up on constructing autograd graphs (assuming our |
Example: