KEMBAR78
OpInfo: `zero_` by krshrimali · Pull Request #58731 · pytorch/pytorch · GitHub
Skip to content

Conversation

@krshrimali
Copy link
Contributor

See #54261

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 21, 2021

💊 CI failures summary and remediations

As of commit 79781e3 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build docker-pytorch-linux-xenial-py3.6-clang7 (1/1)

Step: "Check if image should be built" (full log | diagnosis details | 🔁 rerun)

ERROR: Something has gone wrong and the previou... isn't available for the merge-base of your branch
+ docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3.6-clang7:2152ba6309b12251994fcdcd4ff234bfbe646d06
unsupported manifest format: &{{{2 application/vnd.docker.distribution.manifest.v2+json} {application/vnd.docker.container.image.v1+json 25631 sha256:98cde2da1075205bab984ae16bb71248bdf53d3406c8e91346cd830b28d32a78 [] map[] <nil>} [{application/vnd.docker.image.rootfs.diff.tar.gzip 46254451 sha256:92473f7ef45574f608989888a6cfc8187d3a1425e3a63f974434acab03fed068 [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 851 sha256:fb52bde70123ac7f3a1b88fee95e74f4bdcdbd81917a91a35b56a52ec7671947 [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 527 sha256:64788f86be3fd71809b5de602deff9445f3de18d2f44a49d0a053dfc9a2008ae [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 169 sha256:33f6d5f2e001ababe3ddac4731d9c33121e1148ef32a87a83a5b470cb401abef [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 1390 sha256:ed9900d32ae781b33daf435f0133a89f512e19232f9cfc61bbfd99168591262d [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 241532322 sha256:7813b469b93fa781c42122f271b265b0080cf399716afae3eeca88f62fe6caf0 [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 809 sha256:a6b44d42c4c4855b0c54aec5ca5aa8a1e3f78fa76d124bb6287b4ca20b1279f1 [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 64903309 sha256:bac60e667e389be9028f2aca24f78def91dd821475cd027d55288a25bc9e4fbf [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 408 sha256:d66ba4a66f6a91a00e123f5b683aae0417a14355fff5054da54ec0e3961bc0ee [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 32 sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1 [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 107 sha256:8c5de0402787795ea46c6246b0e1dd2eef28e5ddfb7d0ac164a02b2abdf3c0ca [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 361 sha256:6069f28cdeff49b9650af7fb5a957841984c5f9de13d773d8e1f30a0f9de58ec [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 1154 sha256:17909504ffa51d49dae642ea2460e5d8fc588203d42ff72ca4d6be58ae403d66 [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 407 sha256:38ed1a4ccda8be1f22832cf05543049781c2b4b82f5857d5d2c4726d04f2d248 [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 106 sha256:9b52e58a368e65b9ea16d5324d16ecd0582b42fed0706e153a6631c6332b200c [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 1965 sha256:94fce42716a874c973a30dd96754ddab6f656631cf4581a2bf0ee7687d1d7d9d [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 1245973803 sha256:0bbcf9b41635af953e95d2682508175fcbb34dbb61c3c2f5e5f3e0cfa335c6d1 [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 440 sha256:e38ce17947a8e5c33aa3af336ea07d57e391fc2eb837470aeab6b27a041bc7b1 [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 102 sha256:cb18224d181d07322ce87b6877e179e51767b3735d9c3c4eb3acbfcf8fef3b2e [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 231 sha256:cd8b5f60cf974e6b6fcb4af6af4ae2f1dd014077378a7a2b5d104140f15f6510 [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 987375 sha256:f7e5cc4ba07fc190e24f7f8e5b0775bf72c870d1d990ad2625a14db1c348b0ce [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 914 sha256:436c832c4c47bef7b5fc231f07ca9a89e836f3b1ab959a6da2761b8b8714b28e [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 2007661 sha256:5a73a911fe436975bb865209a6411000e8856025e8c7e6adb1be9f380139360b [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 109 sha256:5945f0ebca6c82de0749ca0f8716860bb3bce7f5a489ea8c2c962cfc2e85bed3 [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 875 sha256:d007f582ecf863cd0f528c2fd225bbfa385e138e7e6b6e1d9251dc4b043159d6 [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 988062 sha256:159337f3e54858592f2eb89882be8d34476617408569ead4536cc660f56bf9f9 [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 103 sha256:eedbb2fa6a0a759935d61e00ac9f39ffd2f77f06cf26dc72b1f1378e166dc6e8 [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 861 sha256:239b2bf0e12bc4725688abe5fa531965d4bd43e58f1efa50f6e29debd4fb2c32 [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 96973035 sha256:1be86537bd9316f33b62b3fc5ec5808ce1d672ce779a411aa0cb127b3b472bee [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 107 sha256:1d858cc47c7f5c551a73cadea9a519e64c6fef6f843628d6203fd69745686d9d [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 1292 sha256:67d965b9db7bea40827815509462732ae1ead923fb6f3542b99aa5b80edfc5cd [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 151 sha256:1b35e2ff20fb17f39b8866b45ec335ebcb612007114fdc77588eaf7918f57786 [] map[] <nil>} {application/vnd.docker.image.rootfs.diff.tar.gzip 642 sha256:d21740c6aafa264b6aa3a9cfb514e7ef2e39f3a93e8d9105af0e54082324b547 [] map[] <nil>} {applica
++ git merge-base HEAD aaccdc39965ade4b61b7852329739e777e244c25
+ git rev-parse aaccdc39965ade4b61b7852329739e777e244c25:.circleci/docker
2152ba6309b12251994fcdcd4ff234bfbe646d06
+++ git merge-base HEAD aaccdc39965ade4b61b7852329739e777e244c25
++ git rev-parse aaccdc39965ade4b61b7852329739e777e244c25:.circleci/docker
+ PREVIOUS_DOCKER_TAG=2152ba6309b12251994fcdcd4ff234bfbe646d06
+ [[ 2152ba6309b12251994fcdcd4ff234bfbe646d06 = \2\1\5\2\b\a\6\3\0\9\b\1\2\2\5\1\9\9\4\f\c\d\c\d\4\f\f\2\3\4\b\f\b\e\6\4\6\d\0\6 ]]
+ echo 'ERROR: Something has gone wrong and the previous image isn'\''t available for the merge-base of your branch'
ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch
+ echo '       contact the PyTorch team to restore the original images'
       contact the PyTorch team to restore the original images
+ exit 1


Exited with code exit status 1


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Collaborator

@kshitij12345 kshitij12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @krshrimali .

Have asked a few questions.

sample_inputs_func=sample_inputs_xlogy),
OpInfo('zero_',
dtypes=all_types_and_complex_and(torch.bfloat16, torch.half, torch.bool),
supports_autograd=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should support autograd.

- name: zero_(Tensor(a!) self) -> Tensor(a!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the error I got for almost all the tests:

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

Do you have any idea what is the solution to this? @kshitij12345

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still get an error after wrapping the operator in lambda?

Copy link
Collaborator

@kshitij12345 kshitij12345 May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests always assume that the operator's function variant doesn't modify the input inplace (which was always true except for this operator) .

eager consistency test has some mechanism to deal with this,

cloned = clone_input_helper(sample.input) if variant in inplace_ops else sample.input

Need to dig a bit further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REG: #58731 (comment)

Yes, still getting the same errors after wrapping the operator in lambda.

Copy link
Collaborator

@kshitij12345 kshitij12345 May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inspired by

cloned = clone_input_helper(sample.input) if variant in inplace_ops else sample.input

I tried,

OpInfo('zero',
       op=lambda x: torch.zero_(x.clone()),
       dtypes=all_types_and_complex_and(torch.bfloat16, torch.half, torch.bool),
       supports_autograd=True,
       supports_out=False,
       sample_inputs_func=sample_inputs_zero_),

which works for all test except jit and eager consistency.
jit consistency has issues with op being wrapped in lambda.

The eager test fails for more subtle reasons.

method_variant = getattr(torch.Tensor, name, None)
# attributes like real, imag are not callable
self.method_variant = method_variant if callable(method_variant) else None
inplace_name = name + "_"
self.inplace_variant = getattr(torch.Tensor, inplace_name, None)

We get the method variant and inplace variant in the above way.

For this case, method variant is Tensor.zero_ and inplace variant is None.

Thus in eager tests at the point below,

pytorch/test/test_ops.py

Lines 321 to 323 in d88d321

cloned = clone_input_helper(sample.input) if variant in inplace_ops else sample.input
if variant in inplace_ops and sample.broadcasts_input:

We incorrectly don't clone the inputs for method variant (which is actually the inplace variant) and hence we get,

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

To mitigate this, I propose to allow setting method_variant and inplace_variant via the OpInfo constructor.

cc: @mruberry

@krshrimali can you take a look at how is this handled by method_tests ? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @kshitij12345 for digging into this. Taking a look at test/test_autograd.py to understand how autograd is tested for method_tests, it seems like requires_grad is set to False (/reset) for inplace ops. There is an additional comment that mentions that it needs to be fixed. Please see the relevant snippet below.

is_magic_method = name[:2] == '__' and name[-2] == '__'
is_inpalce = name[-1] == "_" and not is_magic_method
self_variable = create_input((self_size,), dtype=dtype, device=device)[0][0]
# FixMe: run grad checks on inplace self
if is_inplace:
    self_variable.requires_grad = False

Reference: https://github.com/pytorch/pytorch/blob/master/test/test_autograd.py#L5474

Copy link
Contributor Author

@krshrimali krshrimali May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the scope of this PR, I suggest to jit and eager consistency checks, and we can file this an issue (on allowing to set method_variant and inplace_variant via OpInfo constructor) separately and take the discussion there.

Looking forward to know your views, @mruberry and @kshitij12345.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping those tests and following up by allowing method and inplace variants to be specified sounds good

cases_scalar = ((),)
cases_non_scalar = ((S, S, S),
(S, S),
(S, ))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is okay merge these two. It won't be hard to read.

@krshrimali krshrimali requested a review from mruberry May 21, 2021 11:55
@kshitij12345

This comment has been minimized.

@H-Huang H-Huang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 21, 2021
def sample_inputs_zero_(op_info, device, dtype, requires_grad, **kwargs):
make_arg = partial(make_tensor, device=device, dtype=dtype, requires_grad=requires_grad)

cases = ((), (S, S, S), (S, S), (S,))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this probably doesn't need the (S, S) case

Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks but, per discussion, some follow-up is needed.

For now, actually, I'd keep "supports_autograd=False" and add a comment with a link to an issue for that not being correct, but our tests not correctly dealing with this case.

Then I'd file another issue for setting the method variant and inplace variants for operators directly, AND we should support ops that don't have function variants but only have method and/or inplace variants, too. Solving this latter issue should let us address the first issue.

How does that sound @krshrimali, @kshitij12345?

@kshitij12345
Copy link
Collaborator

For now, actually, I'd keep "supports_autograd=False" and add a comment with a link to an issue for that not being correct, but our tests not correctly dealing with this case.

I think that is fine since even method_tests are not doing the grad tests.

Then I'd file another issue for setting the method variant and inplace variants for operators directly, AND we should support ops that don't have function variants but only have method and/or inplace variants, too. Solving this latter issue should let us address the first issue.

On this, it actually has a function variant is which weird, torch.zero_ which very rarely used. Should it be deprecated? It has very few actual usages across the code base and probably should be replaced with tensor.zero_().

Screenshot from 2021-05-23 14-14-39
Screenshot from 2021-05-23 14-14-22

@mruberry what do you think?

@mruberry
Copy link
Collaborator

For now, actually, I'd keep "supports_autograd=False" and add a comment with a link to an issue for that not being correct, but our tests not correctly dealing with this case.

I think that is fine since even method_tests are not doing the grad tests.

Then I'd file another issue for setting the method variant and inplace variants for operators directly, AND we should support ops that don't have function variants but only have method and/or inplace variants, too. Solving this latter issue should let us address the first issue.

On this, it actually has a function variant is which weird, torch.zero_ which very rarely used. Should it be deprecated? It has very few actual usages across the code base and probably should be replaced with tensor.zero_().

Screenshot from 2021-05-23 14-14-39
Screenshot from 2021-05-23 14-14-22

@mruberry what do you think?

Unfortunately we're inconsistent today with how the inplace versions of operators are exposed. Your question is a good one but this is one discrepancy we probably don't want to worry about at the moment as it's dependent on some future decisions about how we represent operators.

@krshrimali
Copy link
Contributor Author

Updates:

Adding method_variant and inplace_variant args to OpInfo class is being done here. I think this (#59138) PR can wait for a few days until the PR is merged (linked before). The jit issue will still be there though, and an issue will be raised on that soon. Thanks for the patience!

Copy link
Collaborator

@kshitij12345 kshitij12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

OpInfo('zero_',
op=lambda x: torch.zero_(x.clone()),
method_variant=None,
inplace_variant=torch.Tensor.zero_,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice use of the new architecture

Copy link
Collaborator

@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, @krshrimali!

@facebook-github-bot
Copy link
Contributor

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mruberry merged this pull request in ef40757.

deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
Summary:
See pytorch#54261

Pull Request resolved: pytorch#58731

Reviewed By: ngimel

Differential Revision: D28784083

Pulled By: mruberry

fbshipit-source-id: f06de8045afd3728b1fedc014c091d8fd1955a9f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants