KEMBAR78
Ensure NativeFunctions.h codegen output is deterministic by bdhirsh · Pull Request #58889 · pytorch/pytorch · GitHub
Skip to content

Conversation

@bdhirsh
Copy link
Contributor

@bdhirsh bdhirsh commented May 25, 2021

fixes #58796

Planning on re-testing locally tomorrow morning to confirm, but this change should fix the non-determinism in the codegen output that was causing ccache not to re-use its cached output.

I built from the commit referenced in #58796 a few times and ran diff -Naur on the codegen output in build/aten/src/ATen. After a few tries, NativeFunctions.h had a few diffs. The diffs were all related to the ordering of functional/inplace/out variants of a NativeFunctionGroup, which looked non-deterministic.

That looks like it's coming from my calling set() to filter out duplicate NativeFunction declarations. The earlier version of the codegen also called set() to filter out duplicates, but it did so individually for each NativeFunction object, before merging the groups (I'm not too sure why this didn't introduce non-determinism before. though). With the refactor from #57361, we're calling set() on the declarations from every operator for a given DispatchKey, which is probably what introduced the nondeterminism.

Stack from ghstack:

Differential Revision: D28675941

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 25, 2021

💊 CI failures summary and remediations

As of commit d02b4c3 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

bdhirsh added a commit that referenced this pull request May 25, 2021
@bdhirsh bdhirsh requested a review from ezyang May 25, 2021 00:38
Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the prompt fix!

@ezyang
Copy link
Contributor

ezyang commented May 25, 2021

Old code never iterates on the set: https://github.com/pytorch/pytorch/pull/57361/files#diff-7657db4c86fbd16b4a22f6cc49fa1614025a2fd5e5c82c5548a83335c82e7c45L48 but you do in the code you fix here (via list).

We need to figure out some sort of lint rules to detect this sort of nondeterminism.

@bdhirsh
Copy link
Contributor Author

bdhirsh commented May 25, 2021

@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@bdhirsh
Copy link
Contributor Author

bdhirsh commented May 25, 2021

We need to figure out some sort of lint rules to detect this sort of nondeterminism.

A lint rule seems difficult, since there are plenty of places in the codegen where we use stuff like set() in a context that doesn't directly impact the codegen output.

It might not hurt to try specifying PYTHONHASHSEED in our codegen runs? https://stackoverflow.com/a/32529871. It looks like the nondeterminism is because python by default randomizes its hash seed on each invocation.

As a more heavy-handed test we could also add a unit test that runs the codegen multiple times and diffs the output. But that would probably be way too slow of a test, since a single codegen run takes ~5-10 seconds (and it's not guaranteed to detect nondeterminism).

@ezyang
Copy link
Contributor

ezyang commented May 25, 2021

in 1:1 we agreed PYTHONHASHSEED is a good idea but needs a little executing as it needs to be applied as an envvar, which means that we need to apply the envvar from cmake, or make a little stub sh file that sets the env var and calls into the Python. This fix is fine for now.

@facebook-github-bot
Copy link
Contributor

@bdhirsh merged this pull request in 32273e8.

@facebook-github-bot facebook-github-bot deleted the gh/bdhirsh/122/head branch May 29, 2021 14:17
deniskokarev pushed a commit to deniskokarev/pytorch that referenced this pull request Jun 9, 2021
Summary:
Pull Request resolved: pytorch#58889

fixes pytorch#58796

Planning on re-testing locally tomorrow morning to confirm, but this change should fix the non-determinism in the codegen output that was causing `ccache` not to re-use its cached output.

I built from the commit referenced in pytorch#58796 a few times and ran `diff -Naur` on the codegen output in `build/aten/src/ATen`. After a few tries, `NativeFunctions.h` had a few diffs. The diffs were all related to the ordering of functional/inplace/out variants of a NativeFunctionGroup, which looked non-deterministic.

That looks like it's coming from my calling `set()` to filter out duplicate NativeFunction declarations. The earlier version of the codegen also called `set()` to filter out duplicates, but it did so individually for each `NativeFunction` object, before merging the groups (I'm not too sure why this didn't introduce non-determinism before. though). With the refactor from pytorch#57361, we're calling `set()` on the declarations from every operator for a given DispatchKey, which is probably what introduced the nondeterminism.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D28675941

Pulled By: bdhirsh

fbshipit-source-id: bb66de00aafeeb9720d85e8156ac9f7539aed0d6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants