-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Ensure NativeFunctions.h codegen output is deterministic #58889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit d02b4c3 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the prompt fix!
|
Old code never iterates on the set: https://github.com/pytorch/pytorch/pull/57361/files#diff-7657db4c86fbd16b4a22f6cc49fa1614025a2fd5e5c82c5548a83335c82e7c45L48 but you do in the code you fix here (via list). We need to figure out some sort of lint rules to detect this sort of nondeterminism. |
|
@bdhirsh has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
A lint rule seems difficult, since there are plenty of places in the codegen where we use stuff like It might not hurt to try specifying PYTHONHASHSEED in our codegen runs? https://stackoverflow.com/a/32529871. It looks like the nondeterminism is because python by default randomizes its hash seed on each invocation. As a more heavy-handed test we could also add a unit test that runs the codegen multiple times and diffs the output. But that would probably be way too slow of a test, since a single codegen run takes ~5-10 seconds (and it's not guaranteed to detect nondeterminism). |
|
in 1:1 we agreed PYTHONHASHSEED is a good idea but needs a little executing as it needs to be applied as an envvar, which means that we need to apply the envvar from cmake, or make a little stub sh file that sets the env var and calls into the Python. This fix is fine for now. |
Summary: Pull Request resolved: pytorch#58889 fixes pytorch#58796 Planning on re-testing locally tomorrow morning to confirm, but this change should fix the non-determinism in the codegen output that was causing `ccache` not to re-use its cached output. I built from the commit referenced in pytorch#58796 a few times and ran `diff -Naur` on the codegen output in `build/aten/src/ATen`. After a few tries, `NativeFunctions.h` had a few diffs. The diffs were all related to the ordering of functional/inplace/out variants of a NativeFunctionGroup, which looked non-deterministic. That looks like it's coming from my calling `set()` to filter out duplicate NativeFunction declarations. The earlier version of the codegen also called `set()` to filter out duplicates, but it did so individually for each `NativeFunction` object, before merging the groups (I'm not too sure why this didn't introduce non-determinism before. though). With the refactor from pytorch#57361, we're calling `set()` on the declarations from every operator for a given DispatchKey, which is probably what introduced the nondeterminism. Test Plan: Imported from OSS Reviewed By: gchanan Differential Revision: D28675941 Pulled By: bdhirsh fbshipit-source-id: bb66de00aafeeb9720d85e8156ac9f7539aed0d6
fixes #58796
Planning on re-testing locally tomorrow morning to confirm, but this change should fix the non-determinism in the codegen output that was causing
ccachenot to re-use its cached output.I built from the commit referenced in #58796 a few times and ran
diff -Nauron the codegen output inbuild/aten/src/ATen. After a few tries,NativeFunctions.hhad a few diffs. The diffs were all related to the ordering of functional/inplace/out variants of a NativeFunctionGroup, which looked non-deterministic.That looks like it's coming from my calling
set()to filter out duplicate NativeFunction declarations. The earlier version of the codegen also calledset()to filter out duplicates, but it did so individually for eachNativeFunctionobject, before merging the groups (I'm not too sure why this didn't introduce non-determinism before. though). With the refactor from #57361, we're callingset()on the declarations from every operator for a given DispatchKey, which is probably what introduced the nondeterminism.Stack from ghstack:
Differential Revision: D28675941