Enables previously "slow" `gradgrad` checks on CUDA #57802

krshrimali · 2021-05-07T06:38:45Z

Earlier, a few CUDA gradgrad checks (see the list of ops below) were disabled because of them being too slow. There have been improvements (see #57508 for reference) and this PR aimed on:

Time taken by gradgrad checks on CUDA for the ops listed below.
Enabling the tests again if the times sound reasonable

Ops considered: addbmm, baddbmm, bmm, cholesky, symeig, inverse, linalg.cholesky, linalg.cholesky_ex, linalg.eigh, linalg.qr, lu, qr, solve, triangular_solve, linalg.pinv, svd, linalg.svd, pinverse, linalg.householder_product, linalg.solve.

For numbers (on time taken) on a separate CI run: #57802 (comment).

cc: @mruberry @albanD @pmeier

facebook-github-bot · 2021-05-07T06:38:51Z

💊 CI failures summary and remediations

As of commit 22dd008 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-scanned failure(s)

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm4.2-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

codecov · 2021-05-07T23:52:01Z

Codecov Report

Merging #57802 (d58e146) into master (a427820) will increase coverage by 0.37%.
The diff coverage is n/a.

❗ Current head d58e146 differs from pull request most recent head 22dd008. Consider uploading reports for the commit 22dd008 to get more accurate results

@@            Coverage Diff             @@
##           master   #57802      +/-   ##
==========================================
+ Coverage   76.44%   76.81%   +0.37%     
==========================================
  Files        2022     1980      -42     
  Lines      202303   196861    -5442     
==========================================
- Hits       154652   151228    -3424     
+ Misses      47651    45633    -2018

krshrimali · 2021-05-08T13:46:37Z

Closing this, just realized to run all CI jobs, I need to create branch on upstream with ci-all/ prefix. (The issue will now be addressed here: #57895)

mruberry · 2021-05-08T23:22:01Z

@krshrimali Why do you want to run all the CI jobs?

krshrimali · 2021-05-10T04:52:50Z

Update: Results from profiling only CUDA gradgrad checks which were earlier disabled. The list of such ops:

addbmm, baddbmm, bmm, cholesky, symeig, inverse, linalg.cholesky, linalg.cholesky_ex, linalg.eigh, linalg.qr, lu, qr, solve, triangular_solve, linalg.pinv, svd, linalg.svd, pinverse, householder_product, linalg.solve

Below is the table with profiling results: (time is in second)

Function Name	Time Taken (`complex128`)	Time Taken (`float64`)
`addbmm`	1.416	0.379
`baddbmm`	1.354	0.369
`bmm`	0.167	0.132
`cholesky`	0.449	0.227
`symeig`	0.790	0.391
`inverse`	0.335	0.185
`linalg.cholesky`	0.511	0.250
`linalg.cholesky_ex`	0.576	0.280
`linalg.eigh`	0.522	0.271
`linalg.qr`	2.086	0.767
`lu`	1.546	0.665
`qr`	2.077	0.786
`solve`	5.226	2.127
`triangular_solve`	1.056	0.462
`linalg.pinv`	1.167	0.336
`svd`	3.619	1.004
`linalg.svd`	3.667	1.018
`pinverse`	1.297	0.336
`householder_product`	1.841	0.794
`linalg.solve`	6.571	2.838

cc: @mruberry

mruberry · 2021-05-11T05:43:52Z

Thanks for the updated numbers @krshrimali; I think these times are acceptable, cc @albanD and @soulitzer in case they find them egregious

This is going to conflict with a PR in @albanD's forward-mode AD stack. @albanD, do you want to remove the skip additions for slow grad/gradgrad in that PR, too? If so, I think this PR can remove the current slow skips.

krshrimali · 2021-05-20T07:14:04Z

Hi, @mruberry, @albanD - gentle ping. Should we disable the gradgrad checks for all the ops tested in this PR? Or has this been taken in a separate PR already? Thanks!

mruberry · 2021-05-20T07:34:05Z

Hi, @mruberry, @albanD - gentle ping. Should we disable the gradgrad checks for all the ops tested in this PR? Or has this been taken in a separate PR already? Thanks!

Thanks for the ping, @krshrimali! Let me add this to our review on Friday so we can let @albanD review, too. We should have an answer for you then.

krshrimali · 2021-05-26T04:52:46Z

Hi, @mruberry, @albanD - gentle ping. Should we disable the gradgrad checks for all the ops tested in this PR? Or has this been taken in a separate PR already? Thanks!

Thanks for the ping, @krshrimali! Let me add this to our review on Friday so we can let @albanD review, too. We should have an answer for you then.

Gentle ping @mruberry! Has this been taken up separately?

albanD · 2021-05-26T14:25:27Z

Hey!
We did discuss it last week but did not update here sorry.
There were two things:

All these tests are fast enough that we can re-enable them indeed.
It would be nice to get a rule for when a test in slow in this context, for example anything >5 or 10s and then time all the tests and apply that rule.
We could automatically enforce that rule by failing if a user adds a test that is larger than the given threshold without marking it as slow.

mruberry · 2021-05-30T04:18:18Z

Just wanted to ping on this. The action here (following @albanD's write-up) is to:

merge this PR with master
remove the [DO NOT MERGE] text
let's land this!

Sound good, @krshrimali? Thanks for helping us with this analysis. Separately, I owe @albanD a more general issue for how we handle slow tests.

krshrimali · 2021-05-30T04:32:26Z

Just wanted to ping on this. The action here (following @albanD's write-up) is to:

merge this PR with master

remove the [DO NOT MERGE] text

let's land this!

Sound good, @krshrimali? Thanks for helping us with this analysis. Separately, I owe @albanD a more general issue for how we handle slow tests.

Thanks, @albanD for the write-up. @mruberry - Thank you for the ping, just confirming, we want to remove all the tests from slowTest for now, right? In the scope of this PR, I'm guessing that we don't want to add any rule to test (if it's slow, as mentioned by Alban) when a test is added, right? We can probably take this up in another PR.

Please let me know if this sounds good, and I'll make the changes!

mruberry · 2021-05-30T04:34:33Z

Just wanted to ping on this. The action here (following @albanD's write-up) is to:

merge this PR with master

remove the [DO NOT MERGE] text

let's land this!

Sound good, @krshrimali? Thanks for helping us with this analysis. Separately, I owe @albanD a more general issue for how we handle slow tests.

Thanks, @albanD for the write-up. @mruberry - Thank you for the ping, just confirming, we want to remove all the tests from slowTest for now, right?

All the tests that this PR timed, yes.

In the scope of this PR, I'm guessing that we don't want to add any rule to test (if it's slow, as mentioned by Alban) when a test is added, right? We can probably take this up in another PR.

Correct; that's a separate issue that requires additional design.

Please let me know if this sounds good, and I'll make the changes!

Sounds great! Let me know if you have any questions

krshrimali · 2021-05-30T04:36:13Z

Just wanted to ping on this. The action here (following @albanD's write-up) is to:

merge this PR with master

remove the [DO NOT MERGE] text

let's land this!

Sound good, @krshrimali? Thanks for helping us with this analysis. Separately, I owe @albanD a more general issue for how we handle slow tests.

Thanks, @albanD for the write-up. @mruberry - Thank you for the ping, just confirming, we want to remove all the tests from slowTest for now, right?

All the tests that this PR timed, yes.

In the scope of this PR, I'm guessing that we don't want to add any rule to test (if it's slow, as mentioned by Alban) when a test is added, right? We can probably take this up in another PR.

Correct; that's a separate issue that requires additional design.

Please let me know if this sounds good, and I'll make the changes!

Sounds great! Let me know if you have any questions

Thanks, @mruberry! On it. :)

mruberry

Nice investigation, @krshrimali, and cool that these skips are now gone.

cc @soulitzer

facebook-github-bot · 2021-05-31T01:20:35Z

@mruberry has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-05-31T05:18:03Z

@mruberry merged this pull request in 6d45d7a.

Summary: Fixes pytorch#57508 Earlier, a few CUDA `gradgrad` checks (see the list of ops below) were disabled because of them being too slow. There have been improvements (see pytorch#57508 for reference) and this PR aimed on: 1. Time taken by `gradgrad` checks on CUDA for the ops listed below. 2. Enabling the tests again if the times sound reasonable Ops considered: `addbmm, baddbmm, bmm, cholesky, symeig, inverse, linalg.cholesky, linalg.cholesky_ex, linalg.eigh, linalg.qr, lu, qr, solve, triangular_solve, linalg.pinv, svd, linalg.svd, pinverse, linalg.householder_product, linalg.solve`. For numbers (on time taken) on a separate CI run: pytorch#57802 (comment). cc: mruberry albanD pmeier Pull Request resolved: pytorch#57802 Reviewed By: ngimel Differential Revision: D28784106 Pulled By: mruberry fbshipit-source-id: 9b15238319f143c59f83d500e831d66d98542ff8

Summary: **Description:** `SpectralFuncInfo` defines decorator mentioning: "gradgrad is quite slow". This PR re-analyses that statement since things have changed with gradient tests. **Test times:** #60435 (comment) **Follow-up** of #57802 cc: mruberry Pull Request resolved: #60435 Reviewed By: gchanan Differential Revision: D29707444 Pulled By: mruberry fbshipit-source-id: 444b4863bac8556c7e8fcc8ff58d81a91bd96a21

krshrimali added 9 commits May 6, 2021 15:11

Utils for recording time of ALL slow tests

58460b0

Minor edits in decorators to record time

ed0f273

Record time for all tests skipped

153c24b

Minor edits

13b78e8

merge

890032e

Recorded time on local machine

ca30054

disable temp_enable, use slowTest decorator

8a2d582

remove temp_enable, not needed anymore

962408e

Removing text file from the PR

fbfab92

facebook-github-bot added the cla signed label May 7, 2021

krshrimali added 3 commits May 7, 2021 06:41

Minor cleanup, remove useless code added before

4306ca2

remove slowTest import and brackets

f4eab0e

Merge branch 'master' into origin/record-time

3d0afed

pytorchbot added the open source label May 7, 2021

flake8 minor corrections

681fc0d

krshrimali changed the title ~~[DO NOT MERGE] Recording time taken for slowTest enabled tests (CUDA grad/gradgrad checks)~~ [DO NOT MERGE] [WIP] Recording time taken for slowTest enabled tests (CUDA grad/gradgrad checks) May 7, 2021

krshrimali added 8 commits May 7, 2021 16:43

Profile tests which are disabled for gradgradchecks

22b56d5

Merge commit

ef31dc8

Fixing editor issues

8d1326e

Just enable all gradgrad tests

d3324fb

Rollback, remove time tracking, and set gradgradchecks to False again

f1c5796

Remove brackets, just rolling back

297f8f3

Rolling back once again, my bad

84870c8

whitespace removed, just common stuff

e114dde

krshrimali closed this May 8, 2021

krshrimali reopened this May 9, 2021

krshrimali changed the title ~~[DO NOT MERGE] [WIP] Recording time taken for slowTest enabled tests (CUDA grad/gradgrad checks)~~ [DO NOT MERGE] Recording time taken for slowTest enabled tests (CUDA grad/gradgrad checks) May 10, 2021

krshrimali changed the title ~~[DO NOT MERGE] Recording time taken for slowTest enabled tests (CUDA grad/gradgrad checks)~~ [DO NOT MERGE] Recording time taken for slowTest enabled tests (CUDA gradgrad checks) May 10, 2021

krshrimali requested a review from mruberry May 10, 2021 08:35

mruberry added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 12, 2021

mruberry requested a review from albanD May 12, 2021 03:48

krshrimali added 4 commits May 30, 2021 13:51

Enable gradgrad checks (on CUDA) for listed ops

d3ff89d

Remove comments for triangular_solve as well

873eaeb

Merge branch 'master' into origin/record-time

438273c

Minor typo, fixed

22dd008

krshrimali changed the title ~~[DO NOT MERGE] Recording time taken for slowTest enabled tests (CUDA gradgrad checks)~~ Enabling gradgrad checks on CUDA (earlier disabled for being too slow) May 30, 2021

mruberry changed the title ~~Enabling gradgrad checks on CUDA (earlier disabled for being too slow)~~ Enables previously "slow" gradgrad checks on CUDA May 31, 2021

mruberry approved these changes May 31, 2021

View reviewed changes

facebook-github-bot closed this in 6d45d7a May 31, 2021

facebook-github-bot added the Merged label May 31, 2021

krshrimali mentioned this pull request Jun 22, 2021

Analysing time taken by gradgrad checks for Spectral Functions #60435

Closed

Enables previously "slow" gradgrad checks on CUDA #57802

Enables previously "slow" gradgrad checks on CUDA #57802

Uh oh!

Conversation

krshrimali commented May 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented May 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

ci.pytorch.org: 1 failed

Uh oh!

codecov bot commented May 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

krshrimali commented May 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mruberry commented May 8, 2021

Uh oh!

krshrimali commented May 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mruberry commented May 11, 2021

Uh oh!

krshrimali commented May 20, 2021

Uh oh!

mruberry commented May 20, 2021

Uh oh!

krshrimali commented May 26, 2021

Uh oh!

albanD commented May 26, 2021

Uh oh!

mruberry commented May 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krshrimali commented May 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mruberry commented May 30, 2021

Uh oh!

krshrimali commented May 30, 2021

Uh oh!

mruberry left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented May 31, 2021

Uh oh!

facebook-github-bot commented May 31, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Enables previously "slow" `gradgrad` checks on CUDA #57802

Enables previously "slow" `gradgrad` checks on CUDA #57802

krshrimali commented May 7, 2021 •

edited

Loading

facebook-github-bot commented May 7, 2021 •

edited

Loading

codecov bot commented May 7, 2021 •

edited

Loading

krshrimali commented May 8, 2021 •

edited

Loading

krshrimali commented May 10, 2021 •

edited

Loading

mruberry commented May 30, 2021 •

edited

Loading

krshrimali commented May 30, 2021 •

edited

Loading