KEMBAR78
Migrate crossKernel from THC to ATen (CUDA) by peterbell10 · Pull Request #60039 · pytorch/pytorch · GitHub
Skip to content

Conversation

@peterbell10
Copy link
Collaborator

Ref #24507 (There doesn't seem to be an actual issue for cross)

This also moves the remaining operator functors in THCTensorMathPointwise.cuh to SparseCUDATensorMath.cu which is the only file using them.

@peterbell10 peterbell10 added module: porting Issues related to porting TH/THNN legacy to ATen native open source labels Jun 15, 2021
@peterbell10 peterbell10 requested a review from ngimel June 15, 2021 21:07
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jun 15, 2021

💊 CI failures summary and remediations

As of commit 697c739 (more details on the Dr. CI page and at hud.pytorch.org/pr/60039):


  • 3/3 failures introduced in this PR

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (1/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/scripts/windows_cuda_install.sh
Auto-merging .circleci/scripts/windows_cuda_install.sh
CONFLICT (add/add): Merge conflict in .circleci/scripts/setup_ci_environment.sh
Auto-merging .circleci/scripts/setup_ci_environment.sh
CONFLICT (add/add): Merge conflict in .circleci/docker/build.sh
Auto-merging .circleci/docker/build.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/windows_build_definitions.py
Auto-merging .circleci/cimodel/data/windows_build_definitions.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (2/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/scripts/windows_cuda_install.sh
Auto-merging .circleci/scripts/windows_cuda_install.sh
CONFLICT (add/add): Merge conflict in .circleci/scripts/setup_ci_environment.sh
Auto-merging .circleci/scripts/setup_ci_environment.sh
CONFLICT (add/add): Merge conflict in .circleci/docker/build.sh
Auto-merging .circleci/docker/build.sh
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/windows_build_definitions.py
Auto-merging .circleci/cimodel/data/windows_build_definitions.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_macos_10_13_py3_test (3/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Jun 23 00:20:37 ERROR [0.004s]: test_poisson_sample (__main__.TestDistributions)
Jun 23 00:20:37   File "distributions/test_distributions.py", line 805, in _check_sampler_discrete
Jun 23 00:20:37     chisq, p = scipy.stats.chisquare(counts[msk], pmf[msk] * num_samples)
Jun 23 00:20:37   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/scipy/stats/stats.py", line 6853, in chisquare
Jun 23 00:20:37     lambda_="pearson")
Jun 23 00:20:37   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/scipy/stats/stats.py", line 6694, in power_divergence
Jun 23 00:20:37     raise ValueError(msg)
Jun 23 00:20:37 ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:
Jun 23 00:20:37 0.008265582255680495
Jun 23 00:20:37 
Jun 23 00:20:37 ======================================================================
Jun 23 00:20:37 ERROR [0.004s]: test_poisson_sample (__main__.TestDistributions)
Jun 23 00:20:37 ----------------------------------------------------------------------
Jun 23 00:20:37 Traceback (most recent call last):
Jun 23 00:20:37   File "distributions/test_distributions.py", line 1333, in test_poisson_sample
Jun 23 00:20:37     failure_rate=1e-3)
Jun 23 00:20:37   File "distributions/test_distributions.py", line 805, in _check_sampler_discrete
Jun 23 00:20:37     chisq, p = scipy.stats.chisquare(counts[msk], pmf[msk] * num_samples)
Jun 23 00:20:37   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/scipy/stats/stats.py", line 6853, in chisquare
Jun 23 00:20:37     lambda_="pearson")
Jun 23 00:20:37   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/scipy/stats/stats.py", line 6694, in power_divergence
Jun 23 00:20:37     raise ValueError(msg)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@peterbell10
Copy link
Collaborator Author

I measure a performance improvement here:

Shape Before After
(100, 100, 100, 3) 99.2 us 93.9 us
(100, 100, 3) 4.91 us 3.91 us

@anjali411 anjali411 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 16, 2021
int64_t x1stride, int64_t x2stride) {
const auto N = iter.numel();
auto offset_calculator = make_element_offset_calculator<3>(iter);
TORCH_INTERNAL_ASSERT(N > 0 && N <= std::numeric_limits<int32_t>::max());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check enough? e.g. if you have a tensors with <INT_MAX element, but it's not memory dense, so maximum offset is >INT_MAX (or UINT_MAX)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could assert iter.can_use_32bit_indexing() but that's a more costly check and is guarunteed by cross_impl using with_32bit_indexing anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then with_32bit_indexing should guarantee that N is less than INT_MAX? Or it's here to protect people directly calling launch_cross_kernel, not going through cross_impl? In this case probably should be TORCH_INTERNAL_ASSERT_DEBUG_ONLY (unlikely someone will call launch_cross_kernel directly).

@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ngimel merged this pull request in 3f3fd57.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: porting Issues related to porting TH/THNN legacy to ATen native open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants