int32 indexing for Tensor Iterator Reduction #17428

jjsjann123 · 2019-02-23T02:23:48Z

Summary:

Enabling int32 indexing for cases where TI cannot accumulate in output due to
incompatible data types (e.g. Welford).
Updating Welford kernel to use int32 instead of int64 indexing on GPU.

This change improves performance for torch.var / torch.std

Implementation:

Allocated extra buffer to handle accumulation between sub Tensor Iterators.
Removed int64 indexing in gpu_reduce_kernel
WelfordOps now supports index type / combination typeas a template parameter.
While GPU uses int32_t and float, CPU implementation uses int64_t and double.

Summary: 1. Enabling int32 indexing for cases where TI cannot accumulate in output due to incompatible data types (e.g. Welford). 2. Updating Welford kernel to use int32 instead of int64 indexing on GPU. This change improves performance for torch.var / torch.std Implementation: 1. Allocated extra buffer to handle accumulation between sub Tensor Iterators. 2. Removed int64 indexing in gpu_reduce_kernel 3. WelfordOps now supports index type / combination typeas a template parameter. While GPU uses int32_t and float, CPU implementation uses int64_t and double.

jjsjann123 · 2019-02-23T02:24:47Z

Ping @ngimel @umanwizard @colesbury for visibility.
Benchmark scripts and perf number will follow shortly.

aten/src/ATen/native/SharedReduceOps.h

aten/src/ATen/native/cuda/Reduce.cuh

Pointer arithmetic with floats and casting has been updated to safely use size_t

jjsjann123 · 2019-02-26T17:51:04Z

Test failure seems to be machine error.

jjsjann123 · 2019-02-27T20:43:51Z

Benchmark torch.std
Red shows regression and green is speedup.

Benchmark script:

import torch
nrep = 100


def bench(size, fn):
   x=torch.ones([size], device='cuda', dtype = torch.float).view(-1, 512)
   torch.cuda.synchronize()
   import time
   start = time.time()
   for i in range(nrep):
      fn(x,0)
   torch.cuda.synchronize()
   end = time.time()
   return ((end-start)/nrep)


stdbws = []
sizes = []
size = 102400
while size < 1024000000:
   time = bench(size, torch.std)
   bw = size*4/time*1e-9
   stdbws.append(bw)
   sizes.append(size)
   size *= 2
for size, stdbw in zip(sizes, stdbws):
   print(size, stdbw)

jjsjann123 · 2019-02-27T21:16:47Z

I have another PR that resolves the dependent loop unrolling in current Tensor Iterator. It gives further speedup (table above shows relative speedup to int32 indexing)
Perf fluctuates between runs (especially with smaller sizes). But I do see noticeably better perf and proper sass instructions.

The other PR is dependent on this one, I'll issue the other PR once this is merged.

facebook-github-bot

@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

aten/src/ATen/native/cuda/Reduce.cuh

updating fraction reduction function; change shared_ptr to unique_ptr for recursive gpu_reduce_kernel call;

facebook-github-bot

@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

aten/src/ATen/native/cuda/Reduce.cuh

umanwizard · 2019-03-04T02:47:03Z

Hi @jjsjann123 , could you address @apaszke 's comments, and also rebase to see if the failing tests pass then?

aten/src/ATen/native/cuda/Reduce.cuh

facebook-github-bot

@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

jjsjann123 · 2019-03-04T21:12:40Z

Failure looks like infra issue. (Can I) How do I trigger circleci test?

Summary: 1. Enabling int32 indexing for cases where TI cannot accumulate in output due to incompatible data types (e.g. Welford). 2. Updating Welford kernel to use int32 instead of int64 indexing on GPU. This change improves performance for torch.var / torch.std Implementation: 1. Allocated extra buffer to handle accumulation between sub Tensor Iterators. 2. Removed int64 indexing in gpu_reduce_kernel 3. WelfordOps now supports index type / combination typeas a template parameter. While GPU uses int32_t and float, CPU implementation uses int64_t and double. Pull Request resolved: pytorch/pytorch#17428 Differential Revision: D14264608 Pulled By: umanwizard fbshipit-source-id: 3eb54451de925b469dbc1127e5ea7443c4431036

ngimel reviewed Feb 24, 2019

View reviewed changes

jjsjann123 added 3 commits February 25, 2019 16:10

Addressing review comment.

63cbfae

Pointer arithmetic with floats and casting has been updated to safely use size_t

updated buffer allocation to handle non-contiguous output tensor

850670b

Merge remote-tracking branch 'origin/master' into TI_subiterator_PR

3723d98

facebook-github-bot reviewed Feb 28, 2019

View reviewed changes

umanwizard suggested changes Feb 28, 2019

View reviewed changes

aten/src/ATen/native/cuda/Reduce.cuh Show resolved Hide resolved

aten/src/ATen/native/cuda/Reduce.cuh Outdated Show resolved Hide resolved

aten/src/ATen/native/cuda/Reduce.cuh Outdated Show resolved Hide resolved

aten/src/ATen/native/cuda/Reduce.cuh Outdated Show resolved Hide resolved

Addressing review comments

d313b58

updating fraction reduction function; change shared_ptr to unique_ptr for recursive gpu_reduce_kernel call;

umanwizard approved these changes Feb 28, 2019

View reviewed changes

facebook-github-bot reviewed Feb 28, 2019

View reviewed changes

apaszke reviewed Mar 2, 2019

View reviewed changes

aten/src/ATen/native/cuda/Reduce.cuh Outdated Show resolved Hide resolved

aten/src/ATen/native/cuda/Reduce.cuh Outdated Show resolved Hide resolved

jjsjann123 commented Mar 4, 2019

View reviewed changes

aten/src/ATen/native/cuda/Reduce.cuh Outdated Show resolved Hide resolved

jjsjann123 added 2 commits March 4, 2019 09:45

Merge remote-tracking branch 'origin/master' into TI_subiterator_PR

9d79a15

adressing comment issues

61f6cbb

facebook-github-bot reviewed Mar 4, 2019

View reviewed changes

facebook-github-bot closed this in a87eeec Mar 4, 2019

pytorchbot added the merged label Mar 4, 2019

jjsjann123 mentioned this pull request Mar 4, 2019

Tensor Iterator loop unrolling #17667

Closed

ezyang added the open source label Jun 24, 2019

int32 indexing for Tensor Iterator Reduction #17428

int32 indexing for Tensor Iterator Reduction #17428

Uh oh!

Conversation

jjsjann123 commented Feb 23, 2019

Uh oh!

jjsjann123 commented Feb 23, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjsjann123 commented Feb 26, 2019

Uh oh!

jjsjann123 commented Feb 27, 2019

Uh oh!

jjsjann123 commented Feb 27, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

umanwizard commented Mar 4, 2019

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented Mar 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants