KEMBAR78
int32 indexing for Tensor Iterator Reduction by jjsjann123 · Pull Request #17428 · pytorch/pytorch · GitHub
Skip to content

Conversation

@jjsjann123
Copy link
Collaborator

Summary:

  1. Enabling int32 indexing for cases where TI cannot accumulate in output due to
    incompatible data types (e.g. Welford).
  2. Updating Welford kernel to use int32 instead of int64 indexing on GPU.

This change improves performance for torch.var / torch.std

Implementation:

  1. Allocated extra buffer to handle accumulation between sub Tensor Iterators.
  2. Removed int64 indexing in gpu_reduce_kernel
  3. WelfordOps now supports index type / combination typeas a template parameter.
    While GPU uses int32_t and float, CPU implementation uses int64_t and double.

Summary:
1. Enabling int32 indexing for cases where TI cannot accumulate in output due to
incompatible data types (e.g. Welford).
2. Updating Welford kernel to use int32 instead of int64 indexing on GPU.

This change improves performance for torch.var / torch.std

Implementation:
1. Allocated extra buffer to handle accumulation between sub Tensor Iterators.
2. Removed int64 indexing in gpu_reduce_kernel
3. WelfordOps now supports index type / combination typeas a template parameter.
While GPU uses int32_t and float, CPU implementation uses int64_t and double.
@jjsjann123
Copy link
Collaborator Author

Ping @ngimel @umanwizard @colesbury for visibility.
Benchmark scripts and perf number will follow shortly.

@jjsjann123
Copy link
Collaborator Author

Test failure seems to be machine error.

@jjsjann123
Copy link
Collaborator Author

image

Benchmark torch.std
Red shows regression and green is speedup.

Benchmark script:

import torch
nrep = 100


def bench(size, fn):
   x=torch.ones([size], device='cuda', dtype = torch.float).view(-1, 512)
   torch.cuda.synchronize()
   import time
   start = time.time()
   for i in range(nrep):
      fn(x,0)
   torch.cuda.synchronize()
   end = time.time()
   return ((end-start)/nrep)


stdbws = []
sizes = []
size = 102400
while size < 1024000000:
   time = bench(size, torch.std)
   bw = size*4/time*1e-9
   stdbws.append(bw)
   sizes.append(size)
   size *= 2
for size, stdbw in zip(sizes, stdbws):
   print(size, stdbw)

@jjsjann123
Copy link
Collaborator Author

image

I have another PR that resolves the dependent loop unrolling in current Tensor Iterator. It gives further speedup (table above shows relative speedup to int32 indexing)
Perf fluctuates between runs (especially with smaller sizes). But I do see noticeably better perf and proper sass instructions.

The other PR is dependent on this one, I'll issue the other PR once this is merged.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

updating fraction reduction function;
change shared_ptr to unique_ptr for recursive gpu_reduce_kernel call;
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@umanwizard
Copy link
Contributor

Hi @jjsjann123 , could you address @apaszke 's comments, and also rebase to see if the failing tests pass then?

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@jjsjann123
Copy link
Collaborator Author

Failure looks like infra issue. (Can I) How do I trigger circleci test?

zdevito pushed a commit to zdevito/ATen that referenced this pull request Mar 4, 2019
Summary:
1. Enabling int32 indexing for cases where TI cannot accumulate in output due to
incompatible data types (e.g. Welford).
2. Updating Welford kernel to use int32 instead of int64 indexing on GPU.

This change improves performance for torch.var / torch.std

Implementation:
1. Allocated extra buffer to handle accumulation between sub Tensor Iterators.
2. Removed int64 indexing in gpu_reduce_kernel
3. WelfordOps now supports index type / combination typeas a template parameter.
While GPU uses int32_t and float, CPU implementation uses int64_t and double.
Pull Request resolved: pytorch/pytorch#17428

Differential Revision: D14264608

Pulled By: umanwizard

fbshipit-source-id: 3eb54451de925b469dbc1127e5ea7443c4431036
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants