KEMBAR78
Bool tensor. Part 0: Boolean storage implementation by izdeby · Pull Request #16810 · pytorch/pytorch · GitHub
Skip to content

Conversation

@izdeby
Copy link
Contributor

@izdeby izdeby commented Feb 6, 2019

This is the first commit from a series of planned changes in order to add boolean tensors to PyTorch. The whole plan looks like this:

  1. Storage Implementation (this change)
  2. Tensor Creation.
  3. Tensor Conversions.
  4. Tensor Indexing.
  5. Tensor Operations.
  6. Back compatibility related changes.

This feature was requested by the community:
#4764
#4219
#4288

Change:
Added boolean type to the Storage class for CPU and CUDA backends.

Tested via:

  1. unit tests
  2. running this:
    -> import torch
    -> torch.BoolStorage
    <class 'torch.BoolStorage'>
    -> torch.cuda.BoolStorage
    <class 'torch.cuda.BoolStorage'>

@fmassa
Copy link
Member

fmassa commented Feb 6, 2019

Question: do we want to consider the bool tensor as holding 1bit data? E.g., by having uint8_t and packing it as 8 element tensor?

@albanD
Copy link
Collaborator

albanD commented Feb 6, 2019

From a quick look, what is the reason why THCGenerateBoolType.h did not make it into THGenerateAllTypes.h ? Are there cases where you want to generate all other types but not bool?

@izdeby
Copy link
Contributor Author

izdeby commented Feb 6, 2019

From a quick look, what is the reason why THCGenerateBoolType.h did not make it into THGenerateAllTypes.h ? Are there cases where you want to generate all other types but not bool?

This was done to minimize the amount of needed changes. If THCGenerateBoolType.h/THGenerateBoolType.h were included in THCGenerateAllTypes.h/THGenerateAllType.h, this change would me significantly bigger and harder to review.
As an example, take a look at aten/src/TH/THTensor.h. If i had included bool type into *GenerateAllType, i would have to implement all Math, Random and Convolution functionality. This will be done, but later.

@izdeby izdeby changed the title Bool tensor [WIP] Bool tensor Feb 6, 2019
@izdeby
Copy link
Contributor Author

izdeby commented Feb 7, 2019

Question: do we want to consider the bool tensor as holding 1bit data? E.g., by having uint8_t and packing it as 8 element tensor?

@fmassa , it's 1 byte, which is the same as numpy.

@izdeby izdeby changed the title [WIP] Bool tensor Bool tensor. Part 0: Boolean storage implementation Feb 7, 2019
@izdeby izdeby changed the title Bool tensor. Part 0: Boolean storage implementation [Not ready]Bool tensor. Part 0: Boolean storage implementation Feb 7, 2019
@apaszke
Copy link
Contributor

apaszke commented Feb 7, 2019

I understand that it might be simpler to use uint8_t to represent a single element for now, but if we wanted to change the memory representation of those tensors in the future it would be a breaking change (people might have kernels), so it would be good to get it right sooner than later.

@albanD
Copy link
Collaborator

albanD commented Feb 7, 2019

@apaszke but that would mean that we could not do from_numpy on numpy.bool into this type without copy? Which is what at least one of the issue is about no?

@fmassa
Copy link
Member

fmassa commented Feb 7, 2019

A couple more thoughts:

  • if we actually want to have binary masks to be part of autograd (e.g., in order to solve problems like Log and indexing do not commute correctly with respect to gradient #12986) or for the mask in batched tensors, it might be useful considering a 1bit representation for memory savings.
  • having some kind of support for numeric representations with < 1 byte could be interesting, as it could potentially be a way of implementing low-precision numbers by packing 2/4/8 elements into a single byte

But I do agree that supporting 1bit (or 2bit? 4 bit?, 3 bit?) elements would be a much bigger endeavor. In particular, it would also mean changing all the support we currently have in TensorIterator and alike, which would not be easy.

@gchanan
Copy link
Contributor

gchanan commented Feb 7, 2019

I buy @albanD's argument; there are use cases for 1 byte bools (even if we could magically also have 1-bit bools), and given our numpy affinity, it seems like the right choice to call 1-byte bools torch.bool.

There's nothing precluding us from coming in later and supporting 1-bit bools via a packed representation or something (e.g. 1-bit quantization is a thing) that wouldn't break backwards compatibility.

@apaszke
Copy link
Contributor

apaszke commented Feb 7, 2019

@albanD yeah in that case from_numpy won't work on those tensors (it doesn't work today, and I don't think it's a terrible loss). I really think that the memory savings are quite nice, and otherwise we're just duplicating torch.ByteTensor with no clear distinction

@fmassa
Copy link
Member

fmassa commented Feb 7, 2019

There is an advantage though @apaszke: we can have clear semantics on indexing with bool tensor. Now indexing with a byte tensor containing [0, 1, 2] works the same as if it was [0, 1, 1], which can be confusing

@albanD
Copy link
Collaborator

albanD commented Feb 8, 2019

If the goal is just to be compatible with numpy bool and allow for better advanced indexing, do actually need to generate code a completely new scalar type (which looks like this PR is doing)? Can it be a small wrapper around ByteTensor with custom functions for numpy interface and so that it can be used by indexing?

@gchanan
Copy link
Contributor

gchanan commented Feb 8, 2019

@albanD I don't really see how you'd make that work nicely at the C++ level.

@albanD
Copy link
Collaborator

albanD commented Feb 8, 2019

I'm not familiar enough with the way aten works so that was just a suggestion in case it was possible.
Couldn't an implementation inherit from another one?

@gchanan
Copy link
Contributor

gchanan commented Feb 8, 2019

@albanD well C++ is a little more complicated :). i.e. you'd have to make sure everything you need is virtual, which would slow down the main line path. It seems very likely to be a leaky abstraction.

@albanD
Copy link
Collaborator

albanD commented Feb 8, 2019

Ok, good to know !
Larger binaries it is then !

@izdeby izdeby changed the title [Not ready]Bool tensor. Part 0: Boolean storage implementation Bool tensor. Part 0: Boolean storage implementation Feb 8, 2019
@gchanan
Copy link
Contributor

gchanan commented Feb 8, 2019

@albanD I think the right thing to do is to only define minimal mathematical operators so the code size difference shouldn't be too big; people can .to(torch.uint8) if they really want to do math.

@gchanan
Copy link
Contributor

gchanan commented Feb 8, 2019

looks like you may have a clang_tidy issue.

@izdeby izdeby changed the title Bool tensor. Part 0: Boolean storage implementation [Not ready] Bool tensor. Part 0: Boolean storage implementation Feb 9, 2019
@izdeby izdeby changed the title [Not ready] Bool tensor. Part 0: Boolean storage implementation Bool tensor. Part 0: Boolean storage implementation Feb 11, 2019
};

template <>
struct CopyOp <bool, bool> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't fully specialize this unless you rewrite the above to only have one template parameter. You can just partially specialize this.

Copy link
Contributor

@gchanan gchanan Feb 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason being the above won't actually work if someone were to instantiate <non_bool, bool>.

PyInt_Check(object) ? PyInt_AsLong(object) : \
(throw std::runtime_error("Could not parse real"), 0))

#define THPUtils_unpackReal_BOOL(object) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to define a version for py2 vs py3, bools are the same across versions. Just put this below, outside of the if/endif.

#if defined(TH_REAL_IS_BYTE) || defined(TH_REAL_IS_CHAR)
memcpy(THWStorage_(data)(storage), src + offset, count);
#elif defined(TH_REAL_IS_BOOL)
// Because of ASAN checks, we have to manually move bytes instead of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you aren't moving bytes though, you are changing the the values. How about include the ASAN check that tripped this?

Copy link
Contributor

@gchanan gchanan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look good, nice work!

I think the test failures will go away if you merge in master.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izdeby has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izdeby is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary:
This is the first commit from a series of planned changes in order to add boolean tensors to PyTorch. The whole plan looks like this:

0. Storage Implementation (this change)
1. Tensor Creation.
2. Tensor Conversions.
3. Tensor Indexing.
4. Tensor Operations.
5. Back compatibility related changes.

This feature was requested by the community:
pytorch#4764
pytorch#4219
pytorch#4288

**Change**:
Added boolean type to the Storage class for CPU and CUDA backends.

**Tested via**:
1. unit tests
2. running this:
-> import torch
-> torch.BoolStorage
<class 'torch.BoolStorage'>
-> torch.cuda.BoolStorage
<class 'torch.cuda.BoolStorage'>
Pull Request resolved: pytorch#16810

Differential Revision: D14087246

fbshipit-source-id: 12f22b897c33defddad1c967e4cf5ae764b85d13
zdevito pushed a commit to zdevito/ATen that referenced this pull request Feb 19, 2019
Summary:
This is the first commit from a series of planned changes in order to add boolean tensors to PyTorch. The whole plan looks like this:

0. Storage Implementation (this change)
1. Tensor Creation.
2. Tensor Conversions.
3. Tensor Indexing.
4. Tensor Operations.
5. Back compatibility related changes.

This feature was requested by the community:
pytorch/pytorch#4764
pytorch/pytorch#4219
pytorch/pytorch#4288

**Change**:
Added boolean type to the Storage class for CPU and CUDA backends.

**Tested via**:
1. unit tests
2. running this:
-> import torch
-> torch.BoolStorage
<class 'torch.BoolStorage'>
-> torch.cuda.BoolStorage
<class 'torch.cuda.BoolStorage'>
Pull Request resolved: pytorch/pytorch#16810

Reviewed By: gchanan

Differential Revision: D14087246

Pulled By: izdeby

fbshipit-source-id: 042642ced1cb0fd1bb6bff05f9ca871a5c54ee5e
@ezyang ezyang added the merged label Jun 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants