GPU Softmax over last dimension of 3D tensor is slow

Recently I profiled one of my models with `nvprof` and was surprised to find that the softmax layers for an attention mechanism were the most expensive entries in the `cuda_total_time` column.

My original code looked something like this:
```
self.softmax = nn.Softmax(dim=-1)
# ...
# attn is a 3D tensor (batch_size x length x length). Length is in the 30-200 range
attn = self.softmax(attn)
```

After some experimentation, I tried adding some transposes to the code:

```
self.softmax = nn.Softmax(dim=1)
attn = self.softmax(attn.transpose(1, 2)).transpose(1, 2)
```

This increased my overall model speed by around 10%, and now matrix multiplication is at the top of `nvprof` (which is what I would expect).

Could this be considered a performance bug? I wonder if there is some way for the softmax cuda code to have comparable speed regardless of the softmax dimension.

(Sorry I don't have sample code at the moment; my actual code is deeply embedded in the current project I'm doing.)

System info
- OS: Linux
- PyTorch version: 0.3.0
- How you installed PyTorch (conda, pip, source): conda
- Python version: 3.6
- CUDA/cuDNN version: CUDA 9
- GPU models and configuration: K80

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU Softmax over last dimension of 3D tensor is slow #4893

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU Softmax over last dimension of 3D tensor is slow #4893

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions