-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Closed
Description
Recently I profiled one of my models with nvprof and was surprised to find that the softmax layers for an attention mechanism were the most expensive entries in the cuda_total_time column.
My original code looked something like this:
self.softmax = nn.Softmax(dim=-1)
# ...
# attn is a 3D tensor (batch_size x length x length). Length is in the 30-200 range
attn = self.softmax(attn)
After some experimentation, I tried adding some transposes to the code:
self.softmax = nn.Softmax(dim=1)
attn = self.softmax(attn.transpose(1, 2)).transpose(1, 2)
This increased my overall model speed by around 10%, and now matrix multiplication is at the top of nvprof (which is what I would expect).
Could this be considered a performance bug? I wonder if there is some way for the softmax cuda code to have comparable speed regardless of the softmax dimension.
(Sorry I don't have sample code at the moment; my actual code is deeply embedded in the current project I'm doing.)
System info
- OS: Linux
- PyTorch version: 0.3.0
- How you installed PyTorch (conda, pip, source): conda
- Python version: 3.6
- CUDA/cuDNN version: CUDA 9
- GPU models and configuration: K80
Metadata
Metadata
Assignees
Labels
No labels