KEMBAR78
at::native batch norm kernel launch config update by jjsjann123 · Pull Request #17047 · pytorch/pytorch · GitHub
Skip to content

Conversation

@jjsjann123
Copy link
Collaborator

limit block dimension to avoid configuration error on batch norm kernel launch

This should resolve #16998

@ezyang
Copy link
Contributor

ezyang commented Feb 13, 2019

Failure is real

Feb 13 09:00:38 ======================================================================
Feb 13 09:00:38 ERROR: test_batchnorm_large_batch (__main__.TestNN)
Feb 13 09:00:38 ----------------------------------------------------------------------
Feb 13 09:00:38 Traceback (most recent call last):
Feb 13 09:00:38   File "test_nn.py", line 5469, in test_batchnorm_large_batch
Feb 13 09:00:38     out = bn(data).sum().backward()
Feb 13 09:00:38   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
Feb 13 09:00:38     result = self.forward(*input, **kwargs)
Feb 13 09:00:38   File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 76, in forward
Feb 13 09:00:38     exponential_average_factor, self.eps)
Feb 13 09:00:38   File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 1667, in batch_norm
Feb 13 09:00:38     training, momentum, eps, torch.backends.cudnn.enabled
Feb 13 09:00:38 RuntimeError: expected scalar type Float but found Double
Feb 13 09:00:38 
Feb 13 09:00:38 ----------------------------------------------------------------------

@jjsjann123
Copy link
Collaborator Author

Of course I only updated the separate test inside the container and never push that back to the main repo...
Let me update the float with float32

@jjsjann123
Copy link
Collaborator Author

hmmm. it's a surprise that test_nn default layer to double...
I'm missing a type conversion for the layer instead of the input. Fixed in the last commit.

@jjsjann123
Copy link
Collaborator Author

@pytorchbot retest this please

@jjsjann123
Copy link
Collaborator Author

bump master merge just to rerun failed ci/circleci tests.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Feb 20, 2019
Summary:
limit block dimension to avoid configuration error on batch norm kernel launch

This should resolve #16998
Pull Request resolved: pytorch/pytorch#17047

Differential Revision: D14142132

Pulled By: soumith

fbshipit-source-id: 9c8c52dcd1d108cda1f65f5227e625b8fe6e12a0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Batchnormalization fails with CUDA on very large batches

3 participants