-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Description
cc @fmassa as he introduces those in #9038.
Looking into the initialisation of Linear and Convolution layers we have the following
Linear:
pytorch/torch/nn/modules/linear.py
Lines 58 to 63 in 3df79f4
| def reset_parameters(self): | |
| init.kaiming_uniform_(self.weight, a=math.sqrt(5)) | |
| if self.bias is not None: | |
| fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight) | |
| bound = 1 / math.sqrt(fan_in) | |
| init.uniform_(self.bias, -bound, bound) |
Convolution:
pytorch/torch/nn/modules/conv.py
Lines 45 to 51 in 3df79f4
| def reset_parameters(self): | |
| n = self.in_channels | |
| init.kaiming_uniform_(self.weight, a=math.sqrt(5)) | |
| if self.bias is not None: | |
| fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight) | |
| bound = 1 / math.sqrt(fan_in) | |
| init.uniform_(self.bias, -bound, bound) |
Notice the sqrt(5) scaling factor.
Kaiming paper
https://arxiv.org/abs/1502.01852
The standard deviation should be sqrt(2 / fan_in)

Using the same principle as Glorot et al paper, for an uniform distribution we should use bounds of ±√3 * sqrt(2 / fan_in)
This is what is done here:
Lines 288 to 293 in 700271d
| fan = _calculate_correct_fan(tensor, mode) | |
| gain = calculate_gain(nonlinearity, a) | |
| std = gain / math.sqrt(fan) | |
| bound = math.sqrt(3.0) * std # Calculate uniform bounds from standard deviation | |
| with torch.no_grad(): | |
| return tensor.uniform_(-bound, bound) |
Diving deeper into the implementation
It seems like the a = √5 is used in
Lines 8 to 47 in 700271d
| def calculate_gain(nonlinearity, param=None): | |
| r"""Return the recommended gain value for the given nonlinearity function. | |
| The values are as follows: | |
| ================= ==================================================== | |
| nonlinearity gain | |
| ================= ==================================================== | |
| Linear / Identity :math:`1` | |
| Conv{1,2,3}D :math:`1` | |
| Sigmoid :math:`1` | |
| Tanh :math:`\frac{5}{3}` | |
| ReLU :math:`\sqrt{2}` | |
| Leaky Relu :math:`\sqrt{\frac{2}{1 + \text{negative\_slope}^2}}` | |
| ================= ==================================================== | |
| Args: | |
| nonlinearity: the non-linear function (`nn.functional` name) | |
| param: optional parameter for the non-linear function | |
| Examples: | |
| >>> gain = nn.init.calculate_gain('leaky_relu') | |
| """ | |
| linear_fns = ['linear', 'conv1d', 'conv2d', 'conv3d', 'conv_transpose1d', 'conv_transpose2d', 'conv_transpose3d'] | |
| if nonlinearity in linear_fns or nonlinearity == 'sigmoid': | |
| return 1 | |
| elif nonlinearity == 'tanh': | |
| return 5.0 / 3 | |
| elif nonlinearity == 'relu': | |
| return math.sqrt(2.0) | |
| elif nonlinearity == 'leaky_relu': | |
| if param is None: | |
| negative_slope = 0.01 | |
| elif not isinstance(param, bool) and isinstance(param, int) or isinstance(param, float): | |
| # True/False are instances of int, hence check above | |
| negative_slope = param | |
| else: | |
| raise ValueError("negative_slope {} not a valid number".format(param)) | |
| return math.sqrt(2.0 / (1 + negative_slope ** 2)) | |
| else: | |
| raise ValueError("Unsupported nonlinearity {}".format(nonlinearity)) |
The a is only used for leaky_relu, which actually is the default if we don't pass any activation to kaiming_uniform:
Line 261 in 700271d
| def kaiming_uniform_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu'): |
Furthermore this √5 factor conflicts with the recommended sqrt(2.0 / (1 + negative_slope ** 2)) in calculate_gains, and I suspect this is unintentional.
Docs
Whether the √5 factor is intentional or not, the documentation is wrong for the weights.
Linear
While for bias k = 1/in_features is true, for the weight, k = 6/in_features assuming pure Kaiming, or k = 6 * 5/in_features at the moment.
Convolution
Same remark
Closing thoughts
Plenty of tutorials uses ReLU and not LeakyReLU, having the default initialisation for kaiming_uniform to leaky relu would create suboptimal training for those.
At the very least it should be noted in the documentation that Linear and Conv layers initialisation is done assuming it is followed by a leaky relu activation.
Finally the √5 should be explained.

