Kaiming init of conv and linear layers, why gain = sqrt(5)

cc @fmassa as he introduces those in https://github.com/pytorch/pytorch/pull/9038.

Looking into the initialisation of Linear and Convolution layers we have the following

Linear: https://github.com/pytorch/pytorch/blob/3df79f403e8b9621d5adb0447266becd10d633b0/torch/nn/modules/linear.py#L58-L63

Convolution: https://github.com/pytorch/pytorch/blob/3df79f403e8b9621d5adb0447266becd10d633b0/torch/nn/modules/conv.py#L45-L51

Notice the sqrt(5) scaling factor.

## Kaiming paper
https://arxiv.org/abs/1502.01852

The standard deviation should be sqrt(2 / fan_in)
![2018-12-17_22-45-34](https://user-images.githubusercontent.com/22738317/50117507-86464400-024d-11e9-9c6a-981b6898160a.png)

Using the same principle as [Glorot et al](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) paper, for an uniform distribution we should use bounds of `±√3 * sqrt(2 / fan_in)`

This is what is done here: https://github.com/pytorch/pytorch/blob/700271d0e92c33a4cce3fee34b39ad5baa179b00/torch/nn/init.py#L288-L293

## Diving deeper into the implementation

It seems like the a = √5 is used in https://github.com/pytorch/pytorch/blob/700271d0e92c33a4cce3fee34b39ad5baa179b00/torch/nn/init.py#L8-L47

The `a` is only used for leaky_relu, which actually is the default if we don't pass any activation to `kaiming_uniform`:
https://github.com/pytorch/pytorch/blob/700271d0e92c33a4cce3fee34b39ad5baa179b00/torch/nn/init.py#L261

Furthermore this √5 factor conflicts with the recommended `sqrt(2.0 / (1 + negative_slope ** 2))` in calculate_gains, and I suspect this is unintentional.

# Docs

Whether the √5 factor is intentional or not, the documentation is wrong for the weights.

## Linear

![2018-12-17_22-52-08](https://user-images.githubusercontent.com/22738317/50117993-acb8af00-024e-11e9-9f74-f6bdb250c26c.png)

While for bias `k = 1/in_features` is true, for the weight, `k = 6/in_features` assuming pure Kaiming, or `k = 6 * 5/in_features` at the moment.

## Convolution

![2018-12-17_22-50-39](https://user-images.githubusercontent.com/22738317/50118128-0faa4600-024f-11e9-9749-adcc4abcc05a.png)

Same remark

## Closing thoughts

Plenty of tutorials uses ReLU and not LeakyReLU, having the default initialisation for `kaiming_uniform` to leaky relu would create suboptimal training for those.

At the very least it should be noted in the documentation that Linear and Conv layers initialisation is done assuming it is followed by a leaky relu activation.

Finally the √5 should be explained.


	def reset_parameters(self):
	init.kaiming_uniform_(self.weight, a=math.sqrt(5))
	if self.bias is not None:
	fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
	bound = 1 / math.sqrt(fan_in)
	init.uniform_(self.bias, -bound, bound)

	def reset_parameters(self):
	n = self.in_channels
	init.kaiming_uniform_(self.weight, a=math.sqrt(5))
	if self.bias is not None:
	fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
	bound = 1 / math.sqrt(fan_in)
	init.uniform_(self.bias, -bound, bound)

	fan = _calculate_correct_fan(tensor, mode)
	gain = calculate_gain(nonlinearity, a)
	std = gain / math.sqrt(fan)
	bound = math.sqrt(3.0) * std # Calculate uniform bounds from standard deviation
	with torch.no_grad():
	return tensor.uniform_(-bound, bound)

	def calculate_gain(nonlinearity, param=None):
	r"""Return the recommended gain value for the given nonlinearity function.
	The values are as follows:

	================= ====================================================
	nonlinearity gain
	================= ====================================================
	Linear / Identity :math:`1`
	Conv{1,2,3}D :math:`1`
	Sigmoid :math:`1`
	Tanh :math:`\frac{5}{3}`
	ReLU :math:`\sqrt{2}`
	Leaky Relu :math:`\sqrt{\frac{2}{1 + \text{negative\_slope}^2}}`
	================= ====================================================

	Args:
	nonlinearity: the non-linear function (`nn.functional` name)
	param: optional parameter for the non-linear function

	Examples:
	>>> gain = nn.init.calculate_gain('leaky_relu')
	"""
	linear_fns = ['linear', 'conv1d', 'conv2d', 'conv3d', 'conv_transpose1d', 'conv_transpose2d', 'conv_transpose3d']
	if nonlinearity in linear_fns or nonlinearity == 'sigmoid':
	return 1
	elif nonlinearity == 'tanh':
	return 5.0 / 3
	elif nonlinearity == 'relu':
	return math.sqrt(2.0)
	elif nonlinearity == 'leaky_relu':
	if param is None:
	negative_slope = 0.01
	elif not isinstance(param, bool) and isinstance(param, int) or isinstance(param, float):
	# True/False are instances of int, hence check above
	negative_slope = param
	else:
	raise ValueError("negative_slope {} not a valid number".format(param))
	return math.sqrt(2.0 / (1 + negative_slope ** 2))
	else:
	raise ValueError("Unsupported nonlinearity {}".format(nonlinearity))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kaiming init of conv and linear layers, why gain = sqrt(5) #15314

Kaiming paper

Diving deeper into the implementation

Docs

Linear

Convolution

Closing thoughts

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kaiming init of conv and linear layers, why gain = sqrt(5) #15314

Description

Kaiming paper

Diving deeper into the implementation

Docs

Linear

Convolution

Closing thoughts

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions