Kaiming Initialization

Kaiming Initialization, also known as He Initialization, is an optimization method for neural networks. It takes into account the non-linear activation functions, such as ReLU, to avoid the problem of reducing or magnifying input signals exponentially. This method ensures that each layer of the neural network receives the same amount of variance, making it easier to optimize.

Why Initialize Neural Networks?

Neural networks, at their core, are just a collection of mathematical functions. Each function takes several inputs, applies a series of weights and biases, and produces an output. However, the weights and biases used in the functions have to be initialized to some value before the network can be trained. Without proper initialization, a neural network will typically generate random outputs during the training process, causing it to converge very slowly or not at all.

Initializing a neural network involves setting initial values for each neuron in the network. If the values are chosen poorly, the network may not be able to learn effectively. For example, if all the neurons are initialized to zero, they will all perform the same computation, leading to a lack of diversity in the network.

The Importance of Kaiming Initialization

The Kaiming Initialization method, also known as He Initialization, addresses the problem of initializing neural networks. It is particularly useful for networks with ReLU activations, as these functions can cause the signal to explode or vanish. Additionally, it ensures that the weight initialization method does not magnify or reduce the input signals exponentially, leading to better convergence.

Kaiming Initialization involves setting the initial weights to a zero-centered Gaussian distribution with a standard deviation of the square root of 2 divided by the number of neurons in the previous layer. This helps to ensure that each neuron receives roughly the same amount of variance, making the network easier to optimize.

The Kaiming Initialization Algorithm

Given a neural network with $L$ layers, the Kaiming Initialization algorithm involves the following steps:

Set $n\_l$ to be the number of neurons in layer $l$.
Set $w\_l$ to be the weights for layer $l$.
Calculate the variance of the weights as follows: $$\text{Var}(w\_l) = \frac{2}{n\_{l-1}}$$
Set each weight in $w\_l$ to a zero-centered Gaussian distribution with standard deviation $\sqrt{\text{Var}(w\_l)}$.
Set the biases to be zero.

The formula used in step three ensures that the weights do not explode or vanish, by scaling the variance of the distribution inversely with the number of neurons in the previous layer. The resulting weights can then be used to initialize the neural network before training.

How Kaiming Initialization Works

The Kaiming Initialization method works by ensuring that each layer of the neural network receives approximately the same amount of variance. This is important because it makes it easier to optimize the network. In a poorly initialized network, some layers may receive a lot more variance than others, and these layers may be harder to optimize.

By using a Gaussian distribution centered at zero with a variance that is inversely proportional to the number of neurons in the previous layer, Kaiming Initialization ensures that each layer receives approximately the same amount of variance. This makes it easier to optimize the network because the gradients will be more consistent across layers.

Another advantage of Kaiming Initialization is that it works well with ReLU activation functions. These functions can cause the signal to explode or vanish if the weights are initialized improperly. By ensuring that each layer receives roughly the same amount of variance, Kaiming Initialization helps to prevent this problem from occurring.

Kaiming Initialization, or He Initialization, is an essential optimization method for deep neural networks. It helps to ensure that each layer of the network receives approximately the same amount of variance, making it easier to optimize. Additionally, it works well with ReLU activation functions, which can cause problems when weights are initialized poorly. By using a zero-centered Gaussian distribution with a variance that is inversely proportional to the number of neurons in the previous layer, Kaiming Initialization helps to ensure that each neuron in the network has a similar starting point, leading to faster convergence and better performance.