Gaussian Error Linear Units

The Gaussian Error Linear Unit, or GELU, is an activation function that is commonly used in artificial neural networks. It was first introduced in a 2018 paper by Hendrycks and Gimpel titled "A baseline for detecting misclassified and out-of-distribution examples in neural networks".

What is an activation function?

An activation function is a mathematical function that is applied to the output of a neuron in a neural network. It is used to introduce non-linearity into the model, which allows it to represent more complex relationships between the input and output variables. Activation functions are applied at each layer of a neural network to transform the output of the previous layer into a form that is more suitable for the next layer.

How does the GELU activation function work?

The GELU activation function is defined as x times the standard Gaussian cumulative distribution function of x, which can be written as x times the probability that a random variable from a normal distribution with mean 0 and variance 1 is less than or equal to x. Mathematically, the GELU function can be written as:

GELU(x) = x * Φ(x) = x * (1/2)*(1 + erf(x/sqrt(2)))

Here, erf(x) is the error function, which is a mathematical function that describes the degree of deviation of a normal distribution from zero. Essentially, the GELU activation function weights inputs by their percentile, rather than by their sign, as done by ReLU, another commonly used activation function. This makes GELU a smoother non-linearity than ReLU, which can be useful in some cases.

Why is the GELU activation function useful?

The smoothness of the GELU function can be advantageous in deep neural networks with many layers, as it can help prevent the problem of "dying ReLU", which is a problem that can occur when the output of the ReLU function is zero, causing the gradient to become zero, and hence no further learning takes place. The GELU function can also help improve the performance of a neural network on certain tasks, such as natural language processing and speech recognition.

How is the GELU activation function implemented?

The GELU function can be approximated using the hyperbolic tangent function and a polynomial function, as shown below:

0.5x(1+tanh(sqrt(2/pi)*(x+0.044715x^3)))

Alternatively, the function can be approximated using the sigmoid function and a scaling parameter, as shown below:

x*sigma(1.702x)

These approximations can be useful when computational efficiency is a concern, but PyTorch's exact implementation of the GELU function is already fast enough for most applications, and so these approximations are generally unnecessary.

Where is the GELU activation function used?

The GELU activation function is used in many deep learning models, including GPT-3 and BERT, which are powerful natural language processing models developed by OpenAI and Google, respectively. It is also used in most other Transformers, which are deep neural networks that are commonly used for natural language processing, speech recognition and other tasks.

The GELU activation function is a powerful tool for deep learning models that can help prevent the problems associated with "dying ReLU" and improve performance on certain tasks. While there are approximations of the function that can be used for computational efficiency, PyTorch's exact implementation is already fast enough for most applications.