Sparsemax

Sparsemax: A New Type of Activation Function with Sparse Probability Output

Activation functions are an essential component in deep learning models that allow for non-linear transformations between layers. One commonly used activation function is the Softmax, which is used to transform the output into normalized probabilities. However, it can often produce dense probabilities that are not computationally efficient and can emphasize the largest elements, diminishing the importance of the smaller ones. Sparsemax comes in to provide a sparse probability output, which can be beneficial in many different applications.

What is Sparsemax?

Sparsemax is a novel activation function that outputs sparse probabilities, similar to that of the Softmax. However, rather than producing a dense probability distribution over the elements in the input vector, Sparsemax outputs a probability distribution that has at most k non-zero elements, where k is a fixed parameter. This makes the output more efficient and interpretable as it is not nearly as dense as that of the traditional Softmax.

The formula for Sparsemax is given as follows:

$$ \text{sparsemax}\left(z\right) = \arg\_{p∈\Delta^{K−1}}\min||\mathbf{p} - \mathbf{z}||^{2} $$

Essentially, Sparsemax computes a diagonal projection of the input vector onto the probability simplex, which is the set of all probability distributions that sum to one. This means that Sparsemax finds the point in the probability simplex that is closest to the input vector, which ensures the output is always a valid probability distribution.

The Advantages of Sparsemax

Sparsemax has many advantages over traditional activation functions, such as:

Sparsity: Sparsemax produces a probability distribution where almost all elements are zero, except for k elements, where k is a fixed parameter. This sparsity reduces the computational cost of dense vector operations and helps place more emphasis on the important features.
Interpretability: The sparsity of the distribution makes Sparsemax outputs more interpretable as you can easily identify the k most important features. This can be particularly useful in applications where you need to identify only a few critical features.
Stability: Sparsemax works well on inputs with arbitrary scales, which can often cause problems with the traditional Softmax. This means that Sparsemax is more stable when dealing with large input values.
Robustness: Sparsemax can handle inputs with repeated elements, while Softmax cannot. This makes Sparsemax more robust and able to handle a wider range of input data.

Applications of Sparsemax

Sparsemax can be used in a variety of applications where sparse probability distributions are beneficial. Some of the most common applications include:

Semantic Segmentation in Image Processing: Semantic segmentation is the process of partitioning images into different homogeneous regions based on image content. This technique requires the identification of sparsely distributed features in the image, which Sparsemax can help with.
Classification Problems with High-Dimensional Inputs: In high-dimensional classification problems, Sparsemax can be used to select only the most important features of the input vector, reducing the computational overhead of dense vector operations.
Neuron Activation in Neural Networks: Sparsemax can be used as an activation function in neural networks, where it can help reduce the overfitting of the network by emphasizing only the most important features.

Summary

Sparsemax is an innovative activation function that produces sparse probability distributions, making it more efficient, interpretable, stable and robust than the traditional Softmax. It can be used in a variety of applications where sparse probability distributions are helpful, such as semantic segmentation, high-dimensional classification problems, and neuron activation in neural networks.

While still relatively new, Sparsemax has already shown significant potential and is sure to play a critical role in the future of machine learning and AI.