gMLP

gMLP is a new model that has been developed as an alternative to Transformers in the field of Natural Language Processing (NLP). Instead of using self-attention processes, it consists of basic Multi-Layer Perceptron (MLP) layers with gating. The model is organized into a stack of blocks, each defined by a set of equations.

The Structure of gMLP

The gMLP model is composed of a stack of identical blocks, each of which has the following structure:

A linear projection to generate channel projections
An activation function such as GeLU
Spatial gating unit
Another linear projection to generate spatial projections

This structure allows the model to process token representations in sequence with the same structure as Transformers. A token representation is a vector that represents a word in a sentence or document.

In gMLP, the token representations are stored in a matrix called X. Each block in the model takes X as input and produces Y as output. The process of generating Y consists of several steps:

Channel Projections: Linear projections are applied to X along the channel dimension using U. The resulting matrix is denoted as Z.
Activation Function: The activation function is applied to Z to obtain the activation values of each token.
Spatial Gating Unit: Spatial interactions among tokens are captured by applying a spatial gating unit to the activation values.
Spatial Projections: Linear projections are applied to the spatial gating unit output along the spatial dimension using V to get the final output Y.

The model contains L blocks, each with identical structure and size. The process described above is repeated L times on different sets of token representations until the final output is obtained.

Spatial Gating Unit

The Spatial Gating Unit is a key ingredient in the gMLP model. It is a type of gating mechanism used to capture complex spatial interactions among tokens. Its main purpose is to learn weights that can be applied to each token.

The Spatial Gating Unit involves a modified linear gating where a depthwise separable convolution is used. This convolution has three parameters: the input tensor, the depthwise filter kernel, and the pointwise filter kernel.

The depthwise filter kernel captures spatial information and is applied to each spatial dimension independently. The pointwise filter kernel is then applied to reduce the channel dimension. Both filter kernels are learned during the training process.

Advantages of gMLP

One of the main advantages of gMLP is that it does not require position embeddings. Position embeddings are a type of input encoding used in Transformers that require additional computational resources to generate.

Another advantage of gMLP is that it works well on tasks such as text classification, sentiment analysis, and language modeling. It has shown to achieve comparable or even better performance than Transformer-based models on these tasks.

Moreover, gMLP is computationally efficient compared to Transformers. Transformers have a high computational cost due to the use of self-attention mechanisms. gMLP, on the other hand, uses regular MLPs with gating mechanisms which are computationally less expensive.

gMLP is a promising new model in the field of NLP that offers an alternative to Transformers. Its structure allows the model to process token representations in sequence with the same structure as Transformers. The Spatial Gating Unit is a key ingredient in the model that captures complex spatial interactions among tokens. gMLP has several advantages over Transformer-based models, such as not requiring position embeddings and being computationally efficient.