Compressive Transformer

The Compressive Transformer is a type of neural network that is an extension of the Transformer model. It works by mapping past hidden activations, also known as memories, to a smaller set of compressed representations called compressed memories. This allows the network to better process information over time and use both short-term and long-term memory.

Compressive Transformer vs. Transformer-XL

The Compressive Transformer builds on the ideas of the Transformer-XL, which is another type of Transformer model. The Transformer-XL maintains a memory of past activations at each layer to preserve a longer history of context. However, the Transformer-XL discards past activations when they become sufficiently old. This is controlled by the size of the memory. In contrast, the Compressive Transformer compresses these old memories and stores them in an additional compressed memory.

How the Compressive Transformer Works

At each time step, the oldest compressed memories are discarded in a first-in, first-out (FIFO) order. Then, the oldest n states from the ordinary memory are compressed and shifted to the new slot in the compressed memory. During training, the compressive memory component is optimized separately from the main language model, which helps the network learn to better process information over time.

Benefits of the Compressive Transformer

The Compressive Transformer has several benefits over the Transformer model. By compressing old memories instead of discarding them, the network can better retain information over time. This allows the network to better understand context and make more accurate predictions. Additionally, because the compressive memory component is optimized separately from the main language model during training, the network can learn more quickly and accurately.

Overall, the Compressive Transformer is an innovative approach to improving the performance of neural networks. By compressing old memories and maintaining both short-term and long-term memory, this advanced model has the potential to revolutionize the field of natural language processing and other areas of artificial intelligence.