Feedback Memory

Feedback Memory in the Feedback Transformer Architecture

Feedback Memory is a type of attention module used in the Feedback Transformer architecture. This allows for the most abstract representations from the past to be directly used as inputs for the current timestep. The model does not form its representation in parallel, but rather sequentially token by token. Feedback Memory replaces the context inputs to attention modules with memory vectors that are computed over the past. This means that the model has the ability to formulate the representation based on past representations from any layer, providing better representations of the input.

How Feedback Memory Works

In the Feedback Transformer Architecture, a memory vector is computed by summing the representations of each layer at the current time step. The weighting of different layers is done by a softmax output, which gives the model more flexibility as it can average them or select one of them. The computation of the Transformer is adapted from parallel to sequential, allowing for the ability to formulate the representation based on past representations from any layer. The exposure of previous computations to future computations provides better representations of the input.

The Benefits of Feedback Memory

The ability to use Feedback Memory provides many benefits to the Transformer architecture. Models with Feedback Memory have the capacity to capture the same level of abstraction as deeper architectures, allowing for much shallower models to be used. This can save computational resources and time. Additionally, Feedback Memory allows for more accurate predictions since past representations are used as inputs for the current timestep. Overall, using Feedback Memory can improve the performance of the Feedback Transformer architecture.

In summary, Feedback Memory is a type of attention module used in the Feedback Transformer architecture. It allows for the most abstract representations from the past to be directly used as inputs for the current timestep, improving the performance of the model. By using Feedback Memory, models are able to capture the same level of abstraction as deeper architectures, while still being computationally efficient.