DeLighT Block

DeLighT Block is a block used in the transformer architecture of DeLighT, which is a machine learning model that applies DExTra transformations to the input vectors of a single-headed attention module. This block replaces multi-head attention with single-head attention, which helps the model learn wider representations of the input across different layers.

What is DeLighT Block?

DeLighT Block is a vital component of the DeLighT transformer architecture. It serves the fundamental purpose of reducing the dimensionality of the vectors that the model uses. It does this by applying DExTra transformations to the input vectors of a single-headed attention module. The end result is that the model learns wider representations of input across different layers, making the model effective at its task.

The main idea behind the design of the DeLighT Block is that its use of a single-headed attention module itself is not new to transformer architectures. However, the use of DExTra transformations within the block is one of the main innovations behind the design of DeLighT.

How DeLighT Block uses DExTra transformations

DeLighT Block integrates DExTra transformations using a series of carefully designed techniques. In particular, DExTra transformation reduces the dimensionality of input vectors and provides the model with wider representations of input. When the input passes through the attention layer, it is then easier for the single-headed attention module to process and learn the wide representations.

The use of DExTra transformations is desirable in DeLighT Block because it enables the model to learn wider representations of input across different layers. These wider representations improve the model's performance by allowing it to capture more subtle and complex relationships among the data points. In other words, the model can process more data and identify patterns and trends more effectively than if it were limited in its understanding of the input.

One important aspect of DExTra transformations is that they work across the entire dataset, not just individual data points. This means that the transformations capture statistical patterns across many data points, improving the model's performance across different inputs.

Single-headed attention vs. multi-head attention

In transformer architectures, attention modules help the model focus its attention on relevant portions of the input. In traditional transformer architectures, like the original Transformer architecture, multi-head attention is used. Multi-head attention is a technique that applies multiple heads, or attention mechanisms, to the input vectors to capture greater complexity and provide more specificity.

One of the unique features of the DeLighT Block is its use of single-headed attention rather than multi-head attention. This decision was made because the DExTra transformation already provides the model with wider representations of input across different layers, making multi-head attention redundant.

Furthermore, single-headed attention makes the model more computationally efficient. It requires fewer parameters and computations, which leads to faster training times and more effective learning.

Light-weight FFN

Following the attention module in DeLighT Block, there is a light-weight feed-forward network (FFN). This network squeezes the dimensions of the input vectors rather than expanding them, which is what traditional transformer architectures typically do.

The reason for this is similar to why the DeLighT Block uses a single-headed attention module instead of multi-head attention. The DExTra transformation already incorporates wider representations, so squeezing the dimensions is a more effective strategy for improving the model's performance.

In summary, the DeLighT Block is a vital component of the DeLighT transformer architecture. Its use of DExTra transformations and single-headed attention makes the model more computationally efficient, faster, and more effective at processing complex data. The light-weight FFN follows the attention module to continue building on the effect of the DeLighT Block's wider representations of the input. It is these techniques that make DeLighT an innovative and powerful machine learning model.