Neighborhood Attention

Understanding Neighborhood Attention

Neighborhood Attention is a concept used in Hierarchical Vision Transformers, where each token has its receptive field restricted to its nearest neighboring pixels. It is a type of self-attention pattern proposed as an alternative to other local attention mechanisms. The idea behind Neighborhood Attention is that a token can only attend to the pixels directly surrounding it, rather than all of the pixels in the image.

This concept is similar to Standalone Self Attention (SASA), where both can be implemented by a sliding-window operation over the key-value pair. However, Neighborhood Attention requires a modification that handles corner pixels, helping to maintain a fixed receptive field size and an increased number of relative positions.

The Challenges of Neighborhood Attention

The primary challenge in experimenting with Neighborhood Attention and SASA is computation. Extracting key values for each query takes up a large amount of memory and is eventually intractable at scale. To address this challenge, Neighborhood Attention was implemented through a new CUDA extension to PyTorch called NATTEN. The extension helps make the computation faster and more efficient, making it more accessible for experimenting on a larger scale.

Benefits of Neighborhood Attention

One of the benefits of using Neighborhood Attention is that it creates a more granular attention pattern compared to other attention mechanisms. Since each token only attends to neighboring pixels, it reduces the chances of the model learning irrelevant features. Thus, this attention mechanism helps achieve better accuracy in image recognition tasks.

Another advantage of using Neighborhood Attention instead of SASA is that the receptive field size remains fixed, allowing the model to maintain an accurate sense of spatial awareness. This is particularly helpful when dealing with non-rigid objects, where learning compatible rotations and translations among the objects can help improve the performance of the model.

Neighborhood Attention is a restricted self-attention pattern that can help in improving the accuracy of vision transformer models. It allows tokens to attend only to neighboring pixels, reducing the risk of irrelevant learning features. While experimentation with both Neighborhood Attention and SASA has been challenging due to computation, new advancements such as NATTEN make the process more efficient and faster.

By utilizing Neighborhood Attention, vision transformer models can maintain an accurate sense of spatial awareness, which can help improve the performance of the models when dealing with non-rigid objects.