Introduction to NesT
NesT is a neural network architecture that is used for image recognition tasks. It has gained a lot of popularity due to its superior performance compared to other state-of-the-art networks such as ResNet and VGG. NesT stands for Nested Scale-Transformers, and it is built using a combination of transformer layers and "nesting" hierarchies.
How NesT Works
One of the unique features of NesT is that it conducts local self-attention on every image block independently, and then "nests" them hierarchically. This means that instead of processing the entire image at once, it breaks it down into smaller blocks and works on them individually. Coupling of processed information between spatially adjacent blocks is achieved through a proposed block aggregation between every two hierarchies. The overall hierarchical structure can be determined by two key hyper-parameters: patch size $S × S$ and number of block hierarchies $T_d$.
How NesT Layers Work
The layers used in NesT consist of canonical transformer layers. Each transformer layer is composed of a multi-head self-attention (MSA) layer followed by a feed-forward fully-connected network (FFN) with skip-connection and Layer normalization. Positional embeddings are added to encode spatial information before feeding into the block. Finally, a nested hierarchy with block aggregation is built -- every four spatially connected blocks are merged into one. All blocks inside each hierarchy share one set of parameters.
Training and Performance
When training NesT, each image is linearly projected to an embedding. All embeddings are partitioned to blocks and flattened to generate final input. The network is trained using backpropagation with a loss function that depends on the task at hand.
NesT has been shown to outperform other state-of-the-art networks such as ResNet and VGG on a variety of classification tasks, including the popular ImageNet dataset. It has achieved state-of-the-art performance on image classification, object detection, and segmentation tasks.
NesT is a powerful neural network architecture that has shown to be superior to other state-of-the-art networks. It is built using a combination of transformer layers and "nesting" hierarchies, allowing it to conduct local self-attention on every image block independently. This approach has proven to be very effective on a variety of image recognition tasks, achieving state-of-the-art results on datasets such as ImageNet.