Differential attention for visual question answering

At its core, visual question answering (VQA) is the task of answering questions based on images. This is an important problem with applications in various fields, such as robotics and image search engines. To train systems for VQA, a dataset of question-answer pairs for images is used.

The Problem with Image Based Attention

One approach to solving VQA is by using image-based attention. This involves focusing on a specific part of the image while answering the question. Humans also do this when solving VQA problems. However, previous systems using image-based attention have focused on regions that are not necessarily correlated with the regions humans focus on when solving VQA problems. As a result, the accuracy of these systems is limited.

The Solution: Differential Attention

In the paper, the authors propose a new method to improve accuracy in VQA. The method is called differential attention and is based on an exemplar-based approach. The system obtains supporting and opposing exemplars to create a differential attention region that is closer to human attention than other image-based attention methods.

Differential attention is important because it helps the system to focus on the most relevant parts of the image to answer the question, resulting in improved accuracy. The method is evaluated on challenging benchmark datasets and performs better than other image-based attention methods. It is also competitive with other state-of-the-art methods that focus on both image and questions.

Overall, the differential attention method proposed in the paper provides a promising solution for improving accuracy in VQA. By using exemplar-based attention, the system can focus on the most relevant parts of the image to answer the question, resulting in improved accuracy. This method provides a more human-like approach to solving VQA problems and has the potential to be applied in various fields where VQA is important.