Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA

Explanation vs Attention: Improving Visual Question Answering (VQA)

Visual Question Answering (VQA) is a challenging task that requires a machine to answer questions based on images. One of the important factors in VQA is attention, which determines which parts of an image should be focused on to answer a given question. However, supervising attention can be difficult. In this paper, the authors propose using visual explanations, obtained through class activation mappings, as a means of supervision to improve attention in VQA.

The Role of Visual Explanations

The visual explanations obtained through class activation mappings (CAMs) aim to explain the performance of various neural networks. Since these explanations indicate which parts of an image were important for the network to correctly classify an object or answer a question, they can serve as a form of attention supervision. However, the distributions of attention maps and CAMs are different, which makes it unsuitable to directly use CAMs as supervision.

The Proposed Solution

To address this issue, the authors propose using a discriminator that distinguishes between visual explanations and attention maps. By training this discriminator through an adversarial process, they aim to bring the distributions of attention maps and visual explanations closer together. By doing so, they hope to create attention maps that are more closely related to human attention and improve VQA performance.

The Results

The authors observe that this two-player game approach to supervision results in attention maps that are more closely related to human attention. Specifically, it results in a substantial improvement over baseline stacked attention network (SAN) models and a good improvement in rank correlation metric on the VQA task. This method can also be combined with recent multi-modal compact bilinear pooling (MCB) based methods and results in consistent improvement. Comparisons with other forms of learning the attention maps, such as based on Correlation Alignment (Coral), Maximum Mean Discrepancy (MMD), and Mean Square Error (MSE) losses, show that the adversarial loss outperforms the other methods.

In addition, visualization of the results confirms the authors' hypothesis that attention maps improve using this approach to supervision. Overall, this paper presents a novel approach to using visual explanations for attention supervision in VQA, which results in improved performance and attention maps that are more closely related to human attention.

Great! Next, complete checkout for full access to SERP AI.
Welcome back! You've successfully signed in.
You've successfully subscribed to SERP AI.
Success! Your account is fully activated, you now have access to all content.
Success! Your billing info has been updated.
Your billing was not updated.