Spatio-Temporal Attention LSTM

In human action recognition, each type of action generally only depends on a few specific kinematic joints. Furthermore, over time, multiple actions may be performed. To address these observations, Song et al. proposed a joint spatial and temporal attention network based on LSTM, called STA-LSTM, to adaptively find discriminative features and keyframes. This network combines a spatial attention sub-network and a temporal attention sub-network to select important regions and key frames.

What is STA-LSTM?

STA-LSTM is a joint spatiotemporal attention method used for human action recognition that focuses on important joints and keyframes. This method is designed to learn which kinematic joints are important for recognizing specific actions and to identify the essential frames for recognizing a sequence of actions. This network is composed of two attention-related components: the spatial attention sub-network and the temporal attention sub-network.

How does the Spatial Attention Sub-Network Work?

The spatial attention sub-network selects important regions that are relevant for recognizing a specific action. It does this by using a set of learnable parameters and the hidden state at the previous time step to generate spatial attention scores. These scores are then passed through a softmax function to obtain attention weights that are used to combine the input features at each time step. The output of this sub-network is a set of attention-weighted input features that highlight the relevant regions for action recognition.

How does the Temporal Attention Sub-Network Work?

The temporal attention sub-network selects key frames that are necessary for recognizing a sequence of actions. It achieves this by using a similar approach to the spatial attention sub-network to generate temporal attention scores. These scores are then passed through a rectified linear unit (ReLU) function to ensure non-negativity and facilitate optimization. The output of this sub-network is a set of attention weights that indicate the importance of each frame for recognizing the entire sequence of actions.

Why is STA-LSTM Effective?

STA-LSTM is effective because it focuses on the most relevant parts of the input data for recognizing specific actions. By learning which kinematic joints are important for recognizing each action and which frames are essential for recognizing a sequence of actions, this method improves classification accuracy while reducing computational complexity. In practice, this attention mechanism also helps reduce the effect of noisy or irrelevant information in the input data.

STA-LSTM is a joint spatiotemporal attention method designed to improve human action recognition by highlighting the most relevant parts of the input data. This method achieves state-of-art performance on challenging action recognition datasets while reducing computational overhead compared to traditional methods. The spatial attention sub-network focuses on the most important regions in the input data, while the temporal attention sub-network identifies the key frames for recognizing a sequence of actions. Together, these components improve classification accuracy and help reduce the impact of noise or irrelevant information in the input data.