Overview of Video-Audio-Text Transformer (VATT)
Video-Audio-Text Transformer, also known as VATT, is a framework for learning multimodal representations from unlabeled…
Overview of CTAL: Pre-Training Framework for Audio-and-Language Representations
CTAL is a pre-training framework for creating strong audio-and-language representations with a…