VL-BERT: A Game-Changing Approach to Visual-Linguistic Downstream Tasks
The advancements in natural language processing (NLP) and computer vision (CV) have…
What is LXMERT?
LXMERT (Learning Cross-Modality Encoder Representations from Transformers) is a model used for learning vision-and-language cross-modality representations. The…
Unified VLP: An Overview of the Unified Encoder-Decoder Model for General Vision-Language Pre-Training
The Unified VLP (Visual Language Pre-training) model…
Understanding WenLan: A Cross-Modal Pre-Training Model
WenLan is a two-tower pre-training model proposed within the cross-modal contrastive learning framework. The…