Modular vision-language models (Vision-LLMs) align pretrained image enco...
Vision-and-language (VL) models with separate encoders for each modality...
Current multimodal models, aimed at solving Vision and Language (V+L) ta...
Recent advances in NLP and information retrieval have given rise to a di...
Recent advances in multimodal vision and language modeling have predomin...
Question answering systems should help users to access knowledge on a br...
Current state-of-the-art approaches to cross-modal retrieval process tex...
Massively pre-trained transformer models are computationally expensive t...