Webpages have been a rich resource for language and vision-language task...
Webpages have been a rich, scalable resource for vision-language and lan...
We propose a self-supervised approach for learning to perform audio sour...
We introduce Mobile app Tasks with Iterative Feedback (MoTIF), a new dat...
Disentangled visual representations have largely been studied with gener...
In recent years, vision-language research has shifted to study tasks whi...
Current multilingual vision-language models either require a large numbe...
Shouldn't language and vision features be treated equally in vision-lang...