Being able to perceive the semantics and the spatial structure of the
en...
Composed image retrieval searches for a target image based on a multi-mo...
Pre-training has been adopted in a few of recent works for
Vision-and-La...
Most existing works in vision-and-language navigation (VLN) focus on eit...
Compared with expensive pixel-wise annotations, image-level labels make ...
Accuracy of many visiolinguistic tasks has benefited significantly from ...
Vision-and-Language Navigation (VLN) requires an agent to navigate in a
...
Vision-and-language navigation requires an agent to navigate through a r...