We propose MM-REACT, a system paradigm that integrates ChatGPT with a po...
The canonical approach to video captioning dictates a caption generation...
In this paper, we propose UNICORN, a vision-language (VL) model that uni...
Joint image-text embedding is the bedrock for most Vision-and-Language (...
We present a new algorithm that significantly improves the efficiency of...
This paper proposes KB-InfoBot -- a multi-turn dialogue agent which help...