We present Kosmos-2.5, a multimodal literate model for machine reading o...
Self-supervised pre-training techniques have achieved remarkable progres...
A creative image-and-text generative AI system mimics humans' extraordin...
We study the joint learning of image-to-text and text-to-image generatio...
Vision-Language Pre-training (VLP) aims to learn multi-modal representat...
We study joint learning of Convolutional Neural Network (CNN) and Transf...
Due to the compelling efficiency in retrieval and storage,
similarity-pr...
Video temporal action detection aims to temporally localize and recogniz...