We study how vision-language models trained on Internet-scale data can b...
Recent contrastive language image pre-training has led to learning highl...
By transferring knowledge from large, diverse, task-agnostic datasets, m...
Large foundation models can exhibit unique capabilities depending on the...
In this paper, we propose self-supervised training for video transformer...
A common strategy to video understanding is to incorporate spatial and m...
The prevalence of Internet of things (IoT) devices and abundance of sens...