Building joint representations across images and text is an essential st...
Image-language learning has made unprecedented progress in visual
unders...
The development of language models have moved from encoder-decoder to
de...
We present a simple approach which can turn a ViT encoder into an effici...
We present an effective method for fusing visual-and-language representa...
We present F-VLM, a simple open-vocabulary object detection method built...
Effective scaling and a flexible task interface enable large language mo...
We present a pre-training approach for vision and language transformer
m...
Video question answering is a challenging task that requires understandi...
We present Answer-Me, a task-aware multi-task framework which unifies a
...
We propose FindIt, a simple and versatile framework that unifies a varie...
We present 4D-Net, a 3D object detection approach, which utilizes 3D Poi...
In this paper we address the problem of automatically discovering atomic...
In this paper, we introduce a novel visual representation learning which...
In this paper we address the problem of automatically discovering atomic...
A common strategy to video understanding is to incorporate spatial and m...
Standard methods for video recognition use large CNNs designed to captur...
We create a family of powerful video models which are able to: (i) learn...
In this paper we propose an adversarial generative grammar model for fut...
Convolutional operations have two limitations: (1) do not explicitly mod...
We introduce a new public video dataset for action recognition: Anonymiz...
We present a new method to learn video representations from large-scale
...
Video understanding is a challenging problem with great impact on the
ab...
We present a visual imitation learning framework that enables learning o...
We present a new method to learn video representations from unlabeled da...
Learning to represent videos is a very challenging task both algorithmic...
Injuries are a major cost in sports. Teams spend millions of dollars eve...
This paper proposes a novel algorithm which learns a formal regular gram...
In this paper, we present a new method for evolving video CNN models to ...
In this paper, we propose a convolutional layer inspired by optical flow...
In this paper, we propose a method to learn a joint multimodal embedding...
Learning to control robots directly based on images is a primary challen...
In this paper, we introduce a challenging new dataset, MLB-YouTube, desi...
In this paper, we introduce a new convolutional layer named the Temporal...
In this paper, we introduce the concept of learning latent
super-events ...
In this paper, we newly introduce the concept of temporal attention filt...