Vision and Language Models (VLMs), such as CLIP, have enabled visual
rec...
Vision and Language (VL) models offer an effective method for aligning
r...
Recently, large-scale pre-trained Vision and Language (VL) models have s...
Recent models such as XLS-R and Whisper have made multilingual speech
te...
Vision and Language (VL) models have demonstrated remarkable zero-shot
p...
Large-scale pre-trained Vision Language (VL) models have shown remar...
Spatio-temporal grounding describes the task of localizing events in spa...
Building object detectors that are robust to domain shifts is critical f...
Large scale Vision-Language (VL) models have shown tremendous success in...
Prompt tuning, in which a base pretrained model is adapted to each task ...
Scaling transformers has led to significant breakthroughs in many domain...
Pre-training is an effective technique for ensuring robust performance o...
Learning image representations using synthetic data allows training neur...
While transformers have greatly boosted performance in semantic segmenta...
Computer vision models suffer from a phenomenon known as catastrophic
fo...
Vision and Language (VL) models have demonstrated remarkable zero-shot
p...
Recently, large-scale pre-trained Vision-and-Language (VL) foundation mo...
Multilingual text-video retrieval methods have improved significantly in...
Foundation Models (FMs) have demonstrated unprecedented capabilities
inc...
Designing better machine translation systems by considering auxiliary in...
Existing work on VQA explores data augmentation to achieve better
genera...
Multi-modal learning from video data has seen increased attention recent...
The ability to generalize learned representations across significantly
d...
In this paper, we explore self-supervised audio-visual models that learn...
Deep convolutional networks have recently achieved great success in vide...
Generalization to out-of-distribution data has been a problem for Visual...
The self-attention-based model, transformer, is recently becoming the le...
Most existing works in few-shot learning rely on meta-learning the netwo...
Multi-modal learning, which focuses on utilizing various modalities to
i...
When people observe events, they are able to abstract key information an...
Multimodal self-supervised learning is getting more and more attention a...
Nowadays, there is an abundance of data involving images and surrounding...
Tremendous progress has been made in visual representation learning, not...
Network quantization has rapidly become one of the most widely used meth...
Performing inference on deep learning models for videos remains a challe...
Temporal modelling is the key for efficient video action recognition. Wh...
Learning to recognize actions from only a handful of labeled videos is a...
As machine learning algorithms grow in popularity and diversify to many
...
Few-shot learning methods offer pre-training techniques optimized for ea...
Partial domain adaptation which assumes that the unknown target label sp...
Existing Visual Question Answering (VQA) models are often fragile and
se...
Neural Architecture Search (NAS) is a powerful tool to automatically des...
In recent years, a number of approaches based on 2D CNNs and 3D CNNs hav...
Identifying common patterns among events is a key ability in human and
m...
Supervised deep learning methods are enjoying enormous success in many
p...
Action recognition is an open and challenging problem in computer vision...
Data augmentation is one of the most important tools in training modern ...
Neural Architecture Search (NAS) is an open and challenging problem in
m...
Current methods for learning visually grounded language from videos ofte...
In this paper, we propose a new few-shot learning method called StarNet,...