Referring image segmentation (RIS) aims to find a segmentation mask give...
Audiovisual automatic speech recognition (AV-ASR) aims to improve the
ro...
Vision-language (VL) pre-training has recently gained much attention for...
Existing works on open-vocabulary semantic segmentation have utilized
la...
In this work, we introduce Vid2Seq, a multi-modal single-stage dense eve...
In this report, we describe our submission to the Ego4D AudioVisual (AV)...
Audio-visual automatic speech recognition (AV-ASR) is an extension of AS...
A major challenge in text-video and text-audio retrieval is the lack of
...
Recent video and language pretraining frameworks lack the ability to gen...
While most conversational AI systems focus on textual dialogue only,
con...
Human ratings are currently the most accurate way to assess the quality ...
We introduce a novel stochastic regularization technique for deep neural...
We propose a generic framework to calibrate accuracy and confidence (sco...
Image geolocalization is the task of identifying the location depicted i...
Semantic correspondence is the problem of establishing correspondences a...
Visual dialog is a task of answering a series of inter-dependent questio...
We present a framework to analyze various aspects of models for video
qu...
We propose a novel attention model that can accurately attend to target
...
We tackle image question answering (ImageQA) problem by learning a
convo...