Audio visual segmentation (AVS) aims to segment the sounding objects for...
Domain shifts such as sensor type changes and geographical situation
var...
Vehicle-to-everything (V2X) autonomous driving opens up a promising dire...
Audio-visual navigation is an audio-targeted wayfinding task where a rob...
When hearing music, it is natural for people to dance to its rhythm.
Aut...
We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic
...
In the era of extensive intersection between art and Artificial Intellig...
The robustness of deep neural networks (DNNs) is crucial to the hosting
...
This paper presents a DETR-based method for cross-domain weakly supervis...
With the prevalence of multimodal learning, camera-LiDAR fusion has gain...
This paper proposes a novel, abstraction-based, certified training metho...
Open-vocabulary object detection aims to provide object detectors traine...
One-to-one matching is a crucial design in DETR-like object detection
fr...
Snapshot isolation (SI) is a prevalent weak isolation level that avoids ...
Monocular 3D lane detection is a challenging task due to its lack of dep...
3D object detection from multi-view images has drawn much attention over...
We present a simple yet effective end-to-end Video-language Pre-training...
Recently, Vehicle-to-Everything(V2X) cooperative perception has attracte...
Music is essential when editing videos, but selecting music manually is
...
The robustness of neural networks is fundamental to the hosting system's...
Reachability analysis is a promising technique to automatically prove or...
As one of the fundamental functions of autonomous driving system, freesp...
Existing methods for human mesh recovery mainly focus on single-view
fra...
The robustness of deep neural networks is crucial to modern AI-enabled
s...
Human pose estimation aims to accurately estimate a wide variety of huma...
Panoptic Narrative Grounding (PNG) is an emerging task whose goal is to
...
Vision-language navigation is the task of directing an embodied agent to...
Conventional knowledge distillation (KD) methods for object detection ma...
Referring video object segmentation aims to predict foreground labels fo...
Vision-and-language Navigation (VLN) task requires an embodied agent to
...
3D visual grounding aims to locate the referred target object in 3D poin...
Multi-object Tracking (MOT) generally can be split into two sub-tasks, i...
The task of Human-Object Interaction (HOI) detection could be divided in...
In this paper, we present a novel Distribution-Aware Single-stage (DAS) ...
In this work, we address the task of video background music generation. ...
3D object detection with LiDAR point clouds plays an important role in
a...
Two-stage methods have dominated Human-Object Interaction (HOI) detectio...
Recently proposed fine-grained 3D visual grounding is an essential and
c...
In this paper, we are tackling the weakly-supervised referring expressio...
In this paper, we address the makeup transfer and removal tasks
simultan...
Vision and language understanding techniques have achieved remarkable
pr...
Given a natural language expression and an image/video, the goal of refe...
Language-queried video actor segmentation aims to predict the pixel-leve...
To address the challenging task of instance-aware human part parsing, a ...
In recent years, knowledge distillation has been proved to be an effecti...
Video relation detection problem refers to the detection of the relation...
Learning to capture dependencies between spatial positions is essential ...
When describing an image, reading text in the visual scene is crucial to...
In this work, we introduce a novel task - Humancentric Spatio-Temporal V...
Referring image segmentation aims to predict the foreground mask of the
...