The recent advances in Convolutional Neural Networks (CNNs) and Vision
T...
A reliable and comprehensive evaluation metric that aligns with manual
p...
We introduce a new conversation head generation benchmark for synthesizi...
Multimodal fusion integrates the complementary information present in
mu...
Dynamically synthesizing talking speech that actively responds to a list...
Passage ranking involves two stages: passage retrieval and passage
re-ra...
Recent advances in 3D scene representation and novel view synthesis have...
Deep neural networks (DNNs) usually fail to generalize well to outside o...
Recent advances on text-to-image generation have witnessed the rise of
d...
Video temporal dynamics is conventionally modeled with 3D spatial-tempor...
In this paper, we propose a novel deep architecture tailored for 3D poin...
The recent advances in deep learning predominantly construct models in t...
Recent progress on 2D object detection has featured Cascade RCNN, which
...
Outlier detection tasks have been playing a critical role in AI safety. ...
Human face images usually appear with wide range of visual scales. The
e...
The adaption of Generative Adversarial Network (GAN) aims to transfer a
...
Deep prompt tuning (DPT) has gained great success in most natural langua...
Visual question answering is an important task in both natural language ...
Multi-scale learning frameworks have been regarded as a capable class of...
Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone ...
Prior works have proposed several strategies to reduce the computational...
The leverage of large volumes of web videos paired with the searched que...
Motion, as the uniqueness of a video, has been critical to the developme...
Comprehending the rich semantics in an image and ordering them in lingui...
Convolutional Neural Networks (CNNs) have been regarded as the go-to mod...
Recent high-performing Human-Object Interaction (HOI) detection techniqu...
This paper presents an overview and comparative analysis of our systems
...
Vision Transformer (ViT) has become a leading tool in various computer v...
Motion, as the most distinct phenomenon in a video to involve the change...
Human actions are typically of combinatorial structures or patterns, i.e...
Vision-language pre-training has been an emerging and fast-developing
re...
Live video broadcasting normally requires a multitude of skills and expe...
Video content is multifaceted, consisting of objects, scenes, interactio...
Video is complex due to large variations in motion and rich content in
f...
It is not trivial to optimally learn a 3D Convolutional Neural Networks ...
Our work reveals a structured shortcoming of the existing mainstream
sel...
Mainstream state-of-the-art domain generalization algorithms tend to
pri...
Self-supervised learning (SSL) has recently become the favorite among fe...
BERT-type structure has led to the revolution of vision-language pre-tra...
Localizing text instances in natural scenes is regarded as a fundamental...
With the rise and development of deep learning over the past decade, the...
Unsupervised learning is just at a tipping point where it could really t...
Transformer with self-attention has led to the revolutionizing of natura...
Despite having impressive vision-language (VL) pretraining with BERT-bas...
With the development of deep learning techniques and large scale dataset...
This paper explores useful modifications of the recent development in
co...
With the knowledge of action moments (i.e., trimmed video clips that eac...
A steady momentum of innovations and breakthroughs has convincingly push...
Single shot detectors that are potentially faster and simpler than two-s...
In this work, we present Auto-captions on GIF, which is a new large-scal...