This paper introduces the HumTrans dataset, which is publicly available ...
Background music (BGM) can enhance the video's emotion. However, selecti...
Synthesizing realistic videos according to a given speech is still an op...
This paper presents a LoRA-free method for stylized image generation tha...
The single-speaker singing voice synthesis (SVS) usually underperforms a...
Transfer learning has become crucial in computer vision tasks due to the...
Reconstructing 3D objects from extremely sparse views is a long-standing...
Text-to-music generation (T2M-Gen) faces a major obstacle due to the sca...
Though the success of CLIP-based training recipes in vision-language mod...
Recently, text-to-image generation has exhibited remarkable advancements...
Omnidirectional images (ODIs) have become increasingly popular, as their...
The demand for efficient 3D model generation techniques has grown
expone...
We present SEED, an elaborate image tokenizer that empowers Large Langua...
Rendering photorealistic and dynamically moving human heads is crucial f...
3D facial avatar reconstruction has been a significant research topic in...
Image super-resolution (SR) with generative adversarial networks (GAN) h...
Despite the ability of existing large-scale text-to-image (T2I) models t...
Given sparse views of an object, estimating their camera poses is a
long...
This paper introduces DreamDiffusion, a novel method for generating
high...
Detecting adversarial samples that are carefully crafted to fool the mod...
Adversarial training is one of the best-performing methods in improving ...
Scene Graph Generation (SGG) aims to structurally and comprehensively
re...
Visual foundation models like CLIP excel in learning feature representat...
Enhancing AI systems to perform tasks following human instructions can
s...
Stickers have become a ubiquitous part of modern-day communication, conv...
As an important and challenging problem in computer vision, PAnoramic
Se...
Creating a vivid video from the event or scenario in our imagination is ...
Exquisite demand exists for customizing the pretrained large text-to-ima...
Public large-scale text-to-image diffusion models, such as Stable Diffus...
The ultimate goal for foundation models is realizing task-agnostic, i.e....
Existing models for named entity recognition (NER) are mainly based on
l...
We empirically investigate proper pre-training methods to build good vis...
We study to generate novel views of indoor scenes given sparse input vie...
Foundation models have achieved great advances in multi-task learning wi...
We introduce HOSNeRF, a novel 360 free-viewpoint rendering method that
r...
Online reconstructing and rendering of large-scale indoor scenes is a
lo...
Despite the success in large-scale text-to-image generation and
text-con...
The main challenge in domain generalization (DG) is to handle the
distri...
Tags are pivotal in facilitating the effective distribution of multimedi...
We present DreamAvatar, a text-and-shape guided framework for generating...
This paper proposes an anchor-based deformation model, namely AnchorDEF,...
In this paper, we study masked autoencoder (MAE) pretraining on videos f...
Recently, diffusion models have achieved great success in image synthesi...
The state of the arts in vision-language pretraining (VLP) achieves exem...
This paper presents a novel approach for estimating human body shape and...
Depth estimation from a monocular 360 image is a burgeoning problem
owin...
We present a simple yet effective method for skeleton-free motion
retarg...
A good motion retargeting cannot be reached without reasonable considera...
Large-scale embedding-based retrieval (EBR) is the cornerstone of
search...
The incredible generative ability of large-scale text-to-image (T2I) mod...