Direct speech-to-speech translation (S2ST) with discrete self-supervised...
Recently, there has been a growing interest in the field of controllable...
3D scene understanding has gained significant attention due to its wide ...
3D visual grounding aims to localize the target object in a 3D point clo...
In the field of music information retrieval (MIR), cover song identifica...
3D visual grounding involves finding a target object in a 3D scene that
...
Most sign language translation (SLT) methods to date require the use of ...
Zero-shot text-to-speech aims at synthesizing voices with unseen speech
...
Unsupervised text style transfer task aims to rewrite a text into target...
Speech Recognition builds a bridge between the multimedia streaming
(aud...
Scaling text-to-speech to a large and wild dataset has been proven to be...
We are interested in a novel task, namely low-resource text-to-talking
a...
Diffusion models have demonstrated impressive performance in text-to-ima...
Various applications of voice synthesis have been developed independentl...
Large diffusion models have been successful in text-to-audio (T2A) synth...
Direct speech-to-speech translation (S2ST) aims to convert speech from o...
Stutter removal is an essential scenario in the field of speech editing....
Multi-modal Contrastive Representation (MCR) learning aims to encode
dif...
Text-to-speech(TTS) has undergone remarkable improvements in performance...
Speech-to-SQL (S2SQL) aims to convert spoken questions into SQL queries ...
Improving text representation has attracted much attention to achieve
ex...
We are interested in a challenging task, Realistic-Music-Score based Sin...
The speech-to-singing (STS) voice conversion task aims to generate singi...
Building benchmarks to systemically analyze different capabilities of vi...
Generating talking person portraits with arbitrary speech audio is a cru...
Large language models (LLMs) have exhibited remarkable capabilities acro...
Set-based face recognition (SFR) aims to recognize the face sets in the
...
Product Retrieval (PR) and Grounding (PG), aiming to seek image and
obje...
Listening to long video/audio recordings from video conferencing and onl...
ICASSP2023 General Meeting Understanding and Generation Challenge (MUG)
...
Multi-media communications facilitate global interaction among people.
H...
Diffusion models have recently exhibited remarkable abilities to synthes...
Generating photo-realistic video portrait with arbitrary speech audio is...
Large-scale multimodal generative modeling has created milestones in
tex...
Diffusion Probabilistic Models (DPMs) have shown a powerful capacity of
...
Out-of-distribution (OOD) detection is an important task to ensure the
r...
Video to sound generation aims to generate realistic and natural sound g...
Multi-modal video question answering aims to predict correct answer and
...
In this paper, we introduce a new task, spoken video grounding (SVG), wh...
The task of argument mining aims to detect all possible argumentative
co...
Micro-video recommender systems suffer from the ubiquitous noises in use...
Effectively representing users lie at the core of modern recommender sys...
Video Object Grounding (VOG) is the problem of associating spatial objec...
Denoising diffusion probabilistic models (DDPMs) have recently achieved
...
Unconstrained lip-to-speech synthesis aims to generate corresponding spe...
In recent days, streaming technology has greatly promoted the developmen...
Polyphone disambiguation aims to capture accurate pronunciation knowledg...
Direct speech-to-speech translation (S2ST) systems leverage recent progr...
Style transfer for out-of-domain (OOD) speech synthesis aims to generate...
The recent progress in non-autoregressive text-to-speech (NAR-TTS) has m...