Zhou Zhao

research

∙ 09/14/2023

Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer

Direct speech-to-speech translation (S2ST) with discrete self-supervised...

0 Yongqi Wang, et al. ∙

research

∙ 08/28/2023

TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models

Recently, there has been a growing interest in the field of controllable...

0 Shengpeng Ji, et al. ∙

research

∙ 08/17/2023

Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes

3D scene understanding has gained significant attention due to its wide ...

0 Zehan Wang, et al. ∙

research

∙ 07/25/2023

3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding

3D visual grounding aims to localize the target object in a 3D point clo...

0 Zehan Wang, et al. ∙

research

∙ 07/19/2023

DisCover: Disentangled Music Representation Learning for Cover Song Identification

In the field of music information retrieval (MIR), cover song identifica...

0 Jiahao Xun, et al. ∙

research

∙ 07/18/2023

Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding

3D visual grounding involves finding a target object in a 3D scene that ...

0 Zehan Wang, et al. ∙

research

∙ 07/14/2023

Gloss Attention for Gloss-free Sign Language Translation

Most sign language translation (SLT) methods to date require the use of ...

0 Aoxiong Yin, et al. ∙

research

∙ 07/14/2023

Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts

Zero-shot text-to-speech aims at synthesizing voices with unseen speech ...

0 Ziyue Jiang, et al. ∙

research

∙ 06/12/2023

MSSRNet: Manipulating Sequential Style Representation for Unsupervised Text Style Transfer

Unsupervised text style transfer task aims to rewrite a text into target...

0 Yazheng Yang, et al. ∙

research

∙ 06/10/2023

OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment

Speech Recognition builds a bridge between the multimedia streaming (aud...

0 Xize Cheng, et al. ∙

research

∙ 06/06/2023

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

Scaling text-to-speech to a large and wild dataset has been proven to be...

0 Ziyue Jiang, et al. ∙

research

∙ 06/06/2023

Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

We are interested in a novel task, namely low-resource text-to-talking a...

0 Zhenhui Ye, et al. ∙

research

∙ 06/04/2023

Detector Guidance for Multi-Object Text-to-Image Generation

Diffusion models have demonstrated impressive performance in text-to-ima...

0 Luping Liu, et al. ∙

research

∙ 05/30/2023

Make-A-Voice: Unified Voice Synthesis With Discrete Representation

Various applications of voice synthesis have been developed independentl...

0 Rongjie Huang, et al. ∙

research

∙ 05/29/2023

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

Large diffusion models have been successful in text-to-audio (T2A) synth...

0 Jiawei Huang, et al. ∙

research

∙ 05/24/2023

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

Direct speech-to-speech translation (S2ST) aims to convert speech from o...

0 Rongjie Huang, et al. ∙

research

∙ 05/23/2023

FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models

Stutter removal is an essential scenario in the field of speech editing....

0 Ziyue Jiang, et al. ∙

research

∙ 05/22/2023

Connecting Multi-modal Contrastive Representations

Multi-modal Contrastive Representation (MCR) learning aims to encode dif...

0 Zehan Wang, et al. ∙

research

∙ 05/22/2023

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

Text-to-speech(TTS) has undergone remarkable improvements in performance...

0 Huadai Liu, et al. ∙

research

∙ 05/21/2023

Wav2SQL: Direct Generalizable Speech-To-SQL Parsing

Speech-to-SQL (S2SQL) aims to convert spoken questions into SQL queries ...

0 Huadai Liu, et al. ∙

research

∙ 05/18/2023

CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training

Improving text representation has attracted much attention to achieve ex...

0 Zhenhui Ye, et al. ∙

research

∙ 05/18/2023

RMSSinger: Realistic-Music-Score based Singing Voice Synthesis

We are interested in a challenging task, Realistic-Music-Score based Sin...

0 Jinzheng He, et al. ∙

research

∙ 05/08/2023

AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment

The speech-to-singing (STS) voice conversion task aims to generate singi...

0 Ruiqi Li, et al. ∙

research

∙ 05/04/2023

ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos

Building benchmarks to systemically analyze different capabilities of vi...

0 Zhou Yu, et al. ∙

research

∙ 05/01/2023

GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation

Generating talking person portraits with arbitrary speech audio is a cru...

8 Zhenhui Ye, et al. ∙

research

∙ 04/25/2023

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Large language models (LLMs) have exhibited remarkable capabilities acro...

7 Rongjie Huang, et al. ∙

research

∙ 04/13/2023

Set-Based Face Recognition Beyond Disentanglement: Burstiness Suppression With Variance Vocabulary

Set-based face recognition (SFR) aims to recognize the face sets in the ...

0 Jiong Wang, et al. ∙

research

∙ 04/07/2023

DATE: Domain Adaptive Product Seeker for E-commerce

Product Retrieval (PR) and Grounding (PG), aiming to seek image and obje...

0 Haoyuan Li, et al. ∙

research

∙ 03/24/2023

MUG: A General Meeting Understanding and Generation Benchmark

Listening to long video/audio recordings from video conferencing and onl...

5 Qinglin Zhang, et al. ∙

research

∙ 03/24/2023

Overview of the ICASSP 2023 General Meeting Understanding and Generation Challenge (MUG)

ICASSP2023 General Meeting Understanding and Generation Challenge (MUG) ...

0 Qinglin Zhang, et al. ∙

research

∙ 03/09/2023

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition

Multi-media communications facilitate global interaction among people. H...

0 Xize Cheng, et al. ∙

research

∙ 02/05/2023

ShiftDDPMs: Exploring Conditional Diffusion Models by Shifting Diffusion Trajectories

Diffusion models have recently exhibited remarkable abilities to synthes...

0 Zijian Zhang, et al. ∙

research

∙ 01/31/2023

GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis

Generating photo-realistic video portrait with arbitrary speech audio is...

3 Zhenhui Ye, et al. ∙

research

∙ 01/30/2023

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Large-scale multimodal generative modeling has created milestones in tex...

1 Rongjie Huang, et al. ∙

research

∙ 12/26/2022

Unsupervised Representation Learning from Pre-trained Diffusion Probabilistic Models

Diffusion Probabilistic Models (DPMs) have shown a powerful capacity of ...

0 Zijian Zhang, et al. ∙

research

∙ 11/21/2022

Diffusion Denoising Process for Perceptron Bias in Out-of-distribution Detection

Out-of-distribution (OOD) detection is an important task to ensure the r...

0 Luping Liu, et al. ∙

research

∙ 11/19/2022

VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement

Video to sound generation aims to generate realistic and natural sound g...

0 Chenye Cui, et al. ∙

research

∙ 09/08/2022

Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering

Multi-modal video question answering aims to predict correct answer and ...

0 Jiong Wang, et al. ∙

research

∙ 09/01/2022

Video-Guided Curriculum Learning for Spoken Video Grounding

In this paper, we introduce a new task, spoken video grounding (SVG), wh...

0 Yan Xia, et al. ∙

research

∙ 08/20/2022

AntCritic: Argument Mining for Free-Form and Visually-Rich Financial Comments

The task of argument mining aims to detect all possible argumentative co...

0 fcq, et al. ∙

research

∙ 08/17/2022

CCL4Rec: Contrast over Contrastive Learning for Micro-video Recommendation

Micro-video recommender systems suffer from the ubiquitous noises in use...

0 Shengyu Zhang, et al. ∙

research

∙ 08/17/2022

Re4: Learning to Re-contrast, Re-attend, Re-construct for Multi-interest Recommendation

Effectively representing users lie at the core of modern recommender sys...

0 Shengyu Zhang, et al. ∙

research

∙ 08/11/2022

HERO: HiErarchical spatio-tempoRal reasOning with Contrastive Action Correspondence for End-to-End Video Object Grounding

Video Object Grounding (VOG) is the problem of associating spatial objec...

0 Mengze Li, et al. ∙

research

∙ 07/13/2022

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

Denoising diffusion probabilistic models (DDPMs) have recently achieved ...

0 Rongjie Huang, et al. ∙

research

∙ 07/08/2022

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

Unconstrained lip-to-speech synthesis aims to generate corresponding spe...

0 Yongqi Wang, et al. ∙

research

∙ 06/10/2022

AntPivot: Livestream Highlight Detection via Hierarchical Attention Mechanism

In recent days, streaming technology has greatly promoted the developmen...

0 fcq, et al. ∙

research

∙ 06/05/2022

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Polyphone disambiguation aims to capture accurate pronunciation knowledg...

0 Ziyue Jiang, et al. ∙

research

∙ 05/25/2022

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

Direct speech-to-speech translation (S2ST) systems leverage recent progr...

0 Rongjie Huang, et al. ∙

research

∙ 05/15/2022

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis

Style transfer for out-of-domain (OOD) speech synthesis aims to generate...

0 Rongjie Huang, et al. ∙

research

∙ 04/25/2022

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

The recent progress in non-autoregressive text-to-speech (NAR-TTS) has m...

0 Zhenhui Ye, et al. ∙

Zhou Zhao

Featured Co-authors

Sign in with Google

Consider DeepAI Pro