This paper explores the imperative need and methodology for developing a...
Current speaker recognition systems primarily rely on supervised approac...
Text-based speech editing (TSE) techniques are designed to enable users ...
Prosodic phrasing is crucial to the naturalness and intelligibility of
e...
Speaker extraction and diarization are two crucial enabling techniques f...
Brain-inspired spiking neural networks (SNNs) have demonstrated great
po...
Target speaker extraction aims to extract the speech of a specific speak...
Conversational recommender systems (CRS) generate recommendations throug...
It is common in everyday spoken communication that we look at the turnin...
Objective: Conventional EEG-based auditory attention detection (AAD)
is ...
The identification of sensory cues associated with potential opportuniti...
Large Language Models (LLMs) provide a possibility to make a great
break...
Grammatical error correction aims to correct ungrammatical sentences
aut...
The remarkable capabilities of large-scale language models, such as Chat...
The identification of sensory cues associated with potential opportuniti...
The goal of Automatic Voice Over (AVO) is to generate speech in sync wit...
Representing visual data using compact binary codes is attracting increa...
The biological neural systems evolved to adapt to ecological environment...
Audio Deepfake Detection (ADD) aims to detect the fake audio generated b...
In this paper, we present HuatuoGPT, a large language model (LLM) for me...
Topic segmentation and outline generation strive to divide a document in...
Audio deepfake detection is an emerging topic in the artificial intellig...
Discourse parsing, the task of analyzing the internal rhetorical structu...
In active speaker detection (ASD), we would like to detect whether an
on...
Despite much success in natural language processing (NLP), pre-trained
l...
The use of Transformer represents a recent success in speech enhancement...
This paper presents our efforts to democratize ChatGPT across language. ...
Talking face generation, also known as speech-to-lip generation, reconst...
Chatbots are expected to be knowledgeable across multiple domains, e.g. ...
We present Relational Sentence Embedding (RSE), a new paradigm to furthe...
Model-based deep learning has achieved astounding successes due in part ...
The current lyrics transcription approaches heavily rely on supervised
l...
The speaker extraction technique seeks to single out the voice of a targ...
Most sentence embedding techniques heavily rely on expensive human-annot...
Self-supervised pre-training has been successful in both text and speech...
Neural network-based speaker recognition has achieved significant improv...
We study a novel neural architecture and its training strategies of spea...
Accented text-to-speech (TTS) synthesis seeks to generate speech with an...
Conversational Text-to-Speech (TTS) aims to synthesis an utterance with ...
Multimodal emotion recognition leverages complementary information acros...
Recent model-based reference-free metrics for open-domain dialogue evalu...
Emotional voice conversion (EVC) aims to convert the emotional state of ...
Dialogue summarization is abstractive in nature, making it suffer from
f...
Spiking neural networks (SNNs) are shown to be more biologically plausib...
Speech is the surface form of a finite set of phonetic units, which can ...
Output length is critical to dialogue summarization systems. The dialogu...
This technical report describes our system for track 1, 2 and 4 of the
V...
Accented text-to-speech (TTS) synthesis seeks to generate speech with an...
Audio and visual signals complement each other in human speech perceptio...
Emotional speech synthesis aims to synthesize human voices with various
...