Previous Multimodal Information based Speech Processing (MISP) challenge...
Existing approaches to modeling associations between visual stimuli and ...
The Multi-modal Information based Speech Processing (MISP) challenge aim...
Nonnegative Tucker Factorization (NTF) minimizes the euclidean distance ...
Recently self-supervised representation learning has drawn considerable
...
As far as Scene Graph Generation (SGG), coarse and fine predicates mix i...
The development of scene text recognition (STR) in the era of deep learn...
Consistent performance gains through exploring more effective network
st...
Visual Information Extraction (VIE) task aims to extract key information...
Malicious application of deepfakes (i.e., technologies can generate targ...
In this paper, we present AISHELL-4, a sizable real-recorded Mandarin sp...
Deep embedding based text-independent speaker verification has demonstra...
Recently, phase processing is attracting increasinginterest in speech
en...
We show that an end-to-end deep learning approach can be used to recogni...