Never having seen an object and heard its sound simultaneously, can the ...
One primary topic of multi-modal learning is to jointly incorporate
hete...
Audio-Visual Question Answering (AVQA) task aims to answer questions abo...
We live in a world filled with never-ending streams of multimodal
inform...
Novel class discovery (NCD) aims to infer novel categories in an unlabel...
Audio question answering (AQA), acting as a widely used proxy task to ex...
Cross-modal distillation has been widely used to transfer knowledge acro...
Audio-visual learning helps to comprehensively understand the world by f...
The imbalance problem is widespread in the field of machine learning, wh...
Pre-training technique has gained tremendous success in enhancing model
...
Novel class discovery (NCD) aims to infer novel categories in an unlabel...
Sight and hearing are two senses that play a vital role in human
communi...
Both visual and auditory information are valuable to determine the salie...
Multimodal learning helps to comprehensively understand the world, by
in...
In this paper, we focus on the Audio-Visual Question Answering (AVQA) ta...
Recent years have witnessed the success of deep learning on the visual s...
Pre-training has been a popular learning paradigm in deep learning era,
...
The task of audio-visual sound source localization has been well studied...
Audiovisual scenes are pervasive in our daily life. It is commonplace fo...
Model web services provide an approach for implementing and facilitating...
File reading is the basis for data sharing and scientific computing. How...
Many current deep learning approaches make extensive use of backbone net...
Mutual knowledge distillation (MKD) improves a model by distilling knowl...
Unsupervised domain adaptation (UDA) methods for person re-identificatio...
There are rich synchronized audio and visual events in our daily life. I...
Temporal relational modeling in video is essential for human action
unde...
Fine-tuning deep neural networks pre-trained on large scale datasets is ...
Discriminatively localizing sounding objects in cocktail-party, i.e., mi...
How to visually localize multiple sound sources in unconstrained videos ...
Aerial scene recognition is a fundamental task in remote sensing and has...
Aerial scene recognition is a fundamental task in remote sensing and has...
Visual crowd counting has been recently studied as a way to enable peopl...
Associating sound and its producer in complex audiovisual scene is a
cha...
Visual-to-auditory sensory substitution devices can assist the blind in
...
Multiple modalities can provide more valuable information than single on...
The conventional supervised hashing methods based on classification do n...
The seen birds twitter, the running cars accompany with noise, people ta...
Image is usually taken for expressing some kinds of emotions or purposes...
With the increasing demand of massive multimodal data storage and
organi...