RECAP: Retrieval Augmented Music Captioner

12/21/2022
by   Zihao He, et al.
0

With the prevalence of stream media platforms serving music search and recommendation, interpreting music by understanding audio and lyrics interactively has become an important and challenging task. However, many previous works focus on refining individual components of encoder-decoder architecture mapping music to caption tokens, ignoring the potential usage of audio and lyrics correspondence. In this paper, we propose to explicitly learn the multi-modal alignment with retrieval augmentation by contrastive learning. By learning audio-lyrics correspondence, the model is guided to learn better cross-modal attention weights, thus generating high-quality caption words. We provide both theoretical and empirical results that demonstrate the advantage of the proposed method.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset