DisCover: Disentangled Music Representation Learning for Cover Song Identification

by   Jiahao Xun, et al.
HUAWEI Technologies Co., Ltd.
Zhejiang University

In the field of music information retrieval (MIR), cover song identification (CSI) is a challenging task that aims to identify cover versions of a query song from a massive collection. Existing works still suffer from high intra-song variances and inter-song correlations, due to the entangled nature of version-specific and version-invariant factors in their modeling. In this work, we set the goal of disentangling version-specific and version-invariant factors, which could make it easier for the model to learn invariant music representations for unseen query songs. We analyze the CSI task in a disentanglement view with the causal graph technique, and identify the intra-version and inter-version effects biasing the invariant learning. To block these effects, we propose the disentangled music representation learning framework (DisCover) for CSI. DisCover consists of two critical components: (1) Knowledge-guided Disentanglement Module (KDM) and (2) Gradient-based Adversarial Disentanglement Module (GADM), which block intra-version and inter-version biased effects, respectively. KDM minimizes the mutual information between the learned representations and version-variant factors that are identified with prior domain knowledge. GADM identifies version-variant factors by simulating the representation transitions between intra-song versions, and exploits adversarial distillation for effect blocking. Extensive comparisons with best-performing methods and in-depth analysis demonstrate the effectiveness of DisCover and the and necessity of disentanglement for CSI.


ByteCover3: Accurate Cover Song Identification on Short Queries

Deep learning based methods have become a paradigm for cover song identi...

ByteCover: Cover Song Identification via Multi-Loss Training

We present in this paper ByteCover, which is a new feature learning meth...

Pareto Invariant Representation Learning for Multimedia Recommendation

Multimedia recommendation involves personalized ranking tasks, where mul...

CoverHunter: Cover Song Identification with Refined Attention and Alignments

Abstract: Cover song identification (CSI) focuses on finding the same mu...

Disentangled Representation Learning for Text-Video Retrieval

Cross-modality interaction is a critical component in Text-Video Retriev...

Please sign up or login with your details

Forgot password? Click here to reset