Improving Neural Cross-Lingual Summarization via Employing Optimal Transport Distance for Knowledge Distillation

by   Thong Nguyen, et al.

Current state-of-the-art cross-lingual summarization models employ multi-task learning paradigm, which works on a shared vocabulary module and relies on the self-attention mechanism to attend among tokens in two languages. However, correlation learned by self-attention is often loose and implicit, inefficient in capturing crucial cross-lingual representations between languages. The matter worsens when performing on languages with separate morphological or structural features, making the cross-lingual alignment more challenging, resulting in the performance drop. To overcome this problem, we propose a novel Knowledge-Distillation-based framework for Cross-Lingual Summarization, seeking to explicitly construct cross-lingual correlation by distilling the knowledge of the monolingual summarization teacher into the cross-lingual summarization student. Since the representations of the teacher and the student lie on two different vector spaces, we further propose a Knowledge Distillation loss using Sinkhorn Divergence, an Optimal-Transport distance, to estimate the discrepancy between those teacher and student representations. Due to the intuitively geometric nature of Sinkhorn Divergence, the student model can productively learn to align its produced cross-lingual hidden states with monolingual hidden states, hence leading to a strong correlation between distant languages. Experiments on cross-lingual summarization datasets in pairs of distant languages demonstrate that our method outperforms state-of-the-art models under both high and low-resourced settings.


page 1

page 2

page 3

page 4


Learning Cross-Lingual IR from an English Retriever

We present a new cross-lingual information retrieval (CLIR) model traine...

Mutually-paced Knowledge Distillation for Cross-lingual Temporal Knowledge Graph Reasoning

This paper investigates cross-lingual temporal knowledge graph reasoning...

Empowering Dual-Encoder with Query Generator for Cross-Lingual Dense Retrieval

In monolingual dense retrieval, lots of works focus on how to distill kn...

D^2TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization

Many-to-many multimodal summarization (M^3S) task aims to generate summa...

Multi-stage Distillation Framework for Cross-Lingual Semantic Similarity Matching

Previous studies have proved that cross-lingual knowledge distillation c...

ProKD: An Unsupervised Prototypical Knowledge Distillation Network for Zero-Resource Cross-Lingual Named Entity Recognition

For named entity recognition (NER) in zero-resource languages, utilizing...

Research on Multilingual News Clustering Based on Cross-Language Word Embeddings

Classifying the same event reported by different countries is of signifi...

Please sign up or login with your details

Forgot password? Click here to reset