MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies

05/26/2023
by   Shiyue Zhang, et al.
1

Autoregressive language models are trained by minimizing the cross-entropy of the model distribution Q relative to the data distribution P – that is, minimizing the forward cross-entropy, which is equivalent to maximum likelihood estimation (MLE). We have observed that models trained in this way may "over-generalize", in the sense that they produce non-human-like text. Moreover, we believe that reverse cross-entropy, i.e., the cross-entropy of P relative to Q, is a better reflection of how a human would evaluate text generated by a model. Hence, we propose learning with MixCE, an objective that mixes the forward and reverse cross-entropies. We evaluate models trained with this objective on synthetic data settings (where P is known) and real data, and show that the resulting models yield better generated text without complex decoding strategies. Our code and models are publicly available at https://github.com/bloomberg/mixce-acl2023

READ FULL TEXT
research
06/09/2021

Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation

We propose a new training objective named order-agnostic cross entropy (...
research
04/03/2020

Aligned Cross Entropy for Non-Autoregressive Machine Translation

Non-autoregressive machine translation models significantly speed up dec...
research
09/21/2023

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

We expose a surprising failure of generalization in auto-regressive larg...
research
09/01/2018

Dual Conditional Cross-Entropy Filtering of Noisy Parallel Corpora

In this work we introduce dual conditional cross-entropy filtering for n...
research
05/17/2023

FACE: Evaluating Natural Language Generation with Fourier Analysis of Cross-Entropy

Measuring the distance between machine-produced and human language is a ...
research
09/12/2023

Stochastic LLMs do not Understand Language: Towards Symbolic, Explainable and Ontologically Based LLMs

In our opinion the exuberance surrounding the relative success of data-d...
research
09/20/2023

Construction of Paired Knowledge Graph-Text Datasets Informed by Cyclic Evaluation

Datasets that pair Knowledge Graphs (KG) and text together (KG-T) can be...

Please sign up or login with your details

Forgot password? Click here to reset