Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation

by   Benjamin Minixhofer, et al.

Many NLP pipelines split text into sentences as one of the crucial preprocessing steps. Prior sentence segmentation tools either rely on punctuation or require a considerable amount of sentence-segmented training data: both central assumptions might fail when porting sentence segmenters to diverse languages on a massive scale. In this work, we thus introduce a multilingual punctuation-agnostic sentence segmentation method, currently covering 85 languages, trained in a self-supervised fashion on unsegmented text, by making use of newline characters which implicitly perform segmentation into paragraphs. We further propose an approach that adapts our method to the segmentation in a given corpus by using only a small number (64-256) of sentence-segmented examples. The main results indicate that our method outperforms all the prior best sentence-segmentation tools by an average of 6.1 has a point: the use of a (powerful) sentence segmenter makes a considerable difference for a downstream application such as machine translation (MT). By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points over the best prior segmentation tool, as well as massive gains over a trivial segmenter that splits text into equally sized blocks.


A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation

Text alignment is crucial to the accuracy of Machine Translation (MT) sy...

Cross-lingual Retrieval for Iterative Self-Supervised Training

Recent studies have demonstrated the cross-lingual alignment ability of ...

Subword Segmental Machine Translation: Unifying Segmentation and Target Sentence Generation

Subword segmenters like BPE operate as a preprocessing step in neural ma...

Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings

We describe an unsupervised method to create pseudo-parallel corpora for...

Scalable Multilingual Frontend for TTS

This paper describes progress towards making a Neural Text-to-Speech (TT...

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

Data sparsity is one of the main challenges posed by Code-switching (CS)...

Please sign up or login with your details

Forgot password? Click here to reset