RetVec: Resilient and Efficient Text Vectorizer

02/18/2023
by   Elie Bursztein, et al.
0

This paper describes RetVec, a resilient multilingual embedding scheme designed for neural-based text processing, including small-text classification and large-language models. RetVec combines a novel character encoding with an optional small model to embed words into a 256-dimensional vector space. These embeddings enable training competitive multilingual text models resilient to typos and adversarial attacks. In this paper, we evaluate and compare RetVec to state-of-the-art tokenizers and word embeddings on common model architectures. These comparisons demonstrate that RetVec leads to competitive models that are significantly more resilient to text perturbations across a variety of common tasks. RetVec is available under Apache 2 license at <https://github.com/[anonymized]>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/26/2017

ALL-IN-1: Short Text Classification with One Model for All Languages

We present ALL-IN-1, a simple model for multilingual text classification...
research
05/29/2019

Learning Multilingual Word Embeddings Using Image-Text Data

There has been significant interest recently in learning multilingual wo...
research
03/27/2019

Image search using multilingual texts: a cross-modal learning approach between image and text

Multilingual (or cross-lingual) embeddings represent several languages i...
research
03/15/2017

Character-based Neural Embeddings for Tweet Clustering

In this paper we show how the performance of tweet clustering can be imp...
research
03/27/2019

Image search using multilingual texts: a cross-modal learning approach between image and text Maxime Portaz Qwant Research

Multilingual (or cross-lingual) embeddings represent several languages i...
research
11/09/2020

Text Classification through Glyph-aware Disentangled Character Embedding and Semantic Sub-character Augmentation

We propose a new character-based text classification framework for non-a...
research
04/06/2017

MRA - Proof of Concept of a Multilingual Report Annotator Web Application

MRA (Multilingual Report Annotator) is a web application that translates...

Please sign up or login with your details

Forgot password? Click here to reset