Bootstrapping Text Anonymization Models with Distant Supervision

by   Anthi Papadopoulou, et al.

We propose a novel method to bootstrap text anonymization models based on distant supervision. Instead of requiring manually labeled training data, the approach relies on a knowledge graph expressing the background information assumed to be publicly available about various individuals. This knowledge graph is employed to automatically annotate text documents including personal data about a subset of those individuals. More precisely, the method determines which text spans ought to be masked in order to guarantee k-anonymity, assuming an adversary with access to both the text documents and the background information expressed in the knowledge graph. The resulting collection of labeled documents is then used as training data to fine-tune a pre-trained language model for text anonymization. We illustrate this approach using a knowledge graph extracted from Wikidata and short biographical texts from Wikipedia. Evaluation results with a RoBERTa-based model and a manually annotated collection of 553 summaries showcase the potential of the approach, but also unveil a number of issues that may arise if the knowledge graph is noisy or incomplete. The results also illustrate that, contrary to most sequence labeling problems, the text anonymization task may admit several alternative solutions.


page 1

page 2

page 3

page 4


Text-Augmented Open Knowledge Graph Completion via Pre-Trained Language Models

The mission of open knowledge graph (KG) completion is to draw new findi...

AMALGAM: A Matching Approach to fairfy tabuLar data with knowledGe grAph Model

In this paper we present AMALGAM, a matching approach to fairify tabular...

Incorporating Explicit Knowledge in Pre-trained Language Models for Passage Re-ranking

Passage re-ranking is to obtain a permutation over the candidate passage...

A Text Extraction-Based Smart Knowledge Graph Composition for Integrating Lessons Learned during the Microchip Design

The production of microchips is a complex and thus well documented proce...

Common-Knowledge Concept Recognition for SEVA

We build a common-knowledge concept recognition system for a Systems Eng...

KnowGL: Knowledge Generation and Linking from Text

We propose KnowGL, a tool that allows converting text into structured re...

Please sign up or login with your details

Forgot password? Click here to reset