Modeling "Newsworthiness" for Lead-Generation Across Corpora

04/19/2021
by   Alexander Spangher, et al.
5

Journalists obtain "leads", or story ideas, by reading large corpora of government records: court cases, proposed bills, etc. However, only a small percentage of such records are interesting documents. We propose a model of "newsworthiness" aimed at surfacing interesting documents. We train models on automatically labeled corpora – published newspaper articles – to predict whether each article was a front-page article (i.e., newsworthy) or not (i.e., less newsworthy). We transfer these models to unlabeled corpora – court cases, bills, city-council meeting minutes – to rank documents in these corpora on "newsworthiness". A fine-tuned RoBERTa model achieves .93 AUC performance on heldout labeled documents, and .88 AUC on expert-validated unlabeled corpora. We provide interpretation and visualization for our models.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro