Higher Criticism for Discriminating Word-Frequency Tables and Testing Authorship

10/30/2019
by   Alon Kipnis, et al.
0

We adapt the Higher Criticism (HC) goodness-of-fit test to detect changes between word frequency tables. We apply the test to authorship attribution, where the goal is to identify the author of a document using other documents whose authorship is known. The method is simple yet performs well without handcrafting and tuning. As an inherent side effect, the HC calculation identifies a subset of discriminating words. In practice, the identified words have low variance across documents belonging to a corpus of homogeneous authorship. We conclude that in testing a new document against the corpus of an author, HC is mostly affected by words characteristic of that author and is relatively unaffected by topic structure.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset