Higher Criticism for Discriminating Word-Frequency Tables and Testing Authorship
We adapt the Higher Criticism (HC) goodness-of-fit test to detect changes between word frequency tables. We apply the test to authorship attribution, where the goal is to identify the author of a document using other documents whose authorship is known. The method is simple yet performs well without handcrafting and tuning. As an inherent side effect, the HC calculation identifies a subset of discriminating words. In practice, the identified words have low variance across documents belonging to a corpus of homogeneous authorship. We conclude that in testing a new document against the corpus of an author, HC is mostly affected by words characteristic of that author and is relatively unaffected by topic structure.
READ FULL TEXT