FOMO: Topics versus documents in legal eDiscovery

by   Herbert Roitblat, et al.

In the United States, the parties to a lawsuit are required to search through their electronically stored information to find documents that are relevant to the specific case and produce them to their opposing party. Negotiations over the scope of these searches often reflect a fear that something will be missed (Fear of Missing Out: FOMO). A Recall level of 80 of the relevant documents will be left unproduced. This paper makes the argument that eDiscovery is the process of identifying responsive information, not identifying documents. Documents are the carriers of the information; they are not the direct targets of the process. A given document may contain one or more topics or factoids and a factoid may appear in more than one document. The coupon collector's problem, Heaps law, and other analyses provide ways to model the problem of finding information from among documents. In eDiscovery, however, the parties do not know how many factoids there might be in a collection or their probabilities. This paper describes a simple model that estimates the confidence that a fact will be omitted from the produced set (the identified set), while being contained in the missed set. Two data sets are then analyzed, a small set involving microaggressions and larger set involving classification of web pages. Both show that it is possible to discover at least one example of each available topic within a relatively small number of documents, meaning the further effort will not return additional novel information. The smaller data set is also used to investigate whether the non-random order of searching for responsive documents commonly used in eDiscovery (called continuous active learning) affects the distribution of topics-it does not.


page 4

page 5


Is there something I'm missing? Topic Modeling in eDiscovery

In legal eDiscovery, the parties are required to search through their el...

Probably Reasonable Search in eDiscovery

In eDiscovery, a party to a lawsuit or similar action must search throug...

Empirical Evaluations of Active Learning Strategies in Legal Document Review

One type of machine learning, text classification, is now regularly appl...

A Sensitivity Analysis of the MSMARCO Passage Collection

The recent MSMARCO passage retrieval collection has allowed researchers ...

Technology Assisted Reviews: Finding the Last Few Relevant Documents by Asking Yes/No Questions to Reviewers

The goal of a technology-assisted review is to achieve high recall with ...

Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval on Predefined Topics

In this paper, we consider the task of retrieving documents with predefi...

PDC – a probabilistic distributional clustering algorithm: a case study on suicide articles in PubMed

The need to organize a large collection in a manner that facilitates hum...

Please sign up or login with your details

Forgot password? Click here to reset