Probability: The Mathematics of Document Review and How to Dramatically Reduce Costs
by Andy Kraftsow
When is litigation like Las Vegas? The more cynical among us might answer “Always,” arguing that the outcome of any trial is a crapshoot. But for the rest of us, “always” is not the right answer. Most believe there is a relationship between the outcome of a trial and the hard work that went into preparing for it. Most believe that our actions influence the outcome. But there is an aspect of litigation that is almost exactly like Las Vegas, at least from a mathematical perspective, and that aspect is first pass document review.
Las Vegas is at once the land of probability and a haven for the hopeful denial of probability. I am playing blackjack, and I am winning. I have fourteen points in my hand, and the dealer is showing a ten. Do I hold or ask for another card? The odds tell me one thing, but my gut tells me another. Or maybe I am playing craps and wondering if I should double down since the person rolling the dice seems to be hot. Can we really beat the odds? What is the probability that the roulette wheel will make my number a winner? If I close my eyes and cross my toes, will that improve my chances?
All of these questions deal with the probable occurrence of random events, events upon which my actions have no impact. Surprisingly, the same mathematics that determines winners and losers in Vegas impacts first pass review. Consider an example: I have collected 1,000,000 documents, and I suspect that about 15% of them are relevant to the case issues. I want to read all of the relevant documents and none of the nonrelevant ones. Query: How many documents will I have to examine before I find all of the relevant ones? Or stated as an event threshold, how likely is it that the next document I look at will be responsive?
The answer to this question is of great economic significance because first pass review has become the most expensive part of e-discovery. We are drowning in a deluge of documents! It costs about 20 times as much to review a document using a contract attorney ($1) as it does to collect and process the document ($.05). So the question of how to avoid reading nonrelevant documents is very timely. In the example above, I would literally waste $850,000 if I were forced to read all of the nonrelevant documents in order to find all the relevant ones.
From a mathematical perspective, there are two attributes of first pass review that make it expensive. First, the relevant documents that I am seeking are relatively rare (15%) within the set collected. Second, I have no idea how they are distributed. These factors compound my retrieval problems – I do not know where in the collection to look and since the ones I want are rare, I may look for a long time without seeing any relevant documents at all. Anyone who has participated in a large document review has experienced these issues. A reviewer can work for hours without seeing a single relevant document, especially if the documents are ordered for review chronologically.
The mathematical concept that describes the occurrence of random events is called the geometric random variable, and it is at the heart of every Las Vegas triumph and disaster. It is also the factor that predicts how long it will take a reviewer to see the next relevant document. Understanding the geometric random variable is worthwhile because knowledge of its consequences will allow us to seriously reduce first pass review costs by 80% or more. The math is not hard, even for those who are not fond of math.
To get a sense of the geometric random variable, let’s consider an example:
How many times must you roll a single die before you are 95% certain of rolling a 2?
Examining the problem, one notes that there are six numbers on a die so the probability that any particular roll will show a 2 is one-in-six. But we have all had the experience of rolling the same number many times in a row; we know that can happen. If I want to be 95% certain that at least one roll is a 2, the geometric random variable tells me that I have to roll the die 17 times. The formula looks like this, where k is the number of rolls:
k > = ln(.05) / ln(5/6) = 16.431
What stands out is that even though the probability of a 2 showing on any one roll is one-in-six, to be 95% certain that at least one roll is a 2, I need to roll the die 17 times. To be very certain takes a lot of rolls.
Now let’s think about document review. Suppose I have collected 1 million documents, and 15% of them are relevant. How many documents do I have to examine before I am 95% certain to see one relevant document? It is the dice problem translated into document review. Once again, the geometric random variable answers the question:
k > = ln(.05) / ln(850,000/1,000,000) = 18.433
I have to look at 19 documents to be 95% certain of finding the first relevant one. That is a lot of documents to go through just to find one that is relevant – especially since I have to find 149,999 more – but at least it is predictable. At least I know my worst case. The predictability comes from randomness. Rolling dice creates random outcomes.
The 19th-century French mathematician Siméon Denis Poisson discovered that using randomness is also the best way to find needles in a haystack; and what is document review if not looking for needles in a haystack? So we need to introduce randomness into the document review process. We do this by deploying a random “next document” algorithm. When the reviewer asks for the next document, the system picks one at random. It is that simple.
Random examination overcomes the “bunchiness” that one sees in a review organized by custodian or date. It makes sense to review known hot custodians and hot date ranges first, but once those areas are mined, switching to a random algorithm pinpoints where you are and how much more there is to be done.
The geometric random variable tells us that we should expect to see about 18 nonrelevant documents for every relevant one we come across. After a document is examined, it is removed from the review queue. Thus we are removing more nonrelevant documents from the set than relevant ones. In fact, we are removing them at a ratio of 18 to 1. As a result, the relevant documents are getting “denser” in the collection, which means that they will begin to appear more frequently: first once in every seventeen documents, then once in every sixteen and so on. If you run out the sequence, the geometric random variable predicts that about 90% of the collection (900,000 documents) must be read before one can be 95% certain of having found every relevant document. Reviewing 750,000 nonrelevant documents is better than reading 850,000 nonrelevant documents, but it is still an expensive and wasteful proposition. What else can be done to cut the costs?
One answer is to increase the percentage (density) of relevant documents in the collection by removing documents that cannot possibly be relevant. If we could double the density of the set so that at the beginning of review 30% of the documents were relevant, then reviewers would only have to read 45% of the documents to be 95% certain of finding all the relevant ones. That is a big improvement, and there are technologies available to accomplish just this task.
Increasing the density of relevant documents is the goal of predictive coding and related technologies. The idea is to use machine algorithms, or in the case of RenewData’s Language-Based Analytics human language comprehension, to identify and remove documents that are deemed not possibly relevant. Generally these techniques identify about 50% of the collection as not worth reviewing.
By reviewing randomly and using linguistic analysis to increase the density of the set, we have decreased the cost of first pass review by about 50%. Good but not great. In our example we are still wasting $265,000 and almost a thousand man-days reviewing nonrelevant documents. Can we do better?
The geometric random variable says “yes.” There is one more very simple strategy we can deploy, and it has perhaps the greatest impact of all on the cost of first pass review.
The strategy relies on leveraging the redundancy of language, specifically the redundancy of relevant language. Anyone who has participated in a document review knows that the language in a collection is terribly redundant. People say the same things to each other over and over again. Our experience is that every “string” of relevant language identified by a reviewer appears in at least 24 other documents.
Bulk-tagging every document that contains the same identified relevant language string reduces the number of documents that must be read by 50%. That is an astounding result from introducing nothing more than a highlighter and a Boolean bulk-tagger. (RenewData has built a highlight-driven bulk-tagger that sits on top of Relativity). Grabbing every document that contains an identified relevant language string is definitely the way to go.
To summarize: The geometric random variable suggests three strategies to reduce first pass review costs:
1. Use a random next document algorithm to overcome uneven distribution.
2. Increase the density of the set by removing documents that could not possibly be relevant.
3. Use a bulk-tagger to grab every document that contains the relevant language highlighted by the reviewer.
By combining all three strategies, one can avoid reviewing 80% or more of the collection and still be 95% confident of finding every relevant document.
About the Author: RenewData’s Chief Scientist Andy Kraftsow leads the company’s efforts to develop groundbreaking technologies. He joined the company from Digital Mandate via acquisition in 2009. While at Digital Mandate, he spearheaded the company’s efforts to build the Vestigate legal review solution, which remains an integral piece of RDC Analytics’ content analytics portfolio.
Over the past 25 years, Kraftsow has founded and been CEO of three software companies, each of which was ultimately sold to a public company. In 1999, he founded DolphinSearch Inc to bring the power of advanced neural networks mathematics to the legal profession. Trained as a mathematician, his specialty has been in the field of information analysis using various aspects of applied mathematics.
Essentially what this says is the number of rolls is equal to the natural log of the complement of the certainty level (1-.95) divided by the natural log of the ratio of unwanted results to all results (5/6).
For collections containing more than 100,000 documents.