Data Reduction and Document Review

Litigation Preparedness in the Age of E-Discovery

Dr. Andy Cobb, PhD, CCE

Dr. Andy Cobb, PhD, CCE

Part 2: Data Reduction and Document Review



In Part 1 of this series, we discussed the proper preservation of data – including when the duty to preserve arises, litigation holds and the repercussions of not properly preserving data when litigation is possible.  Once data is properly preserved and/or collected, the focus shifts to review of the data. It is important to remember that not all data is created equal, in terms of relevance to the matter. While a large amount of data may have been properly preserved in previous phases, the challenge now becomes separating the wheat from the chaff in a cost-effective way.  In a 2012 study the RAND corporation found that over 70% of costs of the eDiscovery were in the document review phase. Thus, reducing the amount of potentially relevant documents to review has a large impact on the overall cost of eDiscovery.

Several approaches can be applied to narrow down the amount of data to be reviewed, ranging from technical best practices, that can/should be applied to almost any data set, to focused, case-specific tactical solutions. Two general approaches for data reduction are De-NISTing and De-Duplication. Both are general methods that should almost always be the employed. De-NISTing is the process of culling known files from the data set. Windows system files are examples of know files.   When De-NISTing is applied, these known files are “ignored” or removed from the review set.

De-Duplication is the process of culling out documents that have the same content.  De-duping can be helpful so that reviewers are not seeing and coding the same document two or more times, which saves time and money.

Other document culling techniques can be applied that depend on the nature of the case. A few examples of case-specific techniques are:

  • Filtering documents by custodian.  Many cases involve key custodians of interest. One widely used practice is to review emails to/from particular individuals of interest, then expand the scope of review out, as needed.
  • Filtering by dates of interest.  Eliminating documents outside a particular date range can be a very effective method of reducing data size.
  • Keyword Searches. This method involves searching for relevant documents using keywords. The first – and often most difficult – aspect of this approach is settling on a set of keywords that return relevant data, rather than false positives. 

Unless the document review is for an internal investigation and not discovery, the criteria used to reduce documents will most likely need to be agreed upon by both parties.  Courts are generally agreeable to – and may even be order – reasonable methods of reducing the number of document for review.

Document Review

Document review is the process by which documents are coded or categorized – and can be overwhelming. But having the right review platform and right people managing and performing the review process can dramatically reduce the heartburn. Look for a review platform that is efficient and has been time-tested by professional litigators that review routinely. Outside counsel may be a good resource for this.

Experienced reviewers and review managers can greatly improve the efficiency of the review process – they’ve got the battle scars and know what can go wrong and how to address the typical problems that arise. And they usually have a well-defined process by which to efficiently perform review for large or complex projects.

Document review, which is the most costly phase of eDiscovery process, requires preparation of the documents to help reduce the costs of overall discovery.  The phases leading up to document review are critical since they set the stage for both defensibility and lowering costs.

Technology-Assisted Review (TAR)

One other set of techniques, which might be considered a hybrid between data reduction and document review, are those that use software to aid in the review process known as Technology-Assisted Review or TAR.  Predictive coding (now called TAR 1.0) was introduced a few years ago as a technique in which reviewers “train” and test the software until it can accurately predict how documents should be coded.

Predictive coding evolved into the latest form of TAR called continuous learning, or TAR 2.0.  In this technique, the software automatically learns as the reviewers code documents. When the software reaches a certain confidence level, it “takes over” and begins to automatically code the remaining documents as long as the confidence level is maintained.  TAR techniques have been accepted in court under certain circumstances, especially for extremely large document sets.


In this article we’ve discussed several best practices that can be employed to reduce the volume of documents that need review.  These techniques can be instrumental in reducing the overall cost of eDiscovery. As TAR is increasingly accepted in courts for large document sets, the costs of document review for those cases will also dramatically be reduced.

In Parts 1 and 2 of this series, we’ve focused on the scenarios where attorneys handle the review of documents for discovery.  In the final part of this series of articles, we’ll tackle digital forensics investigations, in which a digital forensics expert is needed to perform a deep dive into devices to find the story the data tells.

Leave a Reply