Whether you like to watch true crime shows or not, you probably know that forensically matching a suspect to their DNA profile is one of the most reliable forms of identifying suspects there is. According to Wikipedia, when using Restriction Fragment Length Polymorphism (RFLP) to construct a DNA profile, the theoretical risk of a coincidental DNA match is 1 in 100 billion (100,000,000,000). That’s about 12 times the population of the earth! No wonder law enforcement uses DNA evidence to obtain convictions in criminal cases – it’s that unique as an identifier to tie suspects to the crime.
Hash values are even more unique than DNA and they can be useful to not only forensically authenticate electronic evidence, but also reduce the burden associated with eDiscovery significantly!
What are Hash Values?
A hash value is a numeric value of a fixed length that uniquely identifies data. That data can be as small as a single character to as large as a default size of 2 GB in a single file. Hash values represent large amounts of data as much smaller numeric values, so they are used as digital signatures to uniquely identify every electronic file in an ESI collection. An industry standard algorithm is used to create a hash value identification of each electronic file.
Hash values are typically represented as a hexadecimal number and the length of that number depends on the type of hash algorithm being used. A 32-digit hexadecimal number to represent the contents of a file might look something like this – ec55d3e698d289f2afd663725127bace – making each hash value extremely unique.
How unique? A 32-digit hexadecimal number like the one above has 340,282,366,920,938,463,463,374,607,431,768,211,456 potential combinations. That’s 340 undecillion 282 decillion 366 nonillion 920 octillion 938 septillion 463 sextillion 463 quintillion 374 quadrillion 607 trillion 431 billion 768 million 211 thousand 456!
Unique enough for you?
Types of Hash Values Typically Used in Discovery
There are many hash algorithms out there that can be used to represent data. Two algorithms have become standard within the eDiscovery industry:
Message-Digest algorithm 5 (MD5 Hash): Results in a 128-bit hash value which are represented as 32-digit hexadecimal numbers (like the example above).
Secure Hash Algorithm 1 (SHA-1): Results in a 160-bit hash value which are represented as 40-digit hexadecimal numbers.
It’s important to note that format of a file matters. Files with the same content but different formats (e.g., a Word document printed to PDF) will have different hash values. And, while the method may be industry standard, the manner in which an eDiscovery solution calculates either an MD5 Hash or a SHA-1 hash vary widely, based on implementation of the algorithm and the data and metadata used in generating the hash value. For example, emails have several metadata fields that could be used in generating hash value, including: SentDate, From, To, CC, BCC, Subject, Attachments (including embedded images) and text of the email.
This means that if you’re a party receiving a native production from opposing counsel that includes a separate metadata production with hash value as one of the metadata fields and you load it into your own eDiscovery solution, don’t expect the hash values to match (unless you’re both using the same solution, that is).
How Hash Values are Used in Discovery
Hash values have two primary functions in electronic discovery:
Just like law enforcement uses DNA to authenticate physical evidence at a crime scene, eDiscovery and forensic professionals use hash values to authenticate electronic evidence, which can be vitally important if there are disputes regarding the authenticity of the evidence in your case!