Computer Forensics, Malware Analysis & Digital Investigations: File Entropy explained

I posted a quick EnScript yesterday that calculates the entropy of any selected object in EnCase. One of the comments I received asked for more information about what entropy is and what do the values represent. This post is to provide a little more detail about what entropy is and how it can be helpful/useful.

Entropy is technically defined here and is the second law of thermodynamics.The technical explanation of entropy in computing terms is described here. Simply put, entropy as it relates to digital information is the measurement of randomness in a given set of values (data).

The equation used by Shannon has a resulting value of something between zero (0) and eight (8). The closer the number is to zero, the more orderly or non-random the data is. The closer the data is to the value of eight, the more random or non-uniform the data is. The formula used by Shannon to represent binary data looks like this:

Here is a simple example. Imagine you have a file that is 100 bytes in length and that is filled with the value of zero (0).

Using the above formula, you get a result of zero, meaning the probability of any other value other than zero appearing is zero. Now consider the same 100 byte file filled with half zeros and half ones:

Using the same formula as above, the result would be one.

Take this same file and compress it using Rar:

Entropy = 5.0592468625650353

Take the same message (half zeros and half ones) and encrypt it with PGP and you get this:

Entropy = 7.8347915272089166

Encrypt a blank volume in TrueCrypt and you get this:

Entropy = 7.9999997904861599

The closer you get towards truly random data, the closer the entropy value will be to the maximum value of eight, meaning there is no pattern or probability to guess what the next value might be.

How is this useful:

Entropy can be used is many different way, but quite commonly to detect encryption and compression, since truly random data is not common in typical user data. This is especially true with executables that have purposely been encrypted with a real-time decryption routine. This prevents an AV engine from seeing "inside" the executable as it sits on the disk in order to detect strings or patterns. It is also very helpful in identifying files that have a high-amount of randomness (as illustrated above), which could indicate an encrypted container/volume that may go otherwise unnoticed.

In the original post, there was some discussion on a forensic message board about using entropy to detect the use of a wiping tool against unallocated space. In that example, you would be looking for repeating patterns that occur over a large area of unallocated. Again, the higher the entropy value, the more random the data, vs. the smaller value which indicates more uniformity of the data.

Reference: "File Entropy", McCreight, Shawn