Context Triggered Piecewise Hashes (CTPH) and SSDEEP

It has been a while since I posted anything here, but then life had another plans for me in past three years. Anyways cutting short the crap, today I will discuss about hashes. To be more precise, CONTEXT TRIGGERED PIECEWISE HASHES (CTPH).

This term came into my mind when I was going through the Pyramid of Pain, which happens to be a simple diagram that shows the relationship between the types of indicators you might use to detect an adversary’s activities and how much pain it will cause them when you are able to deny those indicators to them. We can discuss Pyramid of pain some other day. Lets’s talk about hashes first.

We have all used cryptographic hashes to determine the integrity of the files, vastly used during any data forensics investigation. So if a single bit is changed in the input, it will change tha hashed output value drastically. But with the advancement of attacks, it is highly possible to change a bit of an malware to fail in the cryptographic match by a forensic profession even with keeping the functionality of an malware intact.

For overcoming such attacks, we will look into the concept called Context Triggered Piecewise Hashes (CTPH) and how to use an application called ssdeep for threat attribution.

What is CTPH

Unlike any other cryptographic hashes which create a single hash for entire file, CTPH calculates multiple hashes for multiple fixed-size segments of file. It uses a rolling hash.

Few likes from an associated paper:

A rolling hash algorithm produces a pseudo-random value based only on the current context of the input. The rolling hash works by maintaining a state based solely on the last few bytes from the input. Each byte is added to the state as it is processed and removed from the state after a set number of other bytes have been processed.

The current context can be imagined as a moving window across the input. The window length (number of bytes) depends on the implementation of CTPH.

Each recorded value in the CTPH signature depends only on part of the input, and changes to the input will result in only localized changes in the CTPH signature.

Two files similar to each other will have large sequences of identical bits in the same order. The main aim of CTPH is to find similarity between binaries.

If a byte of the input is changed, at most two, and in many cases, only one of the traditional hash values will be changed; the majority of the CTPH signature will remain the same. Because the majority of the signature remains the same, files with modifications can still be associated with the CTPH signatures of known files.

Lets see how it works. I will use ssdeep for the demo.

Installing ssdeep

I am installing it on OsX so using homebrew first.

ssuman@iosec testing % ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
ssuman@iosec testing % brew install wget
ssuman@iosec testing % wget https://github.com/ssdeep-project/ssdeep/releases/download/release-2.14.1/ssdeep-2.14.1.tar.gz
ssuman@iosec testing % tar xf ssdeep-2.14.1.tar.gz
ssuman@iosec testing % cd ssdeep-2.14.1/  
ssuman@iosec testing % sudo ./configure && sudo make && sudo make install
ssuman@iosec testing % ssdeep -h   
ssdeep version 2.14.1 by Jesse Kornblum and the ssdeep Project
For copyright information, see man page or README.TXT.
Usage: ssdeep [-m file] [-k file] [-dpgvrsblcxa] [-t val] [-h|-V] [FILES]
-m - Match FILES against known hashes in file
-k - Match signatures in FILES against signatures in file
-d - Directory mode, compare all files in a directory
-p - Pretty matching mode. Similar to -d but includes all matches
-g - Cluster matches together
-v - Verbose mode. Displays filename as its being processed
-r - Recursive mode
-s - Silent mode; all errors are suppressed
-b - Uses only the bare name of files; all path information omitted
-l - Uses relative paths for filenames
-c - Prints output in CSV format
-x - Compare FILES as signature files
-a - Display all matches, regardless of score
-t - Only displays matches above the given threshold
-h - Display this help message
-V - Display version number and exit

I will be using sample text files to compare and see how this tool work. I have created 3 text files, named one.txt, onecopy.txt(it has a minor change from one.txt) and different.txt which has completely different content.

First let’s see how the conventional md5 hashing works and give the output.

ssuman@iosec testing % md5 *
MD5 (different.txt) = 7b0bfac65305e7af9fbec4a15a112507
MD5 (one.txt) = 6ec167bb837f4c12caabae2278a291b7
MD5 (onecopy.txt) = 34cbd235f514ea8509f481d3abf0b762

As we can see, how the each hash is completely different from each other. Now as we see below the ssdeep gives an output which is different and last two sample don’t have much difference.

ssuman@INM1F52WRMD6M testing % ssdeep *
ssdeep,1.1--blocksize:hash:hash,filename
24:N/L3JL8tXjEn3fKGrmKadY6RH6r+jyJdEKt7r5GUcy:N/V8tX03fuR76JftH5GFy,"/Users/ssuman/Downloads/testing/different.txt"
24:FfJ7sC3g00ZHtXUs+wDnYDBTibJKDGCtsOoWm5GGHvqpufseEgQ:FfJwkgdZHd+wDnYtT+dPOo5fouUlgQ,"/Users/ssuman/Downloads/testing/one.txt"
24:FfJ7sC3g00ZHtXUs+wDnYDBTibJKDGCtsOoWm5GGHvqpufseEgP:FfJwkgdZHd+wDnYtT+dPOo5fouUlgP,"/Users/ssuman/Downloads/testing/onecopy.txt"
ssdeep: Did not process files large enough to produce meaningful results

Next we can compare them in detail rather than observing visually.

ssuman@INM1F52WRMD6M testing % ssdeep * > test.ssd
ssdeep: Did not process files large enough to produce meaningful results
ssuman@INM1F52WRMD6M testing % cat test.ssd
ssdeep,1.1--blocksize:hash:hash,filename
24:N/L3JL8tXjEn3fKGrmKadY6RH6r+jyJdEKt7r5GUcy:N/V8tX03fuR76JftH5GFy,"/Users/ssuman/Downloads/testing/different.txt"
24:FfJ7sC3g00ZHtXUs+wDnYDBTibJKDGCtsOoWm5GGHvqpufseEgQ:FfJwkgdZHd+wDnYtT+dPOo5fouUlgQ,"/Users/ssuman/Downloads/testing/one.txt"
24:FfJ7sC3g00ZHtXUs+wDnYDBTibJKDGCtsOoWm5GGHvqpufseEgP:FfJwkgdZHd+wDnYtT+dPOo5fouUlgP,"/Users/ssuman/Downloads/testing/onecopy.txt"
ssuman@INM1F52WRMD6M testing % ssdeep -m test.ssd -s *
/Users/ssuman/Downloads/testing/different.txt matches test.ssd:/Users/ssuman/Downloads/testing/different.txt (100)
/Users/ssuman/Downloads/testing/one.txt matches test.ssd:/Users/ssuman/Downloads/testing/one.txt (100)
/Users/ssuman/Downloads/testing/one.txt matches test.ssd:/Users/ssuman/Downloads/testing/onecopy.txt (99)
/Users/ssuman/Downloads/testing/onecopy.txt matches test.ssd:/Users/ssuman/Downloads/testing/one.txt (99)
/Users/ssuman/Downloads/testing/onecopy.txt matches test.ssd:/Users/ssuman/Downloads/testing/onecopy.txt (100)

ssdeep considered one and onecopy to be 99%similar, that means these two files have more identical sequences. I tried to do the same for images. Below are two same images, one in .jpg and another one in .png format.

I have made couple of changes in the png image by adding white rectangles as below.

Now, lets repeat the commands.

ssuman@INM1F52WRMD6M image % ssdeep *               
ssdeep,1.1--blocksize:hash:hash,filename
3072:JF+i5nmCLo4wNtjONjulhKwz0lV2UOfe8bTrBLGT3WjRUZ+zsN1q86cq+AG:6i5nm6o4wvjO9ubKw8WfeSNLGTGjR4I4,"/Users/ssuman/Downloads/testing/image/IMG-20210425-WA0010.jpg"
24576:HFrD0+J6kRCRmyUc4nUe6j0Xst6y1Bm9nQW7v5Ltgq18PhmVmrliFzv7Zvjb:HF8+J6kRC6zUe6j0XDcQQWr5L51whmVN,"/Users/ssuman/Downloads/testing/image/IMG-20210425-WA0010.png"
24576:HFrD0+J6kRCRmyUc4nd8JT99dt+4YNRWcZP9eNyg7IpB0ixpupq:HF8+J6kRC6zd8JR1ERW8lewZL0ixIpq,"/Users/ssuman/Downloads/testing/image/IMG-20210425-WA00101.png"
24576:HFrD0+J6L9mkLB4bzuz086zV52SLDjq/q9S2AyZcEzs8bM40jYin5+G:HF8+J6pmkLBO808wbjOeSrgbn0UCb,"/Users/ssuman/Downloads/testing/image/IMG-20210425-WA00102.png"
ssuman@INM1F52WRMD6M image % ssdeep * > image.ssd   
ssuman@INM1F52WRMD6M image % ssdeep -m image.ssd -s *
/Users/ssuman/Downloads/testing/image/IMG-20210425-WA0010.jpg matches image.ssd:/Users/ssuman/Downloads/testing/image/IMG-20210425-WA0010.jpg (100)
/Users/ssuman/Downloads/testing/image/IMG-20210425-WA0010.png matches image.ssd:/Users/ssuman/Downloads/testing/image/IMG-20210425-WA0010.png (100)
/Users/ssuman/Downloads/testing/image/IMG-20210425-WA0010.png matches image.ssd:/Users/ssuman/Downloads/testing/image/IMG-20210425-WA00101.png (43)
/Users/ssuman/Downloads/testing/image/IMG-20210425-WA0010.png matches image.ssd:/Users/ssuman/Downloads/testing/image/IMG-20210425-WA00102.png (30)
/Users/ssuman/Downloads/testing/image/IMG-20210425-WA00101.png matches image.ssd:/Users/ssuman/Downloads/testing/image/IMG-20210425-WA0010.png (43)
/Users/ssuman/Downloads/testing/image/IMG-20210425-WA00101.png matches image.ssd:/Users/ssuman/Downloads/testing/image/IMG-20210425-WA00101.png (100)
/Users/ssuman/Downloads/testing/image/IMG-20210425-WA00101.png matches image.ssd:/Users/ssuman/Downloads/testing/image/IMG-20210425-WA00102.png (30)
/Users/ssuman/Downloads/testing/image/IMG-20210425-WA00102.png matches image.ssd:/Users/ssuman/Downloads/testing/image/IMG-20210425-WA0010.png (30)
/Users/ssuman/Downloads/testing/image/IMG-20210425-WA00102.png matches image.ssd:/Users/ssuman/Downloads/testing/image/IMG-20210425-WA00101.png (30)
/Users/ssuman/Downloads/testing/image/IMG-20210425-WA00102.png matches image.ssd:/Users/ssuman/Downloads/testing/image/IMG-20210425-WA00102.png (100)

As seen above .jpg image doesn’t match with .png file even being all the same, so it’s not always possible use ssdeep for image hash comparison though it worked averagely with png giving the comparison details. Even for slightest of change the comparison is not so well.

Feel free to drop the questions in the comment section.

Reference: 

  1. https://www.sciencedirect.com/science/article/pii/S1742287606000764 
  2. https://www.youtube.com/watch?v=xY4YggSTnD8