First Part Of Why Edit Difference Is Important – Chuck Leaver

Written By Jesse Sampson And Presented By Chuck Leaver CEO Ziften


Why are the same tricks being utilized by assailants all of the time? The basic response is that they are still working today. For instance, Cisco’s 2017 Cyber Security Report informs us that after years of wane, spam e-mail with destructive attachments is once again growing. In that conventional attack vector, malware authors typically conceal their activities by using a filename much like a typical system procedure.

There is not always a connection with a file’s path name and its contents: anyone who has aimed to hide delicate details by providing it a boring name like “taxes”, or changed the extension on a file attachment to circumvent e-mail guidelines is aware of this principle. Malware creators understand this too, and will typically name malware to resemble common system procedures. For example, “explore.exe” is Internet Explorer, however “explorer.exe” with an additional “r” could be anything. It’s easy even for professionals to ignore this minor distinction.

The opposite problem, known.exe files running in uncommon locations, is simple to fix, utilizing string functions and SQL sets.


How about the other case, discovering close matches to the executable name? Most people start their search for near string matches by arranging data and visually looking for disparities. This usually works well for a little set of data, maybe even a single system. To discover these patterns at scale, nevertheless, needs an algorithmic method. One recognized technique for “fuzzy matching” is to utilize Edit Distance.

What’s the best method to calculating edit distance? For Ziften, our technology stack includes HP Vertica, which makes this job easy. The web has lots of data researchers and data engineers singing Vertica’s praises, so it will be adequate to discuss that Vertica makes it simple to develop custom functions that maximize its power – from C++ power tools, to statistical modeling scalpels in R and Java.

This Git repo is kept by Vertica lovers working in industry. It’s not an official offering, however the Vertica team is absolutely aware of it, and moreover is thinking every day about ways to make Vertica better for data scientists – a great space to view. Best of all, it contains a function to compute edit distance! There are also some other tools for the natural processing of langauge here like word tokenizers and stemmers.

By using edit distance on the leading executable paths, we can quickly find the closest match to each of our top hits. This is an intriguing data-set as we can sort by distance to discover the nearest matches over the whole dataset, or we can sort by frequency of the top path to see what is the closest match to our frequently used procedures. This data can likewise appear on contextual “report card” pages, to reveal, e.g. the top 5 nearest strings for a given path. Below is a toy example to give a sense of use, based on real data ZiftenLabs observed in a client environment.


Setting an upper limit of 0.2 seems to discover excellent results in our experience, but the take away is that these can be adapted to fit specific usage cases. Did we find any malware? We notice that “teamviewer_.exe” (needs to be just “teamviewer.exe”), “iexplorer.exe” (must be “iexplore.exe”), and “cvshost.exe” (should be svchost.exe, unless maybe you work for CVS pharmacy…) all look strange. Considering that we’re currently in our database, it’s likewise trivial to obtain the associated MD5 hashes, Ziften suspicion scores, and other attributes to do a deeper dive.


In this specific real life environment, it turned out that teamviewer_.exe and iexplorer.exe were portable applications, not known malware. We helped the client with additional investigation on the user and system where we observed the portable applications considering that use of portable apps on a USB drive could be evidence of naughty activity. The more troubling find was cvshost.exe. Ziften’s intelligence feeds indicate that this is a suspect file. Searching for the md5 hash for this file on VirusTotal validates the Ziften data, suggesting that this is a possibly serious Trojan virus that may be part of a botnet or doing something even more malicious. Once the malware was discovered, nevertheless, it was easy to fix the problem and make sure it remains resolved using Ziften’s ability to kill and constantly block procedures by MD5 hash.

Even as we develop innovative predictive analytics to identify destructive patterns, it is very important that we continue to improve our abilities to hunt for recognized patterns and old techniques. Just because brand new dangers emerge doesn’t mean the old ones disappear!

If you liked this post, watch this space for the second part of this series where we will apply this technique to hostnames to find malware droppers and other malicious websites.

Leave a Reply

Your email address will not be published. Required fields are marked *