Written By Jesse Sampson And Presented By Chuck Leaver CEO Ziften
In the first post on edit distance, we took a look at hunting for destructive executables with edit distance (i.e., the number of character modifications it takes to make two text strings match). Now let’s look at how we can use edit distance to look for destructive domains, and how we can build edit distance functions that can be combined with other domain name features to pinpoint suspicious activity.
Case Study Background
Exactly what are bad actors trying to do with malicious domains? It may be merely using a similar spelling of a common domain name to fool reckless users into looking at ads or getting adware. Legitimate sites are gradually picking up on this strategy, in some cases called typo squatting.
Other harmful domains are the product of domain name generation algorithms, which can be used to do all kinds of nefarious things like avert countermeasures that obstruct known compromised websites, or overwhelm domain name servers in a dispersed DOS attack. Older versions use randomly-generated strings, while further advanced ones include tricks like injecting typical words, further confusing defenders.
Edit distance can aid with both use cases: here we will find out how. Initially, we’ll omit typical domains, because these are usually safe. And, a list of regular domain names offers a baseline for finding anomalies. One excellent source is Quantcast. For this discussion, we will adhere to domain names and prevent subdomains (e.g. ziften.com, not www.ziften.com).
After data cleansing, we compare each candidate domain name (input data observed in the wild by Ziften) to its prospective neighbors in the exact same top level domain (the last part of a domain name – classically.com,. org, and so on now can be almost anything). The fundamental job is to discover the nearest neighbor in terms of edit distance. By finding domains that are one step away from their nearby next-door neighbor, we can quickly find typo-ed domains. By finding domain names far from their neighbor (the normalized edit distance we presented in Part 1 is useful here), we can likewise find anomalous domain names in the edit distance area.
Exactly what were the Outcomes
Let’s take a look at how these results appear in real life. Take care browsing to these domains considering that they might include malicious content!
Here are a few prospective typos. Typo-squatters target well known domain names considering that there are more chances someone will go to them. Several of these are suspicious according to our risk feed partners, but there are some false positives as well with charming names like “wikipedal”.
Here are some weird looking domain names far from their next-door neighbors.
So now we have actually produced two helpful edit distance metrics for searching. Not only that, we have 3 features to potentially add to a machine-learning design: rank of closest next-door neighbor, range from next-door neighbor, and edit distance 1 from next-door neighbor, showing a danger of typo tricks. Other features that might be utilized well with these include other lexical features such as word and n-gram distributions, entropy, and the length of the string – and network functions like the number of unsuccessful DNS requests.
Streamlined Code that you can Play Around with
Here is a simplified version of the code to play with! Created on HP Vertica, but this SQL should work on many sophisticated databases. Keep in mind the Vertica editDistance function might vary in other applications (e.g. levenshtein in Postgres or UTL_MATCH. EDIT_DISTANCE in Oracle).