This blog is the final part of the Insider Threat Hunt: The Series, a collection of blogs where we demonstrate how to analyze large synthetic data sets for insider threat patterns. In part one of the series, we identified a user who attempted to access company systems after leaving the organization. Part two focused on identifying a user who had uploaded company data to an external site before resigning. Parts three and this part identified an employee who was either fired or resigned but not without first conducting unethical, inappropriate, and even illegal acts.
This blog will identify an employee who snoops around another employee's computer to locate interesting information, before then emailing it to their personal email address. The scenario and analysis presented here are based on Carnegie Mellon University's Insider Threat Dataset (available for public download at KiltHub). All the data presented here is fictitious, including email addresses, company names, and individuals' names.
The scenario presented in this blog involves a user, presumably an employee, who exfiltrates company files obtained from another user's computer via their personal email address. The alleged snooper performs this behavior repeatedly over three (3) months.
The dataset used in this analysis is approximately 22 GB compressed, containing eight (8) data sources. However, for this demonstration, we only use the following five (5):
The most natural place to start the analysis is the email.csv file, a 7.53 GB file containing over 10 million rows of details about sent and viewed email messages.
The scenario states that a user sent repeated email messages to their personal email account, presumably from a company email address to a free email service provider like Gmail or Yahoo.
Ten million rows are a lot of data to analyze; therefore, we can start by applying filters on various columns to reduce the data. We can safely filter out all emails sent internally (from and to a dtaa.com email address) because the user exfiltrated the data via an external email domain. By applying the following filters to the TO and FROM columns, we can significantly reduce the number of logs from 10 million to almost 70k.
Next, we need to identify all unique external domains to which internal users sent emails, and one way to do this is by using the Split Column function. We follow these steps:
Before applying the Split Column function, we need to identify the email messages sent to more than one recipient and separate each email address into a different column. We can apply the Character Count function to the TO column, which, as the name implies, counts the number of characters in each cell within the TO column, placing the results in a new column named "TO - LENGTH."
We then sort the "TO - LENGTH" column in ascending order to find the cells with the most number of characters, which correspond to multiple email destinations.
Looking at the TO column above, we see that a semi-colon separates each recipient's email address. Therefore, we can apply the Split Column function to the TO column using the semi-colon as a separator, effectively adding twelve (12) new columns named "TO - TO_SPLIT_1" through "TO - TO_SPLIT_12", each with one email address per cell.
We can now apply the Split Column function to each "TO - TO_SPLIT_#" using the "@" symbol as a separator to isolate the email address local-part from the email domain, effectively adding two new columns per each "TO - TO_SPLIT_#".
By grouping the newly added column containing the domains, we can identify the unique external domains to which users sent emails, revealing one non-business domain: comcast.net
We can further group the FROM and TO columns to identify the unique sender and recipient addresses within the comcast.net group. The screenshot below reveals that a user named Carlos Dieter Ewing sent 101 emails to Ewing_Carlos@comcast.net from his company email address.
The scenario noted that the user repeatedly sent emails to his personal email account over three (3) months. To validate this, we can perform the following steps:
After performing the steps outlined above, we can verify that the user sent most of the emails between March and May.