horizontal lines
Gigasheet Primary logo
  • Luciana Obregon

Insider Threat Hunt Series: Part 4, Finding the Snooper

This blog is the final part of the Insider Threat Hunt: The Series, a collection of blogs where we demonstrate how to analyze large synthetic data sets for insider threat patterns. In part one of the series, we identified a user who attempted to access company systems after leaving the organization. Part two focused on identifying a user who had uploaded company data to an external site before resigning. Parts three and this part identified an employee who was either fired or resigned but not without first conducting unethical, inappropriate, and even illegal acts.

This blog will identify an employee who snoops around another employee's computer to locate interesting information, before then emailing it to their personal email address. The scenario and analysis presented here are based on Carnegie Mellon University's Insider Threat Dataset (available for public download at KiltHub). All the data presented here is fictitious, including email addresses, company names, and individuals' names.

If you would like to follow along, create a free Gigasheet account, download a copy of the dataset, and get hunting.

The Scenario

The scenario presented in this blog involves a user, presumably an employee, who exfiltrates company files obtained from another user's computer via their personal email address. The alleged snooper performs this behavior repeatedly over three (3) months.

The Dataset

The dataset used in this analysis is approximately 22 GB compressed, containing eight (8) data sources. However, for this demonstration, we only use the following five (5):

  • devices.csv: records device activity, including power on and off and removable media events

  • file.csv: contains details of file transfers to and from removable media

  • email.csv: includes details of email communications, including sender, recipient, attachments, and email content

  • logon.csv: records user logon and logoff activity

  • YYYY-MM.csv: holds employee records, including on and offboarding events

The Analysis

The most natural place to start the analysis is the email.csv file, a 7.53 GB file containing over 10 million rows of details about sent and viewed email messages.

The scenario states that a user sent repeated email messages to their personal email account, presumably from a company email address to a free email service provider like Gmail or Yahoo.

Ten million rows are a lot of data to analyze; therefore, we can start by applying filters on various columns to reduce the data. We can safely filter out all emails sent internally (from and to a dtaa.com email address) because the user exfiltrated the data via an external email domain. By applying the following filters to the TO and FROM columns, we can significantly reduce the number of logs from 10 million to almost 70k.

Next, we need to identify all unique external domains to which internal users sent emails, and one way to do this is by using the Split Column function. We follow these steps:

  1. Identify emails sent to multiple recipients and separate individual recipient email addresses into different columns

  2. Separate the email address local-part (the characters that appear before the @) from the domain

Before applying the Split Column function, we need to identify the email messages sent to more than one recipient and separate each email address into a different column. We can apply the Character Count function to the TO column, which, as the name implies, counts the number of characters in each cell within the TO column, placing the results in a new column named "TO - LENGTH."

We then sort the "TO - LENGTH" column in ascending order to find the cells with the most number of characters, which correspond to multiple email destinations.

Looking at the TO column above, we see that a semi-colon separates each recipient's email address. Therefore, we can apply the Split Column function to the TO column using the semi-colon as a separator, effectively adding twelve (12) new columns named "TO - TO_SPLIT_1" through "TO - TO_SPLIT_12", each with one email address per cell.

We can now apply the Split Column function to each "TO - TO_SPLIT_#" using the "@" symbol as a separator to isolate the email address local-part from the email domain, effectively adding two new columns per each "TO - TO_SPLIT_#".

By grouping the newly added column containing the domains, we can identify the unique external domains to which users sent emails, revealing one non-business domain: comcast.net

We can further group the FROM and TO columns to identify the unique sender and recipient addresses within the comcast.net group. The screenshot below reveals that a user named Carlos Dieter Ewing sent 101 emails to Ewing_Carlos@comcast.net from his company email address.

The scenario noted that the user repeatedly sent emails to his personal email account over three (3) months. To validate this, we can perform the following steps:

  1. Filter out all external email domains except comcast.net from the column containing the domains

  2. Apply the Split Column function to the DATE column using the space as a delimiter to separate the date from the time

  3. Apply the Split Column function to the column containing the YYYY-MM-DD, for example, 2011-03-03, using the dash (-) as a delimiter to separate the year from the month and day

After performing the steps outlined above, we can verify that the user sent most of the emails between March and May.

Recent Posts