Insider Threat Hunt: The Series

Insider threats consist of the potential dangers of insiders misusing information system access to knowingly or unknowingly attempt to damage their employers, customers, or partners. Unlike external attackers, insider threats come from within an organization, such as employees, contractors, vendors, or business partners with legitimate access to systems or data. While some insiders are purposeful, driven by unique motivations, others become insider threats unbeknownst to them by carelessness or negligence.

This blog is the first part of the Insider Threat Hunt series, where we analyze a collection of synthetic datasets for signs of insider threat. The datasets used in all demonstrations are based on Carnegie Mellon University's Insider Threat Dataset, available for public download at KiltHub.

This blog analyzes dataset r1.tar.bz2 for signs of unauthorized access following personnel terminations or resignations. Our goal is to identify if any user attempted to access company resources after being offboarded.

The Dataset

The r1.tar.bz2 dataset is approximately 84 MB and contains four data sources: devices.csv, http.csv, logon.csv, multiple (18) LDAP.csv files. However, we will only use the logon.csv and LDAP files for our first demonstration.

The LDAP.csv files provide eighteen months' worth of employee LDAP data. Each file corresponds with a year's month and contains a list of active users at the end of the particular month. The LDAP file runs at the end of the month; therefore, users who end employment in the middle of a month will be included in the previous month's LDAP file but not in the LDAP file for the month the employment ended. For example, users departing in June will appear in May's LDAP file but not in June's.

The logon.csv file contains eighteen months' worth of logon/logoff activity for all users, including user ID, computer name, login activity, and date.

Additional context and background for this dataset include the following:

The company has 1000 users, each assigned a unique computer, 12 of whom are system administrators.
There are 100 shared computers.

Analyzing LDAP Data

The first step consisted in uploading the eighteen LDAP CSV files to Gigasheet and combining them into one single file using the built-in combine function, resulting in a ~17k-row file.

The resulting LDAP file contains six columns, including:

Employee name (A)
User ID (B)
Domain (C)
Email address (D)
Role or Title (E)
Date (year and month) for the LDAP data in the file (F)

Grouping by the date column (F) reveals an eighteen (18) month span, from December 2009 to May 2011. The company started with 1000 users in December 2009, evidenced by the group's count number, and lost/gained users progressively over the following eighteen months.

Similarly, grouping by the user ID column (B) shows each user's length of employment. For example, a user with the user ID IGB0310 was employed for the entire eighteen months (IGB0310(18))

Sorting in ascending order reveals the users with shorter employment lifespans. For example, a user with user ID CIB0489 was only employed for one month ((CIB0489(1))

To determine which users were active on which month, we combined the user ID (B) and date (F) columns using a hyphen as a separator, storing the combined data in a new column named user-last-month. The user-last-month column now includes the user ID along with the user’s employment months.

Analyzing the Logon Data

The logon data contains details of logon and logoff activity for all users, including:

A unique identifier (A)
Timestamp (B)
User ID and domain (C)
Computer name (D)
Activity (E)

To identify unauthorized login activity following user offboarding, we need to find users who logged in or out of their computers in the months following their last month of employment. For example, a user who terminates employment in June should be offboarded in June, and therefore, there should not be any activity for that user in July and onwards. Cross-checking the logon and LDAP datasets should, in theory, reveal users who have attempted to access company resources after being removed from LDAP. Another point to keep in mind is that users who depart in the middle of a month will not be included in the LDAP file for the month the employment ended. In contrast, the logon.csv file will record user activity for such users on their last month of employment.

The next step consisted in splitting the logon.csv user column (C) to separate the user ID from the domain. For this, we used Gigasheet's split column function using a forward slash as a separator, resulting in two new columns, one for the user ID (USER-DATE_SPLIT_2) and one for the domain (USER-DATE_SPLIT_1).

Similarly, we split the date column (B) using a hyphen as a separator, placing the year, month, and date/time in three columns.

To cross-check the logon.csv and ldap.csv files, we need a way to tie the user ID to the year and month in which the logon activity occurred, similarly to how we combined the user ID and date in the LDAP file. To do this, we used Gigasheet's combine columns function, connecting the user ID, year, and month columns using a hyphen as a separator and placing the results in a new column named user-last-month.

Next, we performed a cross-file lookup of the logon.csv user-last-month column against the ldap.csv user-last-month column, resulting in an additional column displaying the lookup results.

Grouping by the user-last-month cross-lookup column reveals two values, true and false. True represents a positive match between the cross-checked values in logon.csv and ldap.csv files, while false represents the opposite.

In this demonstration, we are only concerned with false values, which include details about terminated users. Remember that users will show up in the logon.csv file on their last month of employment, but not after that. Therefore, to conclude that there is no new logon activity after a user is terminated, each unique user ID in the false group should not have more than one month of login activity.

We used Gigasheet's pivot table to reveal the number of months per user ID in the false group by counting the unique values in the month column against each user ID.

We then sorted the results in ascending order and found one user (BHH0460) who had two (2) months' worth of login activity (following the user's removal from LDAP).

Using Gigasheet's built-in filters, we quickly traced the unauthorized connection to August 2, 2010.

Finally, we cross-checked our findings by searching for the user ID in the ldap.csv file to find that the user was last active in LDAP in June 2010 and offboarded in July 2010.

Stay tuned for the next blog in the Insider Threat Hunt series, where we will continue to uncover more insider threat activity.

Exploratory Data Analysis

Insider Threat Hunt: The Series

The Dataset

Analyzing LDAP Data

Analyzing the Logon Data

The ease of a spreadsheet with the power of a data warehouse.

Similar posts

Exploratory Data Analysis

Finding the Sweet Spot: Exploring Pickleball Paddle Data

Exploratory Data Analysis

Data Exploration: Product Hunt Data

Exploratory Data Analysis

Exploratory Data Analysis 101: UFO Sightings Data

Price Transparency Market Intelligence: See Where You Stand
Webinar July 17, @ 1:00 PM ET

Exploratory Data Analysis

Insider Threat Hunt: The Series

Luciana Obregon

The Dataset

Analyzing LDAP Data

Analyzing the Logon Data

The ease of a spreadsheet with the power of a data warehouse.

Similar posts

Exploratory Data Analysis

Finding the Sweet Spot: Exploring Pickleball Paddle Data

Exploratory Data Analysis

Data Exploration: Product Hunt Data

Exploratory Data Analysis

Exploratory Data Analysis 101: UFO Sightings Data

Price Transparency Market Intelligence: See Where You Stand Webinar July 17, @ 1:00 PM ET

Price Transparency Market Intelligence: See Where You Stand
Webinar July 17, @ 1:00 PM ET