Insider Threat Hunt: The Series
Insider threats consist of the potential dangers of insiders misusing information system access to knowingly or unknowingly attempt to damage their employers, customers, or partners. Unlike external attackers, insider threats come from within an organization, such as employees, contractors, vendors, or business partners with legitimate access to systems or data. While some insiders are purposeful, driven by unique motivations, others become insider threats unbeknownst to them by carelessness or negligence.
This blog is the first part of the Insider Threat Hunt series, where we analyze a collection of synthetic datasets for signs of insider threat. The datasets used in all demonstrations are based on Carnegie Mellon University's Insider Threat Dataset, available for public download at KiltHub.
This blog analyzes dataset r1.tar.bz2 for signs of unauthorized access following personnel terminations or resignations. Our goal is to identify if any user attempted to access company resources after being offboarded.
The r1.tar.bz2 dataset is approximately 84 MB and contains four data sources: devices.csv, http.csv, logon.csv, multiple (18) LDAP.csv files. However, we will only use the logon.csv and LDAP files for our first demonstration.
The LDAP.csv files provide eighteen months' worth of employee LDAP data. Each file corresponds with a year's month and contains a list of active users at the end of the particular month. The LDAP file runs at the end of the month; therefore, users who end employment in the middle of a month will be included in the previous month's LDAP file but not in the LDAP file for the month the employment ended. For example, users departing in June will appear in May's LDAP file but not in June's.
The logon.csv file contains eighteen months' worth of logon/logoff activity for all users, including user ID, computer name, login activity, and date.
Additional context and background for this dataset include the following:
The company has 1000 users, each assigned a unique computer, 12 of whom are system administrators.
There are 100 shared computers.
Analyzing LDAP Data
The first step consisted in uploading the eighteen LDAP CSV files to Gigasheet and combining them into one single file using the built-in combine function, resulting in a ~17k-row file.
The resulting LDAP file contains six columns, including:
Employee name (A)
User ID (B)
Email address (D)
Role or Title (E)
Date (year and month) for the LDAP data in the file (F)
Grouping by the date column (F) reveals an eighteen (18) month span, from December 2009 to May 2011. The company started with 1000 users in December 2009, evidenced by the group's count number, and lost/gained users progressively over the following eighteen months.
Similarly, grouping by the user ID column (B) shows each user's length of employment. For example, a user with the user ID IGB0310 was employed for the entire eighteen months (IGB0310(18))
Sorting in ascending order reveals the users with shorter employment lifespans. For example, a user with user ID CIB0489 was only employed for one month ((CIB0489(1))
To determine which users were active on which month, we combined the user ID (B) and date (F) columns using a hyphen as a separator, storing the combined data in a new column named user-last-month. The user-last-month column now includes the user ID along with the user’s employment months.
Analyzing the Logon Data
The logon data contains details of logon and logoff activity for all users, including:
A unique identifier (A)
User ID and domain (C)
Computer name (D)
To identify unauthorized login activity following user offboarding, we need to find users who logged in or out of their computers in the months following their last month of employment. For example, a user who terminates employment in June should be offboarded in June, and therefore, there should not be any activity for that user in July and onwards. Cross-checking the logon and LDAP datasets should, in theory, reveal users who have attempted to access company resources after being removed from LDAP. Another point to keep in mind is that users who depart in the middle of a month will not be included in the LDAP file for the month the employment ended. In contrast, the logon.csv file will record user activity for such users on their last month of employment.
The next step consisted in splitting the logon.csv user column (C) to separate the user ID from the domain. For this, we used Gigasheet's split column function using a forward slash as a separator, resulting in two new columns, one for the user ID (USER-DATE_SPLIT_2) and one for the domain (USER-DATE_SPLIT_1).
Similarly, we split the date column (B) using a hyphen as a separator, placing the year, month, and date/time in three columns.
To cross-check the logon.csv and ldap.csv files, we need a way to tie the user ID to the year and month in which the logon activity occurred, similarly to how we combined the user ID and date in the LDAP file. To do this, we used Gigasheet's combine columns function, connecting the user ID, year, and month columns using a hyphen as a separator and placing the results in a new column named user-last-month.
Next, we performed a cross-file lookup of the logon.csv user-last-month column against the ldap.csv user-last-month column, resulting in an additional column displaying the lookup results.