horizontal lines
  • Syed Hasan

Data Science for Incident Response

Incident responders are frequently battling against inadequate logging, shorter timeframes for response, and a high volume of logs from endpoints. Scaling this to larger organizations with a couple hundred compromised endpoints, it’s the quality of the analysis which takes a direct hit. However, it doesn’t have to.

Event-based analysis has always been a time-consuming activity. Rather than looking for compromises that way, responders can instead lend a helping hand from statistics and data science. By uncovering hidden patterns, trends, and potential indicators of misuse from a plethora of data, responders can devise more educated guesses and respond effectively.


Incident Response Data Science

Analytics for Incident Responders: Drawing Quick Conclusions

In short, data analytic techniques allow us to:

“...inspect, cleanse, transform, and model data with the goal of discovering useful information, informing conclusions, and supporting decision-making.”

Often, incident response (IR) analysts do in fact deal with raw data (in the shape of event logs) and want to sift the valuable information from the junk. The benefits of data science can be ported to the field of security as well. Whether it’s preventing, detecting, or responding to an incident, the ability to organize and triage data effectively will only result in better conclusions.

However, current tooling or techniques required to perform data science require an immense amount of time or command-line expertise which presents a major trade-off. Libraries like Pandas, as we’ll uncover later, are great for exploring data on a massive scale but require expertise to do so. Let’s explore the traditional means of utilizing these techniques and then put them up against Gigasheet, a new way of exploring data and the relationships within.

Traditional Means of Data Analysis

Python is extensively utilized in the field of artificial intelligence, machine learning, and data science. Pandas, a Python package to perform real-world data analysis, is also a favorite of many expert IR analysts and threat hunters for swift exploration of data. Take a look at Roberto Rodriguez’s amazing data analysis of Windows-based logs using a variety of data analytic techniques.

Security datasets almost always require some form of pre-processing before they’re able to be imported into a Pandas dataframe. For example, you might need to flatten sub-nested fields, normalize frames, and run other generic commands (grouping, filtering, etc.) to get basic information about the imported dataset. This can be time consuming and isn’t always trivial. Even if you’d memorized all the specific commands to fulfil your objectives, you still need to have a specialized environment configured for this purpose. PowerShell and its extremely handy cmdlets can also offer you the alternative of a full-fledged SIEM. Take a look; here, an analyst (on Twitter) has just performed a quick frequency analysis on binaries executed on the system (via the EventID ‘4688’).

Using PowerShell for Frequency Analysis
Using PowerShell for Frequency Analysis

But what’s still there is the chunky command used to do so. Although you can compile together dozens of such recipes, the problem still stands.

Get-WinEvent -FilterHashtable @{LogName='Security';Id=4688}  | where-object {$_.Properties[5].Value -ne "Registry" } |Select @{Name="Process";Expression={$_.Properties[5].Value}} | Group-Object -Property Process | Select Count, Name | sort Count

Cybersecurity data science is not easy for beginners and is often just a little too inexpedient for seasoned veterans. If you’re only somewhat familiar with Python, or aren’t a PowerShell expert, these simple yet powerful data science techniques are inconvenient, and often overlooked. How can we actually solve this? Gigasheet!

Incident Response Data Analysis With Gigasheet

Gigasheet offers the ultimate data analysis workbench to sift through millions of logs to produce meaningful relations. Taking manual configurations, command-lines, and extensive scripting out of the equation, you’re a few steps away from swift analysis of data.

To showcase how Gigasheet can assist you during your Incident Response engagements, consider this example. Rather than focusing on a single type of log channel, we’ve taken help from Samir’s (@sbousseaden) GitHub repository, EVTX-ATTACK-SAMPLES, containing log (evidence) from some of the most notorious attack techniques to date. Using this dataset, let’s see what relationships or facts can we derive to assist us in prioritizing strategies for particular threats.


View EVTX attack samples in Gigasheet ➡️

No account? No problem. It’s free to sign up here.


Let’s begin; we’ve imported the CSV sheet (containing details from each attack simulation) to Gigasheet. Rather than downloading the file locally, we can simply pass the link to Gigasheet to ingest and process the file for us. Here’s a fun fact as the file ingests; Gigasheet can easily process files up to 50GB (or more for Enterprise users), and it parses EVTX files automatically. This makes it perfect for exhibition threat hunts, directed incident response, and regular analysis.

Uploading Files to Gigasheet
Options for Uploading Datasets in Gigasheet

Without running Python Pandas functions like shape to retrieve the dimensions of the data, I can see the number of rows and columns in my dataset from the Files menu. Now that the data upload and metadata check is out of the way, let’s take a look at the data and see what we can uncover.

Log Channels

The ‘Security’ and ‘Sysmon’ log channels are perhaps the first priority of every IR analyst. However, there are several other log channels, with far lesser volume, containing logs which are useful in analysis. Let’s take a look at a few examples in our dataset.

I’ve simply grouped the log channel column together and sorted the rows in descending order. We can see how the Application, Terminal Services, and Remote Desktop Services log channels also populate some form of data. Taking an example of failed Remote Desktop Protocol (RDP) login attempts; although the Security log channel offers value via the 4624 event ID, the respective channel for RDP enriches the log with more valuable information. Such events, once put on a timeline, can also help map the incident from start till the end.

Aggregating Logs Based on 'Log Channel'
Aggregating Logs Based on 'Log Channel'

Event IDs

Evidence of execution on the system is also logged, albeit not by default in the Security log channel. However, there are several other Windows-related event IDs which offer value and help connect the dots as to what might’ve happened on the endpoint.

Rather than listing your columns, you can easily open the side-bar and search for the column you want to group together. Here, I simply pick up the EventID column and drop it to the Row Groups field.

Grouping Rows Based on 'EventID'
Grouping Rows Based on 'EventID'

Let’s sort the data back again and see what Event IDs can offer us some valuable insights into attack data.

Aggregating Logs Based on 'EventID'
Aggregating Logs Based on 'EventID'

At the top, we have Event ID, 5145, which is used to log access events to shares exposed by a particular endpoint. Event ID, 1, is classic execution registered by the Sysmon agent on the endpoint along with ID, 7, which refers to Image Load operations. You can further dig into these events to view what fields are consistent across IDs or might help drive your incident response efforts.

One other way we can use this dataset is to improve our detection and monitoring. Say you have no clue about the Event ID, 4663. A simple Google search against the ID reveals its links to object access logging and how we can enable it using the local security policy. Though these logs also register a very high volume of logs - so based on your environment, you might even use this dataset to prioritize which Event IDs require logging (or what filters can be used to reduce the noise).

Suspicious Processes

Frequency analysis of processes can also help us spot anomalies in binaries being executed on the system. Not just this, abnormal executions such as the net, ipconfig, and other utilities being run in a short span of time might indicate active reconnaissance being performed on the endpoint.

To perform this analysis, let’s filter the Event ID field on the value ‘1’ (to simply receive execution events).

Applying Filters in Gigasheet
Applying Filters in Gigasheet

Now that the ID is filtered, we have 843 logs to work with. Let’s group the remaining logs based on the Image field and sort the list.

Well, some of these are quite obvious on the list. cmd.exe is the primary process behind execution of something on the system via the console. Followed by svchost (which manages services on the system) and appcmd.exe (which is perhaps one of the many overpowered binaries shipped with IIS to configure servers).

Using Gigasheet, you can also filter out processes with no command-line or a different field which might make it non-malicious. Simply head over to the filters tab, and select “is not empty” and the filtered results should be there.

Applying 'N/A' Filters
Applying 'N/A' Filters

We do have some other contenders worth a discussion on the list. Rundll32 is used for execution of DLLs or exports on the endpoint so although it can’t be blacklisted, its execution and the command-line patterns should definitely be watched. Consent is used by the User Account Control to seek consent of the user before performing administrative tasks.

If you take a look at the bottom, there’s an interesting process. UACME is actually an open-source project which is used to bypass UAC on endpoints. Processes of this kind reveal important information if found early on during threat hunts and help draft a timeline (based on execution time).

What’s Next?

Data Science has already proven its mettle across other fields of life. It's high time these techniques are utilized by Incident Response analysts to produce value from raw data sources. We've shown how Gigasheet can aid you, as an analyst, to quickly upload logs, filter, group, and identify patterns using simple data analysis techniques to validate your initial hypothesis and draw conclusions.

That's it folks. If you're interested in the beta run of Gigasheet, click here to sign up, add a sample data to your account, and get started!