top of page
horizontal lines
Gigasheet Primary logo
  • Syed Hasan

Data Science for Incident Response

Incident responders are frequently battling against inadequate logging, shorter timeframes for response, and a high volume of logs from endpoints. Scaling this to larger organizations with a couple hundred compromised endpoints, it’s the quality of the analysis which takes a direct hit. However, it doesn’t have to.

Event-based analysis has always been a time-consuming activity. Rather than looking for compromises that way, responders can instead lend a helping hand from statistics and data science. By uncovering hidden patterns, trends, and potential indicators of misuse from a plethora of data, responders can devise more educated guesses and respond effectively.


Incident Response Data Science

Analytics for Incident Responders: Drawing Quick Conclusions

In short, data analytic techniques allow us to:

“...inspect, cleanse, transform, and model data with the goal of discovering useful information, informing conclusions, and supporting decision-making.”

Often, incident response (IR) analysts do in fact deal with raw data (in the shape of event logs) and want to sift the valuable information from the junk. The benefits of data science can be ported to the field of security as well. Whether it’s preventing, detecting, or responding to an incident, the ability to organize and triage data effectively will only result in better conclusions.

However, current tooling or techniques required to perform data science require an immense amount of time or command-line expertise which presents a major trade-off. Libraries like Pandas, as we’ll uncover later, are great for exploring data on a massive scale but require expertise to do so. Let’s explore the traditional means of utilizing these techniques and then put them up against Gigasheet, a new way of exploring data and the relationships within.

Traditional Means of Data Analysis

Python is extensively utilized in the field of artificial intelligence, machine learning, and data science. Pandas, a Python package to perform real-world data analysis, is also a favorite of many expert IR analysts and threat hunters for swift exploration of data. Take a look at Roberto Rodriguez’s amazing data analysis of Windows-based logs using a variety of data analytic techniques.

Security datasets almost always require some form of pre-processing before they’re able to be imported into a Pandas dataframe. For example, you might need to flatten sub-nested fields, normalize frames, and run other generic commands (grouping, filtering, etc.) to get basic information about the imported dataset. This can be time consuming and isn’t always trivial. Even if you’d memorized all the specific commands to fulfil your objectives, you still need to have a specialized environment configured for this purpose. PowerShell and its extremely handy cmdlets can also offer you the alternative of a full-fledged SIEM. Take a look; here, an analyst (on Twitter) has just performed a quick frequency analysis on binaries executed on the system (via the EventID ‘4688’).

Using PowerShell for Frequency Analysis
Using PowerShell for Frequency Analysis

But what’s still there is the chunky command used to do so. Although you can compile together dozens of such recipes, the problem still stands.

Get-WinEvent -FilterHashtable @{LogName='Security';Id=4688}  | where-object {$_.Properties[5].Value -ne "Registry" } |Select @{Name="Process";Expression={$_.Properties[5].Value}} | Group-Object -Property Process | Select Count, Name | sort Count

Cybersecurity data science is not easy for beginners and is often just a little too inexpedient for seasoned veterans. If you’re only somewhat familiar with Python, or aren’t a PowerShell expert, these simple yet powerful data science techniques are inconvenient, and often overlooked. How can we actually solve this? Gigasheet!

Incident Response Data Analysis With Gigasheet

Gigasheet offers the ultimate data analysis workbench to sift through millions of logs to produce meaningful relations. Taking manual configurations, command-lines, and extensive scripting out of the equation, you’re a few steps away from swift analysis of data.

To showcase how Gigasheet can assist you during your Incident Response engagements, consider this example. Rather than focusing on a single type of log channel, we’ve taken help from Samir’s (@sbousseaden) GitHub repository, EVTX-ATTACK-SAMPLES, containing log (evidence) from some of the most notorious attack techniques to date. Using this dataset, let’s see what relationships or facts can we derive to assist us in prioritizing strategies for particular threats.


View EVTX attack samples in Gigasheet ➡️

No account? No problem. It’s free to sign up here.


Let’s begin; we’ve imported the CSV sheet (containing details from each attack simulation) to Gigasheet. Rather than downloading the file locally, we can simply pass the link to Gigasheet to ingest and process the file for us. Here’s a fun fact as the file ingests; Gigasheet can easily process files up to 50GB (or more for Enterprise users), and it parses EVTX files automatically. This makes it perfect for exhibition threat hunts, directed incident response, and regular analysis.

Uploading Files to Gigasheet
Options for Uploading Datasets in Gigasheet

Without running Python Pandas functions like shape to retrieve the dimensions of the data, I can see the number of rows and columns in my dataset from the Files menu. Now that the data upload and metadata check is out of the way, let’s take a look at the data and see what we can uncover.

Log Channels

The ‘Security’ and ‘Sysmon’ log channels are perhaps the first priority of every IR analyst. However, there are several other log channels, with far lesser volume, containing logs which are useful in analysis. Let’s take a look at a few examples in our dataset.

I’ve simply grouped the log channel column together and sorted the rows in descending order. We can see how the Application, Terminal Services, and Remote Desktop Services log channels also populate some form of data. Taking an example of failed Remote Desktop Protocol (RDP) login attempts; although the Security log channel offers value via the 4624 event ID, the respective channel for RDP enriches the log with more valuable information. Such events, once put on a timeline, can also help map the incident from start till the end.