Data exploration, often called exploratory data analysis, is the first and most critical phase in the data analysis pipeline. How so?
This is where the data undergoes an initial review where analysts understand the data, derive patterns, spot anomalies with it (examples of which we’ll uncover later). Most analysts utilize hybrid exploration techniques – including manual and automated exploration tools – for their analysis.
Pandas, Numpy, and Matplotlib are examples of three of the most commonly used Python libraries for this purpose. Gigasheet enables you to do all the same without writing a single line of code! 🤯
Let’s explore a data set together and see Gigasheet in action!
We’ll be breaking our analysis workflow in three steps:
Have your data in flat files like CSV or JSON? We parse these files into spreadsheets in seconds. Gigasheet currently support the following file formats:
Keep your data sets stashed in personal cloud storage platforms like Google Drive or OneDrive? No worries!
Have your data in cloud services like AWS S3, AWS Redshift, or ActiveCampaign? Our data connectors are ready to crunch your data into our platform! We’ve got over 100+ integrations and a lot more in the pipelines to support you data analysts.
On a test drive but don’t have sample data sets to work with? Gigasheet recently launched the Data Community from which you can easily open up a dataset in our web-based Spreadsheet and get to work.
To showcase data exploration with Gigasheet, I’ll be using a dataset from Kaggle where we’re given a US Airline Sentiment dataset which was scraped from Twitter – so these are tweets from customers in February 2015 who took several flights using US-based airlines. A few objectives here might be to find:
Using the file upload functionality, I’ll simply upload my CSV sheet to Gigasheet and let it do the processing. Wait for a few seconds and your sheet should be ready for viewing. Let’s move on to the next phase.
Before anything, I can see several columns which are not necessary for analysis. So, let’s start by removing them from our data set. Here’s what I’ve removed:
Next, a good practice is to check for columns with null values (especially in columns which are important for analysis). A quick and easy way to check is to group the columns and see if any of the values are either empty (or another form of a null value). Here’s an example of me grouping the airline column:
If grouping is not what you’re after, you can also pick a column and look for null values by applying the is empty filter on the desired column. Here’s me doing the same for the airline column with filters:
Fun Fact: You can also save your filters to avoid rewriting filters for analysis at a later stage. Just save the filter and you can easily select it later from the dropdown!
Next, let’s map the sentiments against their counts for a quick check on how our data is distributed. To do so, I’ll first group the data based on airline_sentiment, then use the dropdown by the airline column to select Count.
Once selected, let’s select the data, right click, and select Chart Range to stack it in a column:
Here’s how the sentiments look via a chart:
That’s a lot of negative sentiments. Next, let’s distribute these against a particular airline and see how the data looks. Start off by grouping the data on the airline column, and next, the sentiment column. You can now use any non-numeric field to use the Column aggregation to use in our graph.
Here, I’ve selected the Count aggregation on the name field. Now let’s select all grouped data again and stack it to see how the data looks. Here’s my selection:
And here’s how the grouped data looks in a chart:
Fun Fact: You can further modify the graph by selecting the arrow on the top right (not visible in the screenshot). Add a title, remove the legends, change the colors – practically anything you could do with matplotlib and plotly – you can do here.
You might notice that United, US Airlines, and American have a lot more ‘negative’ sentiment tweets than other airlines. This forms a skewed dataset and you might want to operate on a more inclusive dataset for proper results. However, we’ll continue with our data exploration as this is simply a test run.
One thing we could do is to take a look at ‘why’ these negative sentiments are high. I’ll start by removing positive and neutral values from the dataset. Group the negativereason field and aggregate any field to the Count type. Here’s how the data looks in a bar graph:
Now if you were a firm looking to improve a particular business function – just this simple graph on Gigasheet can help identify a weak point. Here, Customer Service Issue seems to be a recurring problem.
You can further distribute this data into individual airlines e.g. look for negative reasons against United airlines. To do so, first group the data by airline and then by negativereasons. Again, Customer Service Issue comes out to be the root cause of these negative tweets on Twitter.
There’s a lot more you can do to further explore this data and see what models you’d apply in the next phase of your Data Analysis process. This is where exploration ends for me and this write-up. What other exploration use-cases can you think of here?
Frustrated with how much code you have to write to analyze your data? That’s exactly why Gigasheet promotes no-code data science. Load your data, explore it, and get to analysis without writing hundreds of lines of code!
If you’re looking for more fun blogs on using Gigasheet for Data Science and Data Analysis, give these a read: