Python’s the go-to language for most data analysts to process big CSV files. It still requires a lot of coding but doesn’t guarantee results on large datasets. If you’ve ever tried Python’s NumPy with CSV files too big to open in traditional spreadsheets, you’d know the library fails as well. Luckily, Gigasheet’s #NoCode parser can help you sift through the data in no time.
Let’s see NumPy in action and later switch the #NoCode mode on Gigasheet to see a comparison of the two in opening big CSV files for analysis.
NumPy, short for Numerical Python, is a Python library which is used to work with numerical data and large arrays. If you’ve ever used Python’s standard lists, they’re quite inefficient; especially when it comes to large datasets.
Does NumPy really solve the issue? To some extent, yes it does. However, you’re still likely to hit a bottleneck when it comes to loading big files as some functions in NumPy attempt to load entire datasets to memory. That’s only going to work if you’ve got the memory to hold all that data in.
Let’s look at NumPy in action and see if it’s able to open big CSV files!
NumPy supports loading and operating on CSV sheets using two main functions. For instance, you might want to use one of two functions, loadtxt or genfromtxt, to load your data.
So, which one should you use?
Both functions offer different functionalities when it comes to loading smaller datasets, matrices, and multidimensional arrays. However, like I said before, Python’s NumPy is likely going to fail when it comes to opening big CSV files.
Why’s that? Let’s take an objective look at both the functions.
The purpose of the function is simple – load data from a text file with a specific format where all rows have the same number of values.
Here’s a code snippet loading a csv file with integer data-type (since NumPy only works on homogeneous datasets) using NumPy’s loadtxt function:
numpy.loadtxt("sample.csv", dtype=int, delimiter=',')
It’s faster than the genfromtxt function (which we’ll be exploring next) and is more oriented towards operating on simpler text files. It is also a bit more memory-efficient but is very likely to crash your system if the memory runs out.
On the contrary, the genfromtxt function offers the ability to handle missing values while loading datasets. It can help preprocess the data and save time in later stages of the data analysis pipeline.
Here’s a sample code snippet of loading a csv file using NumPy’s genfromtxt function using the value ‘2’ as the filling value for columns with no values:
numpy.genfromtxt("sample.csv", delimiter=",", dtype=int, filling_values='2')
There’s just one downside to using the genfromtxt function; it loads all the data from the CSV sheet into the memory! So for instance, if you’ve got a GB or two of data in a sheet and use the function to load it, it’s going to consume a heavy chunk of your overall memory.
Here’s an excellent StackOverflow post on analysis between the genfromtxt vs. the loadtxt function with a graphical view of the usage of memory on loading data
If you have a similar dataset with GB’s of data to analyze, NumPy won’t be much help unless you’ve got a system which packs a punch and has enough resources to let the script work in peace.
That’s precisely where Gigasheet’s cloud-based Spreadsheet swoops in for the win. Let’s take a look at it next.
To demonstrate the same operations with Gigasheet, I’ve got a sample dataset related to COVID news articles from Kaggle. It’s a 2.5GB+ dataset with four columns – title, content, and category – and roughly 477K rows of article data.
Once you’ve uploaded the dataset, Gigasheet will require a few seconds to minutes to load your data into an e-spreadsheet. Once that’s done, your data is viewable and you can do practically anything you want to – slice, dice, enrich, select, flag, and so much more!
For a test run, I’ve modified the shared code using the genfromtxt function to load string data from the CSV dataset. Here’s it:
Using the following command, I’ve also identified the status of available memory on my Mac system:
vm_stat | perl -ne '/page size of (\d+)/ and $size=$1; /Pages\s+([^:]+)[^\d]+(\d+)/ and printf("%-16s % 16.2f Mi\n", "$1:", $2 * $size / 1048576);'
Take a look at the system resources taken on an idle system with ample memory for execution of code:
Soon after I execute the code, the free memory is consumed by the loader function and there’s no available memory to load the data and later perform operations on top of it (which currently my script doesn’t even have).
That’s the best part about Gigasheet.
There’s no limitation on how big the CSV file is when using Gigasheet.
Got GB’s worth of data? Or perhaps a dataset with a billion rows? I can upload it to Gigasheet and wait for a few minutes for the data to be ready for processing, analysis, and more.
Go ahead, upload your data and test the platform out. The platform is stress-tested beyond these numbers and is ready to process Big Data like no other out there.
Size aside, you can work on heterogeneous datasets in Gigasheet meaning different data types in the same CSV sheet; a feature which isn’t possible with NumPy’s multidimensional arrays.
Quick note – NumPy isn’t the ideal library for dealing with string-type operations but it has support for them. For this particular demonstration, I’ve used a very string-heavy dataset. However, the actual comparison is that of working with large CSV files between the two solutions.
Do you regularly have to analyze big datasets in the good ol’ CSV format? Though other code libraries have their advantages, our #NoCode parser is no comparison for Python’s NumPy and others. Gigasheet can crunch your data in no time and help you get started with analysis in mere minutes! Why write code when you can simplify your work with Gigasheet?
Don’t believe me? Sign up for a free account on Gigasheet today and see for yourselves!
Interested in reading more about what you can achieve with Gigasheet? Here are a few more articles for your reading list: