Image by Editor | Midjourney
Let’s learn how to perform operation in Pandas with Large datasets.
Preparation
As we are talking about the Pandas package, you should have one installed. Additionally, we would use the Numpy package as well. So, install them both.
Then, let’s get into the central part of the tutorial.
Perform Memory-Efficients Operations with Pandas
Pandas are typically not known to process large datasets as memory-intensive operations with the Pandas package can take too much time or even swallow your whole RAM. However, there are ways to improve efficiency in panda operations.
In this tutorial, we will walk you through ways to enhance your experience with large Datasets in Pandas.
First, try loading the dataset with a memory optimization parameter. Also, try changing the data type, especially to a memory-friendly type, and drop any unnecessary columns.
import pandas as pd
df = pd.read_csv('some_large_dataset.csv', low_memory=True, dtype={'column': 'int32'}, usecols=['col1', 'col2'])
Converting the integer and float with the smallest type would help reduce the memory footprint. Using category type to the categorical column with a small number of unique values would also help. Smaller columns also help with memory efficiency.
Next, we can use the chunk process to avoid using all the memory. It would be more efficient if process it iteratively. For example, we want to get the column mean, but the dataset is too big. We can process 100,000 data at a time and get the total result.
chunk_results = []
def column_mean(chunk):
chunk_mean = chunk['target_column'].mean()
return chunk_mean
chunksize = 100000
for chunk in pd.read_csv('some_large_dataset.csv', chunksize=chunksize):
chunk_results.append(column_mean(chunk))
final_result = sum(chunk_results) / len(chunk_results)
Additionally, avoid using the apply method with lambda functions; it could be memory intensive. Alternatively, it’s better to use vectorized operations or the .apply
method with normal function.
df['new_column'] = df['existing_column'] * 2
For conditional operations in Pandas, it’s also faster to use np.where
rather than directly using the Lambda function with .apply
import numpy as np
df['new_column'] = np.where(df['existing_column'] > 0, 1, 0)
Then, using inplace=True
in many Pandas operations is much more memory-efficient than assigning them back to their DataFrame. It’s much more efficient because assigning them back would create a separate DataFrame before we put them into the same variable.
df.drop(columns=['column_to_drop'], inplace=True)
Lastly, filter the data early before any operations, if possible. This will limit the amount of data we process.
df = df[df['filter_column'] > threshold]
Try to master these tips to improve your Pandas experience in large datasets.
Additional Resources
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.