import pandas as pd
import numpy as np
import time
# Create a large DataFrame with random data
np.random.seed(0)
n_rows = 1_000_000
df = pd.DataFrame({
'A': np.random.rand(n_rows),
'B': np.random.rand(n_rows),
'C': np.random.rand(n_rows),
'D': np.random.rand(n_rows)
})
# Regular Pandas syntax for a new column (without eval)
start_time = time.time()
df['Result1'] = df['A'] + df['B'] * df['C'] - df['D']
regular_time = time.time() - start_time
print(f"Regular computation time: {regular_time:.4f} seconds")
# Use eval for the same operation
start_time = time.time()
df.eval('Result2 = A + B * C - D', inplace=True)
eval_time = time.time() - start_time
print(f"eval() computation time: {eval_time:.4f} seconds")
# Verify both results are (almost) equal
comparison = np.allclose(df['Result1'], df['Result2'])
print(f"Results are equal: {comparison}")
Output:
Regular computation time: 0.1152 seconds
eval() computation time: 0.0487 seconds
Results are equal: True
Common Challenges with Large Datasets in Pandas:
1. High Memory Usage
• Pandas loads the entire dataset into memory (RAM).
• Large CSV/Excel files (GBs in size) can cause crashes or slowdowns.
2. Slow Computations
• Operations like groupby(), merge(), or sorting can be slow on millions of rows.
3. Data Loading Bottlenecks
• Reading large files (CSV, Excel) can take a long time.
4. Inefficient Data Types
• By default, Pandas may use memory-inefficient data types (float64, object).
5. Limited Parallelism
• Pandas is mostly single-threaded, limiting speed on multi-core CPUs.
Solutions and Best Practices:
1. Use Efficient Data Types
• Convert to smaller types (float32, int8, category) to reduce memory.
df['id'] = df['id'].astype('int32')
df['category'] = df['category'].astype('category')
2. Load Data in Chunks
• Use chunksize to read large files in pieces.
chunks = pd.read_csv('large_file.csv', chunksize=100000)
for chunk in chunks:
process(chunk)
3. Use dtype Argument While Reading
• Explicitly define data types during import to save memory.
df = pd.read_csv('data.csv', dtype={'id': 'int32', 'flag': 'bool'})
4. Use .query() and .eval() for Speed
• Faster filtering and math expressions using NumExpr backend.
df.query('value > 100')
df.eval('total = price * quantity', inplace=True)
5. Downcast Numeric Columns
• Reduce memory for integer/float columns using .to_numeric().
df['value'] = pd.to_numeric(df['value'], downcast='float')
6. Use Dask or Vaex for Out-of-Core Processing
• Pandas alternative libraries like Dask or Vaex allow processing data larger than RAM.
import dask.dataframe as dd
df = dd.read_csv('large_file.csv')
7. Filter Early and Often
• Apply filters as early as possible to reduce data before performing expensive operations.
df = df[df['date'] >= '2023-01-01']
8. Use Indexing for Speed
• Set indexes when filtering or joining data.
df.set_index('id', inplace=True)
Python Script: Time-Series Analysis in Pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# 1. Generate synthetic time-series data
date_rng = pd.date_range(start='2024-01-01', end='2024-03-31', freq='D')
np.random.seed(42)
data = np.random.normal(loc=100, scale=10, size=len(date_rng))
df = pd.DataFrame(data, columns=['Sales'])
df['Date'] = date_rng
df.set_index('Date', inplace=True)
print("First 5 rows of time-series data:")
print(df.head())
# 2. Plot the raw time-series data
df.plot(title='Daily Sales Over Time', figsize=(10, 4))
plt.xlabel("Date")
plt.ylabel("Sales")
plt.grid(True)
plt.show()
# 3. Resample to monthly frequency and compute the mean
monthly_avg = df.resample('M').mean()
print("\nMonthly Average Sales:")
print(monthly_avg)
# 4. Calculate rolling average (7-day window)
df['7-day MA'] = df['Sales'].rolling(window=7).mean()
# 5. Plot original and rolling average
df[['Sales', '7-day MA']].plot(title='7-Day Rolling Average of Sales', figsize=(10, 4))
plt.xlabel("Date")
plt.ylabel("Sales")
plt.grid(True)
plt.show()
# 6. Time-based filtering
print("\nSales in February 2024:")
print(df['2024-02'])
# 7. Add lag feature (shifted sales)
df['Yesterday Sales'] = df['Sales'].shift(1)
print("\nData with lag feature:")
print(df.head(10))
What This Script Demonstrates:
Step Feature Description
1 DatetimeIndex Setting a time-based index
2 Plotting Line plot of time-series
3 Resampling Aggregating daily data monthly
Calculating 7-day moving
4 Rolling
average
5 Filtering Time-based data slicing
6 Shifting Creating lag features