KEMBAR78
Module 3 | PDF | Integer (Computer Science) | Data Type
0% found this document useful (0 votes)
9 views5 pages

Module 3

Uploaded by

kcmanu112004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views5 pages

Module 3

Uploaded by

kcmanu112004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

import pandas as pd

import numpy as np

import time

# Create a large DataFrame with random data

np.random.seed(0)

n_rows = 1_000_000

df = pd.DataFrame({

'A': np.random.rand(n_rows),

'B': np.random.rand(n_rows),

'C': np.random.rand(n_rows),

'D': np.random.rand(n_rows)

})

# Regular Pandas syntax for a new column (without eval)

start_time = time.time()

df['Result1'] = df['A'] + df['B'] * df['C'] - df['D']

regular_time = time.time() - start_time

print(f"Regular computation time: {regular_time:.4f} seconds")

# Use eval for the same operation

start_time = time.time()

df.eval('Result2 = A + B * C - D', inplace=True)

eval_time = time.time() - start_time

print(f"eval() computation time: {eval_time:.4f} seconds")

# Verify both results are (almost) equal

comparison = np.allclose(df['Result1'], df['Result2'])

print(f"Results are equal: {comparison}")

Output:

Regular computation time: 0.1152 seconds

eval() computation time: 0.0487 seconds

Results are equal: True


Common Challenges with Large Datasets in Pandas:

1. High Memory Usage

• Pandas loads the entire dataset into memory (RAM).

• Large CSV/Excel files (GBs in size) can cause crashes or slowdowns.

2. Slow Computations

• Operations like groupby(), merge(), or sorting can be slow on millions of rows.

3. Data Loading Bottlenecks

• Reading large files (CSV, Excel) can take a long time.

4. Inefficient Data Types

• By default, Pandas may use memory-inefficient data types (float64, object).

5. Limited Parallelism

• Pandas is mostly single-threaded, limiting speed on multi-core CPUs.

Solutions and Best Practices:

1. Use Efficient Data Types

• Convert to smaller types (float32, int8, category) to reduce memory.

df['id'] = df['id'].astype('int32')

df['category'] = df['category'].astype('category')

2. Load Data in Chunks

• Use chunksize to read large files in pieces.

chunks = pd.read_csv('large_file.csv', chunksize=100000)

for chunk in chunks:

process(chunk)

3. Use dtype Argument While Reading

• Explicitly define data types during import to save memory.

df = pd.read_csv('data.csv', dtype={'id': 'int32', 'flag': 'bool'})

4. Use .query() and .eval() for Speed

• Faster filtering and math expressions using NumExpr backend.

df.query('value > 100')

df.eval('total = price * quantity', inplace=True)


5. Downcast Numeric Columns

• Reduce memory for integer/float columns using .to_numeric().

df['value'] = pd.to_numeric(df['value'], downcast='float')

6. Use Dask or Vaex for Out-of-Core Processing

• Pandas alternative libraries like Dask or Vaex allow processing data larger than RAM.

import dask.dataframe as dd

df = dd.read_csv('large_file.csv')

7. Filter Early and Often

• Apply filters as early as possible to reduce data before performing expensive operations.

df = df[df['date'] >= '2023-01-01']

8. Use Indexing for Speed

• Set indexes when filtering or joining data.

df.set_index('id', inplace=True)
Python Script: Time-Series Analysis in Pandas

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

# 1. Generate synthetic time-series data

date_rng = pd.date_range(start='2024-01-01', end='2024-03-31', freq='D')

np.random.seed(42)

data = np.random.normal(loc=100, scale=10, size=len(date_rng))

df = pd.DataFrame(data, columns=['Sales'])

df['Date'] = date_rng

df.set_index('Date', inplace=True)

print("First 5 rows of time-series data:")

print(df.head())

# 2. Plot the raw time-series data

df.plot(title='Daily Sales Over Time', figsize=(10, 4))

plt.xlabel("Date")

plt.ylabel("Sales")

plt.grid(True)

plt.show()

# 3. Resample to monthly frequency and compute the mean

monthly_avg = df.resample('M').mean()

print("\nMonthly Average Sales:")

print(monthly_avg)

# 4. Calculate rolling average (7-day window)

df['7-day MA'] = df['Sales'].rolling(window=7).mean()


# 5. Plot original and rolling average

df[['Sales', '7-day MA']].plot(title='7-Day Rolling Average of Sales', figsize=(10, 4))

plt.xlabel("Date")

plt.ylabel("Sales")

plt.grid(True)

plt.show()

# 6. Time-based filtering

print("\nSales in February 2024:")

print(df['2024-02'])

# 7. Add lag feature (shifted sales)

df['Yesterday Sales'] = df['Sales'].shift(1)

print("\nData with lag feature:")

print(df.head(10))

What This Script Demonstrates:

Step Feature Description

1 DatetimeIndex Setting a time-based index

2 Plotting Line plot of time-series

3 Resampling Aggregating daily data monthly

Calculating 7-day moving


4 Rolling
average

5 Filtering Time-based data slicing

6 Shifting Creating lag features

You might also like