KEMBAR78
Pandas & NumPy For Tabular Data (Cleaning & Reshaping) | PDF | Moving Average | Computer Programming
0% found this document useful (0 votes)
21 views9 pages

Pandas & NumPy For Tabular Data (Cleaning & Reshaping)

The document provides an overview of using Pandas and NumPy for data cleaning, reshaping, and analysis, emphasizing methods for handling missing values, duplicates, and data types. It covers key operations such as pivoting, melting, grouping, and merging tables, along with practical examples and exercises. Additionally, it discusses the integration of Pandas with Django workflows for data preprocessing and analytics.

Uploaded by

nikban2902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views9 pages

Pandas & NumPy For Tabular Data (Cleaning & Reshaping)

The document provides an overview of using Pandas and NumPy for data cleaning, reshaping, and analysis, emphasizing methods for handling missing values, duplicates, and data types. It covers key operations such as pivoting, melting, grouping, and merging tables, along with practical examples and exercises. Additionally, it discusses the integration of Pandas with Django workflows for data preprocessing and analytics.

Uploaded by

nikban2902
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Pandas & NumPy for Tabular Data (Cleaning &

Reshaping)
Data cleaning and reshaping are core tasks in data analysis. Pandas (built on NumPy) provides intuitive
methods to handle missing values, duplicates, and data types; as well as functions for reshaping, merging,
grouping, and window calculations. Below we illustrate key operations with clear code examples and real-
world scenarios (e.g. freight shipments), plus pointers to practice resources. Citations reference the official
pandas documentation and expert tutorials for further depth.

Data Cleaning in Pandas


Real datasets often have missing values, inconsistent types, or duplicates. Pandas offers methods like
isnull() / notnull() , dropna() , fillna() , astype() , drop_duplicates() , etc., to clean
data. For example, to remove rows with missing required fields:

import pandas as pd

# Example shipments data (freight invoices)


df = pd.DataFrame({
'shipment_id': [101, 102, 103, 104, 105],
'weight_kg': [200, None, 340, 285, 500],
'cost_usd': [1200, 1500, 1700, None, 2200],
'route': ['A-B', 'A-B', 'B-C', 'A-B', 'C-D'],
'date': ['2025-01-10', '2025-01-11', '', '2025-01-11', '2025-01-12']
})

# Identify missing values per column


print(df.isnull().sum())
# Drop rows missing critical columns (e.g., weight or cost)
df_clean = df.dropna(subset=['weight_kg','cost_usd'])

This drops any shipments lacking weight or cost. Pandas dropna() removes rows (or columns) with nulls;
it supports a subset argument to target specific columns 1 . You can also fill missing values instead of
dropping: e.g.,

df['weight_kg'] = df['weight_kg'].fillna(df['weight_kg'].mean())
df['date'] = pd.to_datetime(df['date'], errors='coerce') # parse dates, invalid to NaT

1
Here, missing weights are filled with the average weight, and dates are converted to datetime (invalid
formats become NaT ). The pd.to_datetime() function converts strings or columns of year/month/day
into datetime objects 2 .

Pandas also lets you remove duplicate rows easily. For example:

# After cleaning, remove duplicate shipments (keep first occurrence)


df_nodup = df_clean.drop_duplicates()

The drop_duplicates() method “returns DataFrame with duplicate rows removed” (by default keeping
the first) 3 . This can be limited to certain columns using subset . Data types can be coerced using
astype() . For example, ensure numeric columns are numbers:

df_nodup['weight_kg'] = pd.to_numeric(df_nodup['weight_kg'], errors='coerce')


df_nodup['cost_usd'] = pd.to_numeric(df_nodup['cost_usd'], errors='coerce')

Or explicitly convert to integers/categories: df['col'] = df['col'].astype(int) or 'category' .


Cleaning often also involves string normalization (e.g. trimming whitespace, lowercasing, replacing bad
values) using string methods or replace() . As one guide notes, data cleaning “often involves handling
missing values, correcting data types, removing duplicates, and normalizing data” 4 .

Key cleaning steps:


- Identify missing data: df.isnull().sum() to count NAs.
- Drop or fill nulls: df.dropna() , df.fillna(value) , or interpolation.
- Remove duplicates: df.drop_duplicates() 3 .
- Convert types: pd.to_numeric , pd.to_datetime , astype() .
- Normalize values: e.g. mapping categorical labels, stripping strings.

Exercises/Practice: Try cleaning Kaggle’s "Dirty Data" dataset or the Pandas Practice dataset on Kaggle.
DataCamp’s Cleaning Data in Python course covers many of these techniques interactively. Consult the
pandas Missing Data guide for details 5 .

Reshaping & Tabulating Data (Pivot, Melt, GroupBy, Sorting,


Indexing)
Once data is clean, you often restructure it to summarize or transform formats. Pandas supports pivoting
and melting data, grouping with aggregation, sorting, and setting indexes.

• Pivot (wide format): df.pivot(index=..., columns=..., values=...) reorganizes data


into a table. For example, to see total cost per route by date:

# Assume df_nodup has date parsed to datetime


df_nodup['date'] = pd.to_datetime(df_nodup['date'])

2
pivot_table = df_nodup.pivot(index='date', columns='route', values='cost_usd')
print(pivot_table)

This creates a DataFrame with dates as rows and routes as columns, filling costs. According to
pandas docs, DataFrame.pivot “returns reshaped DataFrame organized by given index/column
values” 6 . If multiple entries share the same index/column, use pivot_table with an
aggregation function (e.g. aggfunc='sum' ).

• Melt (long format): pd.melt() reverses a pivot – it “unpivot a DataFrame from wide to long
format” 7 . For instance, to turn the pivot table above back into a tall format:

melted = pd.melt(pivot_table.reset_index(), id_vars=['date'], var_name='route', value_name='


print(melted.head())

This yields columns date , route , cost_usd again.

• Stack/Unstack: These are similar to pivot/melt but operate on MultiIndex. df.stack() pivots
columns into index, df.unstack() does the reverse. They are useful for reshaping grouped data.

• Grouping & aggregation: A powerful way to tabulate data is using groupby() . For example, to
compute total shipments per route or average cost:

grouped = df_nodup.groupby('route').agg(
total_weight=('weight_kg','sum'),
avg_cost=('cost_usd','mean'),
count=('shipment_id','count')
)
print(grouped)

This yields one row per route with aggregated columns. The aggregate() function “uses one or
more operations” on each group 8 . Common aggregations are sum() , mean() , min() ,
max() , etc. You can also sort or filter on groups.

• Sorting: Use df.sort_values(by='col') to sort by a column, or df.sort_index() to sort by


index. After grouping, you might sort results by size or cost. For example:

# Sort routes by descending total cost


route_cost = df_nodup.groupby('route')['cost_usd'].sum().reset_index()
print(route_cost.sort_values(by='cost_usd', ascending=False))

3
• Indexing: Often, setting a meaningful index (or multi-index) speeds lookups and makes pivoting
easier. E.g. df.set_index(['date','route'], inplace=True) creates a multi-index by date
and route. You can then unstack on one level.

Example – freight invoices: Suppose each row in df is a shipment. You can


groupby(['route','carrier']) to tabulate metrics per route-carrier combination. Pivot tables can
aggregate (like Excel pivot) – for example:

pivot_summary = pd.pivot_table(
df_nodup, index='route', columns='carrier',
values='cost_usd', aggfunc='sum', fill_value=0
)
print(pivot_summary)

This creates a matrix of total cost by route (rows) and carrier (columns). Pandas notes that for aggregating
pivot tables, one can use pivot_table 9 or DataFrame.groupby .

Exercises/Practice: Pandas’ official Reshaping and Pivot Tables guide is a great reference. Try out Kaggle’s
Logistics Fleet Data (freight) dataset or any Kaggle “sales” dataset that requires pivoting, such as NYC Taxi
Trip Data. The LeetCode “30 Days of Pandas” plan includes tasks on reshaping and grouping.

Merging, Joining & Concatenating Tables


Real workflows often involve combining multiple tables. Pandas provides SQL-style joins and
concatenation:

• pd.merge() : like SQL joins. Example: if you have a df_carriers with carrier details
( carrier_id , name , email ) and df_shipments with carrier_id , you can merge to add
carrier info:

df_carriers = pd.DataFrame({
'carrier_id': [1,2],
'name': ['CarrierA','CarrierB']
})
merged = pd.merge(df_nodup, df_carriers, on='carrier_id', how='left')

This performs a left join on carrier_id . Pandas user guide notes that merge() “combines
DataFrame with SQL-style joining” 10 .

• DataFrame.join() : simpler, for joining on index. For example, setting carrier_id as index on
carriers and joining shipments (if shipments indexed by carrier_id) can be done via
df_shipments.join(df_carriers, on='carrier_id') .

4
• pd.concat() : stack DataFrames vertically (axis=0) or horizontally (axis=1). For example, if you
have shipment data split by month into separate DataFrames, you can combine them:

df_jan = pd.read_csv('shipments_jan.csv')
df_feb = pd.read_csv('shipments_feb.csv')
all_shipments = pd.concat([df_jan, df_feb], ignore_index=True)

This merges rows. Concatenation along columns can assemble wide tables.

According to pandas documentation: pandas has methods like concat() , DataFrame.join() , and
merge() to combine data 11 . Use how='inner'/'outer'/'left'/'right' in merge to control
joins.

Exercises/Practice: Practice joining on Kaggle by combining two related tables. For example, join an
Orders table with a Customers table on a common key. Kaggle’s E-Commerce Dataset (orders, customers,
products) is good for merge/join exercises. Also see pandas Merging documentation 12 .

Working with Date/Time Fields


Dates and times need special handling. Pandas offers to_datetime() , datetime indexing, and the .dt
accessor to extract components. For example:

df_nodup['date'] = pd.to_datetime(df_nodup['date'])
# Extract year, month, day, weekday
df_nodup['year'] = df_nodup['date'].dt.year
df_nodup['month'] = df_nodup['date'].dt.month
# Filter by date
jan_shipments = df_nodup[df_nodup['date'] >= '2025-01-01']
# Compute monthly totals
monthly_cost = df_nodup.set_index('date').resample('M')['cost_usd'].sum()

Here, to_datetime converts strings to datetime64[ns] dtype (non-parseable become NaT ). As


pandas docs note, pd.to_datetime() “converts a scalar, array-like, Series or DataFrame/dict-like to a
pandas datetime object” 13 . Once in datetime form, operations like resampling (grouping by time
frequency) or time-based rolling windows become easy.

Exercise: Use the Kaggle Time Series Data or any “orders by date” dataset to practice parsing dates and
grouping by time (daily, monthly trends). DataCamp’s Working with Dates and Times in Python covers these
techniques.

5
Aggregation & Window Functions
Beyond simple grouping, you often compute aggregates (sum, mean) or windowed statistics (rolling
averages, expanding sums, EWMA). Examples:

• Aggregation with groupby (see above) yields summary statistics per group. You can also use
df.agg() on whole DataFrame for overall stats or on groups, and pivot_table with aggfunc.

• Rolling windows: For time or sequential data, df.rolling(window=N) creates a rolling object.
For example, to compute a 7-day moving average of cost:

df_nodup = df_nodup.sort_values('date')
df_nodup['cost_7d_avg'] = df_nodup['cost_usd'].rolling(window=7, min_periods=1).mean()

The .rolling() example in pandas docs shows summing over a window of 2 observations:
“ df.rolling(2, min_periods=1).sum() ” 14 . This yields a new Series of rolling sums
(min_periods=1 ensures the first window isn’t NaN).

• Expanding windows: .expanding() gives cumulative aggregates. E.g. cumulative sum up to each
point:

df_nodup['cum_cost'] = df_nodup['cost_usd'].expanding().sum()

As docs illustrate, df.expanding(1).sum() accumulates values with 1 as minimum periods 15

(meaning no minimum gaps).

• Exponentially weighted (EWMA): .ewm(span=N).mean() for a smoothing effect giving more


weight to recent observations.

These window functions let you compute moving averages, trends, or detect anomalies. For example, a
rolling 30-day sum of shipments or a 12-month EWMA of costs. Pandas provides full Windowing operations
(rolling, expanding, ewm) with many options (centered windows, custom windows, etc.) 14 15 .

Exercises/Practice: On a time series dataset (e.g. stock prices or sales), compute rolling statistics. Kaggle’s
Daily Stock Prices is excellent for rolling/ewm practice.

Reshaping and Transforming Data


Beyond pivoting, you may need to transform values. Techniques include:

• apply , map , replace : Apply functions to columns/rows. E.g. categorize shipments:

6
df_nodup['size'] = df_nodup['weight_kg'].apply(lambda x: 'heavy' if x>300 else 'light')
df_nodup['route'] = df_nodup['route'].replace({'A-B':'North Route', 'B-C':'East Route'})

• assign : chain new columns.


• Normalization using NumPy: e.g. scale numeric columns with (x - min) / (max - min) . As
one guide shows, you can apply lambda or vectorized operations to normalize features (e.g. medal
counts) 16 .
• Vectorized NumPy: Pandas leverages NumPy under the hood. You can convert to NumPy arrays via
df.to_numpy() for heavy numeric ops, or use np.where for conditional logic:

import numpy as np
df_nodup['high_cost'] = np.where(df_nodup['cost_usd'] > 2000, True, False)

These transformations help reshape data values, not structure.

Exercises/Practice: Use apply / map on a dataset to create new features. Kaggle’s Healthcare Dataset
(patients with health metrics) is great for feature transformation (categorizing risk, scaling values). Pandas
docs on transformations are useful.

Pandas & NumPy in Django Workflows


Experienced Django developers often use Pandas/NumPy to preprocess data before or after interacting
with the database. Common use cases:

• Reading uploaded CSVs: In a Django view or management command, read an uploaded file into
pandas, clean it, then insert. For example, in a view handling a file upload:

import pandas as pd
def upload_shipments(request):
csv_file = request.FILES['file']
df = pd.read_csv(csv_file)
# Clean df as above...
df = df.dropna(subset=['shipment_id', 'cost_usd'])
# Convert DataFrame rows to model instances
records = df.to_dict('records')
objs = [Shipment(**rec) for rec in records]
Shipment.objects.bulk_create(objs)

Here, bulk_create efficiently inserts many rows in one query 17 . The Django ORM’s
bulk_create() is ideal for loading cleaned pandas records into the DB in bulk.

• Using django-pandas : The django-pandas library provides helpers like read_frame(qs) to


convert a QuerySet to a DataFrame 18 . For example:

7
from django_pandas.io import read_frame
qs = Shipment.objects.filter(date__year=2025)
df_ship = read_frame(qs) # get a pandas DataFrame from QuerySet

This simplifies analytics on model data. (Be cautious: it can load large datasets into memory.)

• Data analytics and reports: Use Pandas for computing stats or generating charts, then pass results
to templates or export as JSON. For instance, calculate monthly shipment volumes with Pandas and
display in a Django dashboard.

• Management commands: Many use Django’s custom commands to run pandas scripts (as shown
by one tutorial) 19 . This decouples data loading/cleaning from web requests.

Pandas fits smoothly into Django for preprocessing user data or postprocessing DB exports. Just
remember to convert between DataFrames and Django models carefully (often via to_dict('records')
or django-pandas ), and use bulk_create for speed 17 .

Exercises/Practice: Try implementing a Django command that loads a Kaggle CSV into a DataFrame, cleans
it, and bulk-creates model instances. See Alex Kirkup’s example for guidance 19 17 . For hands-on
exercises, LeetCode’s 30 Days of Pandas and DataCamp courses on Pandas can solidify these skills.

Summary: Pandas (and underlying NumPy) provides a rich toolkit for cleaning (missing values, duplicates,
types), reshaping (pivot/melt, sorting, indexing), merging/joining, and performing aggregations or window
calculations on tabular data. By practicing with real datasets (sales, healthcare, logistics), you’ll learn to
apply these tools effectively. When using Django, integrate Pandas in data import/export pipelines: load
CSVs into DataFrames for cleaning, then push to the database with bulk operations; or pull QuerySets into
DataFrames for analysis. The combination of Pandas and NumPy (vectorized operations) greatly accelerates
data prep and analytics in your Django projects.

Resources: Official pandas docs (e.g. Working with Missing Data 5 , Reshaping Guide, Merging Guide 12 ),
Kaggle datasets (Logistics/Freight data), DataCamp courses (Cleaning Data in Python, Working with Dates),
and LeetCode’s Pandas study plan for practice problems. Use these to test your understanding of each
concept above.

1 Clean a Kaggle dataset with Pandas and insert into a Django database using Python | by Alex
19 Kirkup | Medium
https://medium.com/@alex.kirkup/clean-a-kaggle-dataset-with-pandas-and-insert-into-a-django-database-using-
python-3e2ecbcbdc7f

2 13 pandas.to_datetime — pandas 2.2.3 documentation


https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

3 pandas.DataFrame.drop_duplicates — pandas 2.2.3 documentation


https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

8
4 Practical Examples of Data Cleaning Using Pandas and Numpy | by Rajat Sharma | The
16 Pythoneers | Medium
https://medium.com/pythoneers/practical-examples-of-data-cleaning-using-pandas-and-numpy-5f59021f0144

5 Working with missing data — pandas 3.0.0.dev0+2097.gcdc5b7418e documentation


https://pandas.pydata.org/pandas-docs/dev/user_guide/missing_data.html

6 9 pandas.DataFrame.pivot — pandas 2.2.3 documentation


https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html

7 pandas.melt — pandas 2.2.3 documentation


https://pandas.pydata.org/docs/reference/api/pandas.melt.html

8 pandas.core.groupby.DataFrameGroupBy.aggregate — pandas 2.2.3 documentation


https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html

10 11 12 Merge, join, concatenate and compare — pandas 2.2.3 documentation


https://pandas.pydata.org/docs/user_guide/merging.html

14 pandas.DataFrame.rolling — pandas 2.2.3 documentation


https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html

15 pandas.DataFrame.expanding — pandas 2.2.3 documentation


https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.expanding.html

17 python - How to write a Pandas DataFrame to Django model - Stack Overflow


https://stackoverflow.com/questions/34425607/how-to-write-a-pandas-dataframe-to-django-model/39644304

18 python - Converting Django QuerySet to pandas DataFrame - Stack Overflow


https://stackoverflow.com/questions/11697887/converting-django-queryset-to-pandas-dataframe

You might also like