0% found this document useful (0 votes)

21 views9 pages

Pandas & NumPy For Tabular Data (Cleaning & Reshaping)

The document provides an overview of using Pandas and NumPy for data cleaning, reshaping, and analysis, emphasizing methods for handling missing values, duplicates, and data types. It covers key operations such as pivoting, melting, grouping, and merging tables, along with practical examples and exercises. Additionally, it discusses the integration of Pandas with Django workflows for data preprocessing and analytics.

Uploaded by

nikban2902

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views9 pages

Pandas & NumPy For Tabular Data (Cleaning & Reshaping)

Uploaded by

nikban2902

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Pandas & NumPy for Tabular Data (Cleaning &

Reshaping)
Data cleaning and reshaping are core tasks in data analysis. Pandas (built on NumPy) provides intuitive
methods to handle missing values, duplicates, and data types; as well as functions for reshaping, merging,
grouping, and window calculations. Below we illustrate key operations with clear code examples and real-
world scenarios (e.g. freight shipments), plus pointers to practice resources. Citations reference the official
pandas documentation and expert tutorials for further depth.

Data Cleaning in Pandas

Real datasets often have missing values, inconsistent types, or duplicates. Pandas offers methods like
isnull() / notnull() , dropna() , fillna() , astype() , drop_duplicates() , etc., to clean
data. For example, to remove rows with missing required fields:

import pandas as pd

# Example shipments data (freight invoices)

df = pd.DataFrame({
'shipment_id': [101, 102, 103, 104, 105],
'weight_kg': [200, None, 340, 285, 500],
'cost_usd': [1200, 1500, 1700, None, 2200],
'route': ['A-B', 'A-B', 'B-C', 'A-B', 'C-D'],
'date': ['2025-01-10', '2025-01-11', '', '2025-01-11', '2025-01-12']
})

# Identify missing values per column

print(df.isnull().sum())
# Drop rows missing critical columns (e.g., weight or cost)
df_clean = df.dropna(subset=['weight_kg','cost_usd'])

This drops any shipments lacking weight or cost. Pandas dropna() removes rows (or columns) with nulls;
it supports a subset argument to target specific columns 1 . You can also fill missing values instead of
dropping: e.g.,

df['weight_kg'] = df['weight_kg'].fillna(df['weight_kg'].mean())
df['date'] = pd.to_datetime(df['date'], errors='coerce') # parse dates, invalid to NaT

1
Here, missing weights are filled with the average weight, and dates are converted to datetime (invalid
formats become NaT ). The pd.to_datetime() function converts strings or columns of year/month/day
into datetime objects 2 .

Pandas also lets you remove duplicate rows easily. For example:

# After cleaning, remove duplicate shipments (keep first occurrence)

df_nodup = df_clean.drop_duplicates()

The drop_duplicates() method “returns DataFrame with duplicate rows removed” (by default keeping
the first) 3 . This can be limited to certain columns using subset . Data types can be coerced using
astype() . For example, ensure numeric columns are numbers:

df_nodup['weight_kg'] = pd.to_numeric(df_nodup['weight_kg'], errors='coerce')

df_nodup['cost_usd'] = pd.to_numeric(df_nodup['cost_usd'], errors='coerce')

Or explicitly convert to integers/categories: df['col'] = df['col'].astype(int) or 'category' .

Cleaning often also involves string normalization (e.g. trimming whitespace, lowercasing, replacing bad
values) using string methods or replace() . As one guide notes, data cleaning “often involves handling
missing values, correcting data types, removing duplicates, and normalizing data” 4 .

Key cleaning steps:

- Identify missing data: df.isnull().sum() to count NAs.
- Drop or fill nulls: df.dropna() , df.fillna(value) , or interpolation.
- Remove duplicates: df.drop_duplicates() 3 .
- Convert types: pd.to_numeric , pd.to_datetime , astype() .
- Normalize values: e.g. mapping categorical labels, stripping strings.

Exercises/Practice: Try cleaning Kaggle’s "Dirty Data" dataset or the Pandas Practice dataset on Kaggle.
DataCamp’s Cleaning Data in Python course covers many of these techniques interactively. Consult the
pandas Missing Data guide for details 5 .

Reshaping & Tabulating Data (Pivot, Melt, GroupBy, Sorting,

Indexing)
Once data is clean, you often restructure it to summarize or transform formats. Pandas supports pivoting
and melting data, grouping with aggregation, sorting, and setting indexes.

• Pivot (wide format): df.pivot(index=..., columns=..., values=...) reorganizes data

into a table. For example, to see total cost per route by date:

# Assume df_nodup has date parsed to datetime

df_nodup['date'] = pd.to_datetime(df_nodup['date'])

2
pivot_table = df_nodup.pivot(index='date', columns='route', values='cost_usd')
print(pivot_table)

This creates a DataFrame with dates as rows and routes as columns, filling costs. According to
pandas docs, DataFrame.pivot “returns reshaped DataFrame organized by given index/column
values” 6 . If multiple entries share the same index/column, use pivot_table with an
aggregation function (e.g. aggfunc='sum' ).

• Melt (long format): pd.melt() reverses a pivot – it “unpivot a DataFrame from wide to long
format” 7 . For instance, to turn the pivot table above back into a tall format:

melted = pd.melt(pivot_table.reset_index(), id_vars=['date'], var_name='route', value_name='

print(melted.head())

This yields columns date , route , cost_usd again.

• Stack/Unstack: These are similar to pivot/melt but operate on MultiIndex. df.stack() pivots
columns into index, df.unstack() does the reverse. They are useful for reshaping grouped data.

• Grouping & aggregation: A powerful way to tabulate data is using groupby() . For example, to
compute total shipments per route or average cost:

grouped = df_nodup.groupby('route').agg(
total_weight=('weight_kg','sum'),
avg_cost=('cost_usd','mean'),
count=('shipment_id','count')
)
print(grouped)

This yields one row per route with aggregated columns. The aggregate() function “uses one or
more operations” on each group 8 . Common aggregations are sum() , mean() , min() ,
max() , etc. You can also sort or filter on groups.

• Sorting: Use df.sort_values(by='col') to sort by a column, or df.sort_index() to sort by

index. After grouping, you might sort results by size or cost. For example:

# Sort routes by descending total cost

route_cost = df_nodup.groupby('route')['cost_usd'].sum().reset_index()
print(route_cost.sort_values(by='cost_usd', ascending=False))

3
• Indexing: Often, setting a meaningful index (or multi-index) speeds lookups and makes pivoting
easier. E.g. df.set_index(['date','route'], inplace=True) creates a multi-index by date
and route. You can then unstack on one level.

Example – freight invoices: Suppose each row in df is a shipment. You can

groupby(['route','carrier']) to tabulate metrics per route-carrier combination. Pivot tables can
aggregate (like Excel pivot) – for example:

pivot_summary = pd.pivot_table(
df_nodup, index='route', columns='carrier',
values='cost_usd', aggfunc='sum', fill_value=0
)
print(pivot_summary)

This creates a matrix of total cost by route (rows) and carrier (columns). Pandas notes that for aggregating
pivot tables, one can use pivot_table 9 or DataFrame.groupby .

Exercises/Practice: Pandas’ official Reshaping and Pivot Tables guide is a great reference. Try out Kaggle’s
Logistics Fleet Data (freight) dataset or any Kaggle “sales” dataset that requires pivoting, such as NYC Taxi
Trip Data. The LeetCode “30 Days of Pandas” plan includes tasks on reshaping and grouping.

Merging, Joining & Concatenating Tables

Real workflows often involve combining multiple tables. Pandas provides SQL-style joins and
concatenation:

• pd.merge() : like SQL joins. Example: if you have a df_carriers with carrier details
( carrier_id , name , email ) and df_shipments with carrier_id , you can merge to add
carrier info:

df_carriers = pd.DataFrame({
'carrier_id': [1,2],
'name': ['CarrierA','CarrierB']
})
merged = pd.merge(df_nodup, df_carriers, on='carrier_id', how='left')

This performs a left join on carrier_id . Pandas user guide notes that merge() “combines
DataFrame with SQL-style joining” 10 .

• DataFrame.join() : simpler, for joining on index. For example, setting carrier_id as index on
carriers and joining shipments (if shipments indexed by carrier_id) can be done via
df_shipments.join(df_carriers, on='carrier_id') .

4
• pd.concat() : stack DataFrames vertically (axis=0) or horizontally (axis=1). For example, if you
have shipment data split by month into separate DataFrames, you can combine them:

df_jan = pd.read_csv('shipments_jan.csv')
df_feb = pd.read_csv('shipments_feb.csv')
all_shipments = pd.concat([df_jan, df_feb], ignore_index=True)

This merges rows. Concatenation along columns can assemble wide tables.

According to pandas documentation: pandas has methods like concat() , DataFrame.join() , and
merge() to combine data 11 . Use how='inner'/'outer'/'left'/'right' in merge to control
joins.

Exercises/Practice: Practice joining on Kaggle by combining two related tables. For example, join an
Orders table with a Customers table on a common key. Kaggle’s E-Commerce Dataset (orders, customers,
products) is good for merge/join exercises. Also see pandas Merging documentation 12 .

Working with Date/Time Fields

Dates and times need special handling. Pandas offers to_datetime() , datetime indexing, and the .dt
accessor to extract components. For example:

df_nodup['date'] = pd.to_datetime(df_nodup['date'])
# Extract year, month, day, weekday
df_nodup['year'] = df_nodup['date'].dt.year
df_nodup['month'] = df_nodup['date'].dt.month
# Filter by date
jan_shipments = df_nodup[df_nodup['date'] >= '2025-01-01']
# Compute monthly totals
monthly_cost = df_nodup.set_index('date').resample('M')['cost_usd'].sum()

Here, to_datetime converts strings to datetime64[ns] dtype (non-parseable become NaT ). As

pandas docs note, pd.to_datetime() “converts a scalar, array-like, Series or DataFrame/dict-like to a
pandas datetime object” 13 . Once in datetime form, operations like resampling (grouping by time
frequency) or time-based rolling windows become easy.

Exercise: Use the Kaggle Time Series Data or any “orders by date” dataset to practice parsing dates and
grouping by time (daily, monthly trends). DataCamp’s Working with Dates and Times in Python covers these
techniques.

5
Aggregation & Window Functions
Beyond simple grouping, you often compute aggregates (sum, mean) or windowed statistics (rolling
averages, expanding sums, EWMA). Examples:

• Aggregation with groupby (see above) yields summary statistics per group. You can also use
df.agg() on whole DataFrame for overall stats or on groups, and pivot_table with aggfunc.

• Rolling windows: For time or sequential data, df.rolling(window=N) creates a rolling object.
For example, to compute a 7-day moving average of cost:

df_nodup = df_nodup.sort_values('date')
df_nodup['cost_7d_avg'] = df_nodup['cost_usd'].rolling(window=7, min_periods=1).mean()

The .rolling() example in pandas docs shows summing over a window of 2 observations:
“ df.rolling(2, min_periods=1).sum() ” 14 . This yields a new Series of rolling sums
(min_periods=1 ensures the first window isn’t NaN).

• Expanding windows: .expanding() gives cumulative aggregates. E.g. cumulative sum up to each
point:

df_nodup['cum_cost'] = df_nodup['cost_usd'].expanding().sum()

As docs illustrate, df.expanding(1).sum() accumulates values with 1 as minimum periods 15

(meaning no minimum gaps).

• Exponentially weighted (EWMA): .ewm(span=N).mean() for a smoothing effect giving more

weight to recent observations.

These window functions let you compute moving averages, trends, or detect anomalies. For example, a
rolling 30-day sum of shipments or a 12-month EWMA of costs. Pandas provides full Windowing operations
(rolling, expanding, ewm) with many options (centered windows, custom windows, etc.) 14 15 .

Exercises/Practice: On a time series dataset (e.g. stock prices or sales), compute rolling statistics. Kaggle’s
Daily Stock Prices is excellent for rolling/ewm practice.

Reshaping and Transforming Data

Beyond pivoting, you may need to transform values. Techniques include:

• apply , map , replace : Apply functions to columns/rows. E.g. categorize shipments:

6
df_nodup['size'] = df_nodup['weight_kg'].apply(lambda x: 'heavy' if x>300 else 'light')
df_nodup['route'] = df_nodup['route'].replace({'A-B':'North Route', 'B-C':'East Route'})

• assign : chain new columns.

• Normalization using NumPy: e.g. scale numeric columns with (x - min) / (max - min) . As
one guide shows, you can apply lambda or vectorized operations to normalize features (e.g. medal
counts) 16 .
• Vectorized NumPy: Pandas leverages NumPy under the hood. You can convert to NumPy arrays via
df.to_numpy() for heavy numeric ops, or use np.where for conditional logic:

import numpy as np
df_nodup['high_cost'] = np.where(df_nodup['cost_usd'] > 2000, True, False)

These transformations help reshape data values, not structure.

Exercises/Practice: Use apply / map on a dataset to create new features. Kaggle’s Healthcare Dataset
(patients with health metrics) is great for feature transformation (categorizing risk, scaling values). Pandas
docs on transformations are useful.

Pandas & NumPy in Django Workflows

Experienced Django developers often use Pandas/NumPy to preprocess data before or after interacting
with the database. Common use cases:

• Reading uploaded CSVs: In a Django view or management command, read an uploaded file into
pandas, clean it, then insert. For example, in a view handling a file upload:

import pandas as pd
def upload_shipments(request):
csv_file = request.FILES['file']
df = pd.read_csv(csv_file)
# Clean df as above...
df = df.dropna(subset=['shipment_id', 'cost_usd'])
# Convert DataFrame rows to model instances
records = df.to_dict('records')
objs = [Shipment(**rec) for rec in records]
Shipment.objects.bulk_create(objs)

Here, bulk_create efficiently inserts many rows in one query 17 . The Django ORM’s
bulk_create() is ideal for loading cleaned pandas records into the DB in bulk.

• Using django-pandas : The django-pandas library provides helpers like read_frame(qs) to

convert a QuerySet to a DataFrame 18 . For example:

7
from django_pandas.io import read_frame
qs = Shipment.objects.filter(date__year=2025)
df_ship = read_frame(qs) # get a pandas DataFrame from QuerySet

This simplifies analytics on model data. (Be cautious: it can load large datasets into memory.)

• Data analytics and reports: Use Pandas for computing stats or generating charts, then pass results
to templates or export as JSON. For instance, calculate monthly shipment volumes with Pandas and
display in a Django dashboard.

• Management commands: Many use Django’s custom commands to run pandas scripts (as shown
by one tutorial) 19 . This decouples data loading/cleaning from web requests.

Pandas fits smoothly into Django for preprocessing user data or postprocessing DB exports. Just
remember to convert between DataFrames and Django models carefully (often via to_dict('records')
or django-pandas ), and use bulk_create for speed 17 .

Exercises/Practice: Try implementing a Django command that loads a Kaggle CSV into a DataFrame, cleans
it, and bulk-creates model instances. See Alex Kirkup’s example for guidance 19 17 . For hands-on
exercises, LeetCode’s 30 Days of Pandas and DataCamp courses on Pandas can solidify these skills.

Summary: Pandas (and underlying NumPy) provides a rich toolkit for cleaning (missing values, duplicates,
types), reshaping (pivot/melt, sorting, indexing), merging/joining, and performing aggregations or window
calculations on tabular data. By practicing with real datasets (sales, healthcare, logistics), you’ll learn to
apply these tools effectively. When using Django, integrate Pandas in data import/export pipelines: load
CSVs into DataFrames for cleaning, then push to the database with bulk operations; or pull QuerySets into
DataFrames for analysis. The combination of Pandas and NumPy (vectorized operations) greatly accelerates
data prep and analytics in your Django projects.

Resources: Official pandas docs (e.g. Working with Missing Data 5 , Reshaping Guide, Merging Guide 12 ),
Kaggle datasets (Logistics/Freight data), DataCamp courses (Cleaning Data in Python, Working with Dates),
and LeetCode’s Pandas study plan for practice problems. Use these to test your understanding of each
concept above.

1 Clean a Kaggle dataset with Pandas and insert into a Django database using Python | by Alex
19 Kirkup | Medium
https://medium.com/@alex.kirkup/clean-a-kaggle-dataset-with-pandas-and-insert-into-a-django-database-using-
python-3e2ecbcbdc7f

2 13 pandas.to_datetime — pandas 2.2.3 documentation

https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

3 pandas.DataFrame.drop_duplicates — pandas 2.2.3 documentation

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

8
4 Practical Examples of Data Cleaning Using Pandas and Numpy | by Rajat Sharma | The
16 Pythoneers | Medium
https://medium.com/pythoneers/practical-examples-of-data-cleaning-using-pandas-and-numpy-5f59021f0144

5 Working with missing data — pandas 3.0.0.dev0+2097.gcdc5b7418e documentation

https://pandas.pydata.org/pandas-docs/dev/user_guide/missing_data.html

6 9 pandas.DataFrame.pivot — pandas 2.2.3 documentation

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html

7 pandas.melt — pandas 2.2.3 documentation

https://pandas.pydata.org/docs/reference/api/pandas.melt.html

8 pandas.core.groupby.DataFrameGroupBy.aggregate — pandas 2.2.3 documentation

https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html

10 11 12 Merge, join, concatenate and compare — pandas 2.2.3 documentation

https://pandas.pydata.org/docs/user_guide/merging.html

14 pandas.DataFrame.rolling — pandas 2.2.3 documentation

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html

15 pandas.DataFrame.expanding — pandas 2.2.3 documentation

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.expanding.html

17 python - How to write a Pandas DataFrame to Django model - Stack Overflow

https://stackoverflow.com/questions/34425607/how-to-write-a-pandas-dataframe-to-django-model/39644304

18 python - Converting Django QuerySet to pandas DataFrame - Stack Overflow

https://stackoverflow.com/questions/11697887/converting-django-queryset-to-pandas-dataframe

Pandas DataFrame Cheat Sheet
No ratings yet
Pandas DataFrame Cheat Sheet
6 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Pandas Data Wrangling Cheat Sheet
100% (2)
Pandas Data Wrangling Cheat Sheet
6 pages
CSV Data Handling Guide
No ratings yet
CSV Data Handling Guide
14 pages
57 Pandas
No ratings yet
57 Pandas
7 pages
Pandas Library
No ratings yet
Pandas Library
6 pages
Content Pandas Cheat Sheet
No ratings yet
Content Pandas Cheat Sheet
9 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
Data Wrangling & Analysis Guide
100% (1)
Data Wrangling & Analysis Guide
36 pages
Python Lecture 5 (2025)
No ratings yet
Python Lecture 5 (2025)
29 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
Pandas Tutorial
No ratings yet
Pandas Tutorial
9 pages
Pandas Notes
No ratings yet
Pandas Notes
3 pages
Pandas Cheat Sheet
100% (4)
Pandas Cheat Sheet
2 pages
Pandas Cheat Sheet CN
No ratings yet
Pandas Cheat Sheet CN
4 pages
Pandas Cheat Sheet
85% (13)
Pandas Cheat Sheet
2 pages
Pandas Cheat Sheet
100% (1)
Pandas Cheat Sheet
2 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
5 pages
Pandas Research
No ratings yet
Pandas Research
14 pages
Rapids Cheatsheet
100% (1)
Rapids Cheatsheet
2 pages
Unit 2 DEV
No ratings yet
Unit 2 DEV
84 pages
7 Cleaning Data w3s.............................................
No ratings yet
7 Cleaning Data w3s.............................................
2 pages
Data Science Exam Prep-Unit 2
No ratings yet
Data Science Exam Prep-Unit 2
18 pages
Learn Pandas
No ratings yet
Learn Pandas
37 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Python Interviews
No ratings yet
Python Interviews
154 pages
Pandas Operations Guide
No ratings yet
Pandas Operations Guide
6 pages
Module 4
No ratings yet
Module 4
38 pages
Python 2.1.3
No ratings yet
Python 2.1.3
6 pages
Pandas Cheatsheet
No ratings yet
Pandas Cheatsheet
1 page
Become A Pandas Master: 31 Tips and Tricks For Super Stars
No ratings yet
Become A Pandas Master: 31 Tips and Tricks For Super Stars
47 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Pandas For Machine Learning
No ratings yet
Pandas For Machine Learning
10 pages
Importing Libraries: Import As Import As Import As Import As Import From Import
No ratings yet
Importing Libraries: Import As Import As Import As Import As Import From Import
12 pages
Pandas Syntax Revision For ML
No ratings yet
Pandas Syntax Revision For ML
10 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
DSBDAL
No ratings yet
DSBDAL
87 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
Sales Analysis Using Python and SQL
No ratings yet
Sales Analysis Using Python and SQL
15 pages
Data Handling Part Ii
No ratings yet
Data Handling Part Ii
41 pages
Pandas Cheat Sheet for Data Manipulation
No ratings yet
Pandas Cheat Sheet for Data Manipulation
1 page
Task2 Eda Cleaning
No ratings yet
Task2 Eda Cleaning
33 pages
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Jamaican Maroons: Cultural Legacy
100% (4)
Jamaican Maroons: Cultural Legacy
138 pages
Tomahawk NV Manual
No ratings yet
Tomahawk NV Manual
2 pages
400 DS PRO 310 Rev F Datasheet of VRU
No ratings yet
400 DS PRO 310 Rev F Datasheet of VRU
18 pages
Biology Project Class XII
No ratings yet
Biology Project Class XII
34 pages
T780 Industrial Electronics N4 Memo Nov 2024
No ratings yet
T780 Industrial Electronics N4 Memo Nov 2024
7 pages
PDF
No ratings yet
PDF
4 pages
Ore Grindability and Testing Methods
No ratings yet
Ore Grindability and Testing Methods
8 pages
All You Need To Know About Vascular Surgery
No ratings yet
All You Need To Know About Vascular Surgery
33 pages
Reflection On Sports and Exercise Psychology
No ratings yet
Reflection On Sports and Exercise Psychology
2 pages
Williams Poems
100% (1)
Williams Poems
3 pages
MP Material by Sravan
No ratings yet
MP Material by Sravan
189 pages
Vehicle Manual for Technicians
No ratings yet
Vehicle Manual for Technicians
1 page
Core House - Neue Nationalgalarie
No ratings yet
Core House - Neue Nationalgalarie
46 pages
Mishary Rahid Al Afasy Muhammad 160215154848
No ratings yet
Mishary Rahid Al Afasy Muhammad 160215154848
3 pages
Educational Psychology 5664
No ratings yet
Educational Psychology 5664
9 pages
Teamwork Principles
No ratings yet
Teamwork Principles
16 pages
Accounting Exercises for Students
No ratings yet
Accounting Exercises for Students
3 pages
Thesis Chapter 4 Qualitative
100% (3)
Thesis Chapter 4 Qualitative
8 pages
BS 3293 Weld Neck and Slip On Flanges Class 150lbs
No ratings yet
BS 3293 Weld Neck and Slip On Flanges Class 150lbs
1 page
Course Log - Theory of Programming Languages
No ratings yet
Course Log - Theory of Programming Languages
6 pages
Biomass 1
No ratings yet
Biomass 1
22 pages
Equivalent Table
88% (8)
Equivalent Table
1 page
Engineering Graphics I
No ratings yet
Engineering Graphics I
2 pages
MEG2
No ratings yet
MEG2
64 pages
10102021103836SHS - EARTH SCIENCE - Q1 - M10 - Effects of Human Activities To Water Resources
No ratings yet
10102021103836SHS - EARTH SCIENCE - Q1 - M10 - Effects of Human Activities To Water Resources
18 pages
Myrna Accordion and Orchestra Score
No ratings yet
Myrna Accordion and Orchestra Score
21 pages
Student Centered Learning Toolkit
No ratings yet
Student Centered Learning Toolkit
72 pages
Applied Radiological Anatomy 2nd Semester
No ratings yet
Applied Radiological Anatomy 2nd Semester
7 pages
Sydney Airport Airside Driving Pocket Book Jul 2018
No ratings yet
Sydney Airport Airside Driving Pocket Book Jul 2018
70 pages
Catalogue Eurotruss 2016 PDF
No ratings yet
Catalogue Eurotruss 2016 PDF
312 pages

Pandas & NumPy For Tabular Data (Cleaning & Reshaping)

Uploaded by

Pandas & NumPy For Tabular Data (Cleaning & Reshaping)

Uploaded by

Pandas & NumPy for Tabular Data (Cleaning &

Data Cleaning in Pandas

# Example shipments data (freight invoices)

# Identify missing values per column

# After cleaning, remove duplicate shipments (keep first occurrence)

df_nodup['weight_kg'] = pd.to_numeric(df_nodup['weight_kg'], errors='coerce')

Or explicitly convert to integers/categories: df['col'] = df['col'].astype(int) or 'category' .

Key cleaning steps:

Reshaping & Tabulating Data (Pivot, Melt, GroupBy, Sorting,

• Pivot (wide format): df.pivot(index=..., columns=..., values=...) reorganizes data

# Assume df_nodup has date parsed to datetime

melted = pd.melt(pivot_table.reset_index(), id_vars=['date'], var_name='route', value_name='

This yields columns date , route , cost_usd again.

• Sorting: Use df.sort_values(by='col') to sort by a column, or df.sort_index() to sort by

# Sort routes by descending total cost

Example – freight invoices: Suppose each row in df is a shipment. You can

Merging, Joining & Concatenating Tables

Working with Date/Time Fields

Here, to_datetime converts strings to datetime64[ns] dtype (non-parseable become NaT ). As

As docs illustrate, df.expanding(1).sum() accumulates values with 1 as minimum periods 15

(meaning no minimum gaps).

• Exponentially weighted (EWMA): .ewm(span=N).mean() for a smoothing effect giving more

Reshaping and Transforming Data

• apply , map , replace : Apply functions to columns/rows. E.g. categorize shipments:

• assign : chain new columns.

These transformations help reshape data values, not structure.

Pandas & NumPy in Django Workflows

• Using django-pandas : The django-pandas library provides helpers like read_frame(qs) to

2 13 pandas.to_datetime — pandas 2.2.3 documentation

3 pandas.DataFrame.drop_duplicates — pandas 2.2.3 documentation

5 Working with missing data — pandas 3.0.0.dev0+2097.gcdc5b7418e documentation

6 9 pandas.DataFrame.pivot — pandas 2.2.3 documentation

7 pandas.melt — pandas 2.2.3 documentation

8 pandas.core.groupby.DataFrameGroupBy.aggregate — pandas 2.2.3 documentation

10 11 12 Merge, join, concatenate and compare — pandas 2.2.3 documentation

14 pandas.DataFrame.rolling — pandas 2.2.3 documentation

15 pandas.DataFrame.expanding — pandas 2.2.3 documentation

17 python - How to write a Pandas DataFrame to Django model - Stack Overflow

18 python - Converting Django QuerySet to pandas DataFrame - Stack Overflow

You might also like