MODULE – 3
Pandas Data Structures
What is Pandas?
• Pandas stands for "Python Data Analysis Library."
• It is an open-source library that provides high-performance, easy-to-use data
structures and data analysis tools for Python.
• It is built on top of NumPy and is designed for practical, real-world data
manipulation.
• Pandas offers two primary data structures — Series and DataFrame — optimized
for fast and flexible data operations.
Feature Description
Easy Handling of Missing
Functions like .fillna(), .dropna()
Data
Flexible Data Alignment Automatic data alignment by labels
Powerful Grouping
Group by, pivot tables, aggregation
Operations
Efficient Data Filtering Boolean indexing, slicing, subsetting
High Performance Optimized with C and NumPy under the hood
Rich IO support Read/write to CSV, Excel, SQL, JSON, HDF5, etc.
Built-in functions for date range generation, moving
Time-Series Analysis
averages
Visualization Integrated plotting with Matplotlib
Why Use Pandas?
• Real-world datasets are often messy, and pandas provides powerful tools to clean,
transform, and prepare data.
• Works seamlessly with big data when combined with Dask or Vaex.
• Essential for Machine Learning, Business Intelligence, Statistical Modeling, and
ETL processes (Extract, Transform, Load).
• Reduces thousands of lines of manual coding into few lines.
How Pandas Works Internally
• Built on NumPy arrays: Thus, it inherits the speed and functionality of NumPy.
• Data is stored in blocks: Homogeneous types (like integers or floats) are grouped
together in memory for efficiency.
• Indexing is optimized for fast lookups and manipulations.
Important Concepts in Pandas
1. Indexing and Slicing
• Every Series or DataFrame has an Index for labeling.
• Indexing can be positional (iloc) or label-based (loc).
df.loc[0] # label-based
df.iloc[0] # position-based
2. Handling Missing Data
• Real-world data is often incomplete.
• Pandas makes it easy to detect (isnull()) and handle (fillna(), dropna()).
3. GroupBy Operations
• Perform operations like sum, mean, count per group (e.g., total sales by city).
df.groupby('City')['Sales'].sum()
4. Merging and Joining DataFrames
• Combine multiple datasets like SQL JOINs (merge(), join(), concat()).
5. Pivot Tables and Reshaping
• Summarize data and rearrange tables easily.
df.pivot_table(values='Sales', index='Region', columns='Product', aggfunc='sum')
6. Time-Series Data
• Special tools for date indexing, resampling, moving averages, etc.
df.resample('M').mean() # monthly average
1. Series (1-Dimensional Data Structure)
• A Series is a one-dimensional labeled array that can hold any type of data
(integers, strings, floats, Python objects, etc.).
• It is like a column in a spreadsheet or a database table.
• It has two main components:
o Index (labels for rows)
o Values (actual data)
import pandas as pd
# Creating a Series
s = pd.Series([10, 20, 30, 40, 50])
print(s)
2. DataFrame (2-Dimensional Data Structure)
• A DataFrame is a two-dimensional labeled data structure with columns of
potentially different types.
• You can think of it like an Excel spreadsheet or a SQL table or a dictionary of
Series objects.
• It has:
o Row Index (label for each row)
o Column Index (label for each column)
o Values (table of data)
import pandas as pd
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Paris', 'London']
df = pd.DataFrame(data)
print(df)
Creating DataFrame from Lists:
df = pd.DataFrame([
[1, 'A'],
[2, 'B'],
[3, 'C']
], columns=['ID', 'Letter'])
print(df)
Pandas DataFrame and All Operations
What is a DataFrame?
• A DataFrame is a 2-dimensional, size-mutable, tabular data structure in pandas.
• It is made up of:
o Rows (horizontal entries)
o Columns (vertical entries)
o Indexes (labels for rows and columns)
• Columns can hold different types of data (int, float, string, boolean, object, etc.).
Creating a DataFrame
From Dictionary
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Paris', 'London']
df = pd.DataFrame(data)
print(df)
1. Viewing Data
Function Description Example
head() Show first N rows df.head(3)
tail() Show last N rows df.tail(2)
shape Get (rows, columns) df.shape
info() Summary (non-null counts, dtypes) df.info()
describe() Statistical summary df.describe()
columns List of column names df.columns
2. Selecting Data
Operation Example Output
Column selection df['Name'] Series (single column)
Multiple columns df[['Name', 'City']] DataFrame
Row by label df.loc[0] Row as Series
Row by position df.iloc[1] Row as Series
Slice rows df[0:2] First two rows
3. Filtering / Conditional Selection
# People aged above 25
df[df['Age'] > 25]
4. Adding and Modifying Columns
# Add new column
df['Country'] = 'USA'
# Modify existing column
df['Age'] = df['Age'] + 1
print(df)
5. Deleting Columns and Rows
# Delete column
df.drop('Country', axis=1, inplace=True)
# Delete row
df.drop(1, axis=0, inplace=True)
6. Renaming Columns
df.rename(columns={'Name': 'FullName'}, inplace=True)
7. Sorting
# Sort by column
df.sort_values(by='Age', ascending=False, inplace=True)
8. Handling Missing Data
# Detect missing
df.isnull()
# Drop missing rows
df.dropna(inplace=True)
# Fill missing
df.fillna('Unknown', inplace=True)
9. Aggregations
Function Example
Sum df['Age'].sum()
Mean df['Age'].mean()
Count df['City'].count()
Max/Min df['Age'].max() / df['Age'].min()
10. GroupBy Operations
# Group by city and calculate average age
df.groupby('City')['Age'].mean()
11. Merging, Joining, and Concatenating
• Merging DataFrames (like SQL JOIN):
pd.merge(df1, df2, on='ID')
• Concatenating DataFrames (stacking vertically or horizontally):
pd.concat([df1, df2], axis=0) # vertically
pd.concat([df1, df2], axis=1) # horizontally
12. Pivot Tables
df.pivot_table(values='Age', index='City', aggfunc='mean')
13. Reshaping
• Melt: Convert wide DataFrame to long format.
pd.melt(df, id_vars=['City'], value_vars=['Age'])
• Stack/Unstack: Pivot the levels of the index.
df.stack()
df.unstack()
Data Preprocessing
What is Data Preprocessing?
Data Preprocessing is the process of cleaning, transforming, and organizing raw data
into a usable format for machine learning models.
Real-world data is often incomplete, inconsistent, noisy, or unstructured, and
preprocessing improves the quality of the data.
Without preprocessing, models can give poor predictions, lower accuracy, or even fail to
run.
Main Steps in Data Preprocessing:
• Handling Missing Data
• Removing Duplicates
• Normalizing Data
• Encoding Categorical Variables
• Data Transformation
1) Handling Missing Data
Missing values occur when some data points are not recorded.
If not handled properly, they can lead to biased models or errors during training.
Common ways to handle missing data:
• Removal: Drop rows or columns with missing data.
• Imputation: Fill missing values using mean, median, mode, or a constant.
🛠 Ways to Handle Missing Data
Method When to Use
Remove rows/columns If too much data is missing
Fill with mean/median For numerical columns
Fill with mode For categorical columns
import numpy as np
import pandas as pd
# Create data with missing values
data = {'Age': [25, np.nan, 30, 22, np.nan],
'Salary': [50000, 60000, np.nan, 52000, 58000]}
df = pd.DataFrame(data)
# Fill missing Age with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Fill missing Salary with median
df['Salary'].fillna(df['Salary'].median(), inplace=True)
print(df)
2) Handling Duplicates
Duplicate entries are repeated rows in the dataset.
They bias the model and misrepresent the actual distribution of data.
Duplicates should be identified and removed to maintain the integrity of the dataset.
🛠 Ways to Handle Duplicates
Method Code
Find duplicates df.duplicated()
Remove duplicates df.drop_duplicates()
import pandas as pd
data = {'Name': ['John', 'Anna', 'John', 'Mike'],
'Age': [25, 22, 25, 30]}
df = pd.DataFrame(data)
# Remove duplicate rows
df = df.drop_duplicates()
print(df)
3) Normalizing Data
Normalization means scaling numerical data into a specific range (like 0–1).
It ensures that all features contribute equally to the result.
Especially important for models based on distance (e.g., KNN, SVM).
Two common types:
• Min-Max Normalization: Scale values between 0 and 1.
• Z-score Standardization: Center the data around mean 0, std 1.
🛠 Ways to Normalize Data
Method Description
Min-Max Scaling (value - min) / (max - min)
Standardization (value - mean) / std deviation
from sklearn.preprocessing import MinMaxScaler
data = [[10], [20], [30]]
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
print(normalized_data)
4) Encoding Categorical Data
Machine learning models only understand numbers — not text.
Categorical (text) data must be converted into numbers.
Techniques:
• Label Encoding: Assigns each category a unique number.
• One-Hot Encoding: Creates binary columns for each category.
🛠 Ways to Encode Categorical Data
Method Use When
Label Encoding Categories have order (e.g., Small, Medium, Large)
One-Hot Encoding Categories are unordered (e.g., Red, Blue, Green)
from sklearn.preprocessing import LabelEncoder
data = {'Color': ['Red', 'Blue', 'Green', 'Blue']}
df = pd.DataFrame(data)
# Label Encoding
encoder = LabelEncoder()
df['Color_encoded'] = encoder.fit_transform(df['Color'])
print(df)
5) Data Transformation
Transformation is changing the structure or distribution of data.
It makes the data more suitable for modeling, by reducing skewness, improving linearity,
etc.
Common transformations:
• Log Transformation (reduce right skew)
• Square Root Transformation
• Box-Cox Transformation
🛠 Ways to Transform Data
Transformation Purpose
Log Reduce large differences between values
Square Root Handle moderate skew
Box-Cox Make data more normal (Gaussian)
import numpy as np
data = np.array([1, 10, 100, 1000])
log_transformed = np.log(data)
print(log_transformed)
Data Wrangling
Introduction to Data Wrangling:
Data wrangling, also known as data munging, is the process of collecting, organizing,
cleaning, enriching, validating, and publishing data for analysis.
It is a critical step in data science because real-world data is often messy, incomplete, and
inconsistent.
Effective data wrangling ensures that the data is usable, reliable, and insightful for further
analysis, reporting, or machine learning.
Steps of Data Wrangling:
1. Discovery (Data Gathering)
• This step involves identifying and collecting raw data from multiple sources such as
databases, CSV files, APIs, web scraping, or manual records.
• The goal is to bring all relevant data into a single working environment.
import pandas as pd
# Load data from a CSV file
data = pd.read_csv('sales_data.csv')
Ways to do:
• Identify data sources (internal databases, APIs, spreadsheets, etc.).
• Use pandas, requests, sqlalchemy in Python to import the data.
2. Organization
• Understand the dataset's structure, inspect columns, data types, null values, and
basic statistics.
• Helps in planning what cleaning and transformation are needed.
# Check the first 5 rows
print(data.head())
# Summary statistics
print(data.describe())
# Data types and null counts
print(data.info())
Ways to do:
• Use .head(), .describe(), .info() methods.
• Visual inspection or exploratory data analysis (EDA) using libraries like matplotlib,
seaborn.
3. Cleaning
• Handle missing values, duplicate records, incorrect formats, and inconsistent
categories.
• This step makes the data reliable for analysis.
# Drop rows with missing values
data_cleaned = data.dropna()
# Remove duplicate rows
data_cleaned = data_cleaned.drop_duplicates()
# Fill missing values with mean
data['price'] = data['price'].fillna(data['price'].mean())
Ways to do:
• dropna(), fillna(), drop_duplicates() in pandas.
• Replace incorrect values manually or through conditional logic.
4. Data Enrichment
• Improve the dataset by adding new features, transforming variables, normalizing
data, and aggregating information.
# Normalize a column
data['normalized_sales'] = (data['sales'] - data['sales'].min()) / (data['sales'].max() -
data['sales'].min())
# Create a new feature
data['revenue'] = data['sales'] * data['price']
# One-hot encode categorical data
data = pd.get_dummies(data, columns=['region'])
Ways to do:
• Feature engineering.
• Encoding categorical variables (LabelEncoder, OneHotEncoder).
• Data normalization (MinMaxScaler, StandardScaler from sklearn).
5. Validation
• Validate the data to ensure it meets business requirements, logical rules, and
statistical thresholds.
# Check if any sales value is negative
invalid_sales = data[data['sales'] < 0]
print(invalid_sales)
Ways to do:
• Logical tests (assert, conditional checks).
• Automated validation scripts.
• Use business rules and thresholds to test consistency.
6. Publish
• Save the clean, validated dataset in appropriate formats (CSV, JSON, database) and
document the wrangling process for reproducibility.
# Save cleaned data
data.to_csv('cleaned_sales_data.csv', index=False)
Ways to do:
• Export using to_csv(), to_json(), database exports.
• Maintain logs or scripts of data wrangling steps.
Data Wrangling Formats:
Format Explanation
Detailed records of business activities like invoices, receipts,
Transactional Data
customer interactions.
Analytical Base A structured table where each row is an entity (e.g., customer) and
Table (ABT) each column is a feature, used in AI/ML.
Data collected over time intervals (daily sales, stock prices) to
Time-Series Data
detect trends and patterns.
Document Library Text-based data stored in documents; needs techniques like text
Data mining and NLP for analysis.
Data Wrangling Benefits:
• Data Consistency:
Uniform and error-free data enhances analysis accuracy.
• Improved Insights:
Clean and enriched data helps in generating actionable business insights.
• Cost Efficiency:
Reduces time and computational resources required for downstream data tasks.
Imagine you have a messy sales dataset:
Product,Price,Sales,Region
A,100,10,North
B,200,,South
C,150,20,North
A,100,10,North
D,300,15,
Step-by-step Wrangling in Python:
import pandas as pd
# Step 1: Discovery
data = pd.read_csv('sales_data.csv')
# Step 2: Organization
print(data.head())
print(data.info())
# Step 3: Cleaning
data = data.drop_duplicates()
data['Sales'] = data['Sales'].fillna(0)
data['Region'] = data['Region'].fillna('Unknown')
# Step 4: Data Enrichment
data['Revenue'] = data['Price'] * data['Sales']
# Step 5: Validation
assert data['Sales'].min() >= 0
# Step 6: Publish
data.to_csv('cleaned_sales_data.csv', index=False)
Data Acquisition with Python
Loading Data from CSV Files, Importing, Cleaning, and
Describing Data in Python
Loading data from CSV (Comma-Separated Values) files is one of the most common ways
to acquire data in Python, especially for structured, tabular datasets.
Once loaded, the data often requires cleaning to address issues like missing values,
duplicates, or inconsistencies.
Describing the data is then essential to understand its structure, content, and
characteristics. This process is foundational in data science and machine learning
workflows.
What is a CSV File?
A CSV file is a plain-text file that stores tabular data, where each row represents a record,
and columns are separated by commas (or other delimiters like tabs or semicolons). It is
widely used due to its simplicity, compatibility with tools like Excel, and support in
programming languages like Python.
Why Use CSV Files?
• Simplicity: Easy to read, write, and share.
• Compatibility: Supported by databases, spreadsheets, and programming libraries.
• Lightweight: Smaller file size compared to Excel or JSON for tabular data.
• Structured Data: Ideal for datasets with rows and columns (e.g., sales records,
customer data).
Why is This Process Important?
• Loading: Imports data into Python for analysis or modeling.
• Cleaning: Ensures data quality by fixing errors, missing values, and inconsistencies.
• Describing: Provides insights into the data’s structure, distribution, and potential
issues, guiding further preprocessing.
1. Loading and Importing Data from CSV Files
What is Loading/Importing?
Loading or importing involves reading a CSV file into a Python environment, typically as a
DataFrame (a tabular data structure) for easy manipulation. Python’s pandas library is the
most popular tool for this task due to its flexibility and functionality.
Methods to Load CSV Files
1. Using pandas.read_csv():
o The primary method to load CSV files into a Pandas DataFrame.
o Key Parameters:
▪ filepath: Path to the CSV file (e.g., 'data.csv').
▪ nrows: Number of rows to read (e.g., nrows=100).
▪ chunksize: Read large files in chunks (e.g., chunksize=1000).
o Example:
import pandas as pd
df = pd.read_csv('data.csv', sep=',', encoding='utf-8', index_col='id')
o Use Case: Loading a customer dataset with an ID column as the index.
2. Using Python’s Built-in csv Module:
o Reads CSV files as lists or dictionaries without creating a DataFrame.
o Useful for lightweight or custom processing.
o Methods:
▪ csv.reader: Reads rows as lists.
▪ csv.DictReader: Reads rows as dictionaries with column names as
keys.
o Example:
import csv
with open('data.csv', 'r') as file:
reader = csv.DictReader(file)
data = [row for row in reader] # List of dictionaries
o Use Case: Reading small CSV files for simple processing.
Handling Common Issues During Import
• File Not Found: Ensure the file path is correct (use absolute paths or check the
working directory with os.getcwd()).
• Encoding Errors: Specify the correct encoding (e.g., 'latin1' for non-UTF-8 files).
• Delimiter Mismatch: Use the correct sep parameter (e.g., sep=';' for semicolon-
separated files).
• Large Files: Use chunksize to read in chunks or nrows to limit rows.
• Malformed CSV: Skip bad lines with error_bad_lines=False or on_bad_lines='skip'
in Pandas.
2. Cleaning Data Loaded from CSV Files
What is Data Cleaning?
Data cleaning involves fixing errors, inconsistencies, and missing values in the loaded
dataset to ensure it is accurate and usable. CSV files often contain issues like missing
data, duplicates, incorrect formats, or outliers that must be addressed.
Common Issues in CSV Data
• Missing values (e.g., empty cells, 'NA', or 'null').
• Duplicate rows or records.
• Inconsistent formats (e.g., mixed date formats, varying capitalization).
• Outliers or erroneous values (e.g., negative ages).
• Incorrect data types (e.g., numbers stored as strings).
Cleaning Techniques
1. Handling Missing Data:
o Identify Missing Values:
df.isnull().sum() # Count missing values per column
o Methods:
▪ Drop Missing Values: Remove rows or columns with missing data.
df.dropna() # Drop rows with any missing values
df.dropna(subset=['age'], inplace=True) # Drop rows where 'age' is missing
▪ Impute Missing Values: Replace missing values with a statistic or
rule.
df['age'].fillna(df['age'].mean(), inplace=True) # Impute with mean
df['city'].fillna('Unknown', inplace=True) # Impute with a constant
▪ Forward/Backward Fill: Use previous or next values (for time-series
data).
df['sales'].fillna(method='ffill', inplace=True) # Forward fill
2. Removing Duplicates:
o Identify Duplicates:
df.duplicated().sum() # Count duplicate rows
o Remove Duplicates:
df.drop_duplicates(inplace=True) # Remove all duplicate rows
df.drop_duplicates(subset=['customer_id'], keep='last', inplace=True) # Keep last
occurrence
3. Correcting Inconsistent Formats:
o Standardize Text:
df['name'] = df['name'].str.lower().str.strip() # Convert to lowercase, remove whitespace
o Standardize Dates:
df['date'] = pd.to_datetime(df['date'], errors='coerce') # Convert to datetime
o Fix Typos:
df['city'].replace({'New Yrok': 'New York', 'NY': 'New York'}, inplace=True)
3. Describing Data: Methods to Understand the Dataset
What is Data Description?
Describing data involves summarizing and exploring the dataset to understand its
structure, content, distribution, and potential issues. This step is critical for identifying
patterns, anomalies, or preprocessing needs.
Methods to Describe Data
1. Basic Information with df.info():
o Displays column names, data types, non-null counts, and memory usage.
o Example:
df.info()
o Use Case: Check for missing values and incorrect data types.
2. Summary Statistics with df.describe():
o Provides statistical measures for numerical columns (count, mean, std, min,
max, quartiles).
o Example:
df.describe()
3. Value Counts with df.value_counts():
o Counts unique values in a column, useful for categorical data.
o Example:
df['city'].value_counts()
4. Unique Values with df.nunique() and df.unique():
o df.nunique(): Counts unique values per column.
df.nunique()
5. Missing Values with df.isnull():
o Quantifies missing data per column.
df.isnull().sum()
6. Correlation Matrix with df.corr():
o Measures pairwise correlations between numerical columns (e.g., Pearson
correlation).
df.corr()
7. GroupBy Summaries with df.groupby():
o Aggregates data by categories to compute statistics.
df.groupby('city')['salary'].mean()
Accessing SQL Databases in Python
1⃣ Accessing SQL Database using sqlite3 (Built-in)
• sqlite3 is a built-in Python library to work with SQLite databases (no installation
needed).
• SQLite is a lightweight, disk-based database that doesn't require a separate
server.
• You use SQL commands (like SELECT, INSERT, UPDATE) through Python to interact
with the database.
Steps to Access using sqlite3
1. Import the sqlite3 library.
2. Connect to a database (creates file if it doesn't exist).
3. Create a Cursor object to execute SQL queries.
4. Execute SQL commands.
5. Commit changes (if modifying data).
6. Close the connection.
2⃣ Accessing SQL Database using SQLAlchemy
• SQLAlchemy is a powerful ORM (Object-Relational Mapper) for Python.
• It abstracts away SQL complexity by allowing you to interact with databases using
Python classes.
• Supports many database engines: SQLite, PostgreSQL, MySQL, etc.
• Safer, more scalable than direct SQL queries.
Steps to Access using SQLAlchemy
1. Install SQLAlchemy (pip install sqlalchemy).
2. Import necessary modules.
3. Create an Engine (database connection).
4. Connect and execute queries.
5. Use ORM models or raw SQL.
6. Close the connection (optional).
3⃣ Accessing SQL Database using pandas (read_sql_query, to_sql)
• pandas can directly query SQL databases.
• Useful for reading entire tables or custom queries into DataFrames.
• You can also write DataFrames back into SQL tables.
Steps to Access using pandas
1. Install pandas (pip install pandas).
2. Import pandas and sqlite3/SQLAlchemy.
3. Connect to the database.
4. Use read_sql_query to fetch data into a DataFrame.
5. Use to_sql to write DataFrame into database.
Database Operations in Python (10 Marks)
When working with databases in Python (like SQLite, MySQL, PostgreSQL), the basic
operations are:
Operation Meaning
connect Establish a connection to the database.
create Create new tables or structures.
insert Insert records/data into tables.
execute Run SQL queries (like create, insert, update, select, delete).
fetchall Retrieve all rows from a query result.
fetchone Retrieve one row at a time.
commit Save (commit) changes made to the database.
close Close the database connection properly.
1⃣ connect() — Connecting to the Database
• It is the first step.
• Using sqlite3.connect('database_name.db') creates or connects to an existing
database.
import sqlite3
conn = sqlite3.connect('my_database.db') # Connects to database
If the file my_database.db doesn’t exist, it will be created automatically.
import sqlite3
conn = sqlite3.connect('example.db') # Creates/opens a database file
2⃣ execute() — Executing SQL Queries
• After connecting, we execute SQL queries using a cursor.
• A cursor allows you to send SQL commands to the database.
cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS users (id INTEGER PRIMARY KEY, name
TEXT, age INTEGER)''')
The SQL inside execute() can be CREATE, INSERT, UPDATE, DELETE, SELECT, etc.
3⃣ create — Creating Tables
• Creating a table is done via an SQL CREATE TABLE statement.
• Make sure to use IF NOT EXISTS to avoid errors if the table already exists.
cursor.execute('''
CREATE TABLE IF NOT EXISTS employees (
id INTEGER PRIMARY KEY,
name TEXT,
department TEXT
''')
Tables define the structure of how your data will be stored.
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS employees (
id INTEGER PRIMARY KEY,
name TEXT,
age INTEGER
''')
4⃣ insert — Inserting Data into Tables
• We use SQL INSERT INTO statements to add data into tables.
cursor.execute("INSERT INTO employees (name, department) VALUES (?, ?)", ("John Doe",
"Finance"))
Using ? and a tuple (("John Doe", "Finance")) protects against SQL Injection attacks
(safer practice).
cursor.execute('INSERT INTO employees (name, age) VALUES (?, ?)', ('Alice', 30))
conn.commit()
cursor.execute('SELECT * FROM employees')
rows = cursor.fetchall()
for row in rows:
print(row)
5⃣ commit() — Saving Changes
• After making changes (inserting, updating, deleting), you need to commit them.
• If you forget commit(), changes will not be saved permanently.
conn.commit()
commit() finalizes your changes in the database.
6⃣ fetchall() — Fetching All Records
• After a SELECT query, you can fetch all rows at once into a list.
cursor.execute("SELECT * FROM employees")
rows = cursor.fetchall()
for row in rows:
print(row)
fetchall() returns a list of tuples, each tuple being a row.
7⃣ fetchone() — Fetching One Record
• Instead of fetching all at once, you can fetch one row at a time.
cursor.execute("SELECT * FROM employees")
row = cursor.fetchone()
print(row)
Useful when you know the query returns only one or few rows.
8⃣ close() — Closing the Connection
• After finishing database operations, you must close the connection properly.
conn.close()
Prevents memory leaks and keeps your application efficient.
Cleaning Data in Python and Stripping Out Extraneous
Information
Data cleaning is a critical step in the data science pipeline, involving the identification and
correction of errors, inconsistencies, and irrelevant information in a dataset to ensure it is
accurate, consistent, and suitable for analysis or modeling.
Stripping out extraneous information is a subset of data cleaning, focusing on removing
unnecessary or redundant data that does not contribute to the analysis.
Python, with its powerful libraries like Pandas, NumPy, and others, offers numerous
methods to clean data and remove extraneous information.
What is Data Cleaning?
Data cleaning, also known as data cleansing or scrubbing, is the process of detecting and
correcting (or removing) errors, inconsistencies, and inaccuracies in a dataset. It
addresses issues such as missing values, duplicates, incorrect formats, outliers, and
irrelevant data to improve data quality. Clean data is essential for reliable analysis,
visualization, and machine learning model performance.
What is Stripping Out Extraneous Information?
Stripping out extraneous information involves removing unnecessary, redundant, or
irrelevant data from a dataset. This could include:
• Unneeded columns or rows.
• Redundant text (e.g., extra spaces, prefixes, or suffixes).
• Metadata or formatting artifacts (e.g., HTML tags, comments).
• Data unrelated to the analysis goal (e.g., irrelevant features).
Why is Data Cleaning Important?
• Improves Data Quality: Ensures accuracy and consistency.
• Enhances Analysis: Clean data leads to reliable insights and model performance.
• Reduces Errors: Minimizes biases or incorrect conclusions.
• Saves Resources: Removes unnecessary data to improve processing efficiency.
• Meets Algorithm Requirements: Many machine learning algorithms require clean,
numerical data.
Why Focus on Stripping Extraneous Information?
• Simplifies Datasets: Reduces complexity and improves interpretability.
• Optimizes Performance: Decreases memory usage and computation time.
• Focuses Analysis: Keeps only relevant data for the task at hand.
Common Data Issues Requiring Cleaning
Before diving into cleaning methods, it’s important to understand the typical issues found
in datasets:
1. Missing Values: Empty cells or placeholders (e.g., 'NA', 'null').
2. Duplicates: Identical or near-identical records.
3. Inconsistent Formats: Mixed date formats, varying capitalization, or units.
4. Outliers: Extreme values that skew analysis.
5. Incorrect Data Types: Numbers stored as strings or vice versa.
6. Extraneous Information: Irrelevant columns, redundant text, or formatting
artifacts.
7. Typos and Errors: Misspellings or invalid entries (e.g., 'New Yrok' instead of 'New
York').
8. Noise: Random or irrelevant data points.
Handling Missing Values in Python
Missing data is a common problem in real-world datasets. Before applying any data
analysis or machine learning models, it’s essential to understand and handle missing
values properly. Poor handling can lead to inaccurate models or misleading results.
Python, especially with libraries like pandas, numpy, and scikit-learn, provides many
tools to identify, visualize, and handle missing data.
1. Identifying Missing Values
First, we need to detect missing values.
In pandas:
• Missing values are usually represented as NaN (numpy.nan) or None.
Common methods:
import pandas as pd
# Create a simple DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', None, 'Dave'],
'Age': [25, None, 22, 23],
'City': ['NY', 'LA', 'NY', None]
})
# Check for missing values
print(df.isnull())
# Check the number of missing values in each column
print(df.isnull().sum())
Other functions:
• df.info() → shows non-null counts.
• df.describe() → can reveal suspicious counts if some data is missing.
2. Ways to Handle Missing Values
There are multiple strategies depending on the problem and dataset size:
A. Removing Data
1. Remove rows with missing values
If the dataset is large and only a few rows have missing data, dropping them may be
acceptable.
# Drop rows with any missing value
df.dropna(inplace=True)
2. Remove columns with missing values
If a column has too many missing values (e.g., >70%), you might drop the column.
# Drop columns with missing data
df.dropna(axis=1, inplace=True)
Caution: Dropping data may lead to loss of important information.
B. Imputing Data (Filling Missing Values)
Instead of removing, we can fill missing values with plausible data.
1. Fill with a fixed value
# Fill NaN with a specified value
df.fillna('Unknown', inplace=True)
2. Fill with statistical values
• Mean, Median, or Mode (most frequent value) are commonly used for numeric
data.
# Fill missing 'Age' with the mean age
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Fill missing 'City' with mode
df['City'].fillna(df['City'].mode()[0], inplace=True)
3. Forward Fill (ffill)
• Propagate the previous value forward.
# Forward fill
df.fillna(method='ffill', inplace=True)
4. Backward Fill (bfill)
• Use the next value to fill the missing one.
# Backward fill
df.fillna(method='bfill', inplace=True)
Removing Duplicates
• Duplicate rows are identical copies of data.
• Duplicates can affect analysis, cause bias, or inflate numbers.
• Use drop_duplicates() to remove them.
df.drop_duplicates(inplace=True)
• To remove duplicates based only on specific columns:
df.drop_duplicates(subset=['column1', 'column2'], inplace=True)
• Keep first occurrence:
df.drop_duplicates(keep='first', inplace=True)
Correcting Data Types
• Correct data types are crucial: e.g., date columns should be datetime, numbers
should be int/float.
• Use .astype() to convert types.
• Wrong types cause errors during analysis or modeling.
df['age'] = df['age'].astype(int)
df['date'] = pd.to_datetime(df['date'])
• Convert multiple columns:
df = df.astype({'age': 'float64', 'salary': 'int64'})
Handling Outliers
• Outliers are extreme values that differ greatly from most data points.
• Use statistical techniques to detect them (IQR, Z-Score).
• Remove or treat outliers to make models perform better.
Using IQR (Interquartile Range):
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df['column'] > lower_bound) & (df['column'] < upper_bound)]
Standardizing and Normalizing Data
• Standardization: Center around mean (mean = 0, std = 1).
• Normalization: Rescale between 0 and 1.
• Useful for ML models sensitive to scale.
Standardization using StandardScaler:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['normalized'] = scaler.fit_transform(df[['column']])
Normalization using MinMaxScaler:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['normalized'] = scaler.fit_transform(df[['column']])
Fixing Formatting Issues
• Formatting problems include inconsistent dates, extra spaces, wrong text case.
• Fix them to avoid hidden issues in analysis.
# Remove extra whitespace
df['column'] = df['column'].str.strip()
# Standardize date format
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
• Text standardization:
df['name'] = df['name'].str.lower()
Encoding Categorical Variables
• ML algorithms can't work with text. Categories must be converted into numbers.
• Techniques: Label Encoding, One-Hot Encoding.
df = pd.get_dummies(df, columns=['category'])
Label Encoding:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])
Saving Cleaned Data
• After cleaning, save the dataset for later use without losing progress.
• Formats: CSV, Excel, SQL Database, etc.
df.to_csv('cleaned_data.csv', index=False)
df.to_excel('cleaned_data.xlsx', index=False)
Stripping Out Extraneous Information
Step 1: Removing Unnecessary Columns
df.drop(columns=['unnecessary_col1', 'unnecessary_col2'], inplace=True)
Step 2: Removing Extra Whitespace
df['column_name'] = df['column_name'].str.strip()
Step 3: Removing Duplicates
Already explained above with drop_duplicates().
Step 4: Filtering Irrelevant Data
df = df[df['column_name'] != 'irrelevant_value']
Can also filter based on multiple conditions.
df = df[(df['age'] > 18) & (df['status'] == 'active')]
Step 5: Fixing Inconsistent Data
df['column_name'] = df['column_name'].str.lower() # or str.upper()
Step 6: Removing Special Characters
df['column_name'] = df['column_name'].str.replace('[^\w\s]', '', regex=True)
Use regex=True for powerful text cleaning.
Normalizing and Formatting Data in Python
Normalizing and formatting data are essential steps in data preprocessing, ensuring that
data is structured, consistent, and suitable for analysis or machine learning.
Normalization involves scaling numerical data to a standard range or distribution to
ensure fair comparisons across features, while formatting focuses on standardizing data
representations (e.g., text, dates, or categorical data) to improve consistency and usability.
What is Data Normalization?
Normalization is the process of transforming numerical data to a common scale, typically
[0, 1] or another standard range, to eliminate the influence of differing units or magnitudes.
It ensures that features with larger ranges (e.g., salary in thousands vs. age in years)
contribute equally to analysis or machine learning models.
Why Normalize Data?
• Equal Feature Contribution: Prevents features with larger scales from dominating
models (e.g., in distance-based algorithms like KNN or SVM).
• Improves Convergence: Speeds up optimization in gradient-based algorithms (e.g.,
neural networks, linear regression).
• Enhances Comparability: Allows fair comparisons across variables.
• Meets Algorithm Requirements: Some algorithms assume normalized data for
optimal performance.
Types of Normalization
Normalization techniques adjust data based on its distribution or range. The main methods
are described below.
What is Data Formatting?
Formatting involves standardizing the representation of data to ensure consistency and
compatibility with analysis or modeling tasks. It addresses issues like inconsistent text
cases, date formats, or categorical encodings, making data easier to process and interpret.
Why Format Data?
• Consistency: Ensures uniform data representation (e.g., all dates in YYYY-MM-DD).
• Compatibility: Prepares data for algorithms that require specific formats (e.g.,
numerical inputs).
• Readability: Improves human interpretability for reporting or visualization.
• Error Prevention: Reduces errors caused by inconsistent or ambiguous data.
Types of Formatting
Formatting applies to numerical, categorical, text, and temporal data, with techniques
tailored to each type.
Python Libraries for Normalizing and Formatting Data
Python offers several libraries for these tasks:
• Pandas: For data manipulation, formatting, and basic normalization.
• NumPy: For numerical operations and advanced normalization.
• Scikit-learn: For machine learning-friendly normalization and encoding.
• re (Regular Expressions): For text formatting.
• datetime: For date and time formatting.
• String: For basic text manipulation.
Methods for Normalizing Data
Normalization techniques scale numerical data to a common range or distribution. Below
are the key methods, with theory, use cases, and concise code examples.
1. Min-Max Normalization
• Scales data to a fixed range, typically [0, 1].
• Formula: Xnorm=X−XminXmax−XminXnorm=Xmax−XminX−Xmin
• Preserves the original data distribution but is sensitive to outliers.
import pandas as pd
df = pd.read_csv('data.csv')
df['salary_norm'] = (df['salary'] - df['salary'].min()) / (df['salary'].max() - df['salary'].min())
Scikit-learn Method:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['salary_norm'] = scaler.fit_transform(df[['salary']])
2. Z-Score Normalization (Standardization)
• Centers data around the mean (μμ) with a standard deviation (σσ) of 1.
• Formula: Xnorm=X−μσXnorm=σX−μ
• Assumes data follows a normal distribution but is less affected by outliers than min-
max.
df['age_norm'] = (df['age'] - df['age'].mean()) / df['age'].std()
Scikit-learn Method:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['age_norm'] = scaler.fit_transform(df[['age']])
3. Robust Scaling
• Scales data using the median and interquartile range (IQR) instead of mean and
standard deviation.
• Formula: Xnorm=X−medianIQRXnorm=IQRX−median, where IQR = Q3 - Q1.
• Highly robust to outliers.
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
df['salary_norm'] = (df['salary'] - df['salary'].median()) / IQR
Scikit-learn Method:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df['salary_norm'] = scaler.fit_transform(df[['salary']])
Methods for Formatting Data
Formatting ensures data is consistently represented across the dataset. Below are the key
methods, with theory, use cases, and concise code examples.
1. Formatting Text Data
• Standardizes text to ensure consistency in case, spacing, or content.
• Common tasks include converting case, removing whitespace, and cleaning special
characters.
Methods:
• Convert Case:
df['name'] = df['name'].str.lower() # Lowercase
df['name'] = df['name'].str.title() # Title case
• Remove Whitespace:
df['name'] = df['name'].str.strip() # Remove leading/trailing spaces
• Replace Substrings:
df['city'] = df['city'].str.replace('NY', 'New York') # Standardize names
2. Formatting Dates and Times
• Converts date and time data to a consistent format (e.g., YYYY-MM-DD).
• Ensures compatibility with time-series analysis or database storage.
Methods:
• Convert to Datetime:
df['date'] = pd.to_datetime(df['date'], errors='coerce') # Handle invalid dates
• Standardize Format:
df['date'] = df['date'].dt.strftime('%Y-%m-%d') # Format as YYYY-MM-DD
• Extract Components:
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
3. Formatting Numerical Data
• Ensures numerical data is in the correct format (e.g., float, integer) and precision.
• Handles issues like strings stored as numbers or excessive decimal places.
Methods:
• Convert to Numeric:
df['price'] = pd.to_numeric(df['price'], errors='coerce') # Convert to float, NaN for invalid
• Round Numbers:
df['price'] = df['price'].round(2) # Round to 2 decimals
• Convert to Integer:
df['age'] = df['age'].astype(int) # Convert to integer
4. Formatting Categorical Data
• Standardizes categorical data by encoding or cleaning categories for analysis or
modeling.
• Ensures categories are consistent and machine-readable.
Methods:
• Standardize Categories:
df['department'] = df['department'].str.lower().str.strip()
df['department'] = df['department'].replace({'hr': 'Human Resources'})
• Label Encoding:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['department_code'] = le.fit_transform(df['department'])
• One-Hot Encoding:
df = pd.get_dummies(df, columns=['city'], prefix='city') # Create binary columns
• Handle Rare Categories:
threshold = 10
counts = df['category'].value_counts()
df['category'] = df['category'].apply(lambda x: x if counts[x] >= threshold else 'Other')
Reading and Writing Data in Text Format
1. Introduction to File Handling in Python
File handling allows a program to persist data — saving outputs, reading inputs, managing
configurations, or exchanging information between systems.
In Python, files are handled with built-in functions and modules.
Key concepts:
• Opening a file.
• Reading from a file.
• Writing to a file.
• Closing the file after use.
2. Opening a File
Use the built-in open() function:
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None,
closefd=True, opener=None)
• file: path to the file.
• mode: defines the operation — read, write, append, etc.
• encoding: specifies character encoding (like 'utf-8').
Example:
f = open('example.txt', 'r') # Open for reading
Important: Always close the file after operations using .close() OR use with open(...) as ...:
to auto-close.
3. Modes of Opening Files
Mode Meaning
'r' Read-only mode (default). File must exist.
'w' Write mode. Creates or overwrites the file.
'a' Append mode. Adds data to the end of file.
'r+' Read and write mode. File must exist.
't' Text mode (default).
'b' Binary mode (e.g., images, executable files).
Modes can be combined: e.g., 'rb' (read binary).
4. Reading from a Text File
Once a file is opened for reading ('r' mode):
Methods:
• read(): Reads the whole file as one string.
• readline(): Reads one line at a time.
• readlines(): Reads all lines into a list.
Example:
with open('example.txt', 'r') as file:
content = file.read()
print(content)
Looping through lines:
with open('example.txt', 'r') as file:
for line in file:
print(line.strip())
Tip: strip() removes \n (newline) characters at the end.
5. Writing to a Text File
Writing happens in 'w' or 'a' mode:
• 'w' will overwrite existing content.
• 'a' will append content to existing content.
Writing a single string:
with open('example.txt', 'w') as file:
file.write('Hello World\n')
Writing multiple lines:
lines = ['Line 1\n', 'Line 2\n']
with open('example.txt', 'w') as file:
file.writelines(lines)
Note: You need to add \n manually after each line!
8. Advanced: File Positioning
• file.tell(): Returns the current file cursor position.
• file.seek(offset, whence): Changes the file cursor position.
Example:
file.seek(0) # Move cursor to the start of the file
whence can be:
• 0 – beginning of file (default)
• 1 – current position
• 2 – end of file
Useful when doing partial reads or rewinds.
9. Reading and Writing Structured Text (CSV, JSON)
Many times, data is not just plain text but structured in formats like CSV or JSON.
Reading/Writing CSV files:
import csv
# Writing
with open('people.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Name', 'Age'])
writer.writerow(['Alice', 25])
# Reading
with open('people.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
print(row)
Reading/Writing JSON files:
import json
# Writing JSON
with open('data.json', 'w') as f:
json.dump({'name': 'Alice', 'age': 25}, f)
# Reading JSON
with open('data.json', 'r') as f:
data = json.load(f)
Combining and Merging Data in Python
Combining and merging data are critical processes in data preprocessing that involve
integrating multiple datasets into a unified structure for analysis or modeling.
These techniques are essential when data is sourced from different files, databases, or
systems, and need to be consolidated to provide a comprehensive view. Python,
particularly with the Pandas library, offers powerful tools for combining and merging data
efficiently.
What is Combining and Merging Data?
Combining data refers to stacking or concatenating datasets vertically (row-wise) or
horizontally (column-wise) to create a single dataset. It is typically used when datasets
have similar structures (e.g., same columns) or complementary information (e.g.,
additional columns).
Merging data involves joining datasets based on common keys or indices, similar to SQL
joins, to integrate related information from different sources. It is used when datasets
share a common column or index (e.g., customer ID) but contain different attributes.
Why Combine and Merge Data?
• Unified Analysis: Integrates data from multiple sources for comprehensive insights.
• Enriched Datasets: Combines complementary information (e.g., customer
demographics with purchase history).
• Data Consolidation: Simplifies workflows by creating a single dataset from
fragmented sources.
• Improved Model Performance: Provides more features or records for machine
learning.
• Real-World Relevance: Most real-world data is stored in multiple tables or files,
requiring integration.
Key Concepts in Combining and Merging Data
1. Combining (Concatenation):
o Vertical Concatenation: Stacks datasets row-wise (appending rows).
o Horizontal Concatenation: Stacks datasets column-wise (adding columns).
o Requires datasets to have compatible structures (e.g., same columns for
vertical, same rows for horizontal).
2. Merging (Joining):
o Combines datasets based on a common key (e.g., customer_id) or index.
o Types of merges (similar to SQL joins):
▪ Inner: Keeps only matching records.
▪ Left: Keeps all records from the left dataset, with NaNs for non-
matching right records.
▪ Right: Keeps all records from the right dataset, with NaNs for non-
matching left records.
▪ Outer (Full): Keeps all records, with NaNs where there’s no match.
Methods for Combining Data
Combining data involves stacking datasets vertically or horizontally using concatenation.
Below are the key methods, with theory, use cases, and concise code examples.
1. Vertical Concatenation (Appending Rows)
• Stacks datasets row-wise, adding rows from one dataset to another.
• Requires datasets to have the same columns (or a subset) for alignment.
• Used when datasets represent different records of the same structure (e.g., sales
data from different months).
Code Example:
import pandas as pd
df1 = pd.DataFrame({'id': [1, 2], 'name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'id': [3, 4], 'name': ['Charlie', 'David']})
df_combined = pd.concat([df1, df2], axis=0, ignore_index=True)
Key Parameters (pd.concat):
• axis=0: Concatenate vertically (rows).
• ignore_index=True: Reset index to avoid duplicate indices.
• join='outer': Include all columns (default); use 'inner' for common columns.
• keys: Add a hierarchical index to distinguish sources (e.g., keys=['df1', 'df2']).
2. Horizontal Concatenation (Appending Columns)
• Stacks datasets column-wise, adding columns from one dataset to another.
• Requires datasets to have the same number of rows or aligned indices.
• Used when datasets contain different attributes for the same records (e.g.,
customer demographics and purchase amounts).
Code Example:
df1 = pd.DataFrame({'id': [1, 2], 'name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'age': [25, 30], 'city': ['New York', 'Chicago']})
df_combined = pd.concat([df1, df2], axis=1)
Key Parameters (pd.concat):
• axis=1: Concatenate horizontally (columns).
• join='outer': Include all rows; use 'inner' for matching rows.
• ignore_index=False: Retain column names (default).
Methods for Merging Data
Merging data involves joining datasets based on a common key or index, similar to SQL
joins. Below are the key methods, with theory, use cases, and concise code examples.
1. Inner Merge
• Keeps only records with matching keys in both datasets.
• Equivalent to SQL INNER JOIN.
• Used when only complete matches are needed (e.g., customers with orders).
Code Example:
df1 = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'id': [1, 2, 4], 'age': [25, 30, 35]})
df_merged = pd.merge(df1, df2, on='id', how='inner')
Use Case:
• Matching customer profiles with their purchase history.
Pros:
• Produces a clean dataset with no missing values.
• Efficient for focused analysis.
Cons:
• Excludes non-matching records, reducing dataset size.
2. Left Merge
• Keeps all records from the left dataset, with NaNs for non-matching records from
the right dataset.
• Equivalent to SQL LEFT OUTER JOIN.
• Used when the left dataset is the primary source (e.g., all customers, even without
orders).
Code Example:
df_merged = pd.merge(df1, df2, on='id', how='left')
Use Case:
• Including all customer records, with optional order data.
Pros:
• Preserves all left dataset records.
• Useful for comprehensive datasets.
Cons:
• Introduces NaNs for non-matches.
3. Right Merge
• Keeps all records from the right dataset, with NaNs for non-matching records from
the left dataset.
• Equivalent to SQL RIGHT OUTER JOIN.
• Used when the right dataset is the priority (e.g., all orders, even for unknown
customers).
Code Example:
df_merged = pd.merge(df1, df2, on='id', how='right')
Use Case:
• Ensuring all order records are included, even for missing customers.
Pros:
• Preserves all right dataset records.
• Useful for right-centric analysis.
Cons:
• Introduces NaNs for non-matches.
4. Outer (Full) Merge
• Keeps all records from both datasets, with NaNs where there’s no match.
• Equivalent to SQL FULL OUTER JOIN.
• Used when all data is needed, regardless of matches (e.g., complete customer and
order data).
Code Example:
df_merged = pd.merge(df1, df2, on='id', how='outer')
Use Case:
• Combining all available data for exploratory analysis.
Pros:
• Retains all information.
• Comprehensive for data integration.
Cons:
• Introduces many NaNs, requiring further cleaning.
5. Merging on Index
• Joins datasets based on their index instead of a column.
• Useful when datasets are indexed by a common identifier (e.g., time series data).
Code Example:
df1 = pd.DataFrame({'name': ['Alice', 'Bob']}, index=[1, 2])
df2 = pd.DataFrame({'age': [25, 30]}, index=[1, 2])
df_merged = df1.merge(df2, left_index=True, right_index=True)
6. Merging on Multiple Keys
• Joins datasets using multiple columns as keys for precise matching.
• Used when a single key is insufficient (e.g., matching by customer_id and date).
Code Example:
df1 = pd.DataFrame({'id': [1, 1], 'date': ['2023-01-01', '2023-01-02'], 'name': ['Alice', 'Alice']})
df2 = pd.DataFrame({'id': [1, 1], 'date': ['2023-01-01', '2023-01-03'], 'age': [25, 26]})
df_merged = pd.merge(df1, df2, on=['id', 'date'], how='inner')
Additional Methods for Combining and Merging
1. Joining DataFrames
• A Pandas-specific method to merge DataFrames based on indices or columns.
• Similar to merge but optimized for index-based joins.
Code Example:
df1 = pd.DataFrame({'name': ['Alice', 'Bob']}, index=[1, 2])
df2 = pd.DataFrame({'age': [25, 30]}, index=[1, 2])
df_joined = df1.join(df2)
2. Appending DataFrames (Deprecated)
• A legacy method to append rows (similar to vertical concatenation).
• Replaced by pd.concat in modern Pandas versions.
Code Example:
df1 = pd.DataFrame({'id': [1, 2], 'name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'id': [3, 4], 'name': ['Charlie', 'David']})
df_appended = df1.append(df2, ignore_index=True) # Deprecated
# Use pd.concat instead:
df_appended = pd.concat([df1, df2], ignore_index=True)
Reshaping and Pivoting Data in Python
Reshaping and pivoting are essential data manipulation techniques in Python used to
reorganize datasets to make them more suitable for analysis, visualization, or modeling.
These processes transform the structure of data, such as changing its layout from wide to
long format or creating summary tables.
Python’s Pandas library provides powerful tools like melt, pivot, pivot_table, and others to
perform these tasks efficiently.
What is Reshaping and Pivoting?
Reshaping involves changing the structure or layout of a dataset without altering its
content. It typically transforms data between wide format (where each variable has its own
column) and long format (where variables are stacked into a single column with
corresponding values).
Pivoting is a specific type of reshaping that creates a summary table by spreading data
across rows and columns based on key variables. It is often used to aggregate data or
create cross-tabulations for easier interpretation.
Why Reshape and Pivot Data?
• Improved Analysis: Reorganizes data to match the requirements of statistical
methods or visualizations.
• Flexibility: Converts data between formats (e.g., long for time-series, wide for
summaries).
• Simplified Summaries: Pivoting creates concise tables for reporting or
dashboards.
• Data Compatibility: Prepares data for tools or algorithms that expect specific
structures.
• Real-World Relevance: Many datasets require reshaping to align with analytical
goals (e.g., converting survey data for analysis).
Key Concepts in Reshaping and Pivoting
1. Wide vs. Long Format:
o Wide Format: Each variable is a column, and each row represents a record
(e.g., columns for sales_2021, sales_2022).
o Long Format: Variables are stacked into a single column, with another
column indicating the variable name (e.g., columns for year and sales).
Methods for Reshaping and Pivoting Data
Below are the key methods for reshaping and pivoting data, with theory, use cases, and
concise code examples.
1. Melting (Wide to Long Format)
• Converts a wide-format dataset to a long format by stacking multiple columns into a
single column.
• Used when data needs to be reformatted for time-series analysis, visualization, or
statistical modeling.
• Function: pd.melt.
Code Example:
import pandas as pd
df = pd.DataFrame({'id': [1, 2], 'sales_2021': [100, 150], 'sales_2022': [120, 180]})
df_melted = pd.melt(df, id_vars=['id'], value_vars=['sales_2021', 'sales_2022'],
var_name='year', value_name='sales')
Key Parameters (pd.melt):
• id_vars: Columns to keep as identifiers (not melted).
• value_vars: Columns to melt into a single column (default: all non-id_vars).
• var_name: Name of the column storing variable names (e.g., year).
• value_name: Name of the column storing values (e.g., sales).
2. Pivoting (Long to Wide Format)
• Converts a long-format dataset to a wide format by spreading a column’s values
across multiple columns.
• Creates a table with specified rows, columns, and values.
• Function: pd.pivot.
• Assumes no duplicate entries for the index-column combination (otherwise, use
pivot_table).
Code Example:
df = pd.DataFrame({'id': [1, 1, 2], 'year': [2021, 2022, 2021], 'sales': [100, 120, 150]})
df_pivoted = pd.pivot(df, index='id', columns='year', values='sales')
Key Parameters (pd.pivot):
• index: Column to use as rows.
• columns: Column to spread into new columns.
• values: Column containing values to populate the table.
3. Pivot Table (Aggregated Pivoting)
• Similar to pivot, but allows aggregation of duplicate entries using a function (e.g.,
mean, sum).
• Function: pd.pivot_table.
• Used for summarizing data or handling datasets with multiple values for the same
index-column pair.
Code Example:
df = pd.DataFrame({'id': [1, 1, 2], 'year': [2021, 2021, 2022], 'sales': [100, 110, 150]})
df_pivot_table = pd.pivot_table(df, index='id', columns='year', values='sales',
aggfunc='mean')
Key Parameters (pd.pivot_table):
• index: Column for rows.
• columns: Column for new columns.
• values: Column to aggregate.
• aggfunc: Aggregation function (e.g., 'mean', 'sum', 'count').
• fill_value: Replace NaNs with a value (e.g., fill_value=0).
4. Stacking and Unstacking
• Stacking: Moves column labels into a row index, converting wide data to a long
format (multi-level index).
• Unstacking: Moves row index levels to columns, converting long data to a wide
format.
• Functions: stack, unstack.
• Used for hierarchical indices or multi-level data.
Code Example (Stack):
df = pd.DataFrame({'id': [1, 2], 'sales_2021': [100, 150], 'sales_2022': [120, 180]})
df_stacked = df.set_index('id').stack().reset_index(name='sales')
Code Example (Unstack):
df = pd.DataFrame({'id': [1, 1, 2], 'year': [2021, 2022, 2021], 'sales': [100, 120, 150]})
df_unstacked = df.set_index(['id', 'year'])['sales'].unstack()
Data Transformation in Python
Data transformation is a critical step in data preprocessing that involves modifying,
restructuring, or reformatting a dataset to make it more suitable for analysis, visualization,
or machine learning.
It encompasses a wide range of techniques, such as scaling, encoding, aggregating,
reshaping, and deriving new features, to ensure data meets the requirements of the
intended task.
Python, particularly with the Pandas, NumPy, and Scikit-learn libraries, provides powerful
tools for data transformation.
What is Data Transformation?
Data transformation refers to the process of converting raw or unprocessed data into a
format that is more appropriate for analysis, modeling, or reporting.
It involves applying mathematical, statistical, or logical operations to change the structure,
scale, or representation of data. The goal is to enhance data quality, compatibility, and
interpretability.
Why Transform Data?
• Improves Analysis: Transformed data is easier to analyze or visualize (e.g.,
normalized values for comparisons).
• Meets Algorithm Requirements: Many machine learning algorithms require
specific data formats (e.g., numerical inputs, scaled features).
• Enhances Interpretability: Derived features or aggregated data provide clearer
insights.
• Handles Inconsistencies: Standardizes data to resolve issues like mixed formats or
scales.
• Optimizes Performance: Reduces computational complexity or improves model
accuracy.
String Manipulation in Python
Introduction to Strings
• A string is a sequence of Unicode characters.
• Strings are immutable — once created, they cannot be changed.
• Python provides many methods for creating, transforming, and analyzing strings.
• String manipulation is essential in data cleaning, processing user inputs, text
analysis, etc.
Categories of String Manipulation in Python
String Manipulation can be grouped into different functional categories based on the
kind of operation performed.
1. Accessing Characters and Substrings
Operations where you retrieve part(s) of a string.
• Indexing: Getting a specific character.
• Slicing: Getting a substring (a section of the string).
• Negative indexing: Access from the end.
• 2. Modifying String Content
Operations that change how the string looks without changing the original.
• Changing case: .lower(), .upper(), .title(), .capitalize(), .swapcase()
• Replacing substrings: .replace(old, new)
• Trimming whitespace: .strip(), .lstrip(), .rstrip()
• Padding strings: .zfill(width)
Example:
s = " hello "
print(s.strip()) # 'hello'
print(s.upper()) # ' HELLO '
3. Combining or Splitting Strings
Operations that join or break strings apart.
• Concatenation: Using +
• Repetition: Using *
• Splitting: .split(separator), .splitlines()
• Joining: .join(list)
Example:
s = "apple,banana,grape"
print(s.split(',')) # ['apple', 'banana', 'grape']
print("-".join(["a", "b", "c"])) # 'a-b-c'
4. Searching and Analyzing Strings
Finding parts of the string or checking properties.
• Searching for substrings: .find(), .rfind(), .index()
• Counting occurrences: .count(substring)
• Checking start/end: .startswith(), .endswith()
• Property checks: .isalpha(), .isdigit(), .isalnum(), .isspace(), etc.
Example:
s = "hello world"
print(s.find("world")) # 6
print(s.isalpha()) # False
5. Formatting Strings
Inserting variables or formatting numbers inside a string.
• Old style: "Hello %s" % name
• str.format() method: "Hello {}".format(name)
• f-strings (modern Python): f"Hello {name}"
Example:
name = "Alice"
print(f"Hello {name}") # 'Hello Alice'
6. Regular Expressions (Advanced String Matching)
Powerful pattern-based searching and transformations.
• Matching (re.match, re.search)
• Finding all matches (re.findall)
• Replacing (re.sub)
• Splitting (re.split)
Example:
import re
print(re.findall(r'\d+', 'abc123xyz456')) # ['123', '456']
Basic String Operations
1. String Creation
s = "Hello World"
print(s)
Strings can be single or multi-line.
2. String Indexing and Slicing
• Strings are indexed (first character at index 0).
• Slicing extracts parts of a string.
• Syntax: string[start:end:step].
• Negative indexing starts from the end.
• Slicing does not modify the original string.
Example:
s = "Python"
print(s[0]) # Output: P
print(s[1:4]) # Output: yth
print(s[-1]) # Output: n
3. String Concatenation and Repetition
• + operator joins two strings.
• * operator repeats the string multiple times.
• Efficient for formatting dynamic content.
• Must ensure both are strings (use str() if needed).
• Works very fast internally as immutable operations.
Example:
s1 = "Hello"
s2 = "World"
print(s1 + " " + s2) # Hello World
print(s1 * 3) # HelloHelloHello
4. String Methods for Changing Case
• lower() → converts to lowercase.
• upper() → converts to uppercase.
• capitalize() → first letter capital.
• title() → first letter of each word capital.
• swapcase() → switches case (lower ↔ upper).
Example:
s = "hello WORLD"
print(s.lower()) # hello world
print(s.title()) # Hello World
5. Searching for Substrings
• find() → returns first index of substring or -1.
• rfind() → last index from right.
• index() → like find() but raises error if not found.
• Helps in text searching and parsing.
• Very fast and efficient in huge texts.
Example:
s = "hello world"
print(s.find('world')) # Output: 6
6. String Replacement
• replace(old, new) replaces substrings.
• Returns a new string.
• Can replace multiple instances at once.
• Useful for cleaning unwanted characters.
• Supports chain replacements.
Example:
s = "bad dog"
print(s.replace('bad', 'good')) # good dog
7. Splitting and Joining Strings
• split(separator) → splits string into list.
• join(iterable) → joins list into string.
• Useful in parsing CSVs, logs, texts.
• splitlines() for line breaks.
• Powerful for preprocessing.
Example:
s = "apple,banana,grape"
fruits = s.split(",")
print(fruits) # ['apple', 'banana', 'grape']
print(",".join(fruits)) # apple,banana,grape
8. Checking String Properties
• isalpha() → all letters.
• isdigit() → all numbers.
• isalnum() → letters and numbers.
• isspace() → only whitespace.
• Good for data validation.
Example:
s = "Python3"
print(s.isalnum()) # True
9. String Formatting
• .format() method.
• f-strings (Python 3.6+).
• % old-style formatting.
• Powerful for inserting dynamic data.
• Readable, clean, and recommended.
Example:
name = "Alice"
print(f"Hello {name}") # Hello Alice
10. Trimming Strings
• strip() removes leading/trailing spaces.
• lstrip(), rstrip() remove from left or right.
• Important for cleaning messy input data.
• Returns a new string.
• Custom characters can also be trimmed.
Example:
s = " Hello World "
print(s.strip()) # Hello World
Regular Expressions in Python