Pandas for Machine
Learning
Data Analysis and Manipulation with Python
Introduction to Pandas
What is Pandas?
Pandas is a fast, powerful, flexible, and easy-to-use open-
source data analysis and manipulation tool built on top of the
Python programming language.
Key Features
• DataFrame object for data manipulation with integrated
indexing
• Tools for reading and writing data between in-memory data
structures and different file formats
• Data alignment and integrated handling of missing data
• Reshaping and pivoting of datasets
• Label-based slicing, fancy indexing, and subsetting of large
datasets
Why Essential for Machine Learning?
• Simplifies data preprocessing and cleaning
• Provides efficient data structures for analysis
Core Data Structures
Series DataFrame
A Series is a one-dimensional labeled array capable of holding any A DataFrame is a two-dimensional labeled data structure with
data type. columns of potentially different types.
import pandas as pd import pandas as pd
# Creating a Series # Creating a DataFrame
s = pd.Series([1, 3, 5, 7, 9], data = {
index=['a', 'b', 'c', 'd', 'e']) 'Name': ['John', 'Anna', 'Peter'],
'Age': [28, 34, 29],
a1 'City': ['New York', 'Paris', 'Berlin']
b3 }
c5 df = pd.DataFrame(data)
d7
e9 Name Age City
dtype: int64 0 John 28 New York
1 Anna 34 Paris
Key Series Features 2 Peter 29 Berlin
• Labeled index for fast lookup
• Vectorized operations
Key DataFrame Features
• Can handle mixed data types • Column and row indexing
• SQL-like joins and merges
• Handling of missing data
Data Loading and Inspection
Reading Data Basic Inspection Methods
Pandas provides functions to read data from various file formats: After loading data, use these methods to understand your dataset:
# CSV files # View first/last n rows
df = pd.read_csv('data.csv') df.head(5) # First 5 rows
df.tail(3) # Last 3 rows
# Excel files
df = pd.read_excel('data.xlsx') # Get summary information
df.info() # Data types and non-null values
# SQL databases
df = pd.read_sql('SELECT * FROM table', conn) # Statistical summary
df.describe() # Count, mean, std, min, max, etc.
# JSON files
df = pd.read_json('data.json') # Column data types
df.dtypes
Common Parameters
Quick Checks
• index_col : Column to use as index
• usecols : List of columns to read • df.shape : Dimensions (rows, columns)
• nrows : Number of rows to read • df.columns : Column names
• skiprows : Rows to skip • df.index : Row indices
• na_values : Values to treat as NaN • df.isnull().sum() : Count missing values
• df.value_counts() : Frequency counts
Data Cleaning and Preprocessing
Handling Missing Values Data Type Conversion
Missing data is a common issue in real-world datasets. Pandas provides Converting data to the correct types is crucial for proper analysis and
several methods to detect and handle missing values. memory efficiency.
# Check for missing values # Check data types
df.isnull().sum() df.dtypes
# Drop rows with missing values # Convert column to numeric
df.dropna() df['numeric_column'] = pd.to_numeric(df['column'])
# Fill missing values # Convert column to datetime
df.fillna(value=0) df['date_column'] = pd.to_datetime(df['column'])
df['column'].fillna(df['column'].mean())
# Convert column to category (memory efficient)
df['category_column'] = df['column'].astype('category')
Removing Duplicates
Duplicate data can skew analysis results. Pandas makes it easy to identify
and remove duplicates.
# Find duplicate rows
df.duplicated().sum()
# Remove duplicate rows
df.drop_duplicates()
Exploratory Data Analysis
Statistical Summaries Visualization Capabilities
Pandas provides powerful methods to quickly understand your data: Pandas integrates with matplotlib for quick visualizations:
# Basic statistics # Basic plotting
df.describe() df['column'].plot(kind='hist')
# Correlation between features # Box plot by category
df.corr() df.boxplot(by='category')
Grouping & Aggregation Key EDA Functions
Group data and apply functions to each group: • df.head()/tail() : View first/last rows
• df.info() : Summary of DataFrame
# Group by category and calculate mean • df.isnull().sum() : Count missing values
df.groupby('category')['value'].mean()
# Multiple aggregations
df.groupby('category').agg({
'value': ['min', 'max', 'mean']
})
Feature Engineering with Pandas
What is Feature Engineering? Feature Creation Example
The process of creating new features or transforming existing ones to
# Sample sales data
improve model performance.
df = pd.DataFrame({
'price': [10, 15, 20],
Common Techniques 'quantity': [5, 3, 2]
• Creating new features from existing ones })
• Transforming features (scaling, normalization)
• Encoding categorical variables # Create new feature
• Handling date/time features df['total_revenue'] = df['price'] * df['quantity']
One-Hot Encoding Example
Date Feature Extraction
# Sample data with categorical variable
df = pd.DataFrame({ # Sample date data
'city': ['New York', 'Paris', 'Berlin'] df = pd.DataFrame({
}) 'date': pd.date_range('2023-01-01', periods=3)
})
# One-hot encoding
pd.get_dummies(df, columns=['city']) # Extract date components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
Pandas in ML Workflow
Data Preparation for ML Integration with ML Libraries
Pandas is essential for preparing data before feeding it to machine Pandas DataFrames work seamlessly with popular machine learning
learning algorithms: libraries:
Data cleaning and preprocessing scikit-learn : Most functions accept pandas DataFrames
Feature selection and engineering TensorFlow/Keras : Convert with tf.convert_to_tensor(df)
Handling categorical variables PyTorch : Convert with torch.tensor(df.values)
Train-Test Split Example Model Evaluation with Pandas
import pandas as pd import pandas as pd
from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix
# Load data and split # Create DataFrame from predictions
df = pd.read_csv('data.csv') results = pd.DataFrame({
X = df.drop('target', axis=1) 'actual': y_test,
y = df['target'] 'predicted': model.predict(X_test)
})
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42) # Analyze results
results['correct'] = results['actual'] == results['predicted']
accuracy = results['correct'].mean()
Best Practices and Performance Tips
Vectorization Over Loops Efficient Data Processing
Always use pandas' vectorized operations instead of Python loops for • Use .loc and .iloc for indexing instead of chained indexing
better performance. • Prefer query() or boolean indexing for filtering large DataFrames
• Use apply() with appropriate axis for custom operations
# Slow: Using loops
for i in range(len(df)): Method Chaining
df.loc[i, 'new_col'] = df.loc[i, 'col1'] * 2
Chain methods together for cleaner, more readable code.
# Fast: Vectorized operation
# Instead of multiple steps
df['new_col'] = df['col1'] * 2
df = df.dropna()
df = df.reset_index()
Memory Optimization
# Use method chaining
• Use appropriate data types (e.g., category for categorical data) df = (df
• Use inplace=True when possible to avoid copies .dropna()
• Consider chunksize parameter when reading large files .reset_index()
.rename(columns={'old': 'new'}))
Resources and Next Steps
Learning Resources Community and Support
Official Pandas Documentation: pandas.pydata.org Stack Overflow: stackoverflow.com/questions/tagged/pandas
Python for Data Analysis by Wes McKinney GitHub: github.com/pandas-dev/pandas
Kaggle Learn: kaggle.com/learn/pandas PyData Community
Data School YouTube Channel
Next Steps
Advanced Pandas Topics • Practice with real-world datasets
• MultiIndex and advanced indexing • Build end-to-end data analysis projects
• Time series analysis • Combine pandas with visualization libraries
• Categorical data • Explore scikit-learn integration
• Performance optimization