KEMBAR78
Pandas For Machine Learning | PDF | Data Management | Information Science
0% found this document useful (0 votes)
25 views10 pages

Pandas For Machine Learning

Pandas is an open-source data analysis and manipulation tool for Python, featuring DataFrame and Series structures for efficient data handling. It is essential for machine learning as it simplifies data preprocessing, cleaning, and feature engineering. The document covers key functionalities, data loading, cleaning, exploratory data analysis, and best practices for using Pandas effectively.

Uploaded by

avi003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

Pandas For Machine Learning

Pandas is an open-source data analysis and manipulation tool for Python, featuring DataFrame and Series structures for efficient data handling. It is essential for machine learning as it simplifies data preprocessing, cleaning, and feature engineering. The document covers key functionalities, data loading, cleaning, exploratory data analysis, and best practices for using Pandas effectively.

Uploaded by

avi003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Pandas for Machine

Learning
Data Analysis and Manipulation with Python
Introduction to Pandas

What is Pandas?
Pandas is a fast, powerful, flexible, and easy-to-use open-
source data analysis and manipulation tool built on top of the
Python programming language.

Key Features
• DataFrame object for data manipulation with integrated
indexing
• Tools for reading and writing data between in-memory data
structures and different file formats
• Data alignment and integrated handling of missing data
• Reshaping and pivoting of datasets
• Label-based slicing, fancy indexing, and subsetting of large
datasets

Why Essential for Machine Learning?


• Simplifies data preprocessing and cleaning
• Provides efficient data structures for analysis
Core Data Structures
Series DataFrame
A Series is a one-dimensional labeled array capable of holding any A DataFrame is a two-dimensional labeled data structure with
data type. columns of potentially different types.

import pandas as pd import pandas as pd


# Creating a Series # Creating a DataFrame
s = pd.Series([1, 3, 5, 7, 9], data = {
index=['a', 'b', 'c', 'd', 'e']) 'Name': ['John', 'Anna', 'Peter'],
'Age': [28, 34, 29],
a1 'City': ['New York', 'Paris', 'Berlin']
b3 }
c5 df = pd.DataFrame(data)
d7
e9 Name Age City
dtype: int64 0 John 28 New York
1 Anna 34 Paris

Key Series Features 2 Peter 29 Berlin

• Labeled index for fast lookup


• Vectorized operations
Key DataFrame Features
• Can handle mixed data types • Column and row indexing
• SQL-like joins and merges
• Handling of missing data
Data Loading and Inspection
Reading Data Basic Inspection Methods
Pandas provides functions to read data from various file formats: After loading data, use these methods to understand your dataset:

# CSV files # View first/last n rows


df = pd.read_csv('data.csv') df.head(5) # First 5 rows
df.tail(3) # Last 3 rows
# Excel files
df = pd.read_excel('data.xlsx') # Get summary information
df.info() # Data types and non-null values
# SQL databases
df = pd.read_sql('SELECT * FROM table', conn) # Statistical summary
df.describe() # Count, mean, std, min, max, etc.
# JSON files
df = pd.read_json('data.json') # Column data types
df.dtypes

Common Parameters
Quick Checks
• index_col : Column to use as index
• usecols : List of columns to read • df.shape : Dimensions (rows, columns)
• nrows : Number of rows to read • df.columns : Column names
• skiprows : Rows to skip • df.index : Row indices
• na_values : Values to treat as NaN • df.isnull().sum() : Count missing values
• df.value_counts() : Frequency counts
Data Cleaning and Preprocessing
Handling Missing Values Data Type Conversion
Missing data is a common issue in real-world datasets. Pandas provides Converting data to the correct types is crucial for proper analysis and
several methods to detect and handle missing values. memory efficiency.

# Check for missing values # Check data types


df.isnull().sum() df.dtypes

# Drop rows with missing values # Convert column to numeric


df.dropna() df['numeric_column'] = pd.to_numeric(df['column'])

# Fill missing values # Convert column to datetime


df.fillna(value=0) df['date_column'] = pd.to_datetime(df['column'])
df['column'].fillna(df['column'].mean())
# Convert column to category (memory efficient)
df['category_column'] = df['column'].astype('category')
Removing Duplicates
Duplicate data can skew analysis results. Pandas makes it easy to identify
and remove duplicates.

# Find duplicate rows


df.duplicated().sum()

# Remove duplicate rows


df.drop_duplicates()
Exploratory Data Analysis
Statistical Summaries Visualization Capabilities
Pandas provides powerful methods to quickly understand your data: Pandas integrates with matplotlib for quick visualizations:

# Basic statistics # Basic plotting


df.describe() df['column'].plot(kind='hist')

# Correlation between features # Box plot by category


df.corr() df.boxplot(by='category')

Grouping & Aggregation Key EDA Functions


Group data and apply functions to each group: • df.head()/tail() : View first/last rows
• df.info() : Summary of DataFrame
# Group by category and calculate mean • df.isnull().sum() : Count missing values
df.groupby('category')['value'].mean()

# Multiple aggregations
df.groupby('category').agg({
'value': ['min', 'max', 'mean']
})
Feature Engineering with Pandas
What is Feature Engineering? Feature Creation Example
The process of creating new features or transforming existing ones to
# Sample sales data
improve model performance.
df = pd.DataFrame({
'price': [10, 15, 20],
Common Techniques 'quantity': [5, 3, 2]
• Creating new features from existing ones })
• Transforming features (scaling, normalization)
• Encoding categorical variables # Create new feature
• Handling date/time features df['total_revenue'] = df['price'] * df['quantity']

One-Hot Encoding Example


Date Feature Extraction
# Sample data with categorical variable
df = pd.DataFrame({ # Sample date data
'city': ['New York', 'Paris', 'Berlin'] df = pd.DataFrame({
}) 'date': pd.date_range('2023-01-01', periods=3)
})
# One-hot encoding
pd.get_dummies(df, columns=['city']) # Extract date components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
Pandas in ML Workflow
Data Preparation for ML Integration with ML Libraries
Pandas is essential for preparing data before feeding it to machine Pandas DataFrames work seamlessly with popular machine learning
learning algorithms: libraries:
Data cleaning and preprocessing scikit-learn : Most functions accept pandas DataFrames
Feature selection and engineering TensorFlow/Keras : Convert with tf.convert_to_tensor(df)
Handling categorical variables PyTorch : Convert with torch.tensor(df.values)

Train-Test Split Example Model Evaluation with Pandas


import pandas as pd import pandas as pd
from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix

# Load data and split # Create DataFrame from predictions


df = pd.read_csv('data.csv') results = pd.DataFrame({
X = df.drop('target', axis=1) 'actual': y_test,
y = df['target'] 'predicted': model.predict(X_test)
})
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42) # Analyze results
results['correct'] = results['actual'] == results['predicted']
accuracy = results['correct'].mean()
Best Practices and Performance Tips
Vectorization Over Loops Efficient Data Processing
Always use pandas' vectorized operations instead of Python loops for • Use .loc and .iloc for indexing instead of chained indexing
better performance. • Prefer query() or boolean indexing for filtering large DataFrames
• Use apply() with appropriate axis for custom operations
# Slow: Using loops
for i in range(len(df)): Method Chaining
df.loc[i, 'new_col'] = df.loc[i, 'col1'] * 2
Chain methods together for cleaner, more readable code.
# Fast: Vectorized operation
# Instead of multiple steps
df['new_col'] = df['col1'] * 2
df = df.dropna()
df = df.reset_index()

Memory Optimization
# Use method chaining
• Use appropriate data types (e.g., category for categorical data) df = (df
• Use inplace=True when possible to avoid copies .dropna()
• Consider chunksize parameter when reading large files .reset_index()
.rename(columns={'old': 'new'}))
Resources and Next Steps
Learning Resources Community and Support
Official Pandas Documentation: pandas.pydata.org Stack Overflow: stackoverflow.com/questions/tagged/pandas
Python for Data Analysis by Wes McKinney GitHub: github.com/pandas-dev/pandas
Kaggle Learn: kaggle.com/learn/pandas PyData Community
Data School YouTube Channel
Next Steps
Advanced Pandas Topics • Practice with real-world datasets
• MultiIndex and advanced indexing • Build end-to-end data analysis projects
• Time series analysis • Combine pandas with visualization libraries
• Categorical data • Explore scikit-learn integration
• Performance optimization

You might also like