Introduction to Data Handling with Pandas and
NumPy
Debasish Dutta
July 4, 2024
Contents
1 Pandas 3
1.1 Basic Operations and Data Structures . . . . . . . . . . . . . . . . . . 3
1.1.1 Creating Series and DataFrames . . . . . . . . . . . . . . . . . . 3
1.1.2 Basic Indexing and Slicing . . . . . . . . . . . . . . . . . . . . . 3
1.2 Data Exploration and Manipulation . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Data Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Handling Missing Data . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.4 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.5 Applying Functions . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Removing Duplicates . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 String Manipulation . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Changing Data Types . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Data Aggregation and Grouping . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 Grouping Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.2 Aggregation Functions . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.3 Pivot Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Combining DataFrames . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.1 Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5.2 Merging and Joining . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.1 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.2 Date/Time Indexing . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Visualization with Pandas . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7.1 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 Exporting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8.1 Saving to CSV . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 NumPy 7
2.1 Basic Operations and Data Structures . . . . . . . . . . . . . . . . . . 7
2.1.1 Creating Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Basic Indexing and Slicing . . . . . . . . . . . . . . . . . . . . . 8
2.2 Mathematical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1
2.2.1 Element-wise Operations . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Matrix Operations . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Statistical Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . 9
2.4.2 Solving Linear Equations . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.1 Generating Random Numbers . . . . . . . . . . . . . . . . . . . 9
2.6 Saving and Loading Data . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.1 Saving Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2
1 Pandas
1.1 Basic Operations and Data Structures
1.1.1 Creating Series and DataFrames
Explanation: A Series is a one-dimensional labeled array, capable of holding any
data type. It can be created from a list, dictionary, or scalar value. DataFrame is a
two-dimensional labeled data structure with columns of potentially different types. It
can be created from a dictionary of lists, a list of dictionaries, or other data structures.
Usage:
import pandas as pd
# Series
s = pd . Series ([1 , 3 , 5 , 7 , 9] , index =[ ’a ’ , ’b ’ , ’c ’ , ’d ’ , ’e ’ ])
print ( s )
# DataFrame from dictionary
data = { ’A ’: [1 , 2 , 3] , ’B ’: [4 , 5 , 6]}
df = pd . DataFrame ( data )
print ( df )
# DataFrame from list of dictionaries
data = [{ ’A ’: 1 , ’B ’: 4} , { ’A ’: 2 , ’B ’: 5} , { ’A ’: 3 , ’B ’: 6}]
df = pd . DataFrame ( data )
print ( df )
1.1.2 Basic Indexing and Slicing
Explanation: Indexing and slicing are used to access specific elements or subsets of
data from Series or DataFrames. This can be done using labels or integer positions.
Usage:
# Series indexing by label
print ( s [ ’a ’ ])
# Series indexing by integer position
print ( s [0])
# DataFrame slicing by column label
print ( df [ ’A ’ ])
# DataFrame slicing by integer position
print ( df . iloc [0 , 1]) # Row 0 , Column 1
# DataFrame slicing by label
print ( df . loc [0 , ’B ’ ]) # Row 0 , Column ’B ’
1.2 Data Exploration and Manipulation
1.2.1 Data Overview
Explanation: Pandas provides methods to get a quick overview of the dataset, such
as the first few rows, summary of the DataFrame, and descriptive statistics.
3
Usage:
# First few rows
print ( df . head () )
# Summary of the DataFrame
print ( df . info () )
# Descriptive statistics
print ( df . describe () )
1.2.2 Handling Missing Data
Explanation: Handling missing data is crucial in data analysis. Pandas provides
methods to detect, remove, or fill missing values.
Usage:
# Detect missing values
print ( df . isna () )
# Drop rows with missing values
df . dropna ( inplace = True )
# Fill missing values
df . fillna (0 , inplace = True )
1.2.3 Sorting
Explanation: Sorting data helps in organizing the data and making it easier to ana-
lyze. Pandas allows sorting by values or index.
Usage:
# Sort by values in column ’A ’
d f_ s o r te d _ by _ v al u e s = df . sort_values ( by = ’A ’)
# Sort by index
df _s or te d_ by _i nd ex = df . sort_index ()
1.2.4 Filtering
Explanation: Filtering data based on conditions allows extracting subsets of data
that meet specific criteria.
Usage:
# Filter rows where column ’A ’ is greater than 1
filtered_df = df [ df [ ’A ’] > 1]
1.2.5 Applying Functions
Explanation: Applying functions to data allows transforming data or performing
operations on it. Pandas provides methods such as apply, applymap, and map.
Usage:
4
# Apply a function to each column
df_applied = df . apply ( lambda x : x * 2)
# Apply a function element - wise
df [ ’A ’] = df [ ’A ’ ]. map ( lambda x : x * 2)
1.3 Data Cleaning
1.3.1 Removing Duplicates
Explanation: Removing duplicate rows from the DataFrame ensures data integrity
and consistency.
Usage:
# Remove duplicate rows
df_no_duplicates = df . drop_duplicates ()
1.3.2 String Manipulation
Explanation: Using string methods to manipulate text data is essential for cleaning
and transforming textual data.
Usage:
# Convert strings to lowercase
df [ ’A ’] = df [ ’A ’ ]. str . lower ()
# Replace substring
df [ ’A ’] = df [ ’A ’ ]. str . replace ( ’ old ’ , ’ new ’)
1.3.3 Changing Data Types
Explanation: Converting data types of DataFrame columns is necessary for ensuring
data types are appropriate for analysis.
Usage:
# Convert column ’A ’ to float
df [ ’A ’] = df [ ’A ’ ]. astype ( ’ float ’)
1.4 Data Aggregation and Grouping
1.4.1 Grouping Data
Explanation: Grouping data by one or more columns and applying aggregation func-
tions helps in summarizing and analyzing data.
Usage:
# Group by column ’A ’ and calculate the sum
grouped_df = df . groupby ( ’A ’) . sum ()
5
1.4.2 Aggregation Functions
Explanation: Aggregation functions can be applied to grouped data to calculate
summary statistics.
Usage:
# Group by column ’A ’ and calculate sum and mean
aggregated_df = df . groupby ( ’A ’) . agg ([ ’ sum ’ , ’ mean ’ ])
1.4.3 Pivot Tables
Explanation: Creating pivot tables allows summarizing data in a matrix format,
which is useful for data analysis and reporting.
Usage:
# Create a pivot table
pivot_table = df . pivot_table ( values = ’B ’ , index = ’A ’ , aggfunc = ’ mean ’)
1.5 Combining DataFrames
1.5.1 Concatenation
Explanation: Concatenating DataFrames along rows or columns combines multiple
DataFrames into one.
Usage:
# Concatenate along columns
concatenated_df = pd . concat ([ df , df ] , axis =1)
# Concatenate along rows
concatenated_df = pd . concat ([ df , df ] , axis =0)
1.5.2 Merging and Joining
Explanation: Merging DataFrames using a key column allows combining data based
on common columns.
Usage:
# Merge DataFrames on column ’A ’
merged_df = df . merge ( df , on = ’A ’)
1.6 Time Series Data
1.6.1 Resampling
Explanation: Resampling time series data involves changing the frequency of the
time series, such as converting daily data to monthly data.
Usage:
# Resample data to monthly frequency and calculate the mean
resampled_df = df . resample ( ’M ’) . mean ()
6
1.6.2 Date/Time Indexing
Explanation: Indexing data by date/time allows performing time series analysis.
Usage:
# Set column ’ date ’ as index
df . set_index ( ’ date ’ , inplace = True )
# Select data for a specific date range
selected_data = df [ ’ 2023 -01 -01 ’: ’ 2023 -12 -31 ’]
1.7 Visualization with Pandas
1.7.1 Plotting
Explanation: Pandas provides built-in plotting methods for quick data visualization.
Usage:
# Line plot
df . plot ( kind = ’ line ’ , x = ’A ’ , y = ’B ’)
# Scatter plot
df . plot ( kind = ’ scatter ’ , x = ’A ’ , y = ’B ’)
# Histogram
df [ ’A ’ ]. plot ( kind = ’ hist ’)
1.8 Exporting Data
1.8.1 Saving to CSV
Explanation: Saving DataFrame to CSV format allows exporting data for use in other
applications.
Usage:
# Save DataFrame to CSV file
df . to_csv ( ’ data . csv ’ , index = False )
2 NumPy
2.1 Basic Operations and Data Structures
2.1.1 Creating Arrays
Explanation: NumPy arrays are used for storing and manipulating data efficiently.
Usage:
import numpy as np
# Create a NumPy array from a list
arr = np . array ([1 , 2 , 3 , 4 , 5])
# Create a NumPy array of zeros
zeros_arr = np . zeros ((3 , 3) )
7
# Create a NumPy array of ones
ones_arr = np . ones ((2 , 2) )
2.1.2 Basic Indexing and Slicing
Explanation: Indexing and slicing NumPy arrays allows accessing specific elements
or subsets of data.
Usage:
# Indexing
print ( arr [0]) # First element
# Slicing
print ( arr [1:4]) # Elements from index 1 to 3
2.2 Mathematical Operations
2.2.1 Element-wise Operations
Explanation: NumPy arrays support element-wise operations, such as addition, sub-
traction, multiplication, and division.
Usage:
# Element - wise addition
result = arr1 + arr2
# Element - wise multiplication
result = arr1 * arr2
2.2.2 Matrix Operations
Explanation: NumPy supports matrix operations, such as matrix multiplication and
dot product.
Usage:
# Matrix multiplication
result = np . matmul ( matrix1 , matrix2 )
# Dot product
result = np . dot ( vector1 , vector2 )
2.3 Statistical Functions
2.3.1 Descriptive Statistics
Explanation: NumPy provides functions for calculating descriptive statistics, such as
mean, median, standard deviation, and variance.
Usage:
# Mean
mean_value = np . mean ( arr )
8
# Median
median_value = np . median ( arr )
# Standard deviation
std_deviation = np . std ( arr )
# Variance
variance = np . var ( arr )
2.4 Linear Algebra
2.4.1 Eigenvalues and Eigenvectors
Explanation: NumPy allows computing eigenvalues and eigenvectors of a square
matrix.
Usage:
# Compute eigenvalues and eigenvectors
eigenvalues , eigenvectors = np . linalg . eig ( matrix )
2.4.2 Solving Linear Equations
Explanation: NumPy provides functions for solving systems of linear equations.
Usage:
# Solve linear equations
solution = np . linalg . solve ( coeff_matrix , const_vector )
2.5 Random Sampling
2.5.1 Generating Random Numbers
Explanation: NumPy allows generating arrays of random numbers from various prob-
ability distributions.
Usage:
# Generate random numbers from uniform distribution
random_numbers = np . random . rand (5)
# Generate random integers
random_integers = np . random . randint (1 , 100 , size =5)
2.6 Saving and Loading Data
2.6.1 Saving Arrays
Explanation: NumPy arrays can be saved to and loaded from binary files.
Usage:
# Save array to binary file
np . save ( ’ array . npy ’ , arr )
# Load array from binary file
loaded_arr = np . load ( ’ array . npy ’)
9
References
[1] Pandas Documentation: Comprehensive official documentation covering instal-
lation, user guide, API reference, and more. Available at https://pandas.pydata.
org/docs/.
[2] NumPy Documentation: Official documentation providing details on instal-
lation, quickstart tutorial, and API reference. Available at https://numpy.org/
doc/.
[3] Jake VanderPlas. Python Data Science Handbook. O’Reilly Media, 2016.
[4] DataCamp Pandas Tutorial: Interactive Pandas tutorial covering es-
sential topics. Access at https://www.datacamp.com/community/tutorials/
pandas-tutorial-dataframe-python.
[5] DataCamp NumPy Tutorial: Interactive NumPy tutorial with examples
and exercises. Access at https://www.datacamp.com/community/tutorials/
python-numpy-tutorial.
[6] GitHub Repositories: Explore GitHub repositories for code examples and
projects using Pandas and NumPy. Example: https://github.com/pandas-dev/
pandas.
[7] Matplotlib: https://matplotlib.org/
[8] Seaborn: https://seaborn.pydata.org/
10