Pandas Guide for Data Science
Pandas Guide for Data Science
Nayeem Islam
What will we Learn today?
Introduction to Pandas.................................................................................................................................... 2
1.1 What is Pandas?.................................................................................................................................... 2
1.2 Why Use Pandas in Data Science?................................................................................................. 2
1.3 Core Data Structures: Series and DataFrame..............................................................................3
1.4 What You'll Learn in This Tutorial.....................................................................................................4
Getting Started with Pandas.......................................................................................................................... 5
2.1 Installation and Setup......................................................................................................................... 5
2.2 Importing Pandas............................................................................................................................... 6
2.3 Setting Up Your Development Environment............................................................................... 6
2.4 Creating Your First Pandas Program..............................................................................................7
Core Data Structures....................................................................................................................................... 8
3.1 Introduction to Series......................................................................................................................... 8
3.2 Operations on Series.........................................................................................................................9
3.3 Introduction to DataFrame............................................................................................................. 10
3.4 Accessing Data in a DataFrame.....................................................................................................11
Data Manipulation with Pandas...................................................................................................................12
4.1 Loading Data into Pandas................................................................................................................ 12
4.2 Exploring Data................................................................................................................................... 14
4.3 Data Cleaning.................................................................................................................................... 16
4.4 Data Transformation......................................................................................................................... 17
Data Aggregation and Grouping.................................................................................................................18
5.1 Grouping Data.....................................................................................................................................18
5.2 Applying Multiple Aggregate Functions....................................................................................20
5.3 Pivot Tables........................................................................................................................................ 21
5.4 Cross Tabulations..............................................................................................................................21
Working with Time Series Data.................................................................................................................. 22
6.1 Date and Time in Pandas................................................................................................................ 22
6.2 Resampling and Frequency Conversion................................................................................... 24
6.3 Time Series Analysis.......................................................................................................................25
6.4 Handling Time Zones......................................................................................................................27
Advanced Data Operations.........................................................................................................................28
7.1 Merging and Joining DataFrames................................................................................................. 28
7.2 Concatenating DataFrames...........................................................................................................30
7.3 Advanced Transformations............................................................................................................. 31
Visualization with Pandas.............................................................................................................................33
8.1 Introduction to Data Visualization.................................................................................................33
8.2 Creating Basic Plots........................................................................................................................ 33
1
8.3 Advanced Plotting........................................................................................................................... 35
8.4 Plotting Directly with DataFrames............................................................................................... 36
8.5 Saving Plots.......................................................................................................................................38
Conclusion.............................................................................................................................................................. 38
9.1 Recap of What You've Learned......................................................................................................38
9.2 Practical Applications of Pandas..................................................................................................39
9.3 Next Steps..........................................................................................................................................40
9.4 Additional Resources...................................................................................................................... 40
2
Introduction to Pandas
Pandas is a powerful open-source Python library used primarily for data manipulation and
analysis. Built on top of the NumPy library, Pandas provides data structures and functions needed
to work with structured data seamlessly, making it a crucial tool in a data scientist’s arsenal.
Pandas is designed to make data manipulation intuitive and straightforward, enabling tasks like
cleaning, transforming, aggregating, and analyzing data. Whether you are working with datasets
from CSV files, Excel sheets, SQL databases, or JSON files, Pandas simplifies the process of
loading, manipulating, and preparing data for analysis.
● Data Structures: Offers two primary data structures: Series (for one-dimensional data) and
DataFrame (for two-dimensional data).
● Data Handling: Easily handles missing data, merges datasets, and reshapes data.
● Integration: Works seamlessly with other Python libraries such as Matplotlib for data
visualization and SciPy for scientific computing.
Data manipulation is a critical step in the data science pipeline. Before data can be analyzed or
modeled, it often needs to be cleaned, transformed, and aggregated. Pandas excels at these
tasks, allowing you to:
● Handle Large Datasets: Efficiently process large amounts of data with optimized
performance.
● Perform Complex Operations: Execute complex data transformations and operations with
just a few lines of code.
● Maintain Readability: Write clear, readable code that is easy to understand and maintain.
Whether you're performing simple tasks like filtering rows based on conditions or complex tasks
like merging multiple datasets, Pandas provides a consistent and powerful API.
● Series: A Pandas Series is a one-dimensional labeled array capable of holding any data
type (integers, strings, floats, etc.). Think of it as a column in a spreadsheet or a single list
3
of values. Each value in a Series is associated with a label (often an integer index),
allowing for quick access to elements.
Code Example:
import pandas as pd
Explanation:
● This example creates a Series named "Sample Series" containing five integer values. The
Series object automatically assigns an index to each value, making it easy to reference
specific elements.
Output:
0 10
1 20
2 30
3 40
4 50
Name: Sample Series, dtype: int64
Code Example:
data_frame = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [24, 27, 22, 32],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
})
print(data_frame)
4
Explanation:
● This code snippet creates a DataFrame with three columns: 'Name', 'Age', and 'City'. Each
column contains data relevant to an individual, organized in rows.
Output:
This tutorial will guide you through the essential features of Pandas, starting from the basics of
data structures like Series and DataFrame to more advanced operations such as data
aggregation, grouping, and time series analysis. You will also learn how to visualize data directly
with Pandas, making it a versatile tool in your data science toolkit.
This foundational knowledge will empower you to tackle real-world data science problems with
confidence.
Before you can start using Pandas, you need to install it. Pandas is compatible with Python 3.7
and above, and it can be installed using Python’s package manager, pip, or through Anaconda,
which is a popular distribution for data science.
5
Option 1: Installing with pip
● The simplest way to install Pandas is using pip, which comes pre-installed with Python.
Run the following command in your terminal or command prompt:
Command:
Explanation:
● This command will install Pandas and its dependencies. If you don’t have pip installed,
you can install it by following the instructions from the official Python documentation.
● Anaconda is a popular choice for data science projects because it simplifies the
installation of packages and manages environments. To install Pandas with Anaconda, use
the following command:
Command:
Explanation:
● This command will install Pandas along with all its dependencies through the Conda
package manager, which is included with Anaconda. This method is particularly useful if
you are working in a data science environment that requires other packages like NumPy,
Matplotlib, or SciPy.
Once Pandas is installed, the first step in any data science workflow using Pandas is to import the
library. It’s a common practice to import Pandas with the alias pd, which is short and easy to type.
Code Example:
import pandas as pd
6
Explanation:
● pd is a commonly used alias for Pandas, making it quicker to reference in your code. This
convention is widely adopted in the Python community, so it’s recommended to follow it
to maintain consistency and readability in your projects.
For the best experience when working with Pandas, you might want to set up an interactive
development environment. Some popular choices include:
● Jupyter Notebook: Ideal for data analysis as it allows you to write and execute code in
cells, view data frames, and include visualizations directly in the notebook.
Setup:
○ To install Jupyter Notebook, you can use pip:
jupyter notebook
● VS Code: A powerful text editor that supports Python development with extensions like
Python and Jupyter for a more integrated experience.
Setup:
○ Install the Python extension for VS Code.
○ If you prefer, you can also install the Jupyter extension to run notebooks directly in
VS Code.
● PyCharm: An IDE that offers advanced code editing, debugging, and profiling tools for
Python development.
Setup:
○ Install the Pandas plugin for better data frame support in PyCharm.
These tools provide a rich environment for exploring data and developing data science projects.
Choose the one that best fits your workflow.
7
Now that you have Pandas installed and your environment set up, let’s create a simple Pandas
program. We will create a DataFrame from scratch and display it.
Code Example:
import pandas as pd
# Creating a DataFrame
data = {
'Product': ['Laptop', 'Tablet', 'Smartphone'],
'Price': [1000, 700, 500],
'Quantity': [10, 20, 15]
}
df = pd.DataFrame(data)
print(df)
Explanation:
● This example creates a dictionary containing product data and then converts it into a
Pandas DataFrame.
● The DataFrame is a fundamental structure in Pandas, allowing you to organize and
manipulate your data in a tabular format.
Output:
This simple script demonstrates how to create a DataFrame, a core component of data
manipulation in Pandas.
8
Core Data Structures
A Pandas Series is a one-dimensional array-like object that can hold data of any type (integers,
strings, floats, etc.). It is similar to a column in a spreadsheet or a SQL table. Each element in a
Series is associated with an index, allowing for quick and easy data retrieval.
● Homogeneous Data: Unlike a Python list, a Series holds data of a single data type.
● Index: Each data point in a Series is associated with an index, which can be either
automatically generated or manually assigned.
● Labeling: Series can be labeled, meaning each data point can have a unique identifier.
Code Example:
import pandas as pd
Explanation:
● The above example creates a Series named temperatures with five integer values
representing temperature readings. The name parameter assigns a label to the Series,
which is useful when printing or analyzing the Series.
Output:
0 72
1 85
2 90
3 78
4 65
Name: Temperature, dtype: int64
9
3.2 Operations on Series
You can perform various operations on Series, such as arithmetic operations, filtering, and
applying functions.
Code Example:
# Arithmetic operations
temperatures_celsius = (temperatures - 32) * 5.0/9.0
print(temperatures_celsius)
# Filtering
high_temps = temperatures[temperatures > 80]
print(high_temps)
Explanation:
Output:
0 22.222222
1 29.444444
2 32.222222
3 25.555556
4 18.333333
Name: Temperature, dtype: float64
1 85
2 90
Name: Temperature, dtype: int64
10
A Pandas DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular
data structure with labeled axes (rows and columns). It is one of the most commonly used data
structures in data science for organizing and analyzing data.
● Heterogeneous Data: Unlike a Series, a DataFrame can contain columns of different data
types (e.g., integers, floats, strings).
● Labeled Axes: Rows and columns in a DataFrame can be labeled, allowing for easy
access and manipulation of data.
● Size-Mutable: The size of a DataFrame can be changed by adding or removing rows or
columns.
Code Example:
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
Explanation:
● This code snippet creates a DataFrame with three columns: Name, Age, and City. Each
column contains data relevant to an individual, organized in rows.
Output:
11
3.4 Accessing Data in a DataFrame
Accessing data in a DataFrame is simple, thanks to Pandas’ intuitive indexing and selection
capabilities.
Code Example:
Explanation:
● Single Column: Access a single column using the column name in square brackets.
● Multiple Columns: Access multiple columns by passing a list of column names.
● Rows: Use .iloc[] to access rows by their index position or .loc[] to access rows by
their label (if the DataFrame has a custom index).
Output:
0 Alice
1 Bob
2 Charlie
Name: Name, dtype: object
Name City
0 Alice New York
1 Bob Los Angeles
2 Charlie Chicago
12
Name Alice
Age 25
City New York
Name: 0, dtype: object
Data in the real world comes in various formats, such as CSV, Excel, JSON, or SQL databases.
Pandas makes it incredibly easy to load data from these formats into a DataFrame, which is the
core data structure used for data manipulation.
● CSV (Comma-Separated Values) is one of the most common data formats used in data
science.
Code Example:
import pandas as pd
Explanation:
● The pd.read_csv() function is used to load data from a CSV file into a DataFrame. The
head() method then prints the first five rows of the DataFrame, giving you a quick
preview of the data.
Output:
13
3 ... ... ...
4 ... ... ...
● Excel files are widely used in business and academia. Pandas can read data from Excel
spreadsheets using the read_excel() function.
Code Example:
● Explanation:
○ The pd.read_excel() function loads data from an Excel file. You can specify
the sheet name using the sheet_name parameter if your workbook contains
multiple sheets.
● JSON (JavaScript Object Notation) is a lightweight format for storing and transporting
data, often used in web applications.
Code Example:
● Explanation:
○ The pd.read_json() function reads data from a JSON file and converts it into a
DataFrame.
14
Once your data is loaded into a DataFrame, the next step is to explore and understand the
structure of the dataset. Pandas provides several methods for this purpose.
● The info() method gives you a concise summary of the DataFrame, including the
number of entries, column names, data types, and memory usage.
Code Example:
print(df.info())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Column1 100 non-null int64
1 Column2 100 non-null float64
2 Column3 100 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 2.5 KB
2. Statistical Summary
● The describe() method generates descriptive statistics that summarize the central
tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.
Code Example:
print(df.describe())
Output:
15
Column1 Column2
count 100.000000 100.000000
mean 50.500000 0.478522
std 29.011492 0.295722
min 1.000000 0.001000
25% 25.750000 0.250500
50% 50.500000 0.483000
75% 75.250000 0.707250
max 100.000000 0.995000
● Explanation:
○ describe() provides a quick statistical summary of numeric columns, helping
you to understand the distribution of your data.
● You can view specific parts of your DataFrame by using indexing and slicing.
Code Example:
● Explanation:
○ df['Column1'] accesses a specific column by name.
○ df.iloc[0:5] slices the DataFrame to view the first five rows.
Data cleaning is one of the most critical steps in data preprocessing. Real-world data often
contains missing values, duplicates, or incorrect data types that need to be addressed.
16
● Missing data can be handled by either removing rows/columns or filling them with specific
values.
Code Example:
Explanation:
2. Removing Duplicates
df_unique = df.drop_duplicates()
Explanation:
● Sometimes, data might be stored in the wrong format (e.g., numbers stored as strings).
Pandas allows for easy conversion of data types.
Code Example:
df['Column1'] = df['Column1'].astype(int)
17
Explanation:
Transforming data is often necessary for analysis, and Pandas offers several ways to reshape,
aggregate, and manipulate your data.
1. Filtering Data
Explanation:
● This filters the DataFrame to only include rows where Column1 is greater than 50.
2. Sorting Data
● Sorting data can help you arrange your DataFrame based on the values of a specific
column.
Code Example:
Explanation:
3. Applying Functions
● You can apply custom functions to each element in a Series or DataFrame using the
apply() method.
Code Example:
df['Column3_length'] = df['Column3'].apply(len)
18
Explanation:
● This applies the len function to each element in Column3 to create a new column that
contains the length of each string.
One of the most powerful features of Pandas is the ability to group data and perform aggregate
operations on these groups. This is particularly useful when you need to analyze data by
categories or calculate summary statistics for different subsets of your dataset.
● The groupby() method is used to group data based on the values of one or more
columns. Once grouped, you can apply aggregate functions such as sum(), mean(),
count(), etc.
Code Example:
import pandas as pd
# Sample DataFrame
data = {
'Department': ['HR', 'Finance', 'IT', 'HR', 'Finance', 'IT'],
'Employee': ['John', 'Sarah', 'Mike', 'Anna', 'Tom', 'Sam'],
'Salary': [50000, 60000, 70000, 48000, 61000, 72000]
}
df = pd.DataFrame(data)
# Grouping by 'Department'
department_group = df.groupby('Department')['Salary'].mean()
print(department_group)
Explanation:
● This code groups the DataFrame by the Department column and then calculates the
average salary for each department using the mean() function.
19
Output:
Department
Finance 60500.0
HR 49000.0
IT 71000.0
Name: Salary, dtype: float64
● You can also group by multiple columns to perform more complex groupings and
aggregations.
Code Example:
Explanation:
● This groups the DataFrame by both Department and Employee, allowing you to
perform aggregations at a more granular level.
Output:
Department Employee
Finance Sarah 60000
Tom 61000
HR Anna 48000
John 50000
IT Mike 70000
Sam 72000
Name: Salary, dtype: int64
20
5.2 Applying Multiple Aggregate Functions
Pandas allows you to apply multiple aggregate functions at once, which can be particularly useful
for generating comprehensive summaries of your data.
Code Example:
Explanation:
● In this example, the agg() method is used to apply multiple aggregate functions (mean,
sum, and count) to the Salary column for each department. This approach provides a
quick overview of key statistics for each group.
Output:
A pivot table is a powerful tool in data analysis that allows you to reorganize and summarize data.
With Pandas, creating pivot tables is straightforward using the pivot_table() method.
Code Example:
21
columns='Employee', aggfunc='sum')
print(pivot)
Explanation:
● The pivot_table() function is used to create a pivot table, where values specify the
data to aggregate, index specifies the rows, columns specifies the columns, and
aggfunc determines the aggregation function to apply.
Output:
Cross tabulation (or crosstab) is another way to aggregate data. It is particularly useful for
categorical data, allowing you to see the frequency distribution of different combinations of
categories.
Code Example:
# Creating a cross-tabulation
crosstab = pd.crosstab(df['Department'], df['Employee'])
print(crosstab)
Explanation:
Output:
22
Employee Anna John Mike Sam Sarah Tom
Department
Finance 0 0 0 0 1 1
HR 1 1 0 0 0 0
IT 0 0 1 1 0 0
Handling date and time data is a common task in data science, especially when working with time
series data. Pandas provides powerful tools to work with dates, times, and time-indexed data.
1. Converting to DateTime
● One of the first steps when working with time series data is converting your data to
Pandas' datetime format, which enables time-based operations and analysis.
Code Example:
import pandas as pd
# Sample data
data = {
'Date': ['2024-01-01', '2024-01-02', '2024-01-03'],
'Value': [100, 200, 300]
}
df = pd.DataFrame(data)
Explanation:
23
● The pd.to_datetime() function converts a column of strings representing dates into
Pandas' datetime objects, which allows for more advanced time series operations.
Output:
Date Value
0 2024-01-01 100
1 2024-01-02 200
2 2024-01-03 300
● A common practice in time series analysis is to set the date or time column as the index of
the DataFrame. This enables easy slicing and manipulation based on time periods.
Code Example:
Explanation:
● By setting the Date column as the index, the DataFrame is now time-indexed, which
makes it easier to access and analyze data over specific time intervals.
Output:
Value
Date
2024-01-01 100
2024-01-02 200
2024-01-03 300
24
Resampling involves changing the frequency of your time series data, such as converting daily
data to monthly data. Pandas makes this process straightforward with the resample() method.
1. Downsampling
● Downsampling reduces the frequency of the data, such as aggregating daily data to
monthly data.
Code Example:
Explanation:
Output:
Value
Date
2024-01-31 600
2. Upsampling
● Upsampling increases the frequency of your data, which often requires filling in missing
values.
Code Example:
25
Explanation:
● The resample('D') method is used to upsample the data to a daily frequency, and
ffill() (forward fill) fills in missing values by propagating the last valid observation
forward.
Output:
Value
Date
2024-01-01 100
2024-01-02 200
2024-01-03 300
Pandas also supports a variety of time series-specific operations, such as calculating moving
averages, rolling windows, and shifting data.
● A moving average smooths time series data by averaging over a specified window of
time. This can help identify trends by reducing short-term fluctuations.
Code Example:
Explanation:
Output:
Value 2-day MA
26
Date
2024-01-01 100 NaN
2024-01-02 200 150.0
2024-01-03 300 250.0
2. Shifting Data
● Shifting data is useful for comparing time periods, such as calculating the difference
between today’s value and yesterday’s value.
Code Example:
Explanation:
● The shift(1) method shifts the data by one time period (in this case, one day), which is
then used to calculate the difference from the previous day.
Output:
Pandas provides robust support for handling time zones, which is crucial when working with time
series data from different regions.
27
1. Converting Time Zones
● You can convert a time series from one time zone to another using the tz_convert()
method.
Code Example:
● Explanation:
○ tz_localize('UTC') sets the time zone of the DateTime index to UTC, and
tz_convert('US/Eastern') converts it to US Eastern time.
● If your time series data does not have time zone information, you can localize it.
Code Example:
# Localizing to UTC
df = df.tz_localize('UTC')
print(df)
Explanation:
28
In many real-world scenarios, data comes from multiple sources and needs to be combined.
Pandas provides powerful tools for merging and joining DataFrames, similar to SQL operations.
1. Merging DataFrames
● The merge() function in Pandas works similarly to SQL joins, allowing you to merge two
DataFrames based on a key or set of keys.
Code Example:
import pandas as pd
# Sample DataFrames
df1 = pd.DataFrame({
'EmployeeID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
df2 = pd.DataFrame({
'EmployeeID': [3, 4, 5, 6],
'Department': ['HR', 'Finance', 'IT', 'Marketing']
})
Explanation:
● The pd.merge() function merges df1 and df2 on the EmployeeID column. The
how='inner' parameter specifies an inner join, meaning only rows with matching
EmployeeID values in both DataFrames will be included in the result.
Output:
29
2. Types of Joins
Explanation:
● In a left join, all rows from df1 are retained, and where df2 has a matching EmployeeID,
the Department column is filled. Where there is no match, NaN is used.
Output:
30
● You can concatenate DataFrames row-wise using the concat() function, effectively
stacking them on top of each other.
Code Example:
# Sample DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']})
Explanation:
● The pd.concat() function is used to concatenate df1 and df2 along rows (axis=0),
effectively stacking df2 below df1.
Output:
A B
0 A0 B0
1 A1 B1
2 A2 B2
0 A3 B3
1 A4 B4
2 A5 B5
31
# Concatenating DataFrames column-wise
concatenated_df = pd.concat([df1, df2], axis=1)
print(concatenated_df)
Explanation:
Output:
A B A B
0 A0 B0 A3 B3
1 A1 B1 A4 B4
2 A2 B2 A5 B5
Pandas allows for complex transformations that involve reshaping your data, such as pivoting and
melting.
1. Pivoting DataFrames
● Pivoting is the process of reshaping data where you turn unique values from one column
into multiple columns.
Code Example:
# Sample DataFrame
data = {
'Date': ['2024-01-01', '2024-01-02', '2024-01-01', '2024-01-02'],
'City': ['New York', 'New York', 'Los Angeles', 'Los Angeles'],
'Temperature': [32, 30, 75, 78]
}
df = pd.DataFrame(data)
32
print(pivot_df)
Explanation:
● The pivot() function reshapes the DataFrame, creating a new DataFrame where Date
is the index, City is the columns, and Temperature is the data.
Output:
2. Melting DataFrames
● Melting is the opposite of pivoting, where you unpivot your DataFrame, turning columns
into rows.
Code Example:
Explanation:
● The melt() function transforms the DataFrame from wide format to long format, useful
for making the data tidy or ready for plotting.
Output:
33
Visualization with Pandas
Data visualization is a crucial aspect of data analysis, allowing you to communicate insights and
trends effectively. While there are dedicated libraries like Matplotlib and Seaborn for creating
detailed visualizations, Pandas provides built-in plotting capabilities that are quick and easy to
use, especially for exploratory data analysis.
Pandas’ plotting functionality is built on top of Matplotlib, which means that it offers a simple
interface for creating common types of plots, such as line plots, bar charts, histograms, and more.
Pandas makes it straightforward to create basic visualizations directly from DataFrames and
Series.
1. Line Plot
● A line plot is ideal for visualizing trends over time or continuous data.
Code Example:
import pandas as pd
# Sample DataFrame
data = {
'Date': pd.date_range(start='2024-01-01', periods=10, freq='D'),
'Value': [10, 12, 14, 15, 18, 20, 19, 21, 24, 26]
}
df = pd.DataFrame(data)
● Explanation:
34
This code creates a simple line plot of Value over Date. The plot() function
○
automatically generates the plot and labels the axes based on the DataFrame
columns.
● Output:
○ A line plot showing the trend of Value over the given dates.
2. Bar Plot
● Bar plots are useful for comparing quantities across different categories.
Code Example:
# Sample DataFrame
data = {
'Category': ['A', 'B', 'C', 'D'],
'Values': [10, 20, 15, 25]
}
df = pd.DataFrame(data)
● Explanation:
○ This code generates a bar plot showing the values for each category. The
kind='bar' parameter specifies the type of plot.
● Output:
○ A bar plot displaying the values for categories A, B, C, and D.
3. Histogram
# Sample DataFrame
df = pd.DataFrame({
'Values': [10, 15, 20, 20, 25, 30, 35, 35, 40, 45, 50]
})
35
# Plotting a histogram
df['Values'].plot(kind='hist', title='Histogram Example', bins=5)
● Explanation:
○ This code generates a histogram to show the distribution of values in the
DataFrame. The bins parameter controls the number of bins in the histogram.
● Output:
○ A histogram displaying the frequency distribution of values.
Pandas also supports more advanced plotting features, such as creating multiple plots in a single
figure and customizing plot styles.
1. Subplots
● You can create multiple subplots in a single figure using the subplots parameter.
Code Example:
# Sample DataFrame
data = {
'Category': ['A', 'B', 'C', 'D'],
'Values1': [10, 20, 15, 25],
'Values2': [15, 25, 10, 30]
}
df = pd.DataFrame(data)
# Creating subplots
df.plot(x='Category', y=['Values1', 'Values2'], kind='bar', subplots=True,
layout=(2, 1), title='Subplots Example')
● Explanation:
○ This code generates two bar plots in a single figure, arranged vertically in two
rows. The subplots=True parameter creates the separate plots, and layout
specifies the arrangement.
36
● Output:
○ Two bar plots, one for Values1 and one for Values2, displayed in a vertical
layout.
● You can customize the appearance of plots using various parameters in the plot()
function, such as color, line style, and more.
Code Example:
● Explanation:
○ This line plot is customized with a green dashed line and circular markers.
Customizing plots helps in making the visualization more informative and
aesthetically pleasing.
● Output:
○ A customized line plot with specified color, line style, and markers.
In addition to the basic plotting functions, you can plot directly from DataFrames, which allows for
more complex visualizations such as scatter plots, box plots, and area plots.
1. Scatter Plot
● Scatter plots are useful for visualizing the relationship between two variables.
Code Example:
# Sample DataFrame
data = {
'X': [1, 2, 3, 4, 5],
'Y': [2, 3, 5, 7, 11]
}
df = pd.DataFrame(data)
37
# Creating a scatter plot
df.plot(x='X', y='Y', kind='scatter', title='Scatter Plot Example')
● Explanation:
○ This code creates a scatter plot to show the relationship between X and Y.
● Output:
○ A scatter plot displaying the correlation between X and Y values.
2. Box Plot
● Box plots are useful for visualizing the distribution of data based on quartiles, including
potential outliers.
Code Example:
● Explanation:
○ This code generates a box plot to show the distribution and variability of data.
● Output:
○ A box plot displaying the distribution of values.
Finally, you can save your plots as image files for use in reports or presentations.
1. Saving a Plot
● You can save any plot to a file using the savefig() method from Matplotlib.
Code Example:
38
df.plot(x='Date', y='Value', kind='line', title='Line Plot Example')
plt.savefig('line_plot.png')
Explanation:
● The savefig() method saves the current plot as an image file. You can specify the
format by changing the file extension (e.g., .png, .jpg, .pdf).
Output:
Conclusion
9.1 Recap of What You've Learned
Throughout this tutorial, you have explored the powerful capabilities of Pandas, a vital library for
data manipulation and analysis in Python. Here's a quick recap of what we covered:
1. Introduction to Pandas:
○ Learned about Pandas' significance in data science and its core data structures:
Series and DataFrame.
2. Getting Started with Pandas:
○ Installed Pandas, set up your development environment, and created your first
DataFrame.
3. Core Data Structures:
○ Delved deeper into Series and DataFrame, exploring how to create, access, and
manipulate data within these structures.
4. Data Manipulation:
○ Mastered the essentials of loading, cleaning, and transforming data, including
handling missing data and filtering rows.
5. Data Aggregation and Grouping:
○ Learned to group data, apply aggregate functions, and create pivot tables for
summarizing data.
6. Working with Time Series Data:
39
○ Explored handling date and time data, resampling, and performing time series
analysis.
7. Advanced Data Operations:
○ Discovered advanced techniques such as merging, joining, concatenating, and
reshaping DataFrames.
8. Visualization with Pandas:
○ Created various plots directly with Pandas, from basic line plots to advanced
visualizations, and saved them for further use.
With the knowledge gained from this tutorial, you are now equipped to handle real-world data
science projects. Some practical applications include:
To further enhance your skills and deepen your understanding of Pandas, consider the following
next steps:
40
○ Apply what you’ve learned by working on real-world datasets available from
sources like Kaggle, UCI Machine Learning Repository, or government open data
portals.
4. Build a Project:
○ Create a full-fledged data analysis or machine learning project from scratch,
leveraging Pandas for data manipulation, analysis, and visualization.
5. Join the Community:
○ Engage with the Pandas community by contributing to the library, asking questions
on forums like Stack Overflow, or participating in data science meetups.
Nayeem Islam
41