Introduction to Pandas and
dataset analysis and visualization
matplotlib
Content:
Introduction to Pandas
• What is Pandas?
• Core Features of Pandas
• Pandas Data Structures: Series & DataFrame
• Importing Data
• Basic Data Manipulation Operations
• Example Use Cases
• Hands-on Examples with Code
• Introduction to Matplotlib
• What is Matplotlib?
• Why Matplotlib for Data Visualization?
• Basic Components of a Plot
• Common Plot Types
• Customizing Plots
• Hands-on Examples with Code
Introduction to pandas
• Pandas is a Python library for working with data sets
• Pandas is a powerful Python library used for data analysis and manipulation.
• Built on top of NumPy and provides easy-to-use data structures and operations.
• It has functions for analysing, cleaning, exploring, and manipulating data.
• The name "Pandas" has a reference to both "Panel Data",
and "Python Data Analysis" and was created by Wes McKinney in 2008
pandas core feature…
Key Features:
• Loading data into DataFrames Series structures
• Handling missing data
• Data manipulation (filtering, aggregation , imputation, removal)
• Filtering data based on conditions
• Creating new columns based on existing data
• It is an opensource library
Key structures:
• Series (1D labeled array)
• DataFrame (2D labeled data, like a table)
Pandas Codebase? - https://github.com/pandas-dev/pandas
Importing and installation
Installation
• Open command prompt
• Run the command `pip install pandas`
Importing pandas
• To use the library we have to import it or include it in our project using the following command
• `import pandas ` or `import pandas as pd`
Importing example
Importing data to pandas
Data Importing:
• Pandas can easily import data from different fileformats, such as CSV, Excel, JSON.
Exploring dataset
import pandas as pd
# data = pd.read_csv(‘file-name.csv')
# data = pd.read_excel(‘file-name.xlsx’) # the path to file
data = pd.read_excel('ESD.xlsx')
# print(data.head())
# print(data.tail())
# print(data.info())
# print(data.describe())
# print(data.isnull().sum())
Handling duplicate data
import pandas as pd
#handling duplicate values
data = pd.read_csv('company1.csv')
print(data['EEID'].duplicated())
print(data)
#find duplicated value
print(data['EEID'].duplicated().sum())
#finds the non-null values
print(data['salary'].count())
# drop duplicate values in dataframe
print(data.drop_duplicates('EEID’))
# in case we want to replace duplicate values
data['EEID'] = data['EEID'].where(~data['EEID'].duplicated(),
other=pd.NA)
# data.loc[data['EEID'].duplicated(), 'EEID'] = pd.NA
Handling missing data
import numpy as np
hmd = pd.read_csv('company1.csv’)
# shows missing data
hmd.isnull()
# counts missing data
Hmd.isnull().sum()
# drop null values
hmd.dropna()
# replace null values to a custom value
hmd.replace(np.nan , 'default_value’)
# replace null values in a specific column
hmd['Name'] = hmd['Name'].replace(np.nan , 'no-name’)
Handling missing data continue…
Sometimes you can’t just drop the missing data .
Mean = the average value (the sum of all values divided by number of values).
Median = the value in the middle, after you have sorted all values ascending.
Mode = the value that appears most frequently.
mean_salary = hmd['salary'].mean() # Calculate the mean
median_salary = hmd['salary'].median() # Calculate the median
mode_salary = hmd['salary'].mode()[0] # Calculate the mode (most frequent salary value)
# Print the calculated values
print(f"\nMean Salary: {mean_salary}")
print(f"Median Salary: {median_salary}")
print(f"Mode Salary: {mode_salary}")
hmd['salary'] = hmd['salary'].replace(np.nan, mode_salary) # Replace NaN with mode in the 'salary' column
print(hmd)
Handling missing data continue…
Filling missing data
• Forward filling
• Backward filling
For example we cannot get the mean, median or mode for gender
hmd_['gender'] = hmd_['gender'].bfill() # backword fill
hmd_['gender'] = hmd_['gender'].ffill() # forward fill
print(hmd_)
Data Transformation
esd = pd.read_excel('ESD.xlsx')
esd.loc[esd['Bonus %'] == 0 , "GetBonus"] = "No bonus"
esd.loc[esd['Bonus %'] > 0 , "GetBonus"] = "bonus"
print(esd.head(10))
# another example
esd = pd.read_excel('ESD.xlsx')
esd['describe_employee'] = esd['EEID'] + ' ' + esd['Full Name'].str.upper() + ' ' + esd['Job Title']
esd['tax'] = esd['Annual Salary'] - ((esd['Annual Salary'] / 100) * 10 )
print(esd.head())
Dataset summarization
sum(): Returns the sum of the values.
mean(): Returns the average of the values.
count(): Returns the number of non-NA/null observations.
max(): Returns the maximum value.
min(): Returns the minimum value.
median(): Returns the median value.
esd = pd.read_excel('ESD.xlsx')
agg_1 = esd.groupby(['Department' , 'Gender']).agg({"EEID":"count"})
agg_2 = esd.groupby(['Department' , 'Ethnicity']).agg({"EEID":"count"})
print(agg_1)
print(agg_2)
Merge and join
import pandas as pd
employee = {
"id": [1,2,3,4],
'names': ['ahmad' , 'mahmood' , 'khalil' , 'khanwali']
}
employee2 = {
"id": [1,2,3,4],
'salary': [12000,10000,4500,8000]
}
df = pd.DataFrame(employee)
df1 = pd.DataFrame(employee2)
emp = pd.merge(df,df1, on='id')
emp = pd.merge(df,df1, how='left')
emp = pd.concat([df,df1])
print(emp)
Introduction to Matplotlib
A comprehensive library for creating static, animated, and interactive visualizations in Python.
• Built on NumPy and designed for easy and flexible plotting.
• Matplotlib was created by John D. Hunter.
• Matplotlib is open source and we can use it freely.
• https://matplotlib.org/stable/ -- documentation
• https://github.com/matplotlib/matplotlib -- codebase
Why matplotlib
Advantages:
• Versatile: Supports a variety of plots (line, scatter,
bar, etc.).
• Customizable: Extensive options for styling and
formatting.
• Integration: Works well with Jupyter notebooks, other
Installation & usage:
libraries like Pandas, and GUI applications.
• To install it `pip install matplotlib`
• To use it `import matplotlib.pyplot as plt`
Plot types
Common Plot Types are as follows:
• Line Plot: `plt.plot()`
• Scatter Plot: `plt.scatter()`
• Bar Plot: `plt.bar()`
• Histogram: `plt.hist()`
• Box Plot: `plt.boxplot()`
Scatter plot
import numpy as np
import matplotlib.pyplot as pt
x_axis = np.random.random(50) * 100
y_axis = np.random.random(50) * 100
pt.scatter(x_axis,y_axis)
pt.show()
pt.scatter(x_axis,y_axis, color='#0000ff', marker="1", s= 100)
pt.show()
Line plot or chart
#line chart
years = [2000+ x for x in range(24)] # x axis
home_price = np.random.random(24) * 1000 # y axis
years1 = [2000+ x for x in range(24)] # x axis
home_price1 = np.random.random(24) * 500 # y axis
pt.plot(years,home_price )
pt.plot(years,home_price, c='red', lw=4, label='line-1' )
#linestyle='--'
pt.plot(years1,home_price1, label='line-2' )
pt.legend('top right')
pt.show()
Barchart
import matplotlib.pyplot as plt
# Data for the bar chart
products = ['Product A', 'Product B', 'Product C', 'Product D']
sales = [120, 300, 250, 450]
# Create a bar chart
plt.bar(products, sales, color='skyblue', edgecolor='black', width=0.6)
# Add Title and Labels
plt.title('Sales of Products in Q1 2024', fontsize=16, fontweight='bold', color='darkblue')
plt.xlabel('Products', fontsize=12, fontweight='bold')
plt.ylabel('Sales (in Units)', fontsize=12, fontweight='bold')
# Gridlines (add gridlines for better readability)
plt.grid(True, which='both', axis='y', linestyle='--', linewidth=0.7)
for i, value in enumerate(sales):
plt.text(i, value + 10, str(value), ha='center', fontsize=11)
plt.gca().set_facecolor('whitesmoke’) # Add a Background Color to the Chart
plt.ylim(0, 1000) # Set the Limits for Y-axis
plt.tight_layout() # Display the bar chart
plt.show()
Histogram chart
# Data for a histogram (continuous data)
ages = [22, 23, 25, 26, 28, 30, 32, 35, 36, 37, 40, 42, 45, 48, 50]
ages = np.random.normal(20,1.5,12000)
# Creating a histogram
# pt.hist(ages, bins=5, color='lightgreen', edgecolor='black')
pt.hist(ages, bins=112, cumulative=True) # cumulative
pt.xlabel('Age')
pt.ylabel('Frequency')
pt.title('Age Distribution')
pt.show()
Pie chart
# Company names and their market caps (in billions)
companies = [
'Apple', 'Microsoft', 'Nvidia', 'Saudi Aramco', 'Alphabet (Google)', 'Amazon',
'Meta Platforms (Facebook)', 'Berkshire Hathaway', 'TSMC', 'Eli Lilly',
'JPMorgan Chase', 'Tesla', 'Visa', 'Johnson & Johnson', 'ExxonMobil',
'Samsung', 'Chevron', 'Walmart', 'Pfizer', 'Procter & Gamble',
'Mastercard', 'Alibaba', 'Boeing', 'Cisco', 'IBM’, 'Shell', 'American Express', 'Qualcomm', 'Verizon', 'Morgan Stanley'
]
market_caps = [
3100, 3100, 2200, 2100, 1700, 1400, 768, 768, 500, 500, 500, 800, 500, 450, 400, 320, 330, 648, 240, 320, 350, 240, 150, 216,
215,
211, 196, 189, 180, 179
]
# Create the pie chart
pt.figure(figsize=(10, 10))
pt.pie(market_caps, labels=companies, autopct='%2.2f%%', startangle=140, colors=plt.cm.Paired.colors)
# Add a title
pt.title('Market Share of Top 30 Companies by Market Cap (2024)')
# Display the pie chart
pt.show()
Boxplot or boxchart
# Sample data for the salaries in different departments
salaries = [
[45000, 48000, 52000, 55000, 58000, 60000, 62000, 67000, 70000], # Department A
[40000, 43000, 47000, 50000, 53000, 56000, 59000], # Department B
[30000, 35000, 40000, 42000, 45000, 47000, 49000, 50000], # Department C
[60000, 62000, 65000, 67000, 70000, 75000], # Department D
[50000, 52000, 54000, 55000, 58000, 60000, 62000, 65000] # Department E
]
# Create the box plot
plt.figure(figsize=(8, 6))
plt.boxplot(salaries, labels=['Dept A', 'Dept B', 'Dept C', 'Dept D', 'Dept E'])
# Add a title and labels
plt.title('Salary Distribution by Department')
plt.ylabel('Salary ($)')
plt.xlabel('Departments')
# Display the plot
plt.show()
Multiple Figures
import matplotlib.pyplot as plt
import numpy as np
# Generating sample data for 1 year (12 months)
months = [
'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'
]
# Sample prices for each cryptocurrency (in USD)
btc_prices = [
32000, 33000, 34000, 35000, 36000, 37000,
38000, 39000, 40000, 41000, 42000, 43000
] # Bitcoin (BTC)
eth_prices = [
2000, 2100, 2200, 2300, 2400, 2500,
2600, 2700, 2800, 2900, 3000, 3100
] # Ethereum (ETH)
# Add a title and labels
plt.title('Cryptocurrency Prices Over 1 Year')
plt.xlabel('Months')
plt.ylabel('Price (USD)')
# Create the line chart
plt.figure(1)
# Plotting the prices
plt.plot(months, btc_prices, marker='o', label='Bitcoin (BTC)', color='orange')
Subplots and saving
import matplotlib.pyplot as plt
import numpy as np
# Generate x values
x = np.linspace(-2 * np.pi, 2 * np.pi, 1000)
# Calculate y values for each function
y_cos = np.cos(x) # Cosine wave
y_sin = np.sin(x) # Sine wave
y_tan = np.tan(x) # Tangent wave
# Create a figure with 3 subplots
fig, axs = plt.subplots(3, 1, figsize=(10, 12))
# Cosine Wave
axs[0].plot(x, y_cos, color='blue', label='Cosine Wave')
axs[0].set_title('Cosine Wave')
axs[0].set_ylabel('cos(x)')
axs[0].set_ylim(-1.5, 1.5) # Limit y-axis for better visibility
axs[0].grid(True)
axs[0].legend()
Subplots and saving continue…
# Sine Wave
axs[1].plot(x, y_sin, color='orange', label='Sine Wave')
axs[1].set_title('Sine Wave')
axs[1].set_ylabel('sin(x)')
axs[1].set_ylim(-1.5, 1.5) # Limit y-axis for better visibility
axs[1].grid(True)
axs[1].legend()
# Tangent Wave
axs[2].plot(x, y_tan, color='green', label='Tangent Wave')
axs[2].set_title('Tangent Wave')
axs[2].set_ylabel('tan(x)')
axs[2].set_ylim(-10, 10) # Limit y-axis for better visibility
axs[2].grid(True)
axs[2].legend()
# Adjust layout to prevent overlap
plt.tight_layout()
plt.savefig('subplots.jpeg', dpi=300, transparent = True )
# plt.show()
3d plotting
import numpy as np
import matplotlib.pyplot as pt
# Create a grid of x and y values
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
x, y = np.meshgrid(x, y)
# Calculate z values based on the mathematical formula
z = np.sin(np.sqrt(x**2 + y**2))
# Create a 3D plot
fig = pt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
# Plot the surface
surface = ax.plot_surface(x, y, z, cmap='viridis', edgecolor='none')
# Add a color bar which maps values to colors
fig.colorbar(surface, shrink=0.5, aspect=10)
# Set titles and labels
ax.set_title('3D Plot of z = sin(sqrt(x^2 + y^2))')
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis')
# Show the plot
pt.show()
3d plotting second example –
scatter plot
import numpy as np
import matplotlib.pyplot as pt
# Generate random data for the scatter plot
num_points = 100
x = np.random.rand(num_points) * 10 # X values
y = np.random.rand(num_points) * 10 # Y values
z = np.random.rand(num_points) * 10 # Z values
# Create a 3D scatter plot
fig = pt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
# Scatter the data points
scatter = ax.scatter(x, y, z, c='g', marker='o', alpha=0.7)
# Set titles and labels
ax.set_title('3D Scatter Plot', fontsize=16)
ax.set_xlabel('X axis', fontsize=12)
ax.set_ylabel('Y axis', fontsize=12)
ax.set_zlabel('Z axis', fontsize=12)
# Show the plot
pt.show()
Using both pandas and matplotlib
import pandas as pd
import matplotlib.pyplot as plt
esd = pd.read_excel('ESD.xlsx')
agg_ = esd.groupby(['Ethnicity']).agg({"EEID": "count"})
ethnicity_counts = agg_.reset_index() # Reset index to make 'Ethnicity' a column
ethnicity_counts.columns = ['Ethnicity', 'Count'] # Rename columns for clarity
plt.figure(figsize=(8, 6)) # Set figure size
plt.pie(ethnicity_counts['Count'], labels=ethnicity_counts['Ethnicity'], autopct='%1.1f%%',
startangle=140)
plt.title('Employee Distribution by Ethnicity')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
Datasets resources
Datasets
• https://archive.ics.uci.edu/ - uci dataset repo
• https://datasetsearch.research.google.com/ - google dataset
• https://data.un.org/ - un datasets
• https://www.statista.com