Data Visualization
CSDS3202
Contents
• Basics of data visualization
• Importance of visualization
• Design principles
• Introduction to data visualization libraries in Python – Matplotlib,
Seaborn
• Generate basic graphs such as bar graphs, histograms, line graphs,
scatter plots
• Generate statistical visualizations of data such as distribution plots, pie
charts, bar charts, heat maps
• Genarate Visual maps and images
Data visualization
• Data Visualization is the fundamental aspect of data science for
representing data in a graphical format.
• It is the process of creating visual elements such as charts, graphs,
maps, and diagrams to communicate complex information in a easy
and understandable manner.
• The goal of data visualization is to tell a story and present data in a
way that helps the user(data experts and non-experts) make sense of
the data and identify patterns, trends, and insights.
Basics of data visualization
• Choosing the right type of
chart or graph
• Designing for clarity and
simplicity
• Using appropriate scales
• Highlighting important
information
Importance of visualization
• Data visualization will help
• simplify complex data and make it more accessible to a wide range of audiences.
• identify hidden patterns and trends in large datasets.
• decision-makers make more informed decisions by finding the insights from the
data.
• enhance data quality by making it easier to spot errors and anomalies in the
data.
• save time by presenting data in a way that is easy to understand and analyze.
Design principles
• Clarity - clear and easy to understand and avoid clutter
• Simplicity - simple and focused on the most important information
• Consistency -use consistent colors, fonts, and other design elements
throughout the visualization
• Context -provide context for the data by including labels, annotations, and
other relevant information
• Accuracy -ensure that the data is accurate and transparent
• Functionality - should be functional and interactive with features such as
zooming, filtering, and sorting.
• Aesthetics - should be visually appealing and engaging with pleasing colors,
fonts, and other design elements
Data visualization libraries
• There are some popular Python libraries for visualization:
1.matplotlib,
2.seaborn,
3.bokeh, and
4.altair etc.
• However, in this chapter, we will mainly focus on the popular
libraries such as Matplotlib and Seaborn .
Why matplotlib?
• Matplotlib produces publication-quality figures in a variety of
formats
• Supports interactive environments across Python platforms.
• Pandas comes equipped with useful wrappers around several
matplotlib plotting routines
• Quick and handy plotting of Series and DataFrame objects.
• Before using Matplotlib, you need to import the library into
your Python script or notebook
import matplotlib.pyplot as plt
Dataset used
• Consider the following Dataframe
‘df’ for creating various plots
import pandas as pd
import matplotlib.pyplot as plt
dic = {'year': [2010, 2011, 2012, 2013, 2014, 2015],
'sales': [50, 70, 90, 80, 100, 120],
'profit': [20, 24, 30, 15, 35, 50],
'rating':['B','B','A','B','A','A']}
df = pd.DataFrame(dic)
Line Plot
• Create line plot to show the sales
and profit for all years
plt.plot(df['year'], df['sales'], label='Sales',linestyle='-
',marker='>')
plt.plot(df['year'], df['profit'],
label='profit',linestyle='--',color='r')
plt.xlabel('Year')
plt.ylabel('Amount')
plt.title('Sales and Profit')
plt.legend()
plt.show()
Line Plot - changing limits, ticks and
figure size
Try it on the plot
• plt.xlim(low,high)
• plt.ylim(low,high)
• plt.xticks([list of points])
• plt.yticks([list of points])
• plt.figure(figsize=(width,height))
Scatter Plot
• Used to observe relationship between
two numeric variables
• Scatter plot is used to identify
patterns, trends, clusters, outliers
and anomalies in data.
plt.scatter(df['sales'],df['profit'],c='g')
plt.xlabel('Sales')
plt.ylabel('Profit')
plt.show()
Histogram
• Used to represent the frequency of
occurrence of a range of values in
a dataset using bars of different
heights.
• Represent the distributional
features(peaks, outliers, skewed)
of variables
plt.hist(df['sales'],bins=4,)
plt.show()
Bar Plot - Vertical
• Used to represent data associated
with categorical variable.
• Used to compare the values of
different categories or groups
plt.bar(df['rating'],df['profit'])
plt.show()
Note: Displays the highest value for
both rating values
Bar Plot - Vertical
• plot to display the median value of
the profit column based on rating
df.groupby('rating')['profit'].median()
.plot(kind='bar')
plt.show()
Bar Plot - Horizontal
• plot to display the mean value of
the profit column based on rating
df.groupby('rating')['profit'].mean().
plot(kind='barh',color='red')
plt.show()
Box Plot
• It is a graphical representation
of the distribution of a dataset.
It displays the median,
quartiles, and outliers of the
data.
plt.boxplot(df['profit'])
plt.show()
Pie Chart
• pie chart is a circular statistical
chart divided into slices to show the
numerical proportion.
• Each slice of the pie chart
represents a category or value, and
the size of each slice corresponds to
its percentage of the whole.
df.groupby('rating')['sales'].mean().plot(kind
='pie',autopct="%3.2f%%",explode=[0.2,0])
Subplots
• Create multiple plots in one figure
• Use subplot() method to plot multiple plots.
• 3 parameters used
• number of rows
• number of columns
• current index
Subplots
• Create a subplot with 4 plots
plt.figure(figsize=(15,10))
plt.subplot(2,2,1)
plt.boxplot(df['profit'])
plt.subplot(2,2,2)
df.groupby('rating')['sales'].mean().plot(kind='pie',autopct="%3.2f%%",e
xplode=[0.2,0])
plt.subplot(2,2,3)
plt.scatter(df['sales'],df['profit'],c='g')
plt.xlabel('Sales')
plt.ylabel('Profit')
plt.subplot(2,2,4)
df.groupby('rating')['profit'].mean().plot(kind='barh',color='red')
plt.suptitle("Combined Chart")
plt.show()
Saving Plots
plt.boxplot(df['profit'])
plt.savefig('chart1.jpg')
The boxplot will be saved in the local disk with the name
chart1.jpg
Seaborn
• Seaborn is a library for making statistical plots using Python.
• It builds on top of matplotlib and integrates closely
with pandas
• Import the library before using it
import seaborn as sns
Distribution Plot
Used for visualizing the
distributions in the data that
includes histograms and kernel
density estimation
sns.displot(data=df, x="profit", kd
e=True)
Pair Plot
Shows joint and marginal
distributions for all pairwise
relationships and for each
variable, respectively.
sns.pairplot(data=df, hue="rating")
Heat Map
• It is graphical representation of
data using colors to visualize the
value of the matrix.
• The scale will represent the
different range of values.
Following heat map shows the values for both
‘sales’ and ‘profit’ columns
sns.heatmap(data = df.iloc[:,1:-1],annot=True)
Visualizing Maps using Folium Library
• Folium is one of the best libraries in Python for visualizing
geospatial data.
• Install the library using the command
!pip install folium
And import the library as
import folium
Creating a map and adding markers
muscat = [23.5880, 58.3829]
nizwa = [22.9171, 57.5363]
salalah = [17.0194, 54.1108]
m = folium.Map(muscat,zoom_start=5,tiles="Stamen
Terrain")
folium.Marker(muscat,popup="Muscat City").add_to(m)
folium.Marker(nizwa,tooltip = "Nizwa").add_to(m)
folium.CircleMarker(salalah,radius=40,popup="Salalah").
add_to(m)
m
Choropleth Maps
This code will use the given dataframe “oman” to
create the choropleth map
om =
'https://raw.githubusercontent.com/codeforamerica/cli
ck_that_hood/master/public/data/oman.geojson'
m1 = folium.Map(muscat,zoom_start=6)
folium.Choropleth(geo_data=om,
data = oman,
columns =['Region','count'],
key_on = 'feature.properties.name',
fill_color='YlOrRd',highlight=True).add_to(m1)
m1
Visualizing Image Datasets
• Visualizing image dataset from
sklearn library using matplotlib
• Display 10 random images
from sklearn.datasets import fetch_olivetti_faces
import matplotlib.pyplot as plt
dataset = fetch_olivetti_faces(shuffle=True,
random_state=10)
for k in range(10):
plt.subplot(2,5,k+1)
plt.imshow(dataset.data[k].reshape(64,64))
plt.title('person '+str(dataset.target[k]))
plt.axis('off')
plt.show()
Visualizing Image Datasets
• Display 10 digits as image
from sklearn.datasets import load_digits
digits = load_digits()
for number in range(1,11):
plt.subplot(3, 4, number)
plt.imshow(digits.images[number],cmap='binary')
plt.axis('off')
plt.show()
References
• Charles Mahler (2023). 7 Best Practices for Data Visualization.
Available 2023-02-12 at https://thenewstack.io/7-best-practices-
for-data-visualization/
• Matplotlib (n.d.), Visualization with Python. Available 2023-02-12 at
https://matplotlib.org/
• Seaborn (n.d.), seaborn: statistical data visualization. Available
2023-02-12 at https://seaborn.pydata.org/