KEMBAR78
Data Visualization | PDF | Statistics | Information Science
0% found this document useful (0 votes)
64 views31 pages

Data Visualization

Document about data visualization Mmm Mmmmm

Uploaded by

mm.hh.m.1520002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views31 pages

Data Visualization

Document about data visualization Mmm Mmmmm

Uploaded by

mm.hh.m.1520002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Data Visualization

CSDS3202
Contents
• Basics of data visualization
• Importance of visualization
• Design principles
• Introduction to data visualization libraries in Python – Matplotlib,
Seaborn
• Generate basic graphs such as bar graphs, histograms, line graphs,
scatter plots
• Generate statistical visualizations of data such as distribution plots, pie
charts, bar charts, heat maps
• Genarate Visual maps and images
Data visualization
• Data Visualization is the fundamental aspect of data science for
representing data in a graphical format.
• It is the process of creating visual elements such as charts, graphs,
maps, and diagrams to communicate complex information in a easy
and understandable manner.
• The goal of data visualization is to tell a story and present data in a
way that helps the user(data experts and non-experts) make sense of
the data and identify patterns, trends, and insights.
Basics of data visualization
• Choosing the right type of
chart or graph
• Designing for clarity and
simplicity
• Using appropriate scales
• Highlighting important
information
Importance of visualization
• Data visualization will help
• simplify complex data and make it more accessible to a wide range of audiences.

• identify hidden patterns and trends in large datasets.

• decision-makers make more informed decisions by finding the insights from the
data.

• enhance data quality by making it easier to spot errors and anomalies in the
data.

• save time by presenting data in a way that is easy to understand and analyze.
Design principles
• Clarity - clear and easy to understand and avoid clutter
• Simplicity - simple and focused on the most important information
• Consistency -use consistent colors, fonts, and other design elements
throughout the visualization
• Context -provide context for the data by including labels, annotations, and
other relevant information
• Accuracy -ensure that the data is accurate and transparent
• Functionality - should be functional and interactive with features such as
zooming, filtering, and sorting.
• Aesthetics - should be visually appealing and engaging with pleasing colors,
fonts, and other design elements
Data visualization libraries
• There are some popular Python libraries for visualization:
1.matplotlib,
2.seaborn,
3.bokeh, and
4.altair etc.
• However, in this chapter, we will mainly focus on the popular
libraries such as Matplotlib and Seaborn .
Why matplotlib?
• Matplotlib produces publication-quality figures in a variety of
formats
• Supports interactive environments across Python platforms.
• Pandas comes equipped with useful wrappers around several
matplotlib plotting routines
• Quick and handy plotting of Series and DataFrame objects.
• Before using Matplotlib, you need to import the library into
your Python script or notebook
import matplotlib.pyplot as plt
Dataset used
• Consider the following Dataframe
‘df’ for creating various plots

import pandas as pd
import matplotlib.pyplot as plt
dic = {'year': [2010, 2011, 2012, 2013, 2014, 2015],
'sales': [50, 70, 90, 80, 100, 120],
'profit': [20, 24, 30, 15, 35, 50],
'rating':['B','B','A','B','A','A']}
df = pd.DataFrame(dic)
Line Plot
• Create line plot to show the sales
and profit for all years
plt.plot(df['year'], df['sales'], label='Sales',linestyle='-
',marker='>')
plt.plot(df['year'], df['profit'],
label='profit',linestyle='--',color='r')

plt.xlabel('Year')
plt.ylabel('Amount')
plt.title('Sales and Profit')
plt.legend()
plt.show()
Line Plot - changing limits, ticks and
figure size
Try it on the plot
• plt.xlim(low,high)
• plt.ylim(low,high)
• plt.xticks([list of points])
• plt.yticks([list of points])
• plt.figure(figsize=(width,height))
Scatter Plot
• Used to observe relationship between
two numeric variables
• Scatter plot is used to identify
patterns, trends, clusters, outliers
and anomalies in data.

plt.scatter(df['sales'],df['profit'],c='g')
plt.xlabel('Sales')
plt.ylabel('Profit')
plt.show()
Histogram
• Used to represent the frequency of
occurrence of a range of values in
a dataset using bars of different
heights.
• Represent the distributional
features(peaks, outliers, skewed)
of variables

plt.hist(df['sales'],bins=4,)
plt.show()
Bar Plot - Vertical
• Used to represent data associated
with categorical variable.
• Used to compare the values of
different categories or groups

plt.bar(df['rating'],df['profit'])
plt.show()

Note: Displays the highest value for


both rating values
Bar Plot - Vertical
• plot to display the median value of
the profit column based on rating

df.groupby('rating')['profit'].median()
.plot(kind='bar')
plt.show()
Bar Plot - Horizontal
• plot to display the mean value of
the profit column based on rating

df.groupby('rating')['profit'].mean().
plot(kind='barh',color='red')
plt.show()
Box Plot
• It is a graphical representation
of the distribution of a dataset.
It displays the median,
quartiles, and outliers of the
data.

plt.boxplot(df['profit'])
plt.show()
Pie Chart
• pie chart is a circular statistical
chart divided into slices to show the
numerical proportion.
• Each slice of the pie chart
represents a category or value, and
the size of each slice corresponds to
its percentage of the whole.
df.groupby('rating')['sales'].mean().plot(kind
='pie',autopct="%3.2f%%",explode=[0.2,0])
Subplots
• Create multiple plots in one figure
• Use subplot() method to plot multiple plots.
• 3 parameters used
• number of rows
• number of columns
• current index
Subplots
• Create a subplot with 4 plots
plt.figure(figsize=(15,10))
plt.subplot(2,2,1)
plt.boxplot(df['profit'])
plt.subplot(2,2,2)
df.groupby('rating')['sales'].mean().plot(kind='pie',autopct="%3.2f%%",e
xplode=[0.2,0])
plt.subplot(2,2,3)
plt.scatter(df['sales'],df['profit'],c='g')
plt.xlabel('Sales')
plt.ylabel('Profit')
plt.subplot(2,2,4)
df.groupby('rating')['profit'].mean().plot(kind='barh',color='red')
plt.suptitle("Combined Chart")
plt.show()
Saving Plots
plt.boxplot(df['profit'])
plt.savefig('chart1.jpg')

The boxplot will be saved in the local disk with the name
chart1.jpg
Seaborn
• Seaborn is a library for making statistical plots using Python.
• It builds on top of matplotlib and integrates closely
with pandas
• Import the library before using it

import seaborn as sns


Distribution Plot
Used for visualizing the
distributions in the data that
includes histograms and kernel
density estimation

sns.displot(data=df, x="profit", kd
e=True)
Pair Plot
Shows joint and marginal
distributions for all pairwise
relationships and for each
variable, respectively.

sns.pairplot(data=df, hue="rating")
Heat Map
• It is graphical representation of
data using colors to visualize the
value of the matrix.
• The scale will represent the
different range of values.

Following heat map shows the values for both


‘sales’ and ‘profit’ columns

sns.heatmap(data = df.iloc[:,1:-1],annot=True)
Visualizing Maps using Folium Library
• Folium is one of the best libraries in Python for visualizing
geospatial data.
• Install the library using the command

!pip install folium

And import the library as

import folium
Creating a map and adding markers
muscat = [23.5880, 58.3829]
nizwa = [22.9171, 57.5363]
salalah = [17.0194, 54.1108]
m = folium.Map(muscat,zoom_start=5,tiles="Stamen
Terrain")
folium.Marker(muscat,popup="Muscat City").add_to(m)
folium.Marker(nizwa,tooltip = "Nizwa").add_to(m)
folium.CircleMarker(salalah,radius=40,popup="Salalah").
add_to(m)
m
Choropleth Maps
This code will use the given dataframe “oman” to
create the choropleth map
om =
'https://raw.githubusercontent.com/codeforamerica/cli
ck_that_hood/master/public/data/oman.geojson'
m1 = folium.Map(muscat,zoom_start=6)
folium.Choropleth(geo_data=om,
data = oman,
columns =['Region','count'],
key_on = 'feature.properties.name',
fill_color='YlOrRd',highlight=True).add_to(m1)
m1
Visualizing Image Datasets
• Visualizing image dataset from
sklearn library using matplotlib
• Display 10 random images
from sklearn.datasets import fetch_olivetti_faces
import matplotlib.pyplot as plt
dataset = fetch_olivetti_faces(shuffle=True,
random_state=10)

for k in range(10):
plt.subplot(2,5,k+1)
plt.imshow(dataset.data[k].reshape(64,64))
plt.title('person '+str(dataset.target[k]))
plt.axis('off')
plt.show()
Visualizing Image Datasets
• Display 10 digits as image
from sklearn.datasets import load_digits
digits = load_digits()
for number in range(1,11):
plt.subplot(3, 4, number)
plt.imshow(digits.images[number],cmap='binary')
plt.axis('off')
plt.show()
References
• Charles Mahler (2023). 7 Best Practices for Data Visualization.
Available 2023-02-12 at https://thenewstack.io/7-best-practices-
for-data-visualization/
• Matplotlib (n.d.), Visualization with Python. Available 2023-02-12 at
https://matplotlib.org/
• Seaborn (n.d.), seaborn: statistical data visualization. Available
2023-02-12 at https://seaborn.pydata.org/

You might also like