DATA MINING
Sections Revision Part 2
− Data Visualization
DATA MINING
Data Visualization
• is the presentation of data in graphical format.
• It helps people understand the significance of data by summarizing and
presenting huge amount of data in a simple and easy-to-understand format and
helps communicate information clearly and effectively.
Plots
− Histogram
− Box Plot
− Bar Plot
− Column Chart
− Pie Chart
− Scatter Plot
− Line Chart
− Violin Plot
− Density Plot
− WordCloud
− Heat Map
HISTOGRAM
• A histogram is an accurate graphical
representation of the distribution of a numeric
variable. It takes as input numeric variables
only.
• The variable is cut into several bins, and the
number of observation per bin is represented by
the height of the bar.
MADE BY π P a g e |2
DATA MINING
Implementation using matplotlib:
− import pandas as pd
− import matplotlib.pyplot as plt
create histogram for numeric data
− df.hist()
show plot
− plt.show()
Implementation using seaborn:
• Graphic library built on top of Matplotlib.
• It allows to make your charts prettier, and facilitates some of the common data
visualization needs
− pip install seaborn
− import seaborn as sns
− sns.distplot( df["Sales"] , bins=20 )
Boxplot
• is probably one of the most common type of graphic. It gives a nice summary of
one or several numeric variables. The line that
divides the box into 2 parts represents
the median of the data.
• The end of the box shows the upper and
lower quartiles.
• the extreme lines shows the highest and lowest
value excluding outliers.
MADE BY π P a g e |3
DATA MINING
For each numeric attribute of dataframe
− df.plot.box()
− plt.show()
individual attribute box plot
− plt.boxplot(df['Income'])
− plt.show()
Barplot
• A barplot (or barchart) is one of the most common types of graphic.
• It shows the relationship between a numeric and a categoric variable.
• Each entity of the categoric variable is represented as a bar.
• The size of the bar represents its numeric value.
Make a fake dataset:
− frequancy = [3, 12]
− bars = ('Male', 'Female')
Create bars
− plt.bar(bars, frequancy)
Create names on the x-axis
− plt.xticks(bars)
Show graphic
− plt.show()
Create horizontal bars
− plt.barh(bars,frequancy)
Create names on the y-axis
− plt.yticks(bars)
MADE BY π P a g e |4
DATA MINING
Column Chart
• A column chart is used to show a
comparison among different
attributes, or it can show a
comparison of items over time.
− df.plot.bar()
− plt.show()
pie chart
• A pie chart shows a static number and how categories represent part of a
whole the composition of something.
• A pie chart represents numbers in
percentages, and the total sum of all
segments needs to equal 100%.
− plt.pie(df['Income'], labels =
df['EMPID'], autopct ='% 1.2f %%')
− plt.show()
scatter chart
• A scatter chart shows the relationship between
two different variables and it can reveal the
distribution trends.
• It should be used when there are many different
data points, and you want to highlight similarities
in the data set.
MADE BY π P a g e |5
DATA MINING
• This is useful when looking for outliers and for understanding the distribution
of your data.
scatter plot between sales and age
− plt.scatter(df['Age'], df['Sales'])
− plt.show()
line chart or line graph
• A line chart or line graph is a type of chart which displays information as a
series of data points called ‘markers’ connected
by straight line segments.
• A line chart is often used to visualize a trend in
data over intervals of time.
− plt.plot( 'Age','Sales', data=df[['Age','Sales']],
color='skyblue', alpha=0.3 , linestyle='--' ,
linewidth=5)
− plt.show()
violin plot
• A violin plot can be used to display the
distribution of the data and its probability
density.
• Furthermore, we get a visualization of the
mean of the data (white dot in the center of the
box plot, in the image below)
− sns.violinplot(x="vs", y='wt', data=df0)
MADE BY π P a g e |6
DATA MINING
density plot
• A density plot shows the distribution of a
numerical variable. It takes only set of numeric
values as input. It is really close to a histogram.
− sns.kdeplot(df['Sales'])
− plt.show()
A Wordcloud (or Tag cloud)
• is a visual representation of text data.
• It displays a list of words, the importance of each
being shown with font size or color.
• This format is useful for quickly perceiving the most
prominent terms.
− pip install wordcloud
− from wordcloud import WordCloud
# Create the wordcloud object
− wordcloud = WordCloud(width=480, height=480, margin=0).generate(text)
heat map (or heatmap)
• A heat map (or heatmap) is a graphical
representation of data where the individual
values contained in a matrix are represented as
colors.
− plt.figure(figsize=(12, 8))
− sns.heatmap(df[['Age','Income', 'Sales']])
− plt.show()
MADE BY π P a g e |7