Data Visualization I
MSBA7001 Business Intelligence and Analytics
HKU Business School
The University of Hong Kong
Instructor: Dr. DING Chao
Agenda
• Tableau
• Connecting to Data
• UI Overview
• Basic Charts
• Matplotlib
Tableau
Why Tableau?
• Tableau is a very effective tool to create interactive data
visualizations very quickly
• It is very simple and user-friendly
• Moreover, users can perform basic calculations and run
some simple stats in Tableau itself
• More than 50,000 customer accounts and growing.
Connecting to Data
Connect to Data
• The data file we are going to use is “Global Superstore.xls”
• The data is provided by Tableau
• It was created to train Tableau users on Tableau tactics, data
visualization strategy, and design
• We are going to use this dataset throughout the training
sessions
Connect to Data
• To connect to the Global Superstore file, click on Microsoft
Excel, navigate to where you saved the file and click open
Connecting to Tableau
• Now Tableau brings us to the data connection window.
• Here we can see the name of the file – and here we can click
to rename the connection if desired
• Drag any data sheet to canvas to load data
canvas
Joining Multiple Sheets
• Drag two sheets to canvas to join them by the common field
(primary key)
• This is similar to the JOIN and keys in SQL
Data Types
• If our column names aren’t ideal, we can click on the drop
down arrow to the right of the name and select rename.
• Clicking on the data type icon allows us to change the
default data type for that column.
Live versus Extract
• Connecting live leaves the data in the database or file
• This is best when we want to leverage a high performance
database’s capabilities, or to get up-to-the-second changes
in data visualized in Tableau
• The other option is to extract the data into Tableau's high
performance in-memory data engine.
• This can help when connecting to a slow database or to take
query load off critical systems.
Tableau UI Overview
Menus & Toolbar
• New sheet tabs are found at the bottom. We can create
sheets, dashboards, and stories with these tabs. We can also
do things like rename the sheets, drag to rearrange them,
duplicate sheets, copy formatting, and many other things.
• Sheets are where we build visualizations.
• We will call visualizations we create viz from now on.
Menus & Toolbar
• At the top, we have the menus, which contain a lot of
powerful controls
• Below is the toolbar, with buttons like undo – there is no
limit to how much you can undo, and this is a very
important button that allows you to explore
• Here we also have save – there’s no automatic save in
Tableau, so make sure to save your work periodically.
Data Pane
• On the left of the screen is
the data pane. The fields
(also called pills) from that
data source are listed
below, broken out into
dimensions and measures.
• Different fields can be
grouped together in a
folder or created as a
hierarchy (e.g., location,
categories)
Dimensions & Measures
• Dimensions contain qualitative values (such as names,
dates, or geographical data). You can use dimensions to
categorize, segment, and reveal the details in your data.
Dimensions affect the level of detail in the view.
• Think of them as the things you group by or drill down by.
Dimensions are usually (but not always) categorical fields
such as Order Priority and City.
• What dimensions we use to build the view will determine
how many color we have – Order Priority has 4 categories,
so it would give us 4 colors.
Dimensions & Measures
• Measures contain numeric, quantitative values that you can
measure.
• Measures can be aggregated. When you drag a measure
into the view, Tableau applies an aggregation to that
measure (by default).
• Think of them as the data elements that you want to
perform calculations on.
• Dimensions come out onto the view as themselves
• Measures come out onto the view as aggregates
Continuous vs Discrete
• Tableau represents data differently in the view depending
on whether the field is discrete (blue), or continuous
(green).
• Continuous and discrete are mathematical terms.
• Continuous means "forming an unbroken whole, without
interruption"; discrete means "individually separate and
distinct."
Continuous vs Discrete
• Continuous field values are treated as an infinite range.
Generally, continuous fields add axes to the view.
• Discrete values are treated as finite. Generally, discrete
fields add headers to the view.
• Text and categories are inherently discrete. Numbers can
also be discrete if they can only take one of a limited set of
distinct, separate values. On the other hand, numbers are
continuous if they can take on any value in a range.
Continuous vs Discrete
• This table shows examples of what the different fields look
like in the view.
Continuous vs Discrete
Continuous Discrete
Shelves & Cards
• Finally, we have the shelves (or cards).
• A view can be built by dragging and dropping fields from the
data pane into the canvas directly, or onto the shelves.
Shelves
view
Cards
Demo I
Show a table of total sales and profits by region and by
market?
Basic Charts
Types of Charts
• There are so many different types of charts and graphs.
• How do you choose the right chart?
• The following page provides with some good suggestions
https://blog.hubspot.com/marketing/data-visualization-
choosing-chart
• Some of the contents in the following slides are from the
above page
Bar Charts
Bar Charts
• A bar chart (or a column chart) is used to show a
comparison among different items, or it can show a
comparison of items over time
• Design Best Practices for Bar Charts:
Use consistent colors throughout the chart, selecting accent
colors to highlight meaningful data points or changes over
time.
Use horizontal labels to improve readability.
Start the y-axis at 0 to appropriately reflect the values in your
graph.
Demo I
Which product category has the worst sales/profits?
Stacked Bar Charts
Stacked Bar Charts
• A stacked bar chart should be used to compare many
different items and show the composition of each item
being compared.
• Design Best Practices for Stacked Bar Charts:
Best used to illustrate part-to-whole relationships.
Use contrasting colors for greater clarity.
Make chart scale large enough to view group sizes in relation
to one another.
Demo I
Is shipping cost increasing over time?
Scatter Plots
Scatter Plots
• A scatter plot will show the relationship between two
different variables or it can reveal the distribution trends. It
should be used when there are many different data points,
and you want to highlight similarities in the data set.
• This is useful when looking for outliers or for understanding
the distribution of your data.
• Design Best Practices for Scatter Plots:
Include more variables, such as different sizes, to incorporate
more data.
Start y-axis at 0 to represent data accurately.
If you use trend lines, only use a maximum of two to make
your plot easy to understand.
Demo I
What is the relationship between discount and profit?
Line Charts
Line Charts
• A line chart reveals trends or progress over time and can be
used to show many different categories of data. You should
use it when you chart a continuous data set.
• Line charts are often used for looking at how something
changes over time.
• Design Best Practices for Line Charts:
Use solid lines only.
Don't plot more than four lines to avoid visual distractions.
Demo I
Show monthly sales and forecast sales for the next year.
Dual Axis Charts
Dual Axis Charts
• A dual axis chart allows you to plot data using two y-axes
and a shared x-axis. It's used with three data sets, one of
which is based on a continuous set of data and another
which is better suited to being grouped by category.
• Design Best Practices for Dual Axis Charts:
Use the y-axis on the left side for the primary
variable because brains are naturally inclined to look left first.
Use different graphing styles to illustrate the two data sets, as
illustrated above.
Choose contrasting colors for the two data sets.
Demo I
Sales and profits in the same graph?
Histogram
Histogram
• A histogram is a plot that lets you discover, and show, the
underlying frequency distribution of a set of continuous
data.
• This allows the inspection of the data for its underlying
distribution (e.g., normal distribution), outliers, skewness,
etc.
• To construct a histogram from a continuous variable you
first need to split the data into intervals, called bins.
Demo I
What is the frequency of discounts across different product
categories?
Heat Maps
Heat Maps
• A heat map shows the relationship between two items and
provides rating information, such as high to low or poor to
excellent.
• The rating information is displayed using varying colors or
saturation.
• Design Best Practices for Heat Maps:
Use a basic and clear map outline to avoid distracting from
the data.
Use a single color in varying shades to show changes in data.
Avoid using multiple patterns.
Demo I
How does profit margin look like across different product
categories?
Pie Charts
Pie Charts
• A pie chart shows a static number and how categories
represent part of a whole -- the composition of something.
• A pie chart represents numbers in percentages, and the
total sum of all segments needs to equal 100%.
• Design Best Practices for Pie Charts:
Don't illustrate too many categories to ensure differentiation
between slices.
Ensure that the slice values add up to 100%.
Order slices according to their size.
Demo I
Which market generates more profits?
Box Plot
Box Plots
• Box plots visually show the
distribution of numerical
data and skewness through
displaying the data quartiles
(or percentiles) and
averages.
• They present five statistics:
min, first quartile, median,
third quartile, max.
• They can also help spot
outliers.
Demo I
Use boxplot to show the distribution of profit margins in the
Canada and USA markets.
Bullet Graphs
Bullet Graphs
• Bullet graphs are a variation of bar charts with the additions
of reference lines and reference arears.
• They help illustrate the relationship between two measures.
• They are usually used to compare actual value with target
value.
• For instance, the actual sales fall between 60% and 80% of
the target sales, which can be a risk signal to decision
makers.
Demo I
Use a bullet graph to compare sales in 2013 with sales in 2012,
which serves as the base.
Dumbbell Chart
Dumbbell Charts
• Also called DNA charts. They are used to demonstrate the
changes/trends between two data points.
• A dumbbell associates two dots with a line, therefore, it is a
dual axis chart.
Demo I
Use a dumbbell chart to show sales in 2012 and sales in 2013.
Exercise
• Make use of “Music Sale_data.xlsx”, make visualizations to
show best artists, best-selling genre, etc.
• See some examples in the next two slides.
Tableau Resources
• Find dataset from HKSAR
https://data.gov.hk/en/
• Find Student Resource from Tableau
https://community.tableau.com/docs/DOC-10635
• Student Viz Assignment Contest by Tableau
https://www.tableau.com/student-viz-assignment-contest
• View projects from Tableau Public
https://public.tableau.com/en-us/s/
Matplotlib
The SciPy Ecosystem
It defines numerical It provides a data
array and matrix types visualization package
It makes possible It provides high-
Jupyter Notebook performance, easy to
use data structures
Matplotlib
• Matplotlib provides a library for basic data visualization
• Basic module: matplotlib.pyplot
• matplotlib.pyplot is a collection of command style
functions that make matplotlib work like MATLAB
plot
• The plot method creates a chart
• The basic syntax is
plot([x], y, [args])
Data (array) for the Other arguments
x axis, optional
Data (array) for the y axis
• You may first generate the data, and then pass them to the
arguments
plot Line Charts
import matplotlib.pyplot as plt
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7,
14958.3]
plt.plot(gdp)
plt.show()
Don’t forget to
show the plot
Default x axis if
not specified
plot Line Charts
years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3]
plt.plot(years, gdp)
plt.show()
plot Line Charts
args = '[color][marker][line]'
character color character description character description
'b' blue '.' point marker '-' solid line style
',' pixel marker
'g' green '--' dashed line style
'o' circle marker
'r' red '-.' dash-dot line style
'v' triangle_down marker
'c' cyan '^' triangle_up marker ':' dotted line style
'm' magenta '<' triangle_left marker
'y' yellow '>' triangle_right marker
'k' black 's' square marker
'p' pentagon marker
'w' white
'*' star marker
'h' hexagon1 marker
'+' plus marker
'x' x marker
'D' diamond marker
plot Line Charts
plt.plot(years, gdp, 'ro-.')
plt.show()
More charts
• Line/scatter chart: plot plot([x], y, [args])
• Scatter plot: scatter scatter(x, y, [args])
• Bar plot: bar bar(x, height, [args])
• Histogram: hist hist(x, [args])
• Box plot: boxplot boxplot(x, [args])
• Pie chart: pie pie(x, [args])
• More examples:
• https://matplotlib.org/stable/gallery/index.html
Format Charts
• xlabel() adds a label to the x axis
• ylabel() adds a label to the y axis
• title() adds a title to the entire chart
• text(x, y, text) adds text at location x, y in the chart
• grid(True) shows the grid lines in the chart
Format Charts
plt.hist(x, 50, color = 'b', density = True)
plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title('Histogram of IQ')
plt.text(50, .025, r'$\mu=100,\ \sigma=15$')
plt.grid(True)
plt.show()
matplotlib is math
symbols friendly
Writing mathematical expressions
• Any text element can use math text. You should use raw
strings (precede the quotes with an 'r'), and surround the
math text with a pair of dollar signs ($), as in TeX
• To make subscripts and superscripts, use the '_' and '^'
symbols
r'$\alpha_i > \beta_i$'
• See a simple tutorial here:
https://matplotlib.org/2.0.2/users/mathtext.html
Subplots
• The subplot method adds additional subplots to the same
chart.
subplot(pos)
• pos is a three-digit integer, where the first digit is the
number of rows, the second the number of columns, and
the third the index of the subplot
• 235 means 2 rows, each row 3 subplots, so in total 6
subplots. Current index is the 5th subplot.
1 2 3
4 5 6
Subplots in One Chart
plt.subplot(131)
plt.plot('TV', 'sales', 'r.', data = advertising)
plt.xlabel('TV')
plt.ylabel('Sales')
plt.subplot(132)
plt.plot('radio', 'sales', 'bo', data = advertising)
plt.xlabel('Radio')
plt.subplot(133)
plt.plot('newspaper', 'sales', 'g>', data = advertising)
plt.xlabel('Newspaper')
plt.subplots_adjust(wspace = 0.8)
plt.suptitle('Impact of promotional strategies on Sales')
plt.show()
Subplots in One Chart
Save your figure
• Before showing the figure, use savefig()
• show functions similar to close
……
plt.savefig('fig.png')
plt.show()
Summary
• Plot line charts
• More charts
• Format charts
• Subplots in one chart
• Save figures