KEMBAR78
Data Communication for Scientists | PDF | Scatter Plot | Information
0% found this document useful (0 votes)
16 views66 pages

Data Communication for Scientists

The document discusses the importance of communication and data visualization in data science, emphasizing the need for data scientists to effectively convey their findings to decision-makers. It covers various visualization techniques, their appropriate use cases, and best practices for presentations. Additionally, it highlights common pitfalls and provides guidelines for creating effective visualizations.

Uploaded by

Clarisse Gaiola
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views66 pages

Data Communication for Scientists

The document discusses the importance of communication and data visualization in data science, emphasizing the need for data scientists to effectively convey their findings to decision-makers. It covers various visualization techniques, their appropriate use cases, and best practices for presentations. Additionally, it highlights common pitfalls and provides guidelines for creating effective visualizations.

Uploaded by

Clarisse Gaiola
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

NOVA

IMS 5
Information
Management
School
COMMUNICATION
AND
DATA VISUALIZATION

Data Science for Marketing


© 2021-2024 Nuno António
Acreditações e Certificações
Instituto Superior de Estatística e Gestão da Informação
Universidade Nova de Lisboa
Summary
1. Communication in Data Science
2. Introduction to data visualization
3.Display mediums
4.Do’s and Don’ts

2
5.1
0
Communication in
Data Science
Communication and Data visualization

3
Top-barriers faced by Data Scientists
§“Lack of management/financial support”
§“Lack of clear questions to answer”
§“Results not used by decision makers”
§“Explaining Data Science to others”
In Data Science and the art of persuasion (Harvard Business Review, 2019)

4
It’s not enough to have the technical skills
As a Data Scientist you also need to:
§Explain effectively how you come to a result
§Justify the rationale for your approach
§Convince the audience that your results should be used
§Reason how your results can improve business in a particular way

Reports Presentations

5
Presentations
§Storytelling:
§Set the context/background for the audience to understand the
story’s relevance
§Use an interesting narrative to get the message across
§Convey the message in business terms
§Highlight the business impact and opportunity
§At the end, summarize the highlights and present the call to
action
§Data visualization:
§Good visualizations are essential to convey a good story
§But visualizations (or annotations) should not tell the story itself
(that is a job for the presenter to tell interactively during the
presentation)
6
Presentation tips
§Be punctual and respect the predefined timings (if any)
§Dress appropriately according the audience
§Adapt according the rhythm of the presentation (e.g., if CEO engages in
a discussion you may forget about the timings)
§Avoid using technical terms if the audience is mostly people from the
business side
§Use large fonts in slides (minimum 20 points)
§Avoid slides with long texts. If necessary, use only visualizations and 2
or 3 words
§Avoid writing mistakes
§Focus on the message you want to convey, not on the slides
§Never, ever:
§ Ramble on and on without a specific idea to transmit
§ Face your back to the audience
§ Read verbatim off the slides or from your annotations 7
5.2
0
Introduction to
data visualization
Communication and Data visualization

8
How humans gather and process information

SYSTEM I SYSTEM II

Thought-processing that is fast, Slow, logical, infrequent, and


automatic, and unconscious. We calculating thought and
use this method frequently to: includes:
§ Read text on a sign
§ Distinguish the difference in
§ Determine the source of a
sound is meaning behind multiple
signs side-by-syde
§ Solve 1+1
§ Recognize the difference § Recite your phone number
between colors § Understand complex social
§ Ride a bike cues
§ Solve 23x21
9
The human memory

Human memory

Short-term
Sensory
(aka working Long-term
(aka iconic)
memory)

where preconscious Where information resides


processing is made during conscious processing
(temporary, dedicated to
visual information, limited
storage capacity)
10
Why visualization
§ To explore the human visual system
as a means of communication
§ The human visual system is both
well characterized and suitable for
transmitting information
§ The visual system provides a very
high-bandwidth channel to the
brain
§ A significant amount of the human
brain is dedicated to visual
processing
§ A significant amount of visual
information processing occurs in
parallel at the preconscious level

11
12
Why visualize data
§Display large amount of information in a small space
§Makes users think about the substance, not the methodology or
technology
§Makes complex information to be easily interpreted
§Encourages the detection of patterns, anomalies and tendencies
§Makes comparisons easier

13
Anscombe’s quartet
“Graphics can be more precise and revealing than conventional
statistical computation” (Tufte, 2007)
dataset I dataset II dataset III dataset IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
N = 11
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
x! = 9.0
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
y! = 7.50
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
l.reg: y = 3.0 + 0.5x
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
var(x) = 10
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 var(y) = 3.75
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 corr(x,y) = 0.816
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89 R! = 0.67
14
Anscombe’s quartet visualization
14 14
I II
12 12
10 10
8 8
y

y
6 6
4 4
2 2
0 0
0 5 10 15 20 0 5 10 15 20
x x

14 14
III IV
12 12
10 10
8 8
y

6 y 6
4 4
2 2
0 0
0 5 10 15 20 0 5 10 15 20
x x

15
Data visualization in CRISP-DM
It is used in all phases, but in particular:
§Data Understanding:
§ Exploratory Data Analysis
§Modeling:
§ Assess models’ performance
§Evaluation:
§ Assess models’ generalization
§ Communicate results
§Deployment:
§ Communicate results
§ Monitor performance
16
5.3
0

Display mediums
Communication and Data visualization

17
Chart selection diagram – just some examples

18
Distribution
Display mediums

19
Histogram
§One variable - numeric
§Changing the number of bins
can affect distribution
visualization

20
Density plot
§One variable - numeric
§Used typically when there is a
high number of data points

21
Count plot
§One variable - categorical
§Used to understand
distribution among categories
and cardinality issues

22
Scatter plot
§Two numerical variables
§Most familiar way to visualize
bivariate distributions

23
Joy plot
§One numerical variable by a
categorical variable
§Great way to visualize
distribution (density curves) of
large number of groups in
relation to each other

24
Joy plot
§Multiple numerical variables by
a categorical variable
§Great way to visualize
distribution (density curves) of
large number of groups in
relation to each other

25
3D area plot
§Three numerical variables
§Advanced method to explore
the distribution of three
variables

26
Pair plots
§Multiple variables, from
multiple types
§Used to plot multiple pairwise
bivariate distributions
§Very helpful in the data
understanding phase

27
Relationship
Display mediums

28
Scatter plot – three variables
§Two numerical variables, by a
categoric variable
§Color represents the categoric
variable

29
Scatter plot – four variables
§Two numerical variables, by
two categoric variables
§One of the categoric variables
is represented by the color
§The other categoric variable is
represented by the marker
§Usually, not very helpful, if
there are too many data points

30
Bubble plot – three to five variables
§Three numerical variables
§One of the variables defines
the size
§Just as in scatter plots, it is
possible to had two categorical
variables (color and marker)
§Usually, not very helpful, if
there are too many data points

31
Face grid– three to six variables
§Grid of scatter plots, bubble
plots, or line plots
§Helpful, if there are too many
data points to break plots by
categoric variables
§Very helpful in the data
understanding phase

32
Composition
Display mediums

33
Pie chart – Sorted bar plot

§Pie chart:
§ Good to show shares/proportions
§ Not helpful with too many categories
§Sorted bar plot:
§ Helpful when there are many categories
§ If a category requires highlighting, just color the category bar 34
Stacked bar plot – 100%
§One numerical variable with
the proportion by two
categorical variables
§Useful to depict proportions of
components with
subcomponents
§Useful to depict changes
overtime (x as a time
dimension) and only relative
differences matter. However,
not useful if x has too many
periods

35
Stacked bar plot
§One numerical variable by two
categorical variables
§Useful to depict values of
components with
subcomponents
§Useful to depict changes
overtime (x as a time
dimension) and both relative
and absolute differences
matter. However, not useful if
x has too many periods

36
Area plot – 100%
§One time-variable vs one
numerical variable with the
proportion by one categorical
variable
§Useful to depict changes
overtime (x as a time
dimension) and only relative
differences matter

37
Area plot
§One time-variable vs one
numerical variable with the
proportion by one categorical
variable
§Useful to depict changes
overtime (x as a time
dimension) and only relative
differences matter

38
Comparison
Display mediums

39
Tables
Work best for:
§Comparison among items
Marital
Divorced
Subscribed - NO
4,136
Subscribed - YES
476
§Look up individual values
Married 22,396 2,532 §Data must be precise
Single 9,948 1,620
Unknown 68 12

40
Column plot
§Values by categoric variable,
with each series being a
category
§Useful to make comparisons
with few categories

41
Column plot with facet
§Useful to make comparisons
with multiple categorical
variables, with many categories
§Very helpful in the data
understanding phase

42
Line plot
§Useful to make comparisons
over time
§Not suitable for a high number
of categories

43
Other types
Display mediums

44
Heatmap

source: https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python

45
Treemap

source: https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python

46
Geospatial data visualizations

source: https://i.stack.imgur.com/QVoiv.png

47
5.4
0

Do’s and Don’ts


Communication and Data visualization

48
Do’s and don’ts

€75000,00
€70000,00
€65000,00

€60000,00
€55000,00

€50000,00

€45000,00
€40000,00

€35000,00 Budget
Actual
€30000,00

October

November

December
February

July
March

June

September
January

April

May

August
Applications encourage users to create poorly-designed
graphs, but sometimes graphs can be made do deceit
Adapted from Few (2008)

49
Do’s and don’ts

€80000,00

€70000,00

€60000,00

€50000,00

€40000,00

€30000,00

€20000,00

€10000,00
Budget
Actual
€0,00

October

November

December
February

July
March

June

September
January

April

May

August
Starting the y-axis value at 0 relativizes the
differences and reflects more accurately
the full value

50
Do’s and don’ts

€80000,00

€70000,00

€60000,00

€50000,00

€40000,00

€30000,00

€20000,00

€10000,00
Budget
Actual
€0,00

October

November

December
February

July
March

June

September
January

April

May

August
Removing the background makes it
easier to read and contrast colors

51
Do’s and don’ts
€80000,00

€70000,00

€60000,00
Actual
€50000,00
Budget
€40000,00

€30000,00

€20000,00

€10000,00

€0,00

November
February

July
March

June

October

December
September
January

April

May

August
Most of the times 3D effects are unjustified. 3D effects create a
disparity of depth, the occlusion hides information, the
perspective creates distortion, and titled text isn’t legible

52
Do’s and don’ts
€80 000

€70 000

€60 000
Actual
€50 000
Budget
€40 000

€30 000

€20 000

€10 000

€0

November
February

July
March

June

October

December
September
January

April

May

August
Horizontal and
Due to the scale of the values, Since months do not
vertical grid lines
removing unnecessary need a separation and
should be used only
decimals and adding the grid lines are displayed
when strictly useful
thousands separator improves in the y-axis, all tick
for improving
readability marks can be removed
readability
53
Do’s and don’ts
€80 000
€70 000
€60 000

€50 000 Actual

€40 000 Budget

€30 000

€20 000
€10 000
€0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Enlarged text makes Avoid 90-degree labels. They are


reading easier difficult to read. Abbreviating
month names and placing them
Avoid using distracting fonts
horizontally makes reading easier
or elements, such as bold,
italic or underlined text
54
Do’s and don’ts
80 000
70 000
60 000
50 000 Actual
Sales €

40 000 Budget

30 000
20 000
10 000
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020

The inclusion of a label on the y- Labeling the x-axis also makes


axis and the € symbol, allows reading easier
removing the € symbol from the
values, which makes reading
easier
55
Do’s and don’ts

The reorientation and The reorientation and


repositioning of the y-axis repositioning of the legend makes
facilitates reading it easier to read and creates more
space for the chart itself
Sales € Actual Budget

80 000
70 000
60 000
50 000
40 000
30 000
20 000
10 000
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020
56
Do’s and don’ts
Sales € Actual Budget

80 000
€70€71 500,00
000,00 €71 500,00
70 000 €64 000,00
€67 000,00
€64 000,00
€67 000,00
€61€60
500,00
000,00 €61€62 000,00
000,00 €60€61 500,00
000,00
€62 000,00
60 000 €58 000,00 €57€56
000,00
500,00 €56€55
000,00
500,00
€53 000,00 €52 000,00
€50 000,00 €51€50
000,00
500,00
50 000
40 000
30 000

20 000
10 000
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Actual €53 000, €58 000, €61 500, €51 000, €57 000, €61 000, €56 000, €60 000, €70 000, €62 000, €64 000, €67 000,
Budget €50 000, €52 000, €60 000, €50 500, €56 500, €62 000, €55 500, €61 500, €71 500, €64 000, €67 000, €71 500,
2020

Be careful with with the Data-Ink ratio. Ink that is used to display anything
that isn’t data or does not improve readability should be reduced to a
minimum

57
Do’s and don’ts

Avoid combining chart types. Having a secondary y-axis, especially


Makes reading difficult on a scale or measure different from
the primary axis, makes it difficult to
interpret the graph

Sales € Actual Budget €


80 000 75 000
70 000 70 000
60 000
65 000
50 000
60 000
40 000
55 000
30 000
50 000
20 000
10 000 45 000

0 40 000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020
58
Do’s and don’ts

Sales € Actual Budget

80 000
70 000
60 000
50 000
40 000
30 000
20 000
10 000
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020
Use colors consistently. If sales
Use contrasting colors, consider printing
are blue and the budget is gray
requirements and the cultural significance of
on a graph, the sales and
the colors. For example, red is often used to
budget colors must be the same
represent bad things or a warning
on all graphs
59
Do’s and don’ts
Sales € Actual Budget

80 000
70 000
60 000
50 000
40 000
30 000
20 000
10 000
0
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020

With the graph transformed into Since this is a time-based graph, lines
lines, the y-axis line better defines are better at revealing changes in
the beginning of the graph patterns over time than bars

60
Do’s and don’ts
Sales € Actual Budget

75 000

70 000

65 000

60 000

55 000

50 000

45 000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020

If the purpose of the graph is to examine the differences between


actual sales and the budget, changing the initial value of the y-axis
can improve the understanding of the graph

61
Do’s and don’ts

Sales €
75 000
Budget
70 000
Actual
65 000

60 000

55 000

50 000

45 000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020

Labeling the series eliminates the need


for a legend and makes reading easier

62
Do’s and don’ts

Representing sales variation from budget as


a single line improves readability.
Modification of y-axis labels required Every chart requires a title

Sales variance from budget


Euros €
8 000

6 000

4 000

2 000

-2 000

-4 000

-6 000
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020
63
Do’s and don’ts
Expressing the variation in a percentage
provides a quicker understanding.
As a percentage, the y-axis label can be The title should reflect the
removed subject under analysis

Sales percentage variance from budget


12,0%
10,0%
8,0%
6,0%
4,0%
2,0%
0,0%
-2,0%
-4,0%
-6,0%
-8,0%
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020
64
Do’s and don’ts

€75000,00
€70000,00
€65000,00

€60000,00
€55000,00

€50000,00
€45000,00
€40000,00

€35000,00 Budget
Actual
€30000,00

October

November

December
February

July
March

June

September
January

April

May

August

Sales percentage variance from budget


12,0%
10,0%
8,0%
6,0%
4,0%
2,0%

from difficult to
0,0%
-2,0%
read graph to easy -4,0%

to read, in 13 steps -6,0%


-8,0%
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2020

65
Data Science for Marketing
© 2021-2024 Nuno António (Rev. 2024-08-28)
Acreditações e Certificações
Instituto Superior de Estatística e Gestão da Informação
Universidade Nova de Lisboa

You might also like