Data Visualization
Introduction
• Data visualization has a crucial role in data science for
understanding the data.
• Data visualization can be used in all steps of the data
science life cycle to facilitate data exploration, identify
anomalies, understand relationships and trends, and
produce reports.
• The best data in the world won't be worth anything if no
one can understand it. There is not only need to collect
and analyze data, but also to present it to the end users
and other interested parties who will then act on that data.
Here’s where data visualization comes in.
Introduction
• Sometimes data does not make sense until you can look at it in a visual
form, such as with charts and plots.
• Being able to quickly visualize your data samples for yourself and others is
an important skill both in applied statistics and in applied machine
learning.
• Statistics does indeed focus on quantitative descriptions and estimations
of data. Data visualization provides an important suite of tools for gaining
a qualitative understanding.
• Data visualization can be helpful when exploring and getting to know a
dataset and can help with identifying patterns, corrupt data, outliers, and
much more.
Classifications of Visualizations
• There are several ways to categorize and think about
different kinds of visualizations. Here are four of the most
useful. The first two are unrelated to the others; the last
two are related to each other.
• Complexity
• Infographics versus Data Visualization
• Exploration versus Explanation
• Informative versus Persuasive versus Visual Art
Complexity
• One way to classify a data visualization is by counting
how many different data dimensions it represents.
• By this we mean the number of discrete types of
information that are visually encoded in a diagram.
• For example, a simple line graph may show the price of a
company’s stock on different days: that’s two data
dimensions.
• If multiple companies are shown (and therefore
compared), there are now three dimensions; if trading
volume per day is added to the graph, there are four
Infographics
• We suggest that the term infographics is useful for
referring to any visual representation of data that is:
• manually drawn (and therefore a custom treatment of the
information);
• specific to the data at hand (and therefore nontrivial to recreate
with different data);
• aesthetically rich (strong visual content meant to draw the eye
and hold interest); and
• relatively data-poor (because each piece of information must be
manually encoded).
Infographics
• Because of their manually-drawn process of creation,
infographics have the option of being aesthetically rich.
• Another consequence of their manual origins is they tend
to be limited in the amount of data they can convey,
simply due to the practical limitations of manipulating
many data points.
• Similarly, it is difficult to change or update the data in an
infographic, as any changes must be implemented
manually.
Data Visualization
• By contrast, it is suggested that the terms data
visualization and information visualization (casually, data
viz and info viz) are useful for referring to any visual representation
of data that is:
• algorithmically drawn (may have custom touches but is largely rendered
with the help of computerized methods);
• easy to regenerate with different data (the same form may be repurposed
to represent different datasets with similar dimensions or characteristics);
• often aesthetically barren (data is not decorated); and
• relatively data-rich (large volumes of data are welcome and viable, in
contrast to infographics).
• The advantage of this approach is that it is relatively simple to
update or regenerate the visualization with more or new data. While
they may show great volumes of data, information visualizations
are often less aesthetically rich than infographics.
Exploration
• Exploratory data visualizations are appropriate when you have a
whole bunch of data and you are not sure what is in it.
• When you need to get a sense of what is inside your data set,
translating it into a visual medium can help you quickly identify its
features, including interesting curves, lines, trends, or anomalous
outliers.
• Exploration is generally best done at a high level of granularity.
There may be a whole lot of noise in your data, but if you
oversimplify or strip out too much information, you could end up
missing something important.
• This type of visualization is typically part of the data
analysis phase, and is used to find the story the data has to tell
you.
Explanation
• By contrast, explanatory data visualization is appropriate when you
already know what the data has to say, and you are trying to tell that
story to somebody else.
• It could be the head of your department, a grant committee, or the
general public.
• Whoever your audience is, the story you are trying to tell (or the answer
you are trying to share) is known to you at the outset, and therefore you
can design to specifically accommodate and highlight that story.
• In other words, you will need to make certain editorial decisions about
which information stays in, and which is distracting or irrelevant and
should come out.
• This is a process of selecting focused data that will support the story you
are trying to tell.
Explanation
• If exploratory data visualization is part of the data
analysis phase, then explanatory data visualization is
part of the presentation phase.
• Such a visualization may stand on its own, or may be part
of a larger presentation, such as a speech, a newspaper
article, or a report.
• In these scenarios, there is some supporting narrative
written or verbal that further explains things.
Informative versus Persuasive versus
Visual Art
• There are three main categories of explanatory
visualizations based on the relationships between the
three necessary players: the designer, the reader, and
the data.
Informative
• An informative visualization primarily serves the
relationship between the reader and the data.
• It aims for a neutral presentation of the facts in such a
way that will educate the reader (though not necessarily
persuade him).
• Informative visualizations are often associated with broad
data sets, and seek to distill the content into a
manageably consumable form.
• Ideally, they form the bulk of visualizations that the
average person encounters on a day-to-day basis
whether that’s at work, in the newspaper, or on a service-
provider’s website.
Persuasive
• A persuasive visualization primarily serves the
relationship between the designer and the reader.
• It is useful when the designer wishes to change the
reader’s mind about something.
It represents a very specific point of view, and advocates
a change of opinion or action on the part of the reader.
• In this category of visualization, the data represented is
specifically chosen for the purpose of supporting the
designer’s point of view, and is presented carefully so as
to convince the reader of same.
Visual Art
• The third category, visual art, primarily serves the
relationship between the designer and the data.
• Visual art is unlike the previous two categories in that it often
entails unidirectional encoding of information, meaning that the
reader may not be able to decode the visual presentation to
understand the underlying information.
• Whereas both informative and persuasive visualizations are meant
to be easily decodable bidirectional in their encoding visual art
merely translates the data into a visual form.
• The designer may intend only to condense it, translate it into a new
medium, or make it beautiful; he/she may not intend for the reader
to be able to extract anything from it other than enjoyment.
choosing the appropriate
visualization technique
• Classification of visualization techniques helps in choosing the
appropriate visualization method for different datasets by providing a
structured framework to match the characteristics of the data with the
most effective visual representation.
1. Understanding the Data
• Data Type: Visualization techniques are classified based on the type of
data (e.g., categorical, numerical, time-series, geospatial). This helps in
determining whether to use bar charts, scatter plots, maps, or other
methods.
• Example: For categorical data, bar charts or pie charts are suitable, while scatter
plots work better for numerical relationships.
• Dimensionality: High-dimensional datasets require specific techniques
like heat maps, parallel coordinates, or dimensionality reduction
methods (e.g., t-SNE) for effective visualization.
• Example: Parallel coordinate plots are suitable for datasets with more than two
attributes.
2. Highlighting Relationships
• Classification helps identify whether the visualization needs to show
relationships (e.g., correlation, causation), distributions, comparisons,
or compositions.
• Example: Scatter plots are ideal for analyzing correlations, while stacked bar
charts can represent compositions.
3. Scalability
• For large datasets, the classification of techniques into methods that
handle large volumes of data (e.g., treemaps, heat maps) can guide the
selection process.
• Example: Heat maps are better suited for summarizing large datasets compared
to simple tables.
4. Purpose of Visualization
• Techniques are often classified by their purpose, such as exploratory (to
uncover patterns) or explanatory (to communicate insights).
• Example: Bullet graphs are ideal for explanatory purposes when tracking
progress toward a goal.
5. Audience and Context
• Classifications based on complexity (simple, intermediate, advanced)
help ensure that the visualization is suitable for the intended audience.
• Example: Donut charts may be more engaging for general audiences, whereas
technical teams might prefer scatter plots or box plots for detailed analysis
6. Multivariate and Linked Data
• For datasets with multiple variables or linked data points, the
classification provides techniques like scatter plot matrices or
interactive dashboards to display relationships effectively.
7. Geospatial Data
• Geographic data is often visualized using maps, and classification
identifies specific techniques (e.g., choropleth maps, bubble maps)
suited to spatial information.
Conclusion
• By systematically categorizing visualization techniques based on data
types, dimensionality, purpose, audience, and scalability, classification
ensures that the most appropriate and effective technique is chosen.
• This reduces misinterpretation, enhances communication, and aids in
uncovering meaningful insights from the dataset.