Data Analysis – Analyzing and Visualizing
Learning Goals
• Perform basic analyses on data to answer simple questions
• Identify which visualization is appropriate based on the type of data
Introduction
Data
data is numbers, characters, images, or other method of recording, in a form which
can be assessed to make a determination or decision about a specific action. Many
believe that data on its own has no meaning, only when interpreted does it take on
meaning and become information. By closely examining data we can find patterns to
perceive information, and then information can be used to enhance knowledge (The
Free On-line Dictionary of Computing, 1993-2005 Denis Howe).
Qualitative data
Data that is represented either in a verbal or narrative format is qualitative data. These
types of data are collected through focus groups, interviews, opened ended
questionnaire items, and other less structured situations. A simple way to look at
qualitative data is to think of qualitative data in the form of words.
Quantitative data
Quantitative data is data that is expressed in numerical terms, in which the numeric
values could be large or small. Numerical values may correspond to a specific
category or label.
Data Strategies
There are a variety of strategies for quantitative and qualitative analyses. Different
strategies provide data analysts with an organized approach to working with data; they
enable the analyst to create a “logical sequence” for the use of different procedures. In
the boxes below, we offer four examples of strategies for quantitative analysis that
you may consider as you work with and develop your skills in data analysis as well as
reasons why you may consider using the strategy.
Data visualization refers to studies that schematically represent abstracted data
visually while simultaneously including their own features and factors. Data
visualization's primary goal is to use graphical tools to communicate information in a
more understandable and efficient manner. In order to come up with good results, data
analysis is required. Decisions can be made using the final result. In order to help the
human brain better understand and inspect the data, which is to comprehend the
outcomes of their processing, data visualizat ion entails turning the data into a visual
context.The fastest human cognitive pathway is the visual canal. Because of this,
more individuals prefer to view data than hear or read it. Fast data processing,
analysis, and display are typically priorities in everyday life in a time of
overproduction of digital data. Visual data analysis and processing tools are
constantly updated as a result of constant contact between managers, users and
developers on the one hand, and on the other, on the other hand.
Data Analysis
Data analysis can refer to a variety of specific procedures and methods. However,
before programs can effectively use these procedures and methods, we believe it is
important to see data analysis as part of a process. By this, we mean that data analysis
involves goals; relationships; decision making; and ideas, in addition to working with
the actual data itself. Simply put, data analysis includes ways of working with
information (data) to support the work, goals and plans of your program or agency.
From this perspective, we present a data analysis process that includes the following
key components:
Purpose
Questions
Data Collection
Data Analysis Procedures and Methods
Interpretation/Identification of Findings
Writing, Reporting, and Dissemination; and
Evaluation
There are linear and cyclical approaches of data analysis;
Overview of Data Analysis Methods
Common methods that are used for data analysis include:
A. Data Exploration: The process of analyzing and comprehending data in order to
gain knowledge and spot trends or links is known as data exploration. Data analysis
entails using a variety of methods and tools, such as statistical analysis, visualization,
and summary.
B. Data Cleaning: Finding and correcting mistakes, inconsistencies, and inaccuracies
in a dataset is known as data cleaning. It entails a number of duties, including
processing missing data, updating data format, identifying and deleting outliers,
dealing with duplicates, and addressing anomalies in data values. We applied methods
for imputing missing data, including mean imputation, duplicate removal, format
standardization, and data value correction.
C. Data Modeling: The term "modeling" describes the process of turning raw data
into a mathematical model that can be used to predict the future or categories brand-
new data points. In machine learning, the aim of data modeling is to produce a model
that accurately reflects the relationships and patterns in the data and can be applied to
fore cast the behavior of new, unforeseen data. Utilizing a linear regression model, we
conducted statistical analysis. It entails a number of steps, such as:
Data preparation: Data cleaning, transformation, and scaling are all examples of
data preparation.
Feature selection: choosing the variables or characteristics that are most pertinent to
the model.
Model selection: It refers to picking the best model or algorithm for the given data
and issue.
Model training: entails setting the model's parameters and training the model using
the data.
Model evaluation: involves verifying the model's performance using a different set
of data and making any improvements.The model can be used to generate predictions
on current, unknown data after being trained and assessed.
D. Data Visualization: The ability to explore the relationship between variables and
spot patterns in the data makes data visualization an essential component of linear
regression analysis. Scattered plots, regression lines, residual plots, and diagnostic
plots were among the visualizations we employed. These can all be used in linear
regression research.
E. Statistical Summaries: Finding patterns and relationships in the data can be aided
by computing summary statistics like mean, median, variance, and correlation
coefficients.
F. Hypothesis Testing: Analysts can assess whether patterns or relationships they
detect are statistically significant or just the result of chance by testing hypotheses
about the data.
Basic Data Analyses – Mean/Median/Mode
There are many basic analyses we can run on features in data to get a sense of what
the data means. You've learned about some of them already in math or statistics
classes.
Mean: sum(lst) / len(lst)
Median: sorted(lst)[len(lst) // 2]
Mode: use mostCommonValue algorithm with a dictionary mapping values to counts.
Calculating Probabilities
You'll also often want to calculate probabilities based on your data. In general, the
probability that a certain data type occurs in a dataset is the count of how often it
occurred, divided by the total number of data points.
Probability: lst.count(item) / len(lst)
Conditional probability (the probability of something occurring given another factor)
are slightly harder. But if you create a modified version of the list that contains only
those elements with that factor, you can use the same equation.
Calculating Joint Probabilities
What if we want to determine how often two features (likely across multiple columns
in the data) occur together in the same data point? This is a joint probability. It
requires slightly more complicated code to compute the result.
count = 0
for i in range(len(data)):
if meetsCondition1(data[i]) and meetsCondition2(data[i]):
count += 1
print(count / len(data))
Messy Data – Duplicates
You'll also sometimes need to clean up messy data to get a proper analysis. Some of
this is done in the data cleaning stage, but even cleaned data can have problems. One
potential issue is duplicate data, when the same data entry is included in the dataset
multiple times. To detect duplicate data, check if your data has a unique ID per data
point; then you can count how often each ID occurs to find the duplicates.
for id in dataIds:
if dataIds.count(id) > 1:
print("Duplicate:", id)
Messy Data – Missing Values
Analyses can also run into problems when there are missing values in some data
entries. Data can be missing if some entries were not collected, and is likely to occur
in surveys with optional questions.
for i in range(len(data)):
if data[i] == "": # can also check for 'n/a' or 'none'
print("Missing row:", i)
To deal with missing data, ask yourself: how crucial is that data? If it's an important
part of the analysis, all entries that are missing the needed data point should be
removed, and the final report should include how much data was thrown out. If it's
less important, you can substitute in a 'N/A' class for categorical data, or skip the entry
for numerical data. But be careful about how missing data affects the analysis.
Messy Data – Outliers
Finally, be careful about how outliers can affect the results of data analysis. Outliers
are data points that are extremely different from the rest of the dataset. For example,
in a dataset of daily time reports, most people might report 5-15 hours, but one person
might report 100 hours. The easiest way to detect outliers is to use visualizations,
which we'll discuss later in the lecture. Outliers should be removed from some
calculations (especially means) to avoid skewing the results. Be careful, some outlier-
like data may not actually be an outlier and may reveal important information.
Example: Analyzing Ice Cream Data
We've now cleaned the ice cream dataset from last week. Let's analyze the data to
answer this question: which ice cream flavors do people like most?
Here's a bit of code to load and represent the dataset:
import csv
def readData(filename):
f = open(filename, "r")
reader = csv.reader(f)
data = [ ]
for row in reader:
data.append(row)
return data
Example: Total Preferences
First: how many times does each flavor occur in any of a person's preferences?
def getIceCreamCounts(data):
iceCreamDict = { }
for i in range(1, len(data)): # skip header
for flavor in data[i]:
if flavor not in iceCreamDict:
iceCreamDict[flavor] = 0
iceCreamDict[flavor] += 1
return iceCreamDict
What is data visualization and why is it important?
Data visualization is the act of taking information(data) and placing into a visual
context, such as a map or graph. Data visualizations make big and small data easier
for the human being to understand, and visualization also makes it easier to detect
patterns trends, and outlines in group of data.
Data Science and data visualizations are not two different entities. They are bound to
each other. Data visualization is a subset of Data Science. Data Science is not a single
process or a method or any work flow. Visualization is used for two primary
purposes: exploration and presentation.
Data visualization's primary goal is to use visuals to convey information in a more
effective and clear way. When information is represented in visual forms, the human
brain is better equipped to spot links and patterns and understand them. We can
analyze data and spot patterns with graphs in a way that is not possible with any other
method. Various graphic formats, such as bar graphs, pie charts, tables, and diagrams,
can be used to visualize data. Technology has advanced quickly because of the usage
of IT tools, but dataanalysis has also benefited from the expanding use of
visualization, or making in formation more palatable in a visual manner. Because the
ordinary person's brain memorizes visual representations and information that it will
get from the world through the visual sense fast, visualization itself is founded on the
quick perception of visual forms by people. Finding methods and techniques to
improve daily human existence is a growing area of scientific interest. The rule that
guides competition amongst the data visualization tools on the market is that they
must be "closer" to users, i.e., more user friendly. In order to better comprehend and
communicate complicated data, data visualization is an effective tool. It entails
displaying data in a graphical or pictorial format to make it simpler to comprehend
and interpret. Data visualization is becoming more crucial than ever due to the
growing availability of data in a variety of fields, including business, social sciences,
humanities, sports, environmental sciences, and healthcare. This chapter offers a
thorough introduction of data visualization tools, techniques, and their uses in many
fields. In order to give readers an understanding of the various kinds of data
visualization tools and techniques available, this chapter tries to emphasize the value
of data visualization in clearly communicating and analyzing data. Whether you are a
learner or a seasoned professional, this will be a useful tool for enhancing your
comprehension of data visualization and its applications.
Overview of Data Visualization Methods
Figure1: Steps in data visualization
Visual Variables Show Differences
In visualization, we use different visual variables to demonstrate the differences
between categories or data points. Which visual variable you use depends on the type
of the data you're representing – categorical, ordinal, or numerical.
There are seven Visual Variables
Visual Variable Options – Numerical
If you want to encode numerical data, you basically have two options: position and
size.
Position: where something is located in the chart, as in an x,y position. Positions to
the upper-right tend to be correlated with larger numbers.
Size: how large a visual element is, or how long it is in the chart. The larger the size,
the bigger the number.
Visual Variable Options – Ordinal
For ordinal data, you can use position and size, but you can also use value.
Value: the hue of a color in the chart (from 0 RGB to 255 RGB). Hues are ordered
based on the ordinal comparison.
Visual Variable Options – Categorical
Categorical data can be presented using position, size, and value, but it also adds two
other options: color and shape.
Color: each category can be assigned a different color (red for Cat1, blue for Cat2,
pink for Cat3).
Shape: each category can be assigned a different shape (square for Cat1, circle for
Cat2, triangle for Cat3).
Select the right type of graph
Data visualization improves communication
Overview of Data Visualization Tools
Data scientists and the scientific community that employs this knowledge in practice
are very interested in the constantly expanding subject of data visualization since it
enables them to make more in -depth, quicker, and more effective observations that
will enable data penetration. This will support them in making decisions that are more
effective, quicker, and more helpful for their organizations. Therefore, it's crucial to
present the data in a creative way, using simple tools like colors, elements, and
dimensions, as well as analyses that have an impact on the data's representativeness.
Numerous visualization technologies have been created and are employed because
data visualization, particularly with regard to large data, necessitates not only an
understanding of design and data but also a foundational understanding of statistics
What are the best data visualization software of 2019?
Whsisense
Looker
Periscope Data
Zoho Analytics
Tableau
Domo
Microsoft Power Bi
Qlikview
What is data discovery and visualization?
What are data visualization tools?
Is Excel a data visualization tool?
How do you create good data visualization?
What kind of visual communication do you want to create?
Four Types of Data Visualizations
Data Visualization
Some basic principles (adapted from Tufte 2009)
Principle 1: The chart should tell a story
Principle 2: The chart should have graphical integrity
Examples of the “lie factor”
Principle 3: The chart should minimize graphical complexity
Generally, the simpler the better…
Sometimes a table is better
When a table is better than a chart?
For a few data points, a table can do just as well…
Choosing a Visualization
There are dozens of different visualizations you can use on data. In order to choose
the best visualization for the job, consider how many dimensions of data you need to
visualize. We'll go over three options: one-dimensional data, two-dimensional data,
and three-dimensional data.
One-Dimensional Data
A one-dimensional visualization only visualizes a single feature of the dataset. For
example:
"I want to know how many of each product type are in my data"
"I want to know the proportion of people who have cats in my data"
To visualize one-dimensional data, use a histogram or a pie chart. Histograms Show
Counts For categorical or ordinal data, show counts for each type of data using bars
(length = count). For numerical data, use bucketing across a distribution. A histogram
shows you the shape and the spread of numerical data. A pie chart shows the
proportion of the data that falls into each category of the feature. The proportions
should always add up to 100%. It doesn't make sense to use a pie chart on a numerical
feature unless you use bucketing.
Two-Dimensional Data
A two-dimensional visualization shows how two features in the dataset relate to each
other. For example:
"I want to know the cost of each product category that we have"
"I want to know the weight of the animals that people own, by pet species"
"I want to know how the size of the product affects the cost of shipping"
To visualize two-dimensional data, use a bar chart, a scatter plot, a line plot, or a box-
and-whiskers plot. A bar chart compares the average results of a numerical feature
across the categories of a categorical feature. You can add error bars on a bar chart to
see if two categories are significantly different. A box-and-whisker plot also compares
averages of a numerical feature across categories of a categorical feature, but it
visually provides summary statistics across the range of the data.This plot is
especially useful for data that is not normally distributed around the average. A scatter
plot compares two numerical features by plotting every data point as a dot on the
graph (with the first feature as the x axis and the second as the y axis). Scatter plots
are useful for observing trends in data. A line plot uses a numerical feature that
specifically measures time on the x axis, and a different numerical feature on the y
axis. Because there's generally one data point per time stamp, the points are connected
using lines, to show a trend over time.
Three-Dimensional Data
A three-dimensional visualization tries to show the relationship between three
different features at the same time. For example:
"I want to know the cost and the development time by product category"
"I want to know the weight of the animals that people own and how much
they cost, by pet species"
"I want to know how the size of the product and the manufacturing location
affects the cost of shipping"
To visualize three-dimensional data, use a colored scatter plot, a scatter plot matrix, or
a bubble plot. A colored scatter plot lets you compare two numerical features across a
set of categories in a categorical feature.Each category has its values plotted in a
scatter plot, and each category gets a different color. This plot makes it easy to tell
which categories have different trends. A bubble plot can be used to compare three
numerical features. One feature is the x axis, and another is the y axis. The third
feature is used to specify the size of the data points. Bubble plots can get confusing
when there's a lot of data points, but are useful on sparse data. A scatter plot matrix
can be used to compare three (or more) numerical features. Each column corresponds
to one of the tree features, and each row corresponds to one of the three features. The
graph shown in each position is then the scatter plot between the row's feature and the
column's feature.Note that graphs on the diagonal are histograms, as they compare a
feature to itself.
One of the golden rules of data visualization is…..
Coding Visualizations with Matplotlib
The matplotlib library can be used to generate interesting visualizations in Python.
Matplotlib Core Ideas
For every visualization you make in Matplotlib, you'll need to set up a figure and axis.
This is generally done with the code:
fig, ax = plt.subplots()
You can then directly add visualizations to the axis by calling different methods on
ax, with the data as the parameter. Let's look at histograms and bar charts specifically.
Once you're done setting up the visualization, call plt.show() to display the chart.
Histogram Example - Numerical
import matplotlib.pyplot as plt
import random
# Generate a normal distribution
data = []
for i in range(100000):
data.append(random.normalvariate(0, 1))
# Set up the plot
fig, ax = plt.subplots()
# Set # of bins with the 'bins' arg
ax.hist(data, bins=20)
plt.show()
Bar Chart Example - Categorical
Let's use our ice cream data to make a nice categorical histogram (which will be
formed using bar charts). We'll graph the counts of the three classic flavors: vanilla,
chocolate, and strawberry. First, process the data to get those counts:
data = readData("icecream.csv")
d = getIceCreamCounts(data)
flavors = [ "vanilla", "chocolate", "strawberry" ]
counts = [ d["vanilla"], d["chocolate"], d["strawberry"] ]
import matplotlib.pyplot as plt
# Set up the plot
fig, ax = plt.subplots()
# Set up the bars
ind = range(len(counts))
rects1 = ax.bar(ind, counts)
# Add labels
ax.set_ylabel('Flavors')
ax.set_title('Counts of Three Flavors')
ax.set_xticks(ind)
ax.set_xticklabels(flavors)
plt.show()
Advanced Bar Chart
We can make our visualizations more advanced by adding side-by-side bars, and
using the other matplotlib features to add data to the chart. For example, let's write a
bit of matplotlib code to compare averages and standard deviations across an arbitrary
data set.
menMeans = [20, 35, 30, 35, 27]
menStd = [2, 3, 4, 1, 2]
womenMeans = [25, 32, 34, 20, 25]
womenStd = [3, 5, 2, 3, 3]
# From matplotlib website
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots()
# Using numpy arrays lets us do useful operations
mensInd = np.arange(5)
width = 0.35 # the width of the bars
womensInd = mensInd + width
rects1 = ax.bar(mensInd, menMeans, width,
color='r', yerr=menStd)
rects2 = ax.bar(womensInd, womenMeans, width,
color='b', yerr=womenStd)
# Labels and titles
ax.set_ylabel('Scores')
ax.set_title('Scores by group and gender')
ax.set_xticks(mensInd + width / 2)
ax.set_xticklabels(['G1', 'G2', 'G3', 'G4', 'G5'])
ax.legend([rects1[0], rects2[0]], ['Men', 'Women'])
plt.show()
Reference
1. Introduction to Data Analysis Handbook,Migrant & Seasonal Head
StartTechnical Assistance Center, Academy for Educational Development,
Contract with DHHS/ACF/OHS/Migrant and Seasonal Program Branch(2006)
2. STRENGTHENING GENDER STATISTICS, DATA VISUALIZATION
TRAINING, WORLD BANK SGS DATA VISUALIZATION TRAINING