KEMBAR78
Data Analysis | PDF | Statistics | Cluster Analysis
0% found this document useful (0 votes)
232 views17 pages

Data Analysis

Exploratory data analysis (EDA) involves visually analyzing datasets to understand their main characteristics without pre-existing hypotheses. EDA was developed by John Tukey to encourage exploring data graphically rather than just testing hypotheses. Descriptive statistics and univariate analysis are important techniques in EDA to summarize single variables through measures of distribution, central tendency, and dispersion.

Uploaded by

Hillary Murunga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
232 views17 pages

Data Analysis

Exploratory data analysis (EDA) involves visually analyzing datasets to understand their main characteristics without pre-existing hypotheses. EDA was developed by John Tukey to encourage exploring data graphically rather than just testing hypotheses. Descriptive statistics and univariate analysis are important techniques in EDA to summarize single variables through measures of distribution, central tendency, and dispersion.

Uploaded by

Hillary Murunga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

MODULE -III

EXPLORATORY ANALYSIS
Exploratory data analysis:
In statistics, exploratory data analysis (EDA) s an approach to analyzing data sets to summarize their
main characteristics, often with visual methods. A statistical model can be used or not, but primarily
EDA is for seeingwhat the data can tell us beyond the formal modeling or hypothesis testing task.
Exploratory data analysis is a concept developed by John Tuckey (1977) that consists on a new
perspective of statistics. Tuckey's idea was that in traditional statistics, the data was not being explored
graphically, was just being used to test hypotheses. The first attempt to develop a tool was done in
it

Stanford, the project was called prim9. The tool was able to visualize data in nine dimensions, therefore
it was able to provide a multivariate perspective of the data.

In recent days, exploratory data analysis is a must and has been included in the big data analytics life
cycle. The ability to find insight and be able to communicate it effectively in an organization is fueled
with strong EDA capabilities.

Based on Tuckey's ideas, Bell Labs developed the S programming languagein order to provide an
interactive interface for doing statistics. The idea of S was to provide extensive graphical capabilities
on
with an easy-to-use language. In today's world, in the context of Big Data, R that is based
the Sprogramming language is the most popular software for analytics.

Top Analytics, Data Mining, Data


Science software used, 2015
I0% 20% 30% 40% 50%
0%
R
RapldMiner
SQL
Python
Excel
KNIME
Hadoop
Tableau
SAS base
Spark
The following program demonstrates the use of exploratory data analysis.

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their
main characteristics, often with visual methods. A statistical model can be used or not, but primarily
EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
Exploratory data analysis was promoted by John Tuckey to encourage statisticians to explore the data,
and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is
different from initial data analysis (DA),which focuses more narrowly on checking assumptions
required for model fiting and hypothsis testing, and handling missing values and making
transformations of variables as needed. EDA encompasses IDA.

Exploratory Data Analysis in Tuckey held that too much emphasis in statistics was placed on statistical
hypothesis testing(confirmatory data analysis); more emphasis needed to be placed on using data to
suggest hypotheses to test. In particular, he held that confusing the two types of analyses and employing
them on the same set of data can lead to systematicbias owing to the issues inherent in testing
hypotheses suggested by the data.

The objectives of EDA are to:

Suggest hypotheses about the causes of observed phenomena


Assess assumptions on which statistical inference will be based
Support the selection of appropriate statistical tools and techniques
Provide a basis for further data collection through surveys or experiments

Many EDA techniques have been adopted into data mining, as well as into big data analytics.They are
also being taught to young students as a way to introduce them to statistical thinking.

DESCRIPTIVE ANALYSIS:
Descriptive statistics are used to describe the basic features of the data in a study. They provide simple
summaries about the sample and the measures. Together with simple graphics analysis, they form
the
basis of virtually every quantitative analysis of data.
Descriptive statistics are typically distinguished from inferential statistics. With descriptive statistics you
are simply describing what is or what the data shows. With inferential
statistics, you are trying to reach
conclusions that extend beyond the immediate data alone. For instance, we use inferential statistics to try
to infer from the sample data what the population might think. Or, we use inferential statistics to make
judgments of the probability that an observed difference between groups is a dependable one or one that
might have happencd by chance in this study. Thus, we use inferentialstatistics to make inferences from
our data to more general conditions; we uso deseriptive statistics simply to deseribe what's going on in
our Explorntory data nalysis data.

Descriptivo Statistics are uscd to resent quantitative descriptions in a managcable form. In a research

study we may have lots of mcasures, Or we may mcasure a large number of people on any measure.
Descriptive statistics hlp us to sinplify large amounts of data in a sensible way. Each descriptive
statistic reduccs lots of data into a simpler summary. For instance, consider a simple number used to
sumnarize how well a batter is performing in bascbal, the batting average. This single number is simply
the number of hits divided by the number of times at bat (reported to three significant digits). A batter
who is hitting .333 is getting a hit one time in every thrce at bats. One batting.250 is hitting one time in
four. The single number describes a large number of discrete events. Or, consider the scourge
of many
students, the Grade Point Average (GPA). This single number describes the general performance of a
student across a potentially wide range of course experiences.

Every time you try to describe a large sct of observations with a single indicator you run the risk of
distorting the original data or losing important detail. The batting average doesn't tell you whether the
batter is hitting home runs or singles. It doesn't tellwhether she's been in a slump or on a streak. The
GPA doesn't tell you whether the student was in difficult courses or easy ones, or whether they were
courses in their major field or in other disciplines. Even given these limitations, descriptive statistics
provide a powerful summary that may enable comparisons across people or other units.

Univariate Analysis:
Univariate analysis involves the examination across cases of one variable at a time. There are three
major characteristics of a single variable that we tend to look at:
the distribution
the central tendency
the dispersion
In most situations, we would describe all three of these characteristics for each of the variables in our

study.
The Distribution: The distribution is a summary of the frequency of individual values or ranges of
values for a variable. The simplest distribution would list every value of a variable and the number of
persons who had cach value. For instance, a typical way to describe the distribution of college, students
is by ycar in college, listing the number or percent of students at each of the four years. Or, we describe
uender by listino thc number or ncrccnt of males and females In thesc cases the variable has few
enough values that we can list each one and summarize how many sample cases had the value. But what
do we do for a variable like income or GPA? With these variables there can be a large number of
possible values, with relatively few people having each one. In this case, we group the raw scores into
categories according to ranges of values. For instance, we might look at GPA according to the letter
grade ranges. Or, we might group income into four or five ranges of income values.
Frequency distribution table.
One of the most common ways to describe a single variable is with a frequency distribution. Depending
on the particular variable, all of the data values may be represented, or you may group the values into
categories first (e.g., with age, price, or temperature variables, it would usually not be sensible to
determine the frequencies for each value. Rather, the value are grouped into ranges and the frequencies
determined.). Frequency distributions can be depicted in two ways, as a table or as a graph. Table 1
shows an age frequency distribution with five categories of age ranges defined. The same frequency
distribution can be depicted in agraph as shown in Figure 1. This type of graph is often referred to as a
histogram or bar chart.
Frequency distribution bar chart.
Distributions may also be displayed using percentages. For example, you could use percentages to
describe the:
percentage of people in different income levels
percentage of people in different age ranges
percentageof people in different ranges of standardized test scores
Central Tendency: The central tendency of a distribution is an estimate of the "center" of a distribution
of values. There are three major types of estimates of central tendency:
Mean
Median
Mode
The Mean or average is probably the most commonly used method of describing central tendency. To
compute the mean allyou do is add up all the values and divide by the number of values. For example.
the mean or average quiz score is determined by summing all the scores and dividing by the number of
students taking the exam. For example, consider the test score values:
15, 20, 21, 20, 36, 15, 25, 15

The sum of these8 values is 167,so the mean is 167/8 = 20.875.

The Median is the scorc found at the exact middle of the set of values. One way to compute the median
isto list allscores in numerical order, and then locato the score in the center of the sample. For example,
there are 500 scores in the list, score #250 would be the median. If we order the
1f
8 scores shown above,
we would get:
15, 15,15,20,20,21,25,36

There are 8 scores and score #4 and #5 represent the halfway point. Since
both of these scores are 20,
the median is 20. If the two middle scores had different values, you would have to interpolate to
determine the median.
The mode is the most frequently occurring value in the set of scores. To determine the mode, you might
again order the scores as shown above, and then count each one. The most frequently occurring value is
the mode. In our example, the value 15 occurs three times and is the model. In some distributions there
is more than one modal value. For instance, in a bimodal distribution there are two values that occur
most frequently.
Notice that for the same set of8 scores we got three different values -- 20.875, 20, and 15 -- for the
mean, median and mode respectively. If the distribution is truly normal (i.e., bell-shaped), the mean,
median and mode are all equal to each other.
Dispersion:Dispersion refers to the spread of the values around the central tendency. There are two
common measures of dispersion, the range and the standard deviation. The range is simply the highest
value minus the lowest value. In our example distribution, the high value is 36 and the low is 15, so the
range is 36 - 15 = 21.

The Standard Deviation is a more accurate and detailed estimate of dispersion because an outlier can
greatly exaggerate the range (as was true in this example where the single outlier value of 36 stands
apart from the rest of the values. The Standard Deviation shows the relation that set of scores has to the
mean of the sample. Again let's take the set of scores:
15, 20,21,20,36,15,25, 15
to compute the standard deviation, we first find the distance between each value and the mean.
Comparative analysis:
comparative analysis as comparison analysis: Use comparison analysis to measurethe
financial relationships between variables Over two or more reporting periods. Businesses use

comparative analysis as a way to identify their competitive positions and operating results over a defined
period. Larger organizations may often comprise the resources to
perforn financial comparative analysis monthly or quarterly, but it is recommended to perform an
annual financial comparison analysis at a minimum.
Financial Comparatives:
Financial statements outline comparatives, which are the variables
the financial
defining operating activities, investing activities and financing activities for a company. Analysts
assess company financial statements using percentages, ratios and amounts when making financial

comparative analysis. This information is the business intelligence decision maker's use for determining
future businessdecisions. A financial comparison also be performed to
analysis may
determine companyprofitability and stability. For example, management of a new venture may make
a financial comparison analysis periodically to evaluate company
performance. Determining losses
prematurely and redefining processes in a shorter period will favor.compared to unforeseen-annual
losses.

Comparative Format:

The comparative format for comparative analysis in accounting is a side by side view
of the financial
comparatives in the financial statements. Comparative analysis accounting identifies an
organization's
financial performance. For example, income statements identify financial comparables
such
as company income, expenses, and profit over a period
of time. A comparison analysis report identifies
where a business meets or exceeds budgets. Potential lenders will also utilize this
information to
determine a company's credit limit.

Comparative Analysis in Business:


Financial statements play a pivotal role in comparative analysis in
business. By analyzing financial
comparatives, businesses are able to pinpoint significant trends and project
future trends with the
identification of considerable or abnormal changes. Businesscomparative
analysis against others in their
industry allows a company to evaluate industry results and gauge
overall companyperformance.
Different factors such as political events, economics changes, or industry
changes influence the changes
in trends. Companies may often document significant events in
their financial statements that have a
major influence on a change in trends.

CLUSTERING:
Cluster Analysis

Cluster is a group of objects that belongs to the same class. In other words,
similar objects are grouped
in one cluster and dissimilar objects are grouped in another cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar objects.

Points to Remember

A cluster of data objects can be treated as one group.

While doing cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the labels to the groups.

The main advantage of clustering over classification is that, it is adaptable to changes and helps
single out useful features that distinguish different groups.

Applications of Cluster Analysis

Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.

Clustering can also help marketers discover distinct groups in their customer base. And they can
characterize their customer groups based on the purchasing patterns.

In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes
with similar functionalities and gain insight into structures inherent to populations.

Clustering also helps in identification of areas of similar land use in an earth observation
database. It also helps in the identification of groups of houses in a city according to house type,
value, and geographic location.

Clustering also helps in classifying documents on the web for information discovery.

Clustering is also used in outlier detection applications such as detection of credit card fraud.

As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of
data to observe characteristics of each cluster.

Requirements of Clusteringin Data Mining

The following points throw light on why clustering is required in data mining -
Scalability - we need highly scalable clustering algorithms to deal with large databases.
-
Ability to deal with different kinds of attributes Algorithms should be capable to be applied on
any kind of data such as interval-based (numerical)
data, categorical, and binary data.

Discovery of clusters with attribute shape -- the clustering algorithm should be capable
of
detecting clusters of arbitrary shape. They should not be bounded to only distance measures
that
tend to find spherical cluster of small sizes.

High dimensionality- the clustering algorithm should not


only be able to handle low
dimensional data but also the high dimensional space.

-
Ability to deal with noisy data Databases
contain noisy, missing or erroneous data.
Some
algorithms are sensitive to such data
and may lead to poor quality clusters.
-
Interpretability the clustering results should
be interpretable, comprehensible, and usable.
Clustering Methods:

Clustering methods can be classified


into the following categories -
Partitioning Method

Hierarchical Method

Density-based Method

Grid-Based Method

Model-Based Method

Constraint-based Method

Partitioning Method
Suppose we are given a database
of 'n' objects and the partitioning method constructs k
data. Each partition will represent a partition of
cluster and k<n. means
groups, which satisfy that it wil classify
the data into k
the following requirements -
• Each group contains at
least one object.
Each object must belong to
exactly one group.
Points to remember

For a given number


of partitions (say k), the partitioning method
partitioning. will create an initial
Then it uses the iterative relocation
technique to improve the
partitioning by moving objects
from one group to other.

Hierarchical Methods
This method creates a hierarchical
decomposition of the given set
hierarchical methods on the basis
of data objects. We can classify
of how the hierarchical decomposition is
approaches here - formed. There are two

Agglomerative Approach

Divisive Approach

Agglomerative Approach
This approach is also known as the
bottom-up approach. In this, we start
with each object forming a
separate group. It keeps on merging
the objects or groups that are close to one another.
It keep on doing
sO until all of the groups are
merged into one or until the termination
condition holds.
Divisive Approach
This approach is also known as the top-down
approach. In this, we start with all
same cluster. In the continuous of the objects in the
a
iteration, cluster is split up into smaller clusters.
It is down until each
object in one cluster or the termination condition
holds. This method is rigid, i.e., once a
merging or
splitting is done, it can never be undone.

Approaches to Improve Quality of Hierarchical Clustering


Here are the two approaches that are used to improve the quality of hierarchical clustering
Perform careful analysis of object linkages at each hierarchical
partitioning.
Integrate hierarchical agglomeration by first using a hierarchical
agglomerative algorithm ,to
group objects into micro-clusters,
and then performing macro-clustering on the micro-clusters.

Density-based Method
This method is based on the notion of density. The basic idea is to,
continue growing the given cluster
as long as the density in the neighborhood exceeds some
threshold, i.e., for each data point within a
given cluster, the radius of a given cluster has to contain at least a minimum number
of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number
of cells that
form a grid structure.
pressing tisse
quantizsd spae
is tod is fast dissonsion in the
g
f
usorf sells in cach
k
y
et dalafor
odel. a given

ee csis
of

lusierso faud bet


toc fit disributionof te
sgatial
yyssd fo
ycussorigtue Gonsiy
sach
fuoction efiosts
t
standacd
clustors based oo
Sa Gorsioc toc nusnbo of
asoisally clustoringsmetods
e y

op ay o oeoe yielás sobust


application-oricntod

Giustgpous by te
oGopation of ur
peporties of dosicod
r
clusloring sesuis

aa
ed s

e
y
e

ae
yssio

c
way

aypliation
tie
uication
suquitt
witi
clustering process
tic

ifrfuon
salsrncnts that heip
ies asc sinpie tuposilories
c s ss asi scietioal aialass
or otor dala

aeke
Gusgig
y
guost
soquices just a litle
aa and
ecgaie
caisguical
for sosuti.
selational
as
aatas such
Sud as kais of
Advantages

The major advantage of this method is fast processing


time.
• Itis dependent only on the number of cells in each dimension in the quantized space.

Model-based methods
In this method, a model is hypothesized for each cluster to find
the best fit of data for a given model.
This method locates the clusters by clustering the density function.
It reflects spatial distribution of the
data points.

This method also provides a way to automatically determine


the number of clusters based on standard
statistics, taking outlier or noise into account. It
therefore yields robust clustering methods.
Constraint-based Method
In this method, the clustering is performed
by the incorporation of user or application-oriented
constraints. A constraint refers to the user expectation or
the properties of desired clustering results.
Constraints provide us with an interactive way
of communication with the clustering process.
Constraints can be specified by the user or
the application requirement.
ASSOCIATION:

Association Rule Mining, as the name suggests, association


rules are simple If/Then statements that help
discover relationships between seemingly independent relational
databases or other data repositories.
Most machine learning algorithms work with numeric
datasets and hence tend to be mathematical.
However, association rule mining is suitable for
non-numeric, categorical data and requires just a little
bit more than simple counting.

Association rule mining is a procedure which aims to


observe frequently occurring patterns,
correlations, or associations from datasets found in various
kinds of databases such as relational
databases, transactional databases, and other forms of repositories.

An association rule has two parts:

an antecedent (if) and


a consequent (then).

An antecedent is something that's found in data, and a consequent is an item that is


found in
combination with the antecedent Haye a look at this rule for instance:
"Ifacustomer buys bread, he's 70% likely of
buying milk.
In the above association rule, bread is the antecedent and milk is the consequent. Simply put, it can be

understood as a retail store's association rule to target their customers better. If the above rule is a result
of thorough analysis of some data sets, it can be used to not only improve customer service but also
improve the company's revenue.

Association rules are created by thoroughly analyzing data and looking for frequent iffthen patterns.
Then, depending on the following two parameters, the important relationships are observed:

Support: Support indicates how frequently the ifthen relationship appears in the database.
Confidence: Confidence tells about the number of times these relationships have been found to
be true.
So, in a given transaction with multiple items, Association Rule Mining primarily tries to find the rules
that govern how or why such products/items are often bought together. For example, peanut butter and
jelly are frequently purchased together because a lot of people like tomake PB&J sandwiches.

Association Rule Mining is sometimes referred to as Market Basket Analysis", as it was the first
application area of association mining. The aim is to discover associations of items occurring together
more often than you'd expect from randomly sampling all the possibilities. The classic anecdote of Beer

and Diaper will help in understanding this better.

The story goes like this: young American men who go to the stores on Fridays to buy diapers have a
predisposition to grab a bottle of beer too. However unrelated and vague that may sound to us laymen,
association rule mining shows us how and why!

Let's do a little analytics ourselves, shall we?

Suppose an X store's retail transactions database includes the following data:

Total number of transactions: 600,000


Transactions containing diapers: 7,500 (1.25 percent)
Transactions containing beer: 60,000 (10 percent)
Transactions containing both beer and diapers: 6,000 (1.0 percent)
From the above figures, we can conclude that if there was no relation between beer and diapers (that is,
to buy beer
they were statisticaly independent), then we would have got only 10% of diaper purchasers

too.
However, as surprising as it may seem, the figures
tell us that 80% (=6000/7500) of the people who buy
diapers also buy beer.

"This is a significant jump


of 8 over what was the expected probability. This factor
as Lift - which is of increase is known
the ratio of the observed frequency co-occurrence
of of our items and the expected
frequency.

Simply by calculating the transactions


in the database and performing simple
mathematical operations.
So, for our example, one plausible
association rule can state that the people who
purchase beer with a Lift factor buy diapers will also
of 8. If we talk mathematically, the lift can be calculated as
the joint probability of two items x the ratio of
and y, divided by the product their
of probabilities.
Lift = P(r.y/[P()P()]

However, if the two items are statistically


independent, then the joint probability
the same as the product of their probabilities. of the two items will be
Or, in other words,
P(x,y)-P(x)P(y),

Which makes the Lift factor = 1. An


interesting point worth mentioning
here is that anti-correlation can
even yield Lift values
-
less than 1 which corresponds to
mutually exclusive items that rarely occur
together.

Association Rule Mining has helped


data scientists find out patterns they never
knew existed.

1. Market Basket Analysis:


This is the most typical example
of association mining. Data is collected using barcode scanners
supermarkets. This database, known as in most
the "market basket" database, consists
records on past transactions. A single
of a large number of
record lists all the items bought by a customer
Knowing which groups are inclined towards in one sale.
which set of items gives these shops
the freedom to adjust
the store layout and the store catalogue to place
the optimally concerning one another.

2. Medical Diagnosis:
Association rules in medical diagnosis can
be useful for assisting physicians
for curing patients.
Diagnosis is not an easy process and has a scope errors
of which may result in unreliable end-results.
Using relational association rule mining, we can
identify the probability of the occurrence
concerning various factors and symptoms.
of an illness
Further, using learning techniques, this
interface can be
Cxtended by adding new symptoms and defining relationships between the
new signs and the

coresponding diseases.

3. Census Data:
Every government has tonnes of census data. This data can be used to plan efficient public
services(education, health, transport) as well as help public businesses (for setting up new factories,

shopping malls, and even marketing particular products). This application of association rule mining and
data mining has immense potential in supporting sound public policsy and bringing forth an efficient

functioning of a democratic society.

4. Protein Sequence:
Proteins are sequences made up of twenty types of amino acids. Each protein bears a unique 3D
structure which depends on the sequence of these amino acids. A slight change in the sequence can
cause a change in structure which might change the functioning of the protein. This dependency of the

protein functioning on its amino acid sequence has been a subject of great research. Earlier it was
thought that these sequences are random, but now it's believed that they aren't. Nitin Gupta, Nitin
Mangal, Kamal Tiwari, and Pabitra Mitra have deciphered the nature of associations between different
amino acids that are present in a protein. Knowledge and understanding of these association rules will
come in extremely helpful during the synthesis of artificial proteins.

Hypothesis Generation:

In a nutshell, hypothesis generation is what helps you come up with new ideas for what you need to
change. Sure, you can do this by sitting around in a room and brainstorming new features, but reaching
out and learning from your users is a much faster way of getting the right data.

Imagine you were building a product to help people buy shoes online. Hypothesis generation might
includethingslike:

Talking to people who buy shoes online to explore what their problems are
Talking to people who don't buy shoes online to understand why
to understand what their
Watching people attempt to buy shoes both online and offline in order
problems really are rather than what they tell you they are
confusing
Watching people use your product to figure out if you've done anything particularly
that is keeping themn from buying shoes from you

your product. For


As you can see, you can do hypothesis generation at any point in the development of
example, before you have any product at all, you need to do research to learn about your potential users'
habits and problems. Once you have a product, you need to do hypothesis generation to understand how
people product what problems you've caused.
are using your and
To be clear, the research itself does not generate hypotheses. YOU do that. The goal is not to just go out
and have people tellyou exactly what they want and then build it. The goal is to gain an understanding
up clever ideas for what to build next.
of your users or your product to help you think
Good hypothesis generation almost always ínvolves qualitative research.

At some point, you need to observe people or talk to people in order to understand them better.
However, you can sometimes use data mining or other metrics analyzation to begin to generate a
hypothesis. For example, you might look at your registration flow and notice a severe drop off half way
through. This might give you a clue that you have some sort of user problem half way though your
registration process that you might want to look into with Some qualitative research.

Hypothesís Validation:

Hypothesis validation is different. In this case, you already have an idea of what is wrong, and you have
an idea ofhow you might possibly fix it. You now have to go out and do some research to figure out if
your assumptions and decisions were correct.

For our fictional shoc-buying product, hypothesis validation might look something
like:

more smoothly than


Standard usability testing on a proposcd new purchase flow to see if it goes

the old one


persona group to see if a proposed new feature
Diowing mockups to people in a particular
appeals to that specificgroup of people
new feature improves purchase conversion
A/B testing of changes to see if a

some sort of tangible thing that is getting tested.


Hypothesis validation also almost always involves
a wireframe to a prototype to an actual feature,
but there's
Ihat thing could be anything from
something that you're testing and getting concrete data about.

-Hypothesis:
or she is
or assertion of an analyst about the problem he
Simply put, a hypothesis is a possible view
may not be true.
working upon. It may be true or
customers are likely to lapse
build a credit risk model to identify which
For example, if youare asked to
are which are not, these can a possible set of hypothesis:

are more likely to default in future


Customers with poor credit history in past
are likely to default more than those with low ratio
Customers with high (loan value / income)
are more likely to be at a higher credit risk
Customers doing impulsive shopping

out of these hypothesis would be true.


At this stage, you don't know which

Hypothesis generation important:

is an upfront hypothesis generation


important? Let us try
Now, the natural question which arises is why
contrast:
and understand the 2 broad approaches and their

boiling the ocean)


Approach 1:Non-hypothesis driven data analysis (i.e.

is no end to what data


you can capture and how much tme you can spend in
In today's world, there you
example, in this particular case mentioned above, if
trying to find out more variables / data. For
you. This
you will try and understand every possible variable available to
don't form initial hypothesis,
hundreds of variables), the company's internal
would include Bureau variables (which will have -
variables, and other exterrnal data sources. So, you are already talking about analyzing 300
experience
S00 variables. As an analyst, you will take a lot of time to do this and the value in doing that is not

much. Why? Because, even if you understand the distribution of all 500 variables, you would need to
understand their correlation and a lot of other information, which can take hell of a time. This strategy is
typically known as boiling the ocean. So, you don't know exactly what you are looking for and you are
exploring every possible variable and relationship in a hope to use all - very difficult and time
consuming.

Approach2:Hypothesisdrivenanalysis
In this case, you list down a comprehensive set of analysis first - basically whatever comes to your mind.
Next, you see which out of these variables are readily available or can be collected. Now, this list should
give you a set of smaller, specific individual pieces of analysis to work on. For example, instead of
understanding all 500 variables first, you check whether the bureau provides number of past defaults or
not and use it in your analysis. This saves a lot of time and effort and if you progress on hypothesis in
order of your expected importance, you will be able to finish the analysis in fraction of time.

If you have read through the examples closely, the benefit of hypothesis driven approach should be
pretty clear. You can further read books "The McKinsey Way" and"The Pyramid Principle"
for gaining
more insight into this process.

MODULE IV -
VISUALIZATION-1
Data Visualization:
In order to understand data, it is often useful to visualize it. Normally in Big Data applications,
the
interest relies in finding insight rather than just making beautiful plots. The following are examples
of
different approaches to understanding data using plots.

Tostart analyzing the flights data, we can start by checking if there are corelations between numeric
variables.

Visualization or visualization (is any technique for creating images, diagrams, or animations to
communicate a message. Visualization through visual imagery has been an effective way to
communicate both abstract and concrete ideas since the dawn of humanity. Examples from history

You might also like