KEMBAR78
Unit 4 Big Data Complete Notes | PDF | Sampling (Statistics) | Outlier
0% found this document useful (0 votes)
55 views32 pages

Unit 4 Big Data Complete Notes

ccbd bca

Uploaded by

Jeevan Kp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views32 pages

Unit 4 Big Data Complete Notes

ccbd bca

Uploaded by

Jeevan Kp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Big Data

Unit IV

Introduction
Data are key ingredients for any analytical exercise. Hence, it is important to
thoroughly consider and list all data sources that are potential interest before starting the
analysis.

Types of Data Sources


Data can originate from a variety of different sources which will be explored in
different forms
1. Transactional Data: It consists of structured, low-level, detailed information
capturing the key characteristics of a customer transaction.
Eg: Purchase, claim, cash transfer etc.
2. Unstructured Data: These data are embedded in text documents.
Eg: Emails, Web pages etc.
 It requires extensive pre-processing before included in the analytical process.
3. Qualitative Data: These data are expert-based data defined as the data that
approximates and characterizes.
 It can be observed and recorded.
 This type of data is collected through methods of observations, one-to-one interviews,
conducting focus group etc.

Sampling
It is a statistical analysis technique used to select, manipulate and analyse a
representative subset of data points to identify patterns and trends in the larger data set being
examined.
 The aim of sampling is to take a subset of past customer data and use that to build an
analytical model
 Key requirement for a good sample is that it should be representative of the future
customers on which the analytical model will be run
Definitions:
1. Population:
It is a collection of observation about which the user would like to make an inference.
2. Sample:
It is a specific group of individuals that allows to collect data from.
3. Sampling Frame:
Sample Frame is the actual list of individuals that the sample will be drawn from.
E.g. working on conditions at Company X, Population is all 1000 employees of the
company, here the sampling frame is the Company’s HR database which list the names and
contact details of every employee.
4. Sample Size:
The number of individuals in the sample depends on the size of the population on
how precisely, to represent the population as a whole.

SJFGC, Mysuru Page 1


Big Data

Types of Sampling
1. Simple Random Sampling:
 In a Simple Random Sample, every member of the population has an equal chance of
being selected.
 Sampling frame should include the whole population.
 Tools like Random Generator are used
E.g: Want to select a simple random sample of 100 employee of company X & assign a
number of every employee in the company database from 1 to 1000, and use a random
number generator to select 100 numbers.
2. Systematic Sampling:
 It is similar to simple random sampling, but it is usually slightly easier to conduct.
 Every number of the population is listed with a number, but instead of randomly
generating numbers individuals are choosen at regular intervals.
E.g: All employees of the company are listed in alphabetical order; from first 10 numbers
randomly select a starting point, i.e., number 6
From number 6 onwards every 10th person on the list is selected (6, 16, 26, 36........).
3. Stratified Sampling:
 It involves dividing the population into sub-populations that may differ in input ways
i.e., divide the population into subgroups called strata based on the relevant
characteristics.
 It allows to draw more precise conclusions by ensuring that every subgroup is
properly represented in the sample.
E.g.. Company has 1000 employees, based on the strata Gender it has been divided into
800 female & 200 males.
4. Cluster Sampling:
 It involves dividing the population into subgroups, but each subgroup should have
similar characteristics to the whole sample.
 Instead of sampling individuals from each subgroup you randomly select the entire
subgroups.
E.g.. Company has offices in 10 cities across the country with the same number of
employees with similar roles, one doesn’t have to travel every office to collect the data, use
random sampling to select 3 offices. These are the clusters.

SJFGC, Mysuru Page 2


Big Data

Types of Data Elements


To start the analysis one requires different types of data elements.
1. Continuous Data:
It represents measurements, therefore their values can’t be counted but they can be
measured.
E.g.. Height of a person, which can describe by using intervals on the real number line.

2. Categorical Data:
It represents the characteristics of the data
 It can also take numerical values.
E.g., Person Gender, language etc..
Categorical data has been classified into
a. Nominal Data
b. Ordinal Data
c. Binary Data

a. Nominal Data:
These are data elements that can only take on a limited set of values with no
meaningful ordering in between.
E.g., Martial Status: Yes or No
 Nominal data has no order, Therefore if would change the order of its values the
meaning would not change.
b. Ordinal Data:
These are data elements that can only take on a limited set of values with a
meaningful ordering in between.
E.g., Age coded as young, middle aged and old.
c. Binary Data:
These are data elements that can only take on two values.
E.g., Employment status

Visual Data Exploration & Exploratory Statistical Analysis


 Visual data exploration is a very important part of getting to know the data in an
“informal way”.
 It allows to get some initial insights into the data, which can be usefully adopted
throughout the modelling.
 Different plot/graphs can be used.
E.g. 1.Piechart-It represents a variable’s distribution as a pie, whereby each section
represents the portion of the total percent taken by each value of the variable.

2. Bar Charts-It represents the frequency of each of the values either absolute or relative as
bars.

3. Histogram-It provides an easy way to visualise the central tendency and to determine the
variability or spread of the data.

4. Scatter Plots-It allows visualizing one variable against another to see whether there are
any correlation patterns in the data.

SJFGC, Mysuru Page 3


Big Data

A next step after visual analysis could be inspecting some basic statistical
measurements such as averages, standard deviations, minimum, maximum, confidence etc.

Missing Values
Missing value occurs when no data value is stored for the variable in an
observation.
 Missing data is a common problem and challenge for analysts.
 Some of the analytical techniques called decision trees etc deals with the missing
values.
 Missing values can occur because of various reasons. The information can be
nonapplicable.

E.g.,

The most popular schemes to deal with missing values are:


1. Replace (Impute): It implies replacing the missing value with a known value.
2. Delete: It is a straightforward option and consists of deleting observations or variables
with lots of missing values.
3. Keep: Missing values can be meaningful sometimes.

SJFGC, Mysuru Page 4


Big Data

Outlier Detection and Treatment


Outliers are defined as samples that are significantly different from the remaining
data (population).
 These are the points that lie outside the overall pattern of the distribution.

Types of Outliers:
1. Univariate Outlier: It can be found when looking at a distribution of values in a
single feature space.
2. Multivariate Outlier:
 It can be found in an n-dimensional space (n features).
 Multivariate outliers are very difficult to handle, so only the model has to be trained.

Some of the common causes of outliers are:


1. Data entry errors (Human Errors)
2. Measurement errors (Instrument Errors)
3. Experimental errors
4. Data Processing errors (Data manipulation on data set)
5. Sampling errors (Extracting or missing data from wrong or various sources)

Detection and Treatment are the two important steps in dealing with the outliers.
 The first check for outliers is to calculate the minimum and maximum values for each
of the data elements.
 Various graphical tools can be used to detect outliers and they are
1. Histograms
2. Box-Plots
3. Z-Score

1. Histograms:
 It is a graphical representation that organizes a group of data points into user
specified ranges
 It is similar in appearance to a bar graph

SJFGC, Mysuru Page 5


Big Data

Box-plots:
 Box-plot is a standardized way of displaying the distribution of data based on a three
key quartiles of the data.
a. Minimum, first Quartile(Q1) (25% of observations)
b. Median,(50% of observations)
c. Maximum, third Quartile(Q3) (75% of observations)
All quartiles are represented as a box. The minimum and maximum values are added
unless they are too far away from the edges of the box.

Z-Score:
 It is a numerical measurement that describes a value’s relationship to the mean of a
group of values.
 It is measured in terms of standard deviations from the mean.
 If Z-score is 0, it indicates that the data point score is identical to the mean score.
 Formula:

Xi represents the input


 represents Standard Deviation
 represents average of the input values
E.g.,

SJFGC, Mysuru Page 6


Big Data

Standardizing Data:
 It is a data pre-processing activity targeted at scaling variables to a similar range.
 It focuses on transforming raw data into usable information before its analyzed.
 Raw data can contain variations in entries that are meant to be the same that could
later affect data analysis.
 But, standardizing the data will be changed to be consistent across all entries.
 Once the information in the dataset is consistent and standardized, it will be
significantly easier to analyze and use.
 Standardization processes create compatibility, similarity, measurement and symbol
standards.
 Standardization is especially used for regression based approaches.

Min/Max Standardization:

Categorization
 It is also known as coarse classification, classing, grouping, & binning.
 It helps to identify and assign categories to a collection of data to allow for more
accurate analysis.
 Classification helps the user for knowledge discovery and future plan.
E.g., Email classifying as ‘spam’ or ‘not spam’.

Common steps for data classification:


a. Scan: This step involves taking the stock of an entire database and making a
digital game plan to tackle the process.
b. Identify: Anything from file type to character units of size of packets of data may
be used to sort the information into searchable, sortable categories.
c. Separate: Once the data is categorized it can be separated by the categories of the
complete dataset.

Binning
 Binning is used to minimize the effects of small observation errors.
 The original data values are divided into small intervals known as bins.
 It has a smoothing effect on the input data and may also reduce the chances of over
fitting in case of small datasets.

SJFGC, Mysuru Page 7


Big Data

Various Methods can be used to do categorizations


1. Equal Interval Binning:
In this Binning, Bins have equal frequency
2. Equal Interval binning/Equal Width Binning:
Bins have equal width with a range of each bin are defined as
[min+w],[min+2w]..................[min+nw]

Where w= (max-min)/No of bins

Chi-Squared Analysis
 It is a more sophisticated way to do coarse classification.
 Chi-Square test is a test of statistical significance for categorical variables.
 Chi-square test is a useful measure of comparing experimentally obtained result with
those expected theoretically & based on the hypothesis.
 Formula:

 If there is no difference between actual and observed frequencies, the value of chi-
square is zero.
 If there is a difference between observed and expected frequencies, then the value of
chi-square would be more than zero.

Weights of Evidence Coding


 Weights of Evidence tell the predictive power of an independent variable in relation
to the dependent variable.
 Weights of Evidence evolved from credit scoring world, it is generally described as a
measure of the separation of good and bad customers.
 Bad customers refers to the customers who defaulted on a loan
 Good Customers refers to the customers who paid back loan.
 WOE is calculated as:
WOE=ln{Distribution of good/Distribution of Bads}

Positive WOE means: Distribution of Good>Distribution of Bad


Negative WOE means: Distribution of Good<Distribution of Bad
 WOE helps to transform a continuous independent variable into a set of bins based on
similarity of dependent variable distribution.

Benefits of WOE
1. It can treat Outliers
2. It can handle missing values as missing values can be binned separately
3. It helps to built strict linear relationship with log odds.

SJFGC, Mysuru Page 8


Big Data

Variable Selection
Variable selection means selecting which variables to include in the model rather
than some sort of selection.

Variable Selection/Feature Selections are used for several reasons:


i. Simplification of models to make them easier to interpret by researchers
ii. Shorter Training times
iii. To avoid the curse of dimensionality.

Various filters measured used in the selection of variable & they are
1. Pearson’s Correlation: Pearson’s Correlation Coefficient is the test statistics that
measures the relationship or association between two continuous variables.

 The linear dependency between two variables and always varies between -1 & +1.

2. Fisher Score: It is test analysis helps to measure the relationships between categorical
variable.

Formula:

Here XG & XB: Average value of the variable for Good & Bad.

SG2 & SB2 is the variance.

SJFGC, Mysuru Page 9


Big Data

Information Value

 It is the one of the most useful technique to select important variables in predictive
model.
 It helps to rank variables on the basis of their importance.
 Information Value is calculated as:

IV: i=1 to k(Distribution of Goodi— Distribution of Badi)*WOE

 k represents the number of categories of the variable.

Rule related to Information Value

Segmentation

Segmentation is the process of taking the data and dividing it up and grouping similar
data together based on the chosen parameter.

 Segmentation is pre-processing activity because it helps to estimate different


analytical models tailored to a specific segment.

E.g., Banks might adopt special strategies to specific segments of customers.

Why Data Segmentation is Important?

1. It allows for easier conduct on analysis of the data stored.

2. It helps in mass-personalize the marketing communications & reducing costs.

SJFGC, Mysuru Page 10


Big Data

Analytics

 Analytics is everywhere and strongly embedded into our daily lives.


 It refers to extract useful business patterns or mathematical decision models from a
preprocessed data set.

Examples for Analytics Application:

Analytics Process Model


Analytics Process Model is the model that describes the iterative chain of processing
steps involved in turning data into information or decisons.
Analytics process Model includes various steps and they are

1. Source Data

 Source data need to be identified that could be of potential interest.


 It is very important step, as data is the key ingredient to any analytical model
and the selection of data will have a deterministic impact on the analytical
models that will be built in a subsequent step.
 All data will then be gathered in a staging area, which could be a datamart
or data warehouse.

SJFGC, Mysuru Page 11


Big Data

2. Data Mining Mart

 Exploratory Analysis can be considered using Online Analytical Processing

(OLAP) facilities for multidimensional data analysis.

 Data Analysis may be Roll-up, Drill-Down, Slicing & Dicing.

3. Data Cleaning

 It helps to get rid of all inconsistencies such as missing values, outliers and
duplicate data.

4. Data transformation

 Additional transformations also considered such as binning, alphanumeric to

numeric coding, geographical aggregation and so forth.

5. Analytics

 Analytical Model will be estimated on the pre-processed and transformed


data.
 Once the model has been built it will be interpreted and evaluated by the
business experts
 Analytical Model helps in detecting the trivial patterns
 Analytical Model provides some validation of the model.
 During the analytics step, the need for additional data may be identified,
which may necessitate additional cleaning, transformation
 The most time consuming step is the data selection and preprocessing step;
which consumes around 80% of the total efforts needed to build an analytical
model.
 Analytics is multidisciplinary model in which many different job profiles
need to collaborate together.

6. Interpretation & Evaluation

 The last step is interpretation and evaluation where the results will be

interpretated & evaluated by the experts.

Persons in Building Analytical Model

1. Data Scientist: Data Scientists are analytical experts who utilize their skills in
both technology & social science to find trends and manage data.

2. Data Miner: Persons who involves exploring & analyzing large blocks of information to
glean meaningful patterns & trends.

SJFGC, Mysuru Page 12


Big Data

3. Data Analyst: Persons who understand data & use it to make strategic business
decisions.

Analytical Model Requirements

1. Business Performance:

The analytical model should solve the business problem for which it was developed.

2. Statistical Performance:

The model should have statistical significance and predictive power.

3. Model should be Interpretable & Justifiable:

Interpretable: It refers to understanding the patterns that the analytical model captures.

Justifiability: It refers to the degree to which a model corresponds to prior business


knowledge & intuition.

4. Analytical model should be operationally efficient:

It refers to the efforts needed to collect the data, preprocess it, evaluate the model and
feed is outputs to the business application.

5. Economic Cost:

Software costs, human & computing resources should be taken into consideration.

Types of Analytics

1. Predictive Analytics

2. Descriptive Analytics

3. Social Network Analytics

Predictive Analytics

Predictive Analytics used for mining the data, using statistical algorithms and

machine learning to predict trends or probabilities.

 Uses historical data & patterns in historical data to predict the future.
 Create models based on patterns in data to predict the probability of
something happening in the future.
 Better the model & the training data the better the prediction.

SJFGC, Mysuru Page 13


Big Data

Steps in Predictive Analytics are:

1. Create Model 2. Train Model 3. Evaluate 4. Test Model 5.Predict

Training Data Set: The set is implemented to build up a model.

Testing Data Set: The set is validating the model built.

Applications for Predictive Analytics

1. Retail: Predictive Analytics is used in retail is always looking to improve its sales position
and forge better relations with customers.

2. Health: User in the predicting epidemics or public health issues based on the probability

of a person suffering the same again.

Types of Predictive Analytics are:

1. Regression

2. Classification

Regression

Regression is a statistical technique used to model relationship between two sets of


variables

 Variables are-

a. Independent-X Variable &


b. Dependent-Y Variable

a. Independent Variables:

 Independent Variables are regarded as inputs to a system and may taken an


different values freely
 They are called Predictor Variable.

b. Dependent Variables:

 Dependent Variables are those values that change as a consequence of change in the
other values in the system.
 They are called Criterion Variable

SJFGC, Mysuru Page 14


Big Data

Types of Regression

1. Linear regression

2. Logistic regression

Linear Regression

Linear Regression is where the relationships between the variables can be described
with the straight line.

 Linear Regression Analysis is used to find equations that fit data.


 The straight line equation has the form : y=ax+b

Here y is the dependent variable, x is the independent variable

‘b’ is the slope of the line, ’a’ is the y-intercept

Slope: The slope of a line is the change in y for a one unit increase in x.

Y-Intercept: It is the height at which the line crosses the vertical axis & it is obtaining by
setting x=0 in the above equation.

Example:

Logistic Regression:

Logistic Regression is the analysis conduct when the dependent variable is binary.

 It is used to describe data and to explain the relationship between one dependent
binary variable & one or more nominal, ordinal independent variable.
 Formula:

SJFGC, Mysuru Page 15


Big Data

Example:

 Regression Model can generate the predicted probability ranging from negative to
positive infinity, whereas probability of an outcome can only lie between 0<p(x)<1.

Decision Trees

Decision Trees are the graphical representation for getting all the possible solutions
to a problem/decision based on given conditions.

 It can be used for both classification & regression problems but mostly used for
classification problems.
 Decision tree is a tree structured classifier, where internal nodes represent the features
of a dataset, branches represent the decision rules and each leaf node represents the
outcome.
 A decision tree simply asks a question,& based on the answer(Yes/No),features split
the tree into sub-trees.
 In Decision tree, there are two nodes,

a. Decision Node - Decision nodes are used to make any decision and have
multiple branches,
b. Leaf Node-Leaf nodes are the output of those decisions and do not contain
any further branches.

Structure of Decision Trees

SJFGC, Mysuru Page 16


Big Data

Decision Tree Terminologies

a. Root Node:

 Root node is from where the decision tree starts.


 It represents the entire data set, which gets further divided into two or more
homogeneous sets.

b. Leaf Node:

 These are final output node and the tree cannot be segregated further after
getting a leaf node.

c. Branch/SubTree:

 A tree formed by splitting the tree.

d. Parent/Child Node:

 The root node of the tree is called parent node & other nodes are called the
child nodes.

Process

1. Splitting:

It is the process of dividing the node/root into sub-nodes according to the condition.

2. Pruning:

It is the process of removing the unwanted branches from the root.

Example:

Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node

SJFGC, Mysuru Page 17


Big Data

(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:

Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a technique
which is called as Attribute selection measure or ASM. By this measurement, we can easily
select the best attribute for the nodes of the tree.

There are two popular techniques for ASM, which are:

1. Information Gain

2. Gini Index

Information Gain:

Information gain is the measurement of changes in entropy after the segmentation of a


dataset based on an attribute.

 It calculates how much information a feature provides us about a class.


 According to the value of information gain, we split the node and build the
decision tree.
 A decision tree algorithm always tries to maximize the value of information
gain, and a node/attribute having the highest information gain is split first.
 It can be calculated using the below formula:

Information Gain= Entropy(S) - [(Weighted Avg) *Entropy(each feature)

SJFGC, Mysuru Page 18


Big Data

Entropy:

Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:

Entropy(s) = P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

Gini Index:

 Gini index is a measure of impurity or purity used while creating a decision tree in
the CART (Classification and Regression Tree) algorithm.
 An attribute with the low Gini index should be preferred as compared to the high
Gini index.
 It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
 Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Advantages of the Decision Tree

 It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
 It can be very useful for solving decision-related problems.
 It helps to think about all the possible outcomes for a problem.
 There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

 The decision tree contains lots of layers, which makes it complex.


 It may have an over fitting issue, which can be resolved using the Random Forest
algorithm.
 For more class labels, the computational complexity of the decision tree may increase.

Neural Network

Neural Network is the mathematical representations deigned to operate like a human


brain.

SJFGC, Mysuru Page 19


Big Data

 Neural networks are inspired from the biological neurons within the human body
which activate under certain circumstances resulting in a related action performed
by the body in response.
 Neural nets consist of various layers of interconnected artificial neurons powered
by activation functions which help in switching them ON/OFF.

Structure of Neural Network:

Here,
x0, x1,x2 are the inputs
w0, w1, w2 are the weights
F-Activation Function
 Weights are numeric values which are multiplied with inputs
 Activation Function is a mathematical formula which helps the neuron to switch
ON/OFF
 Bias Component (B) The Neural Network takes the input and compute the weighted
sum of inputs & include a bias component.
Architecture of Neural Network

Neural network consists of three different layers and they are:

1. Input Layer:
 Input layer represents dimensions of the input vector.
 It accepts the inputs in several different formats provided by the programmers.
2. Hidden layer:
 It represents the intermediary nodes that divide the input space into regions with (soft)
boundaries.
 It takes in a set of weighted input and produces output through an activation function.
 The hidden layer presents in between input & output layers.

SJFGC, Mysuru Page 20


Big Data

 It performs all the calculations to find hidden patterns & features.


3. Output layer:
 It represents the output of the neural network.
 The input goes through a series of transformations using the hidden layer, which
finally results in output that is conveyed using this layer.

Activation Functions:
 Activation functions help to normalize the output of each neuron to a range between 1
&0 or between -1 &1.
The most popular transformation functions are:

Types of Neural Networks


a. Perceptron Model:
 It is one of the simplest and oldest models of Neuron.
 It is the smallest unit of neural network that does certain computations to detect
features or business intelligence in the input data.
 It accepts weighted inputs, and apply the activation function to obtain the output as
the final result.
 Perceptron is also known as TLU(threshold logic unit)

Advantages of Perceptron
 Perceptrons can implement Logic Gates like AND, OR, or NAND

Disadvantages of Perceptron

SJFGC, Mysuru Page 21


Big Data

 Perceptrons can only learn linearly separable problems such as boolean AND
problem. For non-linear problems such as boolean XOR problem, it does not work.

b. Feed Forward Neural Networks

 In this network, input data travels in one direction only passing through neural nodes

and exiting through output nodes.

 Hidden layer may or may not be present.

 Weights are Static here.

 This network used in Simple Classification & Face Recognition

Advantages of Feed Forward Neural Networks


1. Less complex, easy to design & maintain
2. Fast and speedy [One-way propagation]
3. Highly responsive to noisy data
Disadvantages of Feed Forward Neural Networks:
1. Cannot be used for deep learning [due to absence of dense layers and back
propagation]

c. Multilayer Perceptron
 In this model, the input data travels various layers of neurons.
 Every single node is connected to all neurons in the next layer which makes it a fully
connected neural network.
 Input &Output layers are present having multiple hidden layers.
 It has bi-directional propagation.
 This model used in Speech Recognition &Machine Translation

SJFGC, Mysuru Page 22


Big Data

Advantages on Multi-Layer Perceptron


1. Used for deep learning [due to the presence of dense fully connected layers and
back propagation]
Disadvantages on Multi-Layer Perceptron:
1. Comparatively complex to design and maintain
2. Comparatively slow (depends on number of hidden layers)

d. Convolutional Neural Network

 Convolution neural network contains a three-dimensional arrangement of

neurons, instead of the standard two-dimensional array.

 The first layer is called a convolutional layer.

 Each neuron in the convolutional layer only processes the information

from a small part of the visual field.

 Input features are taken in batch-wise like a filter.

 This model is used in Image Processing & Computer Vision.


Advantages of Convolution Neural Network:

1. Used for deep learning with few parameters


2. Less parameters to learn as compared to fully connected layer

Disadvantages of Convolution Neural Network:

1. Comparatively complex to design and maintain

SJFGC, Mysuru Page 23


Big Data

2. Comparatively slow [depends on the number of hidden layers]

e. Radial Basis Function Neural Networks

 Radial Basis Function Network consists of an input vector followed by a layer of RBF

neurons and an output layer with one node per category.

 Classification is performed by measuring the input’s similarity to data points from the

training set where each neuron stores a prototype.

 When a new input vector [the n-dimensional vector that you are trying to classify]

needs to be classified, each neuron calculates the Euclidean distance between the

input and its prototype.


 The output layer consists of a set of neurons [one per category].

f. Recurrent Neural Networks

 Recurrent Neural Network is fed back to the input to help in predicting the

outcome of the layer.

 The first layer is typically a feed forward neural network followed by

recurrent neural network layer where some information it had in the

previous time-step is remembered by a memory function.

SJFGC, Mysuru Page 24


Big Data

 This model used in Text processing like auto suggests grammar checks &Text to

speech processing.

Advantages of Recurrent Neural Networks


1. Model sequential data where each sample can be assumed to be dependent on
historical ones is one of the advantages.
2. Used with convolution layers to extend the pixel effectiveness.
Disadvantages of Recurrent Neural Networks
1. Gradient vanishing and exploding problems
2. Training recurrent neural nets could be a difficult task

Applications of Neural Network:


 Pattern recognition
 Control Systems & Monitoring
 Mobile Computing
 Marketing & Financial Applications
 Forecasting-sales, market research etc..

Advantages of Neural Network


1. Parallel Processing Capability
2. Storing data on the entire network
3. Capability to work with incomplete knowledge

Disadvantages of Neural Network


1. Assurance of proper network Structure
2. Unrecognized behaviour of the network
3. Hardware Dependence
4. Difficulty of showing the issue in the network.

SJFGC, Mysuru Page 25


Big Data

Descriptive Analytics

It is analytics that creates a summary of historical data to yield useful information and
possibly prepare the data for further analysis.
 The aim is to describe patterns of customer behaviour.
 It serves as a preliminary step in the business intelligence process, creating a
foundation for further analysis & understanding.
 This analysis seeks the answers about what happened, without performing the more
complex analysis.
E.g., Summarising the event such as sales & operations data.
The three most common types of Descriptive Analytics are:
 Association rules
 Sequence rules
 Clustering

Association Rules
Association Rules helps to detect frequently occurring patterns between the items.
E g., Market Basket Analysis is one of the key techniques used by large relations to show
associations between items.
 It allows the retailers to identify the relationships between the items that
people buy items frequently.
Implications of Association Rules:
XY

whereby X  I &YI and XY=


 X referred as Rule of Antecedent
 Y referred as Rule of Consequent

SJFGC, Mysuru Page 26


Big Data

Basic Definitions:
1. Support Count (): Frequency of occurrence of an item set.
{Milk,Bread}=1

2. Frequent Item set: An item set whose support is greater than or equal to minimize
threshold.

Rule Evaluation Metrics


1 .Support(S): The occurring frequency of the rule, i.e., no of transactions that contain both
X & Y.

2. Confidence(C): The strength of the association, measures of how often items appear in
transactions that contain X.

3. Lift-Ratio of observed support to that expected if X & Y were independent

Lift(XY) = Support (XUY)


Support(X)*Support(Y)
 If lift=1,probability of occurrence of X variable &Y variable are independent
to each other
 If lift>1,probability of occurrence of X variable &Y variable are dependent
on each other
 If lift<1,Items are substitute to each other, means that presence of one item has
negative effect on presence of other item
Apriori Algorithm:
It is used for mining frequent items-sets & devising association rules from a
transactional database.
Steps:
1. Calculate the support items sets(of size k=1) in the transactional database. This is
called generating the candidate set.
2. Prune the candidate set by eliminating items with a support less than the given
threshold.
3. Join the frequent item sets to form sets of size k+1, &Repeat above sets until no more
item sets can be formed.

SJFGC, Mysuru Page 27


Big Data

Problem: Suppose that the given support is 3 and the required confidence is 80%

The following rules can be obtained from the size oftwo frequent itemsets (2-frequent
itemsets):

1. I2 -> I3, I2−>I3 Confidence = 3/3 = 100%.


2. I3 -> I2,I3−>I2 Confidence = 3/4 = 75%
3. I3 -> I4, I3−>I4 Confidence = 3/4 = 75%.
4. I4 -> I3,I4−>I3 Confidence = 3/3 = 100%

Since our required confidence is 80%, only rules 1 and 4 are included in the result.
Therefore, it can be concluded that customers who bought item two (I2) always bought item
three (I3) with it, and customers who bought item four (I4) always bought item 3 (I3) with it.

Sequence Rules

Sequence rules are used for finding statistically relevant patterns between the data
where the values are delivered in a sequence.
 Sequential Rules is to find maximal sequences among all sequences that have certain-
user specified minimum support and confidence.
E.g., Sequence of webpage visits in Web Analytics.
Consider the example of a transactions data set in a Web analytics,
The letter A, B,C……refer to the webpages.

SJFGC, Mysuru Page 28


Big Data

Advantages of Descriptive Analytics


 Identify gaps and performance issues early - before they become problems.
 Identify specific learners who require additional support, regardless of how many
students or employees there are.
 Identify successful learners in order to offer positive feedback or additional resources.
 Analyze the value and impact of course design and learning resources.

Disadvantages of Descriptive Analytics


 They are limited, so much that they allow making summations about the people or
objects that have actually measured, but cannot use the data, to generalize the people
or objects.
Applications
1. Market Basket analysis: The aim is to detect which products or services are frequently
purchased together by analyzing market baskets
2. Recommender Systems: The systems adopted by companies such as Amazon & Netflix to
give a recommendation based on past purchases or browsing behaviour.

SJFGC, Mysuru Page 29


Big Data

Difference between Descriptive Analytics & Predictive Analytics

Social Network Analytics


 Social Network analysis is the process of gathering and analyzing data from social
network such as Facebook, Instagram etc..
 It is commonly used by marketers to track online conversations about products and
companies
Eg., Web Pages connected by Hyperlinks
Email traffic between people.
Social Network Definitions
Social Network consists of both nodes & edges.
1. Node: A Node be defined as a customer, family, webpage etc..
2. Edge: An edge can be defined as a relationship, transmission etc..
 Edges can be weighted based on the interaction frequency, intimacy etc..

Representation:
1. Sociograms: Social Networks can be represented as a sociograms.
 Sociograms are good for small scale network.
 Color of the nodes corresponds to the specific status.

SJFGC, Mysuru Page 30


Big Data

2. Matrix: Matrix is good for larger scale networks.


 These matrices will be symmetrical and typically very sparse (with lots of zero).
 The matrix can also contain the weight in case of weighted connections.

Social Network Metrics


Social network can be characterized by various social network metrics.
Assume a well-known Kite network,

1. Degree: Number of connections of a node


Example: Diane has the most connections. She works as a connector or hub.

2. Closeness: The average distance of a node to all other nodes in the network
Example: Fernando & Garth are the closest to all others. They are the best positioned to
communicate messages that need to flow quickly through to all other nodes in the network.

SJFGC, Mysuru Page 31


Big Data

3. Betweeness: Counts the number of times a node or connection lies on the shortest path
between any two nodes in the network.
Example: Heather has the highest betweenness. She sits in between two important
communities .She plays a broker role between both communities but is also a single point of
failure.

Social Network Learning


In Social Network Learning, the goal is within network classification to compute the
marginal class membership probability of a particular node given the other nodes in the
network.
Social Network learner consists of following components;
1. Local Model:
This is a model using node-specific characteristics, typically estimated using a
classical predictive analytics model.
2. Network Model:
It is a model that will make use of the connections in the network to do the
inferencing.
3. Collective Inferencing Procedure:
This is a procedure to determine how the unknown nodes are estimated together
hereby influencing each other nodes.

Relational Neighbour Classifier


It is a relation model based on the idea that the behaviour between nodes is correlated,
which is that connected nodes have a propensity to belong to the same class.
 In particular, predicts a node’s class based on its neighbouring nodes & adjacent
edges.
 The classifier make use of the homophily, which states that the connected nodes have
a propensity to belong to the same class. This idea is referred as “Guilt by
Association”.
 If two nodes are associated, they tend to exhibit similar behaviour.
The posterial class probability for node ‘n’ to belong to class ‘c’, it is
calculated as;

SJFGC, Mysuru Page 32

You might also like