Module-1
by Dr. Rupak Chakraborty (Dept. of CSE-AI, TIU)
Big Data Analytics - Overview
The volume of data that one has to deal has exploded to unimaginable levels in the past decade, and at the same time, the pri ce of data
storage has systematically reduced. Private companies and research institutions capture terabytes of data about their users’ interactions,
business, social media, and also sensors from devices such as mobile phones and automobiles. The challenge of this era is to make sense
of this sea of data. This is where big data analytics comes into picture.
Big Data Analytics largely involves collecting data from different sources, munge it in a way that it becomes available to be consumed by
analysts and finally deliver data products useful to the organization business.
The process of converting large amounts of unstructured raw data, retrieved from different sources to a data product useful f or organizations
forms the core of Big Data Analytics.
Traditional Data Mining Life Cycle
In order to provide a framework to organize the work needed by an organization and deliver clear insights from Big Data, it’s useful to t hink of
it as a cycle with different stages. It is by no means linear, meaning all the stages are related with each other. This cyc le has superficial
similarities with the more traditional data mining cycle as described in CRISP methodology.
CRISP-DM Methodology
The CRISP-DM methodology that stands for Cross Industry Standard Process for Data Mining, is a cycle that describes commonly used
approaches that data mining experts use to tackle problems in traditional BI data mining. It is still being used in tradition al BI data mining
teams.
Take a look at the following illustration. It shows the major stages of the cycle as described by th e CRISP-DM methodology and how they are
interrelated.
CRISP-DM was conceived in 1996 and the next year, it got underway as a European Union project under the ESPRIT funding initiative. The
project was led by five companies: SPSS, Teradata, Daimler AG, NCR Corporation, and OHRA (an insurance company). The project was
finally incorporated into SPSS. The methodology is extremely detailed oriented in how a data mining project should be specifi ed.
Let us now learn a little more on each of the stages involved in the CRISP-DM life cycle −
Business Understanding − This initial phase focuses on understanding the project objectives and requirements from a business
perspective, and then converting this knowledge into a data mining problem definition. A preliminary pl an is designed to achieve the
objectives. A decision model, especially one built using the Decision Model and Notation standard can be used.
Data Understanding − The data understanding phase starts with an initial data collection and proceeds with activities in order to get
familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form
hypotheses for hidden information.
Data Preparation − The data preparation phase covers all activities to construct the final dataset (data that will be fed into the
modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed
order. Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.
Modeling − In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values.
Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the
form of data. Therefore, it is often required to step back to the data preparation phase.
Evaluation − At this stage in the project, you have built a model (or models) that appears to have high quality, from a data analysis
perspective. Before proceeding to final deployment of the model, it is important to evaluate the model thoroughly and review the steps
executed to construct the model, to be certain it properly achieves the business objectives.
A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this
phase, a decision on the use of the data mining results should be reached.
Deployment − Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge
of the data, the knowledge gained will need to be organized and presented in a way that is useful to the customer.
Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implement ing a
repeatable data scoring (e.g. segment allocation) or data mining process.
In many cases, it will be the customer, not the data analyst, who will carry out the deployment steps. Even if the analyst de ploys the model, it
is important for the customer to understand upfront the actions which will need to be carried out in order to actually make use of the created
models.
SEMMA Methodology
SEMMA is another methodology developed by SAS for data mining modeling. It stands for Sample, Explore, Modify, Model, and Asses. Here
is a brief description of its stages −
Sample − The process starts with data sampling, e.g., selecting the dataset for modeling. The dataset should be large enough to
contain sufficient information to retrieve, yet small enough to be used efficiently. This phase also deals with data partitioning.
Explore − This phase covers the understanding of the data by discovering anticipated and unanticipated relationships between the
variables, and also abnormalities, with the help of data visualization.
Modify − The Modify phase contains methods to select, create and transform variables in preparation for data modeling.
Model − In the Model phase, the focus is on applying various modeling (data mining) techniques on the prepared variables in order t o
create models that possibly provide the desired outcome.
Assess − The evaluation of the modeling results shows the reliability and usefulness of the created models.
The main difference between CRISM–DM and SEMMA is that SEMMA focuses on the modeling aspect, whereas CRISP-DM gives more
importance to stages of the cycle prior to modeling such as understanding the business problem to be solved, understanding and
preprocessing the data to be used as input, for example, machine learning algorithms.
Big Data Life Cycle
In today’s big data context, the previous approaches are either incomplete or suboptimal. For example, the SEMMA methodology disregards
completely data collection and preprocessing of different data sources. These stages normally constitute most of t he work in a successful big
data project.
A big data analytics cycle can be described by the following stage −
Business Problem Definition
Research
Human Resources Assessment
Data Acquisition
Data Munging
Data Storage
Exploratory Data Analysis
Data Preparation for Modeling and Assessment
Modeling
Implementation
In this section, we will throw some light on each of these stages of big data life cycle.
Business Problem Definition
This is a point common in traditional BI and big data analytics life cycle. Normally it is a non-trivial stage of a big data project to define the
problem and evaluate correctly how much potential gain it may have for an organization. It seems obvious to mention this, but it has to be
evaluated what are the expected gains and costs of the project.
Research
Analyze what other companies have done in the same situation. This involves looking for solutions that are reasonable for you r company,
even though it involves adapting other solutions to the resources and requirements that your company has. In this stage, a methodology for
the future stages should be defined.
Human Resources Assessment
Once the problem is defined, it’s reasonable to continue analyzing if the current staff is able to complete the project successfully. Traditional
BI teams might not be capable to deliver an optimal solution to all the stages, so it should be considered before starting the project if there is
a need to outsource a part of the project or hire more people.
Data Acquisition
This section is key in a big data life cycle; it defines which type of profiles would be needed to deliver the resultant data product. Data
gathering is a non-trivial step of the process; it normally involves gathering unstructured data from different sources. To give an example, it
could involve writing a crawler to retrieve reviews from a website. This involves dealing with text, perhaps in different languages normally
requiring a significant amount of time to be completed.
Data Munging
Once the data is retrieved, for example, from the web, it needs to be stored in an easyto-use format. To continue with the reviews examples,
let’s assume the data is retrieved from different sites where each has a different display of the data.
Suppose one data source gives reviews in terms of rating in stars, therefore it is possible to read this as a mapping for the response
variable y ∈ {1, 2, 3, 4, 5}. Another data source gives reviews using two arrows system, one for up voting and the other for down voting. This
would imply a response variable of the form y ∈ {positive, negative}.
In order to combine both the data sources, a decision has to be made in order to make these two response representations equi valent. This
can involve converting the first data source response representation to the second form, considering one star as negative and five stars as
positive. This process often requires a large time allocation to be delivered with good quality.
Data Storage
Once the data is processed, it sometimes needs to be stored in a database. Big data technologies offer plenty of alternatives regarding this
point. The most common alternative is using the Hadoop File System for storage that provides users a limited version of SQL, known as
HIVE Query Language. This allows most analytics task to be done in similar ways as would be done in traditional BI data warehouses, from
the user perspective. Other storage options to be considered are MongoDB, Redis, and SPARK.
This stage of the cycle is related to the human resources knowledge in terms of their abilities to implement different archit ectures. Modified
versions of traditional data warehouses are still being used in large scale applications. For example, teradata and IBM offer SQL databases
that can handle terabytes of data; open source solutions such as postgreSQL and MySQL are still being used for large scale applications.
Even though there are differences in how the different storages work in the background, from the client side, most solutions provide a SQL
API. Hence having a good understanding of SQL is still a key skill to have for big data analytics.
This stage a priori seems to be the most important topic, in practice, this is not true. It is not even an essential stage. It is possible to
implement a big data solution that would be working with real-time data, so in this case, we only need to gather data to develop the model
and then implement it in real time. So there would not be a need to formally store the data at all.
Exploratory Data Analysis
Once the data has been cleaned and stored in a way that insights can be retrieved from it, the data exploration phase is mand atory. The
objective of this stage is to understand the data, this is normally done with statistical techniques and also plotting the data. This is a good
stage to evaluate whether the problem definition makes sense or is feasible.
Data Preparation for Modeling and Assessment
This stage involves reshaping the cleaned data retrieved previously and using statistical preprocessing for missing values imputation, outlier
detection, normalization, feature extraction and feature selection.
Modelling
The prior stage should have produced several datasets for training and testing, for example, a predictive model. This stage involves trying
different models and looking forward to solving the business problem at hand. In practice, it is normally desired that the model would give
some insight into the business. Finally, the best model or combination of models is selected evaluating its performance on a left-out dataset.
Implementation
In this stage, the data product developed is implemented in the data pipeline of the company. This involves setting up a vali dation schem e
while the data product is working, in order to track its performance. For example, in the case of implementing a predictive model, this stage
would involve applying the model to new data and once the response is available, evaluate the model.
Univariate, Bivariate and Multivariate data and its analysis
1. Univariate data –
This type of data consists of only one variable. The analysis of univariate data is thus the simplest form of analysis since the
information deals with only one quantity that changes. It does not deal with causes or relationships and the main purpose of the
analysis is to describe the data and find patterns that exist within it. The example of a univariate data can be height.
Suppose that the heights of seven students of a class is recorded(figure 1),there is only one variable that is height and it is not
dealing with any cause or relationship. The description of patterns found in this type of data can be made by drawing
conclusions using central tendency measures (mean, median and mode), dispersion or spread of data (range, minimum,
maximum, quartiles, variance and standard deviation) and by using frequency distribution tables, histograms, pie charts,
frequency polygon and bar charts.
2. Bivariate data
This type of data involves two different variables. The analysis of this type of data deals with causes and relationships and the
analysis is done to find out the relationship among the two variables. Example of bivariate data can be temperature and ice
cream sales in summer season.
Suppose the temperature and ice cream sales are the two variables of a bivariate data(figure 2). Here, the relationship is vi sible
from the table that temperature and sales are directly proportional to each other and thus related because as the temperature
increases, the sales also increase. Thus bivariate data analysis involves comparisons, relationships, causes and explanations .
These variables are often plotted on X and Y axis on the graph for better understanding of data and one of these variables is
independent while the other is dependent.
3. Multivariate data
When the data involves three or more variables, it is categorized under multivariate. Example of this type of data is suppose an
advertiser wants to compare the popularity of four advertisements on a website, then their click rates could be measured for both
men and women and relationships between variables can then be examined. It is similar to bivariate but contains more than one
dependent variable. The ways to perform analysis on this data depends on the goals to be achieved. Some of the techniques are
regression analysis, path analysis, factor analysis and multivariate analysis of variance (MANOVA).
There are a lots of different tools, techniques and methods that can be used to conduct your analysis. You could use software
libraries, visualization tools and statistic testing methods. However, this blog we will be compare Univariate, Bivariate and
Multivariate analysis.
Univariate Bivariate Multivariate
It only summarize single
It only summarize two variables It only summarize more than 2 variables.
variable at a time.
It does not deal with It does deal with causes and It does not deal with causes and relationships and analysis
causes and relationships. relationships and analysis is done. is done.
It does not contain any It does contain only one dependent It is similar to bivariate but it contains more than 2
dependent variable. variable. variables.
The main purpose is to
The main purpose is to explain. The main purpose is to study the relationship among them.
describe.
Example, Suppose an advertiser wants to compare the
The example of bivariate can be popularity of four advertisements on a website.
The example of a
temperature and ice sales in summer
univariate can be height. Then their click rates could be measured for both men and
vacation.
women and relationships between variable can be
examined
Big Data Analytics - Core Deliverables
As mentioned in the big data life cycle, the data products that result from developing a big data product are in most of the cases some of the
following −
Machine learning implementation − This could be a classification algorithm, a regression model or a segmentation model.
Recommender system − The objective is to develop a system that recommends choices based on user behavior. Netflix is the
characteristic example of this data product, where based on the ratings of users, other movies are recommended.
Dashboard − Business normally needs tools to visualize aggregated data. A dashboard is a graphical mechanism to make this data
accessible.
Ad-Hoc analysis − Normally business areas have questions, hypotheses or myths that can be answered doing ad-hoc analysis with
data.
Big Data Analytics - Key Stakeholders
In large organizations, in order to successfully develop a big data project, it is needed to have management backing up the project. This
normally involves finding a way to show the business advantages of the project. We don’t have a unique solution to the problem of finding
sponsors for a project, but a few guidelines are given below −
Check who and where are the sponsors of other projects similar to the one that interests you.
Having personal contacts in key management positions helps, so any contact can be triggered if the project is promising.
Who would benefit from your project? Who would be your client once the project is on track?
Develop a simple, clear, and exiting proposal and share it with the key players in your organization.
The best way to find sponsors for a project is to understand the problem and what would be the resulting data product once it has been
implemented. This understanding will give an edge in convincing the management of the importance of the big data project.
Big Data Analytics - Data Analyst
A data analyst has reporting-oriented profile, having experience in extracting and analyzing data from traditional data warehouses using
SQL. Their tasks are normally either on the side of data storage or in reporting general business results. Data warehousing is by no means
simple, it is just different to what a data scientist does.
Many organizations struggle hard to find competent data scientists in the market. It is however a good idea to select prospective data
analysts and teach them the relevant skills to become a data scientist. This is by no means a trivial task and would normally involve the
person doing a master degree in a quantitative field, but it is definitely a viable option. The basic skills a competent data analyst must have
are listed below −
Business understanding
SQL programming
Report design and implementation
Dashboard development
Big Data Analytics - Data Scientist
The role of a data scientist is normally associated with tasks such as predictive modeling, developing segmentation algorithm s,
recommender systems, A/B testing frameworks and often working with raw unstructured data.
The nature of their work demands a deep understanding of mathematics, applied statistics and programming. There are a few ski lls common
between a data analyst and a data scientist, for example, the ability to query databases. Both analyze data, b ut the decision of a data
scientist can have a greater impact in an organization.
Here is a set of skills a data scientist normally need to have −
Programming in a statistical package such as: R, Python, SAS, SPSS, or Julia
Able to clean, extract, and explore data from different sources
Research, design, and implementation of statistical models
Deep statistical, mathematical, and computer science knowledge
In big data analytics, people normally confuse the role of a data scientist with that of a data archit ect. In reality, the difference is quite simple.
A data architect defines the tools and the architecture the data would be stored at, whereas a data scientist uses this archi tecture. Of course,
a data scientist should be able to set up new tools if needed for ad-hoc projects, but the infrastructure definition and design should not be a
part of his task.
Big Data Analytics - Problem Definition
Problem Definition
Problem Definition is probably one of the most complex and heavily neglected stages in the big data analytics pipeline. In order to define
the problem a data product would solve, experience is mandatory. Most data scientist aspirants have little or no experience in this stage.
Most big data problems can be categorized in the following ways −
Supervised classification
Supervised regression
Unsupervised learning
Learning to rank
Hoping students already know about first three….
Learning to Rank
This problem can be considered as a regression problem, but it has particular characteristics and deserves a separate treatment. The
problem involves given a collection of documents we seek to find the most relevant ordering given a query. In order to develop a supervised
learning algorithm, it is needed to label how relevant an ordering is, given a query.
Big Data Analytics - Data Collection
Data collection plays the most important role in the Big Data cycle. The Internet provides almost unlimited sources of data f or a variety of
topics. The importance of this area depends on the type of business, but traditional industries can acquire a diverse source of external data
and combine those with their transactional data.
For example, let’s assume we would like to build a system that recommends restaurants. The first step would be to gather data, in this case,
reviews of restaurants from different websites and store them in a database. As we are interested in raw text, and would use that for
analytics, it is not that relevant where the data for developing the model would be stored. This may sound contradictory with the big data
main technologies, but in order to implement a big data application, we simply need to make it work in real time.
Big Data Analytics - Cleansing Data
Once the data is collected, we normally have diverse data sources with different characteristics. The most immediate step would be to make
these data sources homogeneous and continue to develop our data product. However, it depends on the type of data. We should ask
ourselves if it is practical to homogenize the data.
Maybe the data sources are completely different, and the information loss will be large if the sources would be homogenized. In this case, we
can think of alternatives. Can one data source help me build a regression model and the other one a classification model? Is it possible to
work with the heterogeneity on our advantage rather than just lose information? Taking these decisions are what make analytics inter esting
and challenging.
In the case of reviews, it is possible to have a language for each data source. Again, we have two choices −
Homogenization − It involves translating different languages to the language where we have more data. The quality of translations
services is acceptable, but if we would like to translate massive amounts of data with an API, the cost would be significant . There are
software tools available for this task, but that would be costly too.
Heterogenization − Would it be possible to develop a solution for each language? As it is simple to detect the language of a corpus,
we could develop a recommender for each language. This would involve more work in terms of tuning each recommender according to
the amount of languages available but is definitely a viable option if we have a few languages available.
Big Data Analytics - Data Exploration
Exploratory data analysis is a concept developed by John Tuckey (1977) that consists on a new perspective of statistics. Tuckey’s idea
was that in traditional statistics, the data was not being explored graphically, is was just being used to test hypotheses. T he first attempt to
develop a tool was done in Stanford, the project was called prim9. The tool was able to visualize data in nine dimensions, therefore it was
able to provide a multivariate perspective of the data.
In recent days, exploratory data analysis is a must and has been included in the big data analytics life cycle. The ability to find insight and be
able to communicate it effectively in an organization is fueled with strong EDA capabilities.
Based on Tuckey’s ideas, Bell Labs developed the S programming language in order to provide an interactive interface for doing statistics.
The idea of S was to provide extensive graphical capabilities with an easy-to-use language. In today’s world, in the context of Big
Data, R that is based on the S programming language is the most popular software for analytics.