Big Data Analytics Module-1
Big Data Analytics Module-1
Module -1
1.0 OBJECTIVES of Big Data
Irrespective of the size of the enterprise whether it is big or small, data
continues to be a precious and irreplaceable asset. Data is present in homogeneous
sources as well as in heterogeneous sources. The need of
the hour is to understand, manage, process, and take the data for analysis to draw
valuable insights. Digital data can be structured, semi-structured or unstructured
data.
1. Unstructured data: This is the data which does not conform to a data model or
is not in a form which can be used easily by a computer program. About 80% data
of an organization is in this format; for example, memos, chat rooms, PowerPoint
presentations, images, videos, letters. researches, white papers, body of an email,
etc.
Unstructured data
semi-structured
data
Structured data
The "Internet of Things" and its widely ultra-connected nature are leading to a
burgeoningrise in big data. There is no dearth of data for today's enterprise. On the
contrary, they are mired in data and quite deep at that. That brings us to the
following questions:
1. Why is it that we cannot forego big data?
2. How has it come to assume such magnanimous importance in running
business?
3. How does it compare with the traditional Business Intelligence (BI)
environment?
4. Is it here to replace the traditional, relational database management system and
data warehouse environment or is it likely to complement their existence?"
Data is widely available. What is scarce is the ability to draw valuable insight.
As of 2014, Linkedln has more than 250 million user accounts and has
added many additionalfeatures and data-related products, such as recruiting, job
seeker tools, advertising, and lnMaps, which show a social graph of a user's
professional network.
"What are the events associated with this data?" and so on.
Small data (data as it existed prior to the big data revolution) is about certainty. It
is about known datasources; it is about no major changes to the composition or
context of data.
Composition
Data Condition
Context
Most often we have answers to queries like why this data was generated,
where and when it was generated, exactly how we would like to use it, what
questions will this data be able to answer, and so on. Big data is about complexity.
Complexity in terms of multiple and unknown datasets, in terms of exploding
volume, in terms of speed at which the data is being generated and the speed at
which it needs to be processed and in terms of the variety of data(internal or
external, behavioural or social) that is being generated.
• Big data is high-velocity and high-variety information assets that demand cost
effective, innovative forms of information processing for enhanced insight and
decision making.
• Big data refers to datasets whose size is typically beyond the storage capacity
of and alsocomplex for traditional database software tools
• Big data is anything beyond the human & technical infrastructure needed to
supportstorage, processing and analysis.
• It is data that is big in volume, velocity and variety. Refer to figure
1.3
Variety: Data can be structured data, semi-structured data and unstructured data.
Data stored in a database is an example of structured data.HTML data, XML data,
email data,
Velocity: Velocity essentially refers to the speed at which data is being created in
real- time. We have moved from simple desktop applications like payroll
application to real-time processing applications.
For the sake of easy comprehension, we will look at the definition in three
parts. Refer Figure 1.4.
Part III of the definition: "enhanced insight and decision making" talks about
deriving deeper, richer and meaningful insights and then using these insights to
make faster and better decisions to gain business value and thus a competitive
edge.
Data —> Information —> Actionable intelligence —> Better decisions —
>Enhanced business value
Figure 1.4 Definition of big data – Gartner
Refer figure 1.5. Following are a few challenges with big data:
Data volume: Data today is growing at an exponential rate. This high tide of data
will continue to rise continuously. The key questions are –
“will all this data be useful for analysis?”,
“Do we work with all this data or subset of it?”,
“How will we separate the knowledge from the noise?” etc.
Storage: Cloud computingis the answer to managing infrastructure for big data as
far as cost-efficiency, elasticity and easy upgrading / downgrading is concerned.
This further complicates the decision to host big data solutions outside the
enterprise.
Data retention: How long should one retain this data? Some data may require for
log-term decision, but some data may quickly become irrelevant and obsolete.
Skilled professionals: In order to develop, manage and run those applications that
generate insights, organizations need professionals who possess a high-level
proficiencyin data sciences.
Other challenges: Other challenges of big data are with respect to capture,
storage, search,analysis, transfer and security of big data.
Visualization: Big data refers to datasets whose size is typically beyond the
storage capacity of traditional database software tools. There is no explicit
definition of how big the data set should be for it to be considered bigdata. Data
visualization(computer graphics) is becoming popular as a separate discipline.
There are very few data visualization experts.
The more data we have for analysis, the greater will be the analytical
accuracy and the greater would be the confidence in our decisions based on these
analytical findings. Theanalytical accuracy will lead a greater positive impact in
terms of enhancing operational efficiencies, reducing cost and time, and
originating new products, new services, and optimizing existing services. Refer
Figure 1.6.
The transformed data is then loaded into the enterprise data warehouse
(available at the enterprise level) or data marts (available at the business unit/
functional unit or business process level).
Business intelligence and analytics tools are then used to enable decision
making from the use of ad-hoc queries, SQL, enterprise dashboards, data mining,
Online Analytical Processing etc. Refer Figure 1.7.
Following are the differences that one encounters dealing with traditional Bl and
big data.
Organizations have been trying to reduce customer churn, increase sales, and
cross-sell customers for many years.
What is new is the opportunity to fuse advanced analytical techniques with Big
Data to produce more impactful analyses for these traditional problems.
Many compliance and regulatory laws have been in existence for decades, but
additional requirements are added every year, which represent additional
complexity and data requirements for organizations.
Tables – 1.3 and 1.4 explain the comparison between BI and Data Science.
Business Intelligence
Typical Techniques and • Standard and ad hoc reporting,
Data Types dashboards, alerts, queries, details on
demand
• Structured data. traditional manageable
datasets sources.
Common Questions • What happened last quarter?
• How many units sold?
• Where is the problem? Hey
situation? in which
Table 1.4: BI
1.9.1 Current Analytical Architecture: Figure 1.9 explains a typical analytical
architecture.
1. For data sources to be loaded into the data warehouse, data needs to be well
understood, structured and normalized with the appropriate data type
definitions.
2. As a result of this level of control on the EDW(enterprise data warehouse-on
server or on cloud), additional local systems may emerge in the form of
departmental warehouses and local data marts that business users create to
accommodate their need for flexible analysis. However, these local systems
reside in isolation, often are not synchronized or integrated with other data
stores and may not be backed up.
3. In the data warehouse, data is read by additional applications across the
enterprise for Bl and reporting purposes.
4. At the end of this workflow, analysts get data from server. Because users
generally are not allowed to run custom or intensive analytics on production
databases, analysts create data extracts from the EDW to analyze data offline in
R or other local analytical tools to store and process critical data, supporting
enterprise applications and enabling corporate reporting activities.
As shown in Figure 1.10, in the 1990s the volume of information was often
measured in terabytes. Most organizations analyzed structured data in rows and
columns and used relational databases and data warehouses to manage large
amount of enterprise information.
Figure 1.10: Data Evolution and the Rise of Big Data Sources
1. Data devices and the "Sensornet” gather data from multiple locations and
continuously generate new data about this data. For each gigabyte of new data
created, an additional petabyte of data is created about that data.
Retail shopping loyalty cards record not just the amount an individual
spends, but the locations of stores that person visits, the kinds of products
purchased, the stores wheregoods are purchased most often, and the combinations
of products purchased together. Collecting this data provides insights into
shopping and travel habits and the likelihood of successful advertisement targeting
for certain types of retail promotions.
2. Data collectors include sample entities that collect data from the device and
users.
Data results from a cable TV provider tracking the shows a
person watches, which TV channels someone will and will not pay for to watch on
demand, and the prices someone is willing to pay for premium TV content
Retail stores tracking the path a customer takes through their store while
pushing a shopping cart with an RFID chip so they can gauge which products get
the most foot traffic using geospatial data collected from the RFID chips
3. Data aggregators make sense of the data collected from the various entities
from the "SensorNet" or the "Internet of Things." These organizations compile
data from the devices and usage patterns collected by government agencies,
retail stores and websites. ln turn, they can choose to transform and package
the data as products to sell to list brokers, who may want to generate marketing
lists of people who may be good targets for specific ad campaigns.
4. Data users / buyers: These groups directly benefit from the data collected
andaggregated by others within the data value chain. Retail banks, acting as a
data buyer, may want to know which customers have the highest likelihood to
apply for a second mortgage or a home equity line of credit.
To provide input for this analysis, retail banks may purchase data from a data
aggregator. This kind of data may include demographic information about people
living in specific locations; people who appear to have a specific level of debt, yet
still have solid credit scores (or other characteristics such as paying bills on time
and having savings accounts) that can be used to infer credit worthiness; and those
who are searching the web for information about paying off debts or doing home
remodeling projects. Obtaining data from these various sources and aggregators
will enable a more targeted marketing campaign, which would have been more
challenging before Big Data due to the lack of information or high-performing
technologies.
Refer figure 1.12 for Key roles of the new big data ecosystems.
3. Data Savvy Professionals-has less technical depth but has a basic knowledge of
statistics or machine learning and can define key questions that can be
answered using advanced analytics.
These people tend to have a base knowledge of working with data, or an
appreciation for some of the work being performed by data scientists and
others with deep analytical talent.
These three groups must work together closely to solve complex Big Data
challenges.
Most organizations are familiar with people in the latter two groups mentioned,
but the first group, Deep Analytical Talent, tends to be the newest role for most
and the least understood.
For simplicity, this discussion focuses on the emerging role of the Data
Scientist. It describes the kinds of activities that role performs and provides a
more detailed view ofthe skills needed to fulfill that role.
There are three recurring sets of activities that data scientists perform:
Data scientists are generally thought of as having five main sets of skills and
behavioral characteristics, as shown in Figure 1-13:
Quantitative skill: such as mathematics or statistics
Skeptical mind-set and critical thinking: It is important that data scientists can
examine their work critically rather than in a one-sided way. Curious and creative:
Data scientists are passionate about data and finding creative waysto solve
problems and portray information.
Data scientists are generally comfortable using this blend of skills to acquire,
manage, analyze, and visualize data and tell compelling stories about it.
OBJECTIVES
BigData is creating significant new opportunities for organizations to derive new value
and create competitive advantage from their most valuable asset: information. For
businesses, Big Data helps drive efficiency, quality, and personalized products and
services, producing improved levels of customer satisfaction and profit. For scientific
efforts, Big Data analytics enable new avenues of investigation with potentially richer
results and deeper insights than previously available. In many cases, Big Data analytics
integrate structured and unstructured data with Realtime feeds and queries, opening new
paths to innovation and insight.
2. About gaining a meaningful, deeper, and richer insight into your business to
steer it in the right direction. understanding the customer's demographics to
cross-sell and up- sell to them, better leveraging the services of your vendors
and suppliers, etc.
Author's experience: The other day I was pleasantly surprised to get a few
recommendations via email from one of my frequently visited online retailers.
They had recommended clothing line from my favorite brand and also the color
suggested was one to my liking. How did they arrive at this? In the recent past, I
had been buying clothing line of a particular brand and the color preference was
pastel shades. They had it stored in their database and pulled it out while making
recommendations to me.
3. About a competitive edge over your competitors by enabling you with findings
thatallow quicker and better decision-making.
4. A tight handshake between three communities: IT, business users, and data
scientists. Refer Figure 3.3.
5. Working with datasets whose volume and variety exceed the current storage
and processing capabilities and infrastructure of your enterprise.
About moving code to data. This makes perfect sense as the program for
distributed processing is tiny (just a few KBs) compared to the data (Terabytes or
Petabytes today andlikely to be Exabytes or Zettabytes in the near future).
2.2CLASSIFICATION OF ANALYTICS
Basic analytics: This primarily is slicing and dicing of data to help with basic
business insights. This is about reporting on historical data, basic visualization, etc.
How can we make it happen?
Let us take a closer look at analytics 1.0, analytics 2.0, and analytics 3.0. Refer
Table 2.1. Figure 2.1 shows the subtle growth of analytics from Descriptive
Diagnostic
Predictive
Perspective analytics.
Analytics 1.0 Analytics 2.0 Analytics 3.0
Era: mid 1990s to 2009 2005 to 2012 2012 to present
Descriptive Descriptive statistics Descriptive + predictive
predictivestatistics (use + prescriptive statistics (use
statistics (report on data from the past to data from the past to make
events,occurrences, make predictions for the
future)
etc. of the past) prophecies for the future
and at the same time
make
recommendations to
leverage the situation to
one's advantage)
key questions asked: key questions asked: Key questions asked:
What happened? What happened? What will happen?
Why did it happen? Why will it happen? When will it happen?
Why will it happen?
What should be the
action
taken to take advantage of
what will happen?
Small and structured Big data is being taken up A blend of big data and
seriously.
data sources. Data stored in Data is mainly traditional analytics to
enterprise data warehousesunstructured, arriving at a yield insights and offerings
or data marts. much higher pace. This with speed and impact.
fast flow of data entailed
that the influx of big
volume data had to be
stored
and processed rapidly,
often on
massive parallel
servers running
Hadoop.
Data was internally Data was often Data is both being
sourced. externally sourced. internally and externally
sourced.
Relational databases Database appliances, In memory analytics, in
Hadoop clusters, SQL to database processing, agile
Hadoop environments, analytical methods,
etc. machine learning
techniques etc.
Table 2.1Analytics 1.0, 2.0 and 3.0
There are mainly seven challenges of big data: scale, security, schema,
Continuous availability, Consistency, Partition tolerant and data quality.
Security: Most of the NoSQL big data platforms have poor security mechanisms
(lack of proper authentication and authorization mechanisms) when it comes to
safeguarding big data. A spot that cannot be ignored given that big data carries
credit card information, personal information and other sensitive data.
schema: Rigid schemas have no place. We want the technology to be able to fit
our big data and not the other way around. The need of the hour is dynamic
schema. Static (pre-defined schemas) are obsolete.
Continuous availability: The big question here is how to provide 24/7 support
because almostall RDBMS and NoSQL big data platforms have a certain amount
of downtime built in.
Let us study the various approaches to analysis of data and what it leads to.
Reactive-Business Intelligence: What does Business Intelligence (BI) help us
with? It allows the businesses to make faster and better decisions by providing the
right information to the right person at the right time in the right format. It is about
analysis of the past or historical data and then displaying the findings of the
analysis or reports in the form of enterprise dashboards, alerts, notifications, etc. It
has support for both pre-specified reports as well as ad hoc querying.
Reactive - Big Data Analytics: Here the analysis is done on huge datasets but the
approach is still reactive as it is still based on static data.
It employs techniques and theories drawn from many fields from the broad
areas of mathematics, statistics, information technology including machine
learning, data engineering, probability models, statistical learning, pattern
recognition and learning, etc.
A data scientist should have following ability to play the role of data scientist.
• Understanding of domain
• Business strategy
• Problem solving
• Communication
• Presentation
• Keenness
The following are the key skills that a data scientist will have to have to
comprehend data,interpret it and analyze.
• Mathematics.
• Statistics.
• Artificial Intelligence (AI).
• Algorithms.
• Machine learning.
• Pattern recognition.
• Natural Language Processing.
• To sum it up, the data science process is
• Collecting raw data from multiple different data sources.
• Processing the data.
• Integrating the data and preparing clean datasets.
• Engaging in explorative data analysis using model and algorithms.
• Preparing presentations using data visualizations.
• Communicating the findings to all stakeholders.
• Making faster and better decisions.
2.7 RESPONSIBILITIES
Basically Available: This constraint states that the system does guarantee the
availability of the data as regards CAP Theorem; there will be a response to any
request. But, that responsecould still be ‘failure’ to obtain the requested data or the
data may be in an inconsistent or changing state, much like waiting for a check to
clear in your bank account.
Eventual consistency: The system will eventually become consistent once it stops
receiving input. The data will propagate to everywhere it should sooner or later,
but the system will continue to receive input and is not checking the consistency of
every transaction before it moves onto the next one. Werner Vogel’s article
“Eventually Consistent – Revisited” covers this topic is much greater detail.
Soft state: The state of the system could change over time, so even during times
without input there may be changes going on due to
‘eventual consistency,’ thus the state of the system is always ‘soft.’
this phase, the team also needs to familiarize itself with the data thoroughly and
take steps to condition the data.
Phase 3-Model planning: Phase 3 is model planning, where the team determines
the methods,techniques and workflow it intends to follow for the subsequent
model building phase. The team explores the data to learn about the relationships
between variables and subsequently selects key variables and the most suitable
models.
Phase 4-Model building: In Phase 4, the team develops data sets for
testing, training, and production purposes. In addition, in this phase the team builds
and executes models based on the work done in the model planning phase. The
team also considers whether its existing tools will suffice for running the models,
or if it will need a more robust
environment for executing models and workflows (for example, fast hardware and
parallel processing, if applicable).
Phase 6-0perationalize: In Phase 6, the team delivers final reports, briefings, code
and technical documents. In addition, the team may run a pilot project to
implement the models in a production environment.