0% found this document useful (0 votes)

38 views26 pages

Big Data Analytics Module-1

Big Data Analytics focuses on understanding, managing, and processing various types of data—structured, semi-structured, and unstructured—generated from diverse sources, particularly in the context of the Internet of Things. The document outlines the evolution, definition, characteristics, and challenges of Big Data, emphasizing its significance in enhancing decision-making and operational efficiencies across industries. It also contrasts traditional Business Intelligence with Big Data approaches, highlighting the need for innovative processing techniques to derive actionable insights from vast datasets.

Uploaded by

Yashi Bajpai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views26 pages

Big Data Analytics Module-1

Uploaded by

Yashi Bajpai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 26

BIG DATA ANALYTICS

Module -1
1.0 OBJECTIVES of Big Data
Irrespective of the size of the enterprise whether it is big or small, data
continues to be a precious and irreplaceable asset. Data is present in homogeneous
sources as well as in heterogeneous sources. The need of
the hour is to understand, manage, process, and take the data for analysis to draw
valuable insights. Digital data can be structured, semi-structured or unstructured
data.

Data generates information and from information we can draw valuable

insight. As depicted in Figure 1.1, digital data can be broadly classified into
structured, semi-structured, and unstructured data.

1. Unstructured data: This is the data which does not conform to a data model or
is not in a form which can be used easily by a computer program. About 80% data
of an organization is in this format; for example, memos, chat rooms, PowerPoint
presentations, images, videos, letters. researches, white papers, body of an email,
etc.

Unstructured data

semi-structured
data

Structured data

Figure 1.1 classification of digital data

2. Semi-structured data: Semi-structured data is also referred to as

selfdescribing structure. This is the data which does not conform to a data model
but has some structure. However, it is not in a form which can be used easily by a
computer program. About 10% data of an organization is in this format; for
example, HTML, XML, JSON, email data etc.
Figure 1.1 classification of digital data

3. Structured data: When data follows a pre-defined schema/structure we say it

is structured data. This is the data which is in an organized form (e.g., in rows and
columns) and be easily used by a computer program. Relationships exist between
entities of data, such as classes and their objects. About 10% data of an
organization isin this format.
Data stored in databases is an example of structured data.

1.1 INTRODUCTION TO BIG DATA

The "Internet of Things" and its widely ultra-connected nature are leading to a
burgeoningrise in big data. There is no dearth of data for today's enterprise. On the
contrary, they are mired in data and quite deep at that. That brings us to the
following questions:
1. Why is it that we cannot forego big data?
2. How has it come to assume such magnanimous importance in running
business?
3. How does it compare with the traditional Business Intelligence (BI)
environment?
4. Is it here to replace the traditional, relational database management system and
data warehouse environment or is it likely to complement their existence?"

Data is widely available. What is scarce is the ability to draw valuable insight.

Some examples of Big Data:

• There are some examples of Big Data Analytics in different areas such as retail,
IT infrastructure, and social media.
• Retail: As mentioned earlier, Big Data presents many opportunities to improve
sales andmarketing analytics.
• An example of this is the U.S. retailer Target. After analyzing
consumer purchasing behavior, Target's statisticians determined that the retailer
made a great deal of money from three main life-event situations.
• Marriage, when people tend to buy many new products
• Divorce, when people buy new products and change their spending habits
• Pregnancy, when people have many new things to buy and have an
urgency to buy them. The analysis target to manage its inventory, knowing that
there would be demand for specific products and it would likely vary by month
over the coming nine- to ten-month cycles
• IT infrastructure: MapReduce paradigm is an ideal technical framework for
many Big Data projects, which rely on large data sets with unconventional data
structures.
• One of the main benefits of Hadoop is that it employs a distributed file system,
meaning it can use a distributed cluster of servers and commodity hardware to
process large amounts ofdata.

Some of the most common examples of Hadoop implementations are in the

social media space, where Hadoop can manage transactions, give textual updates,
and develop social graphs among millions of users.

Twitter and Facebook generate massive amounts of unstructured data and

use Hadoop and its ecosystem of tools to manage this high volume.

Social media: It represents a tremendous opportunity to leverage social and

professionalinteractions to derive new insights.
LinkedIn represents a company in which data itself is the product. Early on,
Linkedln founderReid Hoffman saw the opportunity to create a social network for
working professionals.

As of 2014, Linkedln has more than 250 million user accounts and has
added many additionalfeatures and data-related products, such as recruiting, job
seeker tools, advertising, and lnMaps, which show a social graph of a user's
professional network.

1.2 CHARACTERISTICS OF DATA

As depicted in Figure 1.2, data has three key characteristics:

1. Composition: The composition of data deals with the structure of data, that is,
the sources of data, the granularity, the types, and the nature of data as to
whether it isstatic or real-time streaming.
2. Condition: The condition of data deals with the state of data, that is, "Can one
use this data as is foranalysis?" or "Does it require cleansing for further
enhancement and enrichment?"
3. Context: The context of data deals with "Where has this data been generated?"
"Why was this datagenerated?" How sensitive is this data?"

"What are the events associated with this data?" and so on.
Small data (data as it existed prior to the big data revolution) is about certainty. It
is about known datasources; it is about no major changes to the composition or
context of data.
Composition

Data Condition

Context

Figure 1.2 Characteristics of data

Most often we have answers to queries like why this data was generated,
where and when it was generated, exactly how we would like to use it, what
questions will this data be able to answer, and so on. Big data is about complexity.
Complexity in terms of multiple and unknown datasets, in terms of exploding
volume, in terms of speed at which the data is being generated and the speed at
which it needs to be processed and in terms of the variety of data(internal or
external, behavioural or social) that is being generated.

1.3 EVOLUTION OF BIG DATA

1970s and before was the era of mainframes. The data was essentially
primitive and structured. Relational databases evolved in 1980s and 1990s. The era
was of data intensive applications. The World
Wide Web (WWW) and the Internet of Things (IOT) have led to an
onslaught of structured, unstructured, and multimedia data. Refer Table 1.1.

Table 1.1 The evolution of big data

1.4 DEFINITION OF BIG DATA

• Big data is high-velocity and high-variety information assets that demand cost
effective, innovative forms of information processing for enhanced insight and
decision making.
• Big data refers to datasets whose size is typically beyond the storage capacity
of and alsocomplex for traditional database software tools
• Big data is anything beyond the human & technical infrastructure needed to
supportstorage, processing and analysis.
• It is data that is big in volume, velocity and variety. Refer to figure
1.3

Variety: Data can be structured data, semi-structured data and unstructured data.
Data stored in a database is an example of structured data.HTML data, XML data,
email data,

Figure 1.3 Data: Big in volume, variety and volume

CSV files are the examples of semi-structured data. Power point presentation,
images, videos, researches, white papers, body of email etc are the examples of
unstructured data.

Velocity: Velocity essentially refers to the speed at which data is being created in
real- time. We have moved from simple desktop applications like payroll
application to real-time processing applications.

Volume: Volume can be in Terabytes or Petabytes or Zettabytes. Gartner Glossary

Big data is high-volume, high-velocity and/or highvarietyinformation assets that
demand cost-effective, innovative forms of information processing that enable
enhanced insight and decision making.

For the sake of easy comprehension, we will look at the definition in three
parts. Refer Figure 1.4.

Part I of the definition: "Big data is high-volume, high-velocity, and high-variety

information assets" talks about voluminous data (humongous data) that may have
great variety (a good mix of structured, semi-structured. and unstructured data) and
will require a good speed/pace for storage, preparation, processing and analysis.

Part II of the definition: "cost effective, innovative forms of information

processing" talks about embracing new techniques and technologies to capture
(ingest), store, process, persist, integrate and visualize the highvolume, high-
velocity, and high-variety data.

Part III of the definition: "enhanced insight and decision making" talks about
deriving deeper, richer and meaningful insights and then using these insights to
make faster and better decisions to gain business value and thus a competitive
edge.
Data —> Information —> Actionable intelligence —> Better decisions —
>Enhanced business value
Figure 1.4 Definition of big data – Gartner

1.5 CHALLENGES WITH BIG DATA

Refer figure 1.5. Following are a few challenges with big data:

Figure 1.5 Challenges with big data

Data volume: Data today is growing at an exponential rate. This high tide of data
will continue to rise continuously. The key questions are –
“will all this data be useful for analysis?”,
“Do we work with all this data or subset of it?”,
“How will we separate the knowledge from the noise?” etc.
Storage: Cloud computingis the answer to managing infrastructure for big data as
far as cost-efficiency, elasticity and easy upgrading / downgrading is concerned.
This further complicates the decision to host big data solutions outside the
enterprise.

Data retention: How long should one retain this data? Some data may require for
log-term decision, but some data may quickly become irrelevant and obsolete.

Skilled professionals: In order to develop, manage and run those applications that
generate insights, organizations need professionals who possess a high-level
proficiencyin data sciences.

Other challenges: Other challenges of big data are with respect to capture,
storage, search,analysis, transfer and security of big data.

Visualization: Big data refers to datasets whose size is typically beyond the
storage capacity of traditional database software tools. There is no explicit
definition of how big the data set should be for it to be considered bigdata. Data
visualization(computer graphics) is becoming popular as a separate discipline.
There are very few data visualization experts.

1.6 WHY BIG DATA?

The more data we have for analysis, the greater will be the analytical
accuracy and the greater would be the confidence in our decisions based on these
analytical findings. Theanalytical accuracy will lead a greater positive impact in
terms of enhancing operational efficiencies, reducing cost and time, and
originating new products, new services, and optimizing existing services. Refer
Figure 1.6.

Figure 1.6: Why big data?

1.7 DATA WAREHOUSE ENVIRONMENT

The data from these sources may differ in format.

Operational or transactional or day-to-day business data is gathered from
Enterprise Resource Planning (ERP) systems, Customer Relationship Management
(CRM), Legacy systems, and several third-party applications.

The data from these sources may differ in format.

This data is then integrated, cleaned up, transformed, and standardized
through the process of Extraction, Transformation, and Loading (ETL).

The transformed data is then loaded into the enterprise data warehouse
(available at the enterprise level) or data marts (available at the business unit/
functional unit or business process level).

Business intelligence and analytics tools are then used to enable decision
making from the use of ad-hoc queries, SQL, enterprise dashboards, data mining,
Online Analytical Processing etc. Refer Figure 1.7.

Figure 1.7: Data Warehouse Environment

1.8 TRADITIONAL BUSINESS INTELLIGENCE (BI) VERSUS BIG DATA

Following are the differences that one encounters dealing with traditional Bl and
big data.

In traditional BI environment, all the enterprise's data is housed in a central

server whereas in a big data environment data resides in a distributed file system.
The distributed file system scales by scaling in(decrease) or out(increase)
horizontally as compared to typical database server that scales vertically.

In traditional BI, data is generally analysed in an offline mode whereas in

big data, it is analysed in both real-time streaming as well as in offline mode.

Traditional Bl is about structured data and it is here that data is taken to

processing functions (move data to code) whereas big data is about variety:
Structured, semi- structured, and unstructured data and here the processing
functions are taken to the data (move code to data).

1.9 STATE OF THE PRACTICE IN ANALYTICS

Current business problems provide many opportunities for organizations to

become more analytical and data driven, as shown in Table 1 ·2.

Business Driver Examples

Optimize business operations Sales, pricing, profitability,
efficiency
Identify business risk Customer churn, fraud, default
Predict new business Upsell, cross-sell, best new
opportunities customer prospects
Comply with laws or regulatory Anti-Money Laundering, Fair
requirements Lending,
TABLE 1.2 Business Drivers for Advanced Analytics

The first three examples do not represent new problems.

Organizations have been trying to reduce customer churn, increase sales, and
cross-sell customers for many years.

What is new is the opportunity to fuse advanced analytical techniques with Big
Data to produce more impactful analyses for these traditional problems.

The last example portrays emerging regulatory requirements.

Many compliance and regulatory laws have been in existence for decades, but
additional requirements are added every year, which represent additional
complexity and data requirements for organizations.

Laws related to anti-money laundering (AML) and fraud prevention require

advanced analytical techniques to comply with and manage properly.

Different types of analytics:

1.9.1 BI Versus Data Science
1.9.2 Current Analytical Architecture (data flow)
1.9.3 Drivers of Big Data
1.9.4 Emerging Big Data Ecosystem and a New Approach to Analytics
1.9.5 BI Versus Data Science: Refer figure 1.8 for comparing BI with Data Science
Figure 1.8 Comparing BI with Data Science

Tables – 1.3 and 1.4 explain the comparison between BI and Data Science.

Predictive Analytics and Data Mining (Data Science)

Typical Techniques and • Optimization. predictive modelling,
Data Types forecasting. statistical analysis
• Structured/unstructured data, many types of
sources, verylarge datasets
Common Questions • What if ... ?
• What's the optimal scenario for our business?
• What will happen next? What if these trends
continue? Whyis this happening?
Table 1.3: Data Science

Business Intelligence
Typical Techniques and • Standard and ad hoc reporting,
Data Types dashboards, alerts, queries, details on
demand
• Structured data. traditional manageable
datasets sources.
Common Questions • What happened last quarter?
• How many units sold?
• Where is the problem? Hey
situation? in which
Table 1.4: BI
1.9.1 Current Analytical Architecture: Figure 1.9 explains a typical analytical
architecture.

Figure 1.9:Typical Analytical Architecture

1. For data sources to be loaded into the data warehouse, data needs to be well
understood, structured and normalized with the appropriate data type
definitions.
2. As a result of this level of control on the EDW(enterprise data warehouse-on
server or on cloud), additional local systems may emerge in the form of
departmental warehouses and local data marts that business users create to
accommodate their need for flexible analysis. However, these local systems
reside in isolation, often are not synchronized or integrated with other data
stores and may not be backed up.
3. In the data warehouse, data is read by additional applications across the
enterprise for Bl and reporting purposes.
4. At the end of this workflow, analysts get data from server. Because users
generally are not allowed to run custom or intensive analytics on production
databases, analysts create data extracts from the EDW to analyze data offline in
R or other local analytical tools to store and process critical data, supporting
enterprise applications and enabling corporate reporting activities.

Although reports and dashboards are still important for

organizations, most traditional data architectures prevent data exploration and
more sophisticated analysis.

1.9.3 Drivers of Big Data:

As shown in Figure 1.10, in the 1990s the volume of information was often
measured in terabytes. Most organizations analyzed structured data in rows and
columns and used relational databases and data warehouses to manage large
amount of enterprise information.
Figure 1.10: Data Evolution and the Rise of Big Data Sources

The following decade (2000) saw different kinds of data sourcesmainly

productivity and publishing tools such as content management repositories and
networked attached storage systems-to manage this kind of information, and the
data began to increase in size and started to be measured at petabyte scales.

In the 2010s, the information that organizations try to manage has

broadened to includemany other kinds of data. In this era, everyone and everything
is leaving a digital footprint. These applications, which generate data volumes that
can be measured in exabyte scale, provide opportunities for new analytics and
driving new value for organizations. The data now comes from multiple sources,
like Medical information, Photos and video footage, Video surveillance, Mobile
devices, Smart devices, Nontraditional IT devices etc.

1.9.4 Emerging Big Data Ecosystem and a New Approach to

Analytics

Figure 1.11 – Emerging Big Data Ecosystem

As the new ecosystem takes shape, there are four main groups of players within
this interconnected web. These are shown in Figure 1.11.

1. Data devices and the "Sensornet” gather data from multiple locations and
continuously generate new data about this data. For each gigabyte of new data
created, an additional petabyte of data is created about that data.

For example, consider someone playing an online video game through a

PC, game console, or smartphone. In this case, the video game provider captures
data about the skill and levels attained by the player. Intelligent systems monitor
and log how and when the user plays the game. As a consequence, the game
provider can fine-tune the difficulty of the game, suggest other related games that
would most likely interest the user, and offer additional equipment and
enhancements for the character based on the user's age, gender, and interests. This
information may get stored locally or uploaded to the game provider's cloud to
analyze the gaming habits and opportunities for upsell and cross-sell and identify
typical profiles of specific kinds of users.

Smartphones provide another rich source of data. In addition to messaging

and basic phone usage, they store and transmit data about Internet usage, SMS
usage, and real- time location. This metadata can be used for analyzing traffic
patterns by scanning the density of smartphones in locations to track the speed of
cars or the relative traffic congestion on busy roads. In this way, GPS devices in
cars can give drivers real-time updates and offer alternative routes to avoid traffic
delays.

Retail shopping loyalty cards record not just the amount an individual
spends, but the locations of stores that person visits, the kinds of products
purchased, the stores wheregoods are purchased most often, and the combinations
of products purchased together. Collecting this data provides insights into
shopping and travel habits and the likelihood of successful advertisement targeting
for certain types of retail promotions.

2. Data collectors include sample entities that collect data from the device and
users.
Data results from a cable TV provider tracking the shows a
person watches, which TV channels someone will and will not pay for to watch on
demand, and the prices someone is willing to pay for premium TV content

Retail stores tracking the path a customer takes through their store while
pushing a shopping cart with an RFID chip so they can gauge which products get
the most foot traffic using geospatial data collected from the RFID chips

3. Data aggregators make sense of the data collected from the various entities
from the "SensorNet" or the "Internet of Things." These organizations compile
data from the devices and usage patterns collected by government agencies,
retail stores and websites. ln turn, they can choose to transform and package
the data as products to sell to list brokers, who may want to generate marketing
lists of people who may be good targets for specific ad campaigns.

4. Data users / buyers: These groups directly benefit from the data collected
andaggregated by others within the data value chain. Retail banks, acting as a
data buyer, may want to know which customers have the highest likelihood to
apply for a second mortgage or a home equity line of credit.

To provide input for this analysis, retail banks may purchase data from a data
aggregator. This kind of data may include demographic information about people
living in specific locations; people who appear to have a specific level of debt, yet
still have solid credit scores (or other characteristics such as paying bills on time
and having savings accounts) that can be used to infer credit worthiness; and those
who are searching the web for information about paying off debts or doing home
remodeling projects. Obtaining data from these various sources and aggregators
will enable a more targeted marketing campaign, which would have been more
challenging before Big Data due to the lack of information or high-performing
technologies.

Using technologies such as Hadoop to perform natural language processing

on unstructured, textual data from social media websites, users can gauge the
reaction to events such as presidential campaigns. People may, for example, want
to determine public sentiments toward a
candidate by analyzing related blogs and online comments. Similarly, data users
may want to track and prepare for natural disasters by identifying which areas a
hurricane affects first and how it moves, based on which geographic areas are
tweeting about it or discussing it via social media.

1.11 KEY ROLES FOR THE NEW BIG DATA ECOSYSTEM

Refer figure 1.12 for Key roles of the new big data ecosystems.

1. Deep Analytical Talent is technically savvy, with strong analytical skills.

Members possess a combination of skills to handle raw, unstructured data and
to apply complex analytical techniques at massive scales.

2. This group has advanced training in quantitative disciplines, such as

mathematics, statistics, and machine learning. To do their jobs, members need
access to a robust analytic sandbox or workspace where they can perform
large-scale analytical data experiments.

Examples of current professions fitting into this group include statisticians,

economists,mathematicians, and the new role of the Data
Scientist.

3. Data Savvy Professionals-has less technical depth but has a basic knowledge of
statistics or machine learning and can define key questions that can be
answered using advanced analytics.
These people tend to have a base knowledge of working with data, or an
appreciation for some of the work being performed by data scientists and
others with deep analytical talent.

Examples of data savvy professionals include financial analysts, market

research analysts, life scientists, operations managers, and business and
functional managers.

4. Technology and Data Enablers- This group represents people providing

technical expertise to support analytical projects, such as provisioning and
administrating analytical sandboxes, and managing large-scale data
architectures that enable widespread analytics within companies and other
organizations.

This role requires skills related to computer engineering, programming, and

database administration.

These three groups must work together closely to solve complex Big Data
challenges.
Most organizations are familiar with people in the latter two groups mentioned,
but the first group, Deep Analytical Talent, tends to be the newest role for most
and the least understood.

For simplicity, this discussion focuses on the emerging role of the Data
Scientist. It describes the kinds of activities that role performs and provides a
more detailed view ofthe skills needed to fulfill that role.

Activities of data scientist:

There are three recurring sets of activities that data scientists perform:

Reframe business challenges as analytics challenges. Specifically, this is a

skill to diagnose business problems, consider the core of a given problem, and
determine whichkinds of analytical methods can be applied to solve it.

Design, implement, and deploy statistical models and data mining

techniques on Big Data. This set of activities is mainly what people think about
when they consider the role of the Data Scientist: namely, applying complex or
advanced analytical methods to a variety of business problems using data.

Develop insights that lead to actionable recommendations. It is critical to

note that applying advanced methods to data problems does not necessarily drive
new business value. Instead, it is important to learn how to draw insights out of the
data and communicate them effectively.

Profile of a data scientist:

Data scientists are generally thought of as having five main sets of skills and
behavioral characteristics, as shown in Figure 1-13:
Quantitative skill: such as mathematics or statistics

Figure 1.13 - Data scientist

Technical aptitude: namely, software engineering, machine learning, and

programming skills

Skeptical mind-set and critical thinking: It is important that data scientists can
examine their work critically rather than in a one-sided way. Curious and creative:
Data scientists are passionate about data and finding creative waysto solve
problems and portray information.

Communicative and collaborative: Data scientists must be able to understand the

business value in a clear way and collaboratively work with other groups,
including project sponsors and key stakeholders.

Data scientists are generally comfortable using this blend of skills to acquire,
manage, analyze, and visualize data and tell compelling stories about it.
OBJECTIVES
BigData is creating significant new opportunities for organizations to derive new value
and create competitive advantage from their most valuable asset: information. For
businesses, Big Data helps drive efficiency, quality, and personalized products and
services, producing improved levels of customer satisfaction and profit. For scientific
efforts, Big Data analytics enable new avenues of investigation with potentially richer
results and deeper insights than previously available. In many cases, Big Data analytics
integrate structured and unstructured data with Realtime feeds and queries, opening new
paths to innovation and insight.

2.1 INTRODUCTION TO BIG DATA ANALYTICS

Big Data Analytics is...

1 Technology-enabled analytics: Quite a few data analytics and visualization tools

are available in the market today from leading vendors such as IBM, Tableau,
SAS, R Analytics, Statistica, World Programming Systems (WPS), etc. to help
process and analyze your big data.

2. About gaining a meaningful, deeper, and richer insight into your business to
steer it in the right direction. understanding the customer's demographics to
cross-sell and upsell to them, better leveraging the services of your vendors
and suppliers, etc.
Author's experience: The other day I was pleasantly surprised to get a few
recommendations via email from one of my frequently visited online retailers.
They had recommended clothing line from my favorite brand and also the color
suggested was one to my liking. How did they arrive at this? In the recent past, I
had been buying clothing line of a particular brand and the color preference was
pastel shades. They had it stored in their database and pulled it out while making
recommendations to me.

3. About a competitive edge over your competitors by enabling you with findings
thatallow quicker and better decision-making.

4. A tight handshake between three communities: IT, business users, and data
scientists. Refer Figure 3.3.

5. Working with datasets whose volume and variety exceed the current storage
and processing capabilities and infrastructure of your enterprise.

About moving code to data. This makes perfect sense as the program for
distributed processing is tiny (just a few KBs) compared to the data (Terabytes or
Petabytes today andlikely to be Exabytes or Zettabytes in the near future).

2.2CLASSIFICATION OF ANALYTICS

There are basically basically two schools of thought:

2.2.1 Those that classify analytics into basic, operationalized, advanced and
Monetized.
2.2.2 Those that classify analytics into analytics 1.0, analytics 2.0, and analytics
3.0.
2.23. First School of Thought

It includes Basic analytics, Operationalized analytics, Advanced analytics and

Monetized analytics.

Basic analytics: This primarily is slicing and dicing of data to help with basic
business insights. This is about reporting on historical data, basic visualization, etc.
How can we make it happen?

What will happen?

Why did it happen

Figure 2.1 Analytics 1.0, 2.0 and 3.0

Operationalized analytics: It is operationalized analytics if it gets woven into the
enterprisesbusiness processes.
Advanced analytics: This largely is about forecasting for the future by way of
predictive andprescriptive modelling.
Monetized analytics: This is analytics in use to derive direct business revenue.

2.2.2Second School of Thought:

Let us take a closer look at analytics 1.0, analytics 2.0, and analytics 3.0. Refer
Table 2.1. Figure 2.1 shows the subtle growth of analytics from Descriptive
 Diagnostic
 Predictive
 Perspective analytics.
Analytics 1.0 Analytics 2.0 Analytics 3.0
Era: mid 1990s to 2009 2005 to 2012 2012 to present
Descriptive Descriptive statistics Descriptive + predictive
predictivestatistics (use + prescriptive statistics (use
statistics (report on data from the past to data from the past to make
events,occurrences, make predictions for the
future)
etc. of the past) prophecies for the future
and at the same time
make
recommendations to
leverage the situation to
one's advantage)
key questions asked: key questions asked: Key questions asked:
What happened? What happened? What will happen?
Why did it happen? Why will it happen? When will it happen?
Why will it happen?
What should be the
action
taken to take advantage of
what will happen?

Data from legacy Big data A blend of big data and

systems. ERP, CRM, data from legacy
and 3rd party systems, ERP, CRM,
applications. and 3rd party applications.

Small and structured Big data is being taken up A blend of big data and
seriously.
data sources. Data stored in Data is mainly traditional analytics to
enterprise data warehousesunstructured, arriving at a yield insights and offerings
or data marts. much higher pace. This with speed and impact.
fast flow of data entailed
that the influx of big
volume data had to be
stored
and processed rapidly,
often on
massive parallel
servers running
Hadoop.
Data was internally Data was often Data is both being
sourced. externally sourced. internally and externally
sourced.
Relational databases Database appliances, In memory analytics, in
Hadoop clusters, SQL to database processing, agile
Hadoop environments, analytical methods,
etc. machine learning
techniques etc.
Table 2.1Analytics 1.0, 2.0 and 3.0

2.3 CHALLENGES OF BIG DATA

There are mainly seven challenges of big data: scale, security, schema,
Continuous availability, Consistency, Partition tolerant and data quality.

Scale: Storage (RDBMS (Relational Database Management System) or NoSQL

(Not only SQL)) is one major concern that needs to be addressed to handle the
need for scaling rapidly and elastically. The need of the hour is a storage that can
best withstand the attack of large volume, velocity and variety of big data. Should
you scale vertically or should you scale horizontally?

Security: Most of the NoSQL big data platforms have poor security mechanisms
(lack of proper authentication and authorization mechanisms) when it comes to
safeguarding big data. A spot that cannot be ignored given that big data carries
credit card information, personal information and other sensitive data.

schema: Rigid schemas have no place. We want the technology to be able to fit
our big data and not the other way around. The need of the hour is dynamic
schema. Static (pre-defined schemas) are obsolete.

Continuous availability: The big question here is how to provide 24/7 support
because almostall RDBMS and NoSQL big data platforms have a certain amount
of downtime built in.

Consistency: Should one opt for consistency or eventual consistency? Partition

tolerant: How to build partition tolerant systems that can take care of both
hardwareand software failures?

Data quality: How to maintain data quality- data accuracy, completeness,

timeliness, etc.? Do we have appropriate metadata in place?

2.4 IMPORTANCE OF BIG DATA

Let us study the various approaches to analysis of data and what it leads to.
Reactive-Business Intelligence: What does Business Intelligence (BI) help us
with? It allows the businesses to make faster and better decisions by providing the
right information to the right person at the right time in the right format. It is about
analysis of the past or historical data and then displaying the findings of the
analysis or reports in the form of enterprise dashboards, alerts, notifications, etc. It
has support for both pre-specified reports as well as ad hoc querying.

Reactive - Big Data Analytics: Here the analysis is done on huge datasets but the
approach is still reactive as it is still based on static data.

Proactive - Analytics: This is to support futuristic decision making by use of data

mining predictive modelling, text mining, and statistical analysis on. This analysis
is not on big data as it still the traditional database management practices on big
data and therefore has severe limitations on the storage capacity and the processing
capability.

Proactive - Big Data Analytics: This is filtering through terabytes, petabytes,

exabytes of information to filter out the relevant data to analyze. This also includes
high performance analytics to gain rapid insights from big data and the ability to
solve complex problems using more data.

2.5 BIG DATA TECHNOLOGIES

Following are the requirements of technologies to meet challenges of big data:

• The first requirement is of cheap and ample storage.
• We need faster processors to help with quicker processing of big data.
Affordable open source distributed big data platforms, such as Hadoop.
• Parallel processing, clustering, virtualization, large grid environments (to
distribute processing to a number of machines), high connectivity, and high
throughputs(rate at which something is processed). • Cloud computing and
other flexible resource allocation arrangements.

2.6 DATA SCIENCE

Data science is the science of extracting knowledge from data. In other
words, it is a scienceof drawing out hidden patterns amongst data using statistical
and mathematical techniques.

It employs techniques and theories drawn from many fields from the broad
areas of mathematics, statistics, information technology including machine
learning, data engineering, probability models, statistical learning, pattern
recognition and learning, etc.

Data Scientist works on massive datasets for weather predictions, oil

drillings, earthquake prediction, financial frauds, terrorist network and activities,
global economic impacts, sensorlogs, social media analytics, customer churn,
collaborative filtering(prediction about interest on users), regression analysis, etc.
Data science is multi-disciplinary. Refer to Figure 2.2.
Figure 2.2 Data Scientist

2.61 Business Acumen(expertise) Skills:

A data scientist should have following ability to play the role of data scientist.
• Understanding of domain
• Business strategy
• Problem solving
• Communication
• Presentation
• Keenness

2.6.2 Technology Expertise:

Following skills required as far as technical expertise is concerned.

• Good database knowledge such as RDBMS.
• Good NoSQL database knowledge such as MongoDB, Cassandra, HBase, etc.
• Programming languages such as Java. Python, C++, etc.
• Open-source tools such as Hadoop.
• Data warehousing.
• Data mining
• Visualization such as Tableau, Flare, Google visualization APIs, etc.

2.6.3 Mathematics Expertise:

The following are the key skills that a data scientist will have to have to
comprehend data,interpret it and analyze.
• Mathematics.
• Statistics.
• Artificial Intelligence (AI).
• Algorithms.
• Machine learning.
• Pattern recognition.
• Natural Language Processing.
• To sum it up, the data science process is
• Collecting raw data from multiple different data sources.
• Processing the data.
• Integrating the data and preparing clean datasets.
• Engaging in explorative data analysis using model and algorithms.
• Preparing presentations using data visualizations.
• Communicating the findings to all stakeholders.
• Making faster and better decisions.

2.7 RESPONSIBILITIES

Refer figure 2.3 to understand the responsibilities of a data scientist.

Data Management: A data scientist employs several approaches to develop the

relevant datasets for analysis. Raw data is just "RAW", unsuitable for analysis. The
data scientist works on it to prepare to reflect the relationships and contexts. This
data then becomes useful for processing and further analysis.

Analytical Techniques: Depending on the business questions which we are trying

to find answers to and the type of data available at hand, the data scientist employs
a blend of analytical techniques to develop models and algorithms to understand
the data, interpret relationships, spot trends, and reveal patterns.

Figure 2.3 Data scientist: your new best friend!!!

Business Analysis: A data scientist is a business analyst who distinguishes cool

facts from insights and is able to apply his business expertise and domain
knowledge to see the results inthe business context.
Communicator: He is a good presenter and communicator who is able to
communicate the results of his findings in a language that is understood by the
different business stakeholders.

2.8 SOFT STATE EVENTUAL CONSISTENCY

ACID property in RDBMS:
Atomicity: Either the task (or all tasks) within a transaction are performed or none
of them are. This is the all-or-none principle. If one element of a transaction fails
the entire transaction fails.

Consistency: The transaction must meet all protocols or rules defined by v t

the system at all times. The transaction does not isolate those protocols and the
database must remain in a consistent state at the beginning and end of a
transaction; there are never any half-completedtransactions.

Isolation: No transaction has access to any other transaction that is in an

intermediate or unfinished state. Thus, each transaction is independent unto itself.
This is required for both performance and consistency of transactions within a
database.
Durability: Once the transaction is complete, it will persist as complete and
cannot be undone; it will survive system failure, power loss and other types of
system breakdowns.

BASE (Basically Available, Soft state, Eventual consistency). In a system where

BASE is the prime requirement for reliability, the activity/potential (p) of the data
(H) changes; it essentially slows down.

Basically Available: This constraint states that the system does guarantee the
availability of the data as regards CAP Theorem; there will be a response to any
request. But, that responsecould still be ‘failure’ to obtain the requested data or the
data may be in an inconsistent or changing state, much like waiting for a check to
clear in your bank account.

Eventual consistency: The system will eventually become consistent once it stops
receiving input. The data will propagate to everywhere it should sooner or later,
but the system will continue to receive input and is not checking the consistency of
every transaction before it moves onto the next one. Werner Vogel’s article
“Eventually Consistent – Revisited” covers this topic is much greater detail.

Soft state: The state of the system could change over time, so even during times
without input there may be changes going on due to
‘eventual consistency,’ thus the state of the system is always ‘soft.’

2.9 DATA ANALYTICS LIFE CYCLE

Here is a brief overview of the main phases of the Data Analytics:

Phase 1- Discovery: In Phase 1, the team learns the business domain,

including relevant history such as whether the organization or business unit has
attempted similar projects in the past from which they can learn. The team assesses
the resources available to support the project in terms of people, technology, time
and data. Important activities in this phase include framing the business problem as
an analytics challenge that can be addressed in subsequent phases and formulating
initial hypotheses (IHs) to test and begin learning the data.

Phase 2- Data preparation: Phase 2 requires the presence of an analytic sandbox,

in which the team can work with data and perform analytics for the duration of the
project. The team needs to execute extract, load, and transform (ELT) or extract,
transform and load (ETL) to get data into the sandbox. The ELT and ETL are
sometimes abbreviated as ETLT. Data should be transformed in the ETLT process
so the team can work with it and analyze it. In

Figure 2.4 - Overview of Data Analytical Lifecycle

this phase, the team also needs to familiarize itself with the data thoroughly and
take steps to condition the data.

Phase 3-Model planning: Phase 3 is model planning, where the team determines
the methods,techniques and workflow it intends to follow for the subsequent
model building phase. The team explores the data to learn about the relationships
between variables and subsequently selects key variables and the most suitable
models.

Phase 4-Model building: In Phase 4, the team develops data sets for
testing, training, and production purposes. In addition, in this phase the team builds
and executes models based on the work done in the model planning phase. The
team also considers whether its existing tools will suffice for running the models,
or if it will need a more robust
environment for executing models and workflows (for example, fast hardware and
parallel processing, if applicable).

Phase 5-Communicate results: In Phase 5, the team, in collaboration with major

stakeholders, determines if the results of the project are a success or a failure based
on the criteria developed in Phase 1. The team should identify key findings,
quantify the business value, anddevelop a narrative to summarize and convey
findings to stakeholders.

Phase 6-0perationalize: In Phase 6, the team delivers final reports, briefings, code
and technical documents. In addition, the team may run a pilot project to
implement the models in a production environment.

Big Data Insights for Businesses
No ratings yet
Big Data Insights for Businesses
136 pages
117769
No ratings yet
117769
20 pages
Bigdata Units
No ratings yet
Bigdata Units
80 pages
Unit 1
No ratings yet
Unit 1
20 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
20 pages
Unit I Bda
No ratings yet
Unit I Bda
18 pages
Fundamentals of Big Data Analytics
No ratings yet
Fundamentals of Big Data Analytics
151 pages
Introduction To Big Data (Module 1)
No ratings yet
Introduction To Big Data (Module 1)
25 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
115 pages
Unit 1 Notes Bda
No ratings yet
Unit 1 Notes Bda
20 pages
Unit 1 and Unit 2 Notes Bda
No ratings yet
Unit 1 and Unit 2 Notes Bda
11 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
University Institute of Computing: Big Data Analytics 21CAH-782
No ratings yet
University Institute of Computing: Big Data Analytics 21CAH-782
13 pages
Unit 1
No ratings yet
Unit 1
24 pages
Big Data Analytics
No ratings yet
Big Data Analytics
83 pages
Understanding Big Data Basics
No ratings yet
Understanding Big Data Basics
73 pages
Unit 3 Big Data Analytics
No ratings yet
Unit 3 Big Data Analytics
18 pages
Module 1
No ratings yet
Module 1
14 pages
Big Data Basics Unit 1
No ratings yet
Big Data Basics Unit 1
12 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
BDA Notes
No ratings yet
BDA Notes
35 pages
Big Data Notes UNIT-1
No ratings yet
Big Data Notes UNIT-1
14 pages
Big Data Analytics
No ratings yet
Big Data Analytics
23 pages
Attachment
No ratings yet
Attachment
10 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
BigData BCom
No ratings yet
BigData BCom
57 pages
BigData - BCom Unit 1
No ratings yet
BigData - BCom Unit 1
9 pages
Big Data Basics for Beginners
No ratings yet
Big Data Basics for Beginners
43 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
Big Data Project
100% (3)
Big Data Project
61 pages
Unit 1
No ratings yet
Unit 1
56 pages
Unit 1 (Chapter 1) - Introduction
No ratings yet
Unit 1 (Chapter 1) - Introduction
10 pages
Unit 1
No ratings yet
Unit 1
44 pages
BDA 1st Unit
No ratings yet
BDA 1st Unit
41 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
Unit-III CC&BD Cs62 Ab
No ratings yet
Unit-III CC&BD Cs62 Ab
85 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
55 pages
Big Data Basics for IT Professionals
No ratings yet
Big Data Basics for IT Professionals
108 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
37 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
BDM 1
No ratings yet
BDM 1
37 pages
Extracted Note For Big Data - 070659
No ratings yet
Extracted Note For Big Data - 070659
79 pages
Introduction To Bigdata
No ratings yet
Introduction To Bigdata
31 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Big Data Intro
No ratings yet
Big Data Intro
12 pages
Bda (Unit 1)
No ratings yet
Bda (Unit 1)
24 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
Big Data Analytics (VN) 1
No ratings yet
Big Data Analytics (VN) 1
98 pages
IMTC634 - Data Science - Chapter 11
No ratings yet
IMTC634 - Data Science - Chapter 11
22 pages
Big Data, Hadoop
No ratings yet
Big Data, Hadoop
24 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Big Data: Concepts and Applications
No ratings yet
Big Data: Concepts and Applications
5 pages
Unit I: Chapter 1: Introduction To Big Data
No ratings yet
Unit I: Chapter 1: Introduction To Big Data
35 pages
Itfm Assignment Group 8
100% (1)
Itfm Assignment Group 8
16 pages
Unit 1
No ratings yet
Unit 1
107 pages
Big Data: Made By: Harshita Salian 17038 Syed Khadija Rizvi 17049 Sayyed Alfiya 17041 Rahul Masam 17028 Deepak Pal 17033
No ratings yet
Big Data: Made By: Harshita Salian 17038 Syed Khadija Rizvi 17049 Sayyed Alfiya 17041 Rahul Masam 17028 Deepak Pal 17033
12 pages
Unit 1
No ratings yet
Unit 1
57 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
37 pages
BDT 1
No ratings yet
BDT 1
49 pages
S V Gomba and 3 Others (293 of 2024) 2024 ZWHHC 283 (8 July 2024)
No ratings yet
S V Gomba and 3 Others (293 of 2024) 2024 ZWHHC 283 (8 July 2024)
16 pages
Project Template
No ratings yet
Project Template
4 pages
As Ganjil 24-25 BHS Inggris Kelas 11
No ratings yet
As Ganjil 24-25 BHS Inggris Kelas 11
6 pages
Invisible Knapsack Answers
No ratings yet
Invisible Knapsack Answers
5 pages
SHIPCO SKETCHEs
No ratings yet
SHIPCO SKETCHEs
37 pages
Barefoot Counsellor
100% (1)
Barefoot Counsellor
18 pages
Module 3 Lesson BSTM 1a
No ratings yet
Module 3 Lesson BSTM 1a
9 pages
Form XIV 4th QTR - Gr. 10
No ratings yet
Form XIV 4th QTR - Gr. 10
9 pages
Omental Tuberculosis: A Rare Presentation of Abdominal TB: December 2019
No ratings yet
Omental Tuberculosis: A Rare Presentation of Abdominal TB: December 2019
4 pages
New Interpretation of Theto Be or Not To BeSolil
No ratings yet
New Interpretation of Theto Be or Not To BeSolil
7 pages
Executive Summary: Operations Management Assignment Report Writing: Importance of Operations Management
No ratings yet
Executive Summary: Operations Management Assignment Report Writing: Importance of Operations Management
12 pages
IBM Analysis by Krisztina Toth
No ratings yet
IBM Analysis by Krisztina Toth
14 pages
Final Candlestick Patterns Guide
No ratings yet
Final Candlestick Patterns Guide
3 pages
Week 13 (Intro To Linear Programming)
No ratings yet
Week 13 (Intro To Linear Programming)
24 pages
McDonald's Highway 70 Complaint
No ratings yet
McDonald's Highway 70 Complaint
3 pages
Letter Writing Complaintsppt
No ratings yet
Letter Writing Complaintsppt
10 pages
Investor Presentation: July 2019
No ratings yet
Investor Presentation: July 2019
35 pages
The Human Factor - The Critical Importance of Effective Teamwork and Communication in Providing Quality and Safe Care
No ratings yet
The Human Factor - The Critical Importance of Effective Teamwork and Communication in Providing Quality and Safe Care
5 pages
Unit 2 - Khu 702
No ratings yet
Unit 2 - Khu 702
82 pages
OWASP Vuln MGM Guide Jul23 2020
No ratings yet
OWASP Vuln MGM Guide Jul23 2020
20 pages
508 Determiners Grammar Test Exercises Multiple Choice Questions With Answers Advanced Level 9
50% (2)
508 Determiners Grammar Test Exercises Multiple Choice Questions With Answers Advanced Level 9
5 pages
MCA Web Technologies Exam Guide
No ratings yet
MCA Web Technologies Exam Guide
6 pages
Case Study Questions PDF
No ratings yet
Case Study Questions PDF
4 pages
Hegemony
No ratings yet
Hegemony
11 pages
Rad Prod Lesson 2
No ratings yet
Rad Prod Lesson 2
6 pages
Zscaler Cisco SD WAN Deployment Guide FINAL
No ratings yet
Zscaler Cisco SD WAN Deployment Guide FINAL
129 pages
Accounting For Changing Prices (Inflation Accounting)
No ratings yet
Accounting For Changing Prices (Inflation Accounting)
6 pages
Green Valley Organisational Study
100% (1)
Green Valley Organisational Study
58 pages
Bahan Berbahaya & Beracun (B3)
100% (1)
Bahan Berbahaya & Beracun (B3)
8 pages
Environmental Impacts of Metallurgical Engineering
No ratings yet
Environmental Impacts of Metallurgical Engineering
38 pages