Unit – 1
(Learning Notes)
SYLLABUS:
Introduction to Big Data Analytics: Big Data Overview
State of Practice in Analytics Role of Data Scientists
Big Data Analytics in Industry Verticals
Big Data Overview
Data is created constantly, and at an ever-increasing rate. Mobile
phones, social media, imaging technologies to determine a medical diagnosis-
all these and more create new data, and that must be stored somewhere for
some purpose. Devices and sensors automatically generate diagnostic
information that needs to be stored and processed in real time. Merely keeping
up with this huge influx of data is difficult, but substantially more challenging
is analysing vast amounts of it, especially when it does not conform to
traditional notions of data structure, to identify meaningful patterns and
extract useful information.
These challenges of the data deluge present the opportunity to
transform business, government, science, and everyday life. Several
industries have led the way in developing their ability to gather and exploit
data:
Credit card companies monitor every purchase their customers make
and can identify fraudulent purchases with a high degree of accuracy
using rules derived by processing billions of transactions.
Mobile phone companies analyse subscribers' calling patterns to
determine, for example, whether a caller's frequent contacts are on a
rival network. If that rival network is offering an attractive promotion
that might cause the subscriber to defect, the mobile phone company
can proactively offer the subscriber an incentive to remain in her
contract.
For companies such as Linked In and Facebook, data itself is their
primary product. The valuations of these companies are heavily derived
from the data they gather and host, which contains more and more
intrinsic value as the data grows.
Three attributes stand out as defining Big Data
characteristics:
Huge volume of data: Rather than thousands or millions of rows, Big
Data can be billions of rows and millions of columns.
Complexity of data types and structures: Big Data reflects the variety of
new data sources, formats, and structures, including digital traces
being left on the web and other digital repositories for subsequent
analysis.
Speed of new data creation and growth: Big Data can describe high
velocity data, with rapid data ingestion and near real time analysis.
Definition:
Big Data is data whose scale, distribution, diversity, and/or
timeliness require the use of new technical architectures and
analytics to enable insights that unlock new sources of business
value.
Data Structures
Big data can come in multiple forms, including structured and non-
structured data such as financial data, text files, multimedia files, and genetic
mappings. Contrary to much of the traditional data analysis performed by
organizations, most of the Big Data is unstructured or semi-structured in
nature, which requires different techniques and tools to process and analyse)
Distributed computing environments and massively parallel processing (MPP)
architectures that enable parallelized data ingest and analysis are the
preferred approach to process such complex data.
1. STRUCTURED DATA: Data containing a defined data type, format, and
structure (that is, transaction data, online analytical processing [OLAP]
data cubes, traditional RDBMS, CSV files, and even simple
spreadsheets).
2. SEMI-STRUCTURED DATA: Textual data files with a discernible
pattern that enables parsing (such as Extensible Markup Language
[XML] data files that are self-describing and defined by an XML
schema).
3. QUASI-STRUCTURED DATA: Textual data with erratic data formats
that can be formatted with effort, tools, and time (for instance, web
clickstream data that may contain inconsistencies in data values and
formats).
4. UNSTRUCTURED DATA: Data that has no inherent structure, which
may include text documents, PDFs, images, and video.
Analyst Perspective on Data Repositories
The introduction of spreadsheets enabled business users to create
simple logic on data structured in rows and columns and create their own
analyses of business problems. Database administrator training is not
required to create spreadsheets: They can be set up to do many things quickly
and independently of information technology (IT) groups. Spreadsheets are
easy to share, and end users have control over the logic involved. However,
their proliferation can result in "many versions of the truth." In other words,
it can be challenging to determine if a particular user has the most relevant
version of a spreadsheet, with the most current data and logic in it. Moreover,
if a laptop is lost or a file becomes corrupted, the data and logic within the
spreadsheet could be lost. This is an ongoing challenge because spreadsheet
programs such as Microsoft Excel still run on many computers worldwide.
With the proliferation of data islands (or spread marts), the need to centralize
the data is more pressing than ever.
State of the Practice in Analytics
Business Driver Examples Business Driver Examples
Optimize business operations Optimize business operations
Identify business risk Identify business risk
Predict new business opportunities Predict new business opportunities
Comply with laws or regulatory Comply with laws or regulatory
BI Versus Data Science
Although much is written generally about analytics, it is important to
distinguish between Bland Data Science. One way to evaluate the type of
analysis being performed is to examine the time horizon and the kind of
analytical approaches being used. Bl tends to provide reports, dashboards,
and queries on business questions for the current period or in the past. Bl
systems make it easy to answer questions related to quarter-to-date revenue,
progress toward quarterly targets, and understand how much of a given
product was sold in a prior quarter or year. These questions tend to be closed-
ended and explain current or past behaviour, typically by aggregating
historical data and grouping it in some way. BI provides hindsight and some
insight and generally answers questions related to "when" and "where" events
occurred.
By comparison, Data Science tends to use disaggregated data in a more
forward-looking, exploratory way, focusing on analysing the present and
enabling informed decisions about the future. Rather than aggregating
historical data to look at how many of a given product sold in the previous
quarter, a team may employ Data Science techniques such as time series
analysis to forecast future product sales and revenue more accurately than
extending a simple trend line. In addition, Data Science tends to be more
exploratory in nature and may use scenario optimization to deal with more
open-ended questions.
This approach provides insight into current activity and foresight into
future events, while generally focusing on questions related to "how" and
"why" events occur. Where BI problems tend to require highly structured data
organized in rows and columns for accurate reporting, Data Science projects
tend to use many types of data sources, including large or unconventional
datasets.
Depending on an organization's goals, it may choose to embark on a BI
project if it is doing reporting, creating dashboards, or performing simple
visualizations, or it may choose Data Science projects if it needs to do a more
sophisticated analysis with disaggregated or varied datasets.
Current Analytical Architecture
For data sources to be loaded into the data warehouse, data needs to
be well understood, structured, and normalized with the appropriate
data type definitions. Although this kind of centralization enables
security, backup, and fail over of highly critical data, it also means that
data typically must go through significant pre-processing and
checkpoints before it can enter this sort of controlled environment,
which does not lend itself to data exploration and iterative analytics.
As a result of this level of control on the EDW, additional local systems
may emerge in the form of departmental warehouses and local data
marts that business users create to accommodate their need for flexible
analysis. These local data marts may not have the same constraints for
security and structure as the main EDW and allow users to do some
level of more in-depth analysis. However, these one-off systems reside
in isolation, often are not synchronized or integrated with other data
stores and may not be backed up.
Once in the data warehouse, data is read by additional applications
across the enterprise for Bl and reporting purposes. These are high-
priority operational processes getting critical data feeds from the data
warehouses and repositories.
At the end of this workflow, analysts get data provisioned for their
downstream analytics. Because users generally are not allowed to run
custom or intensive analytics on production databases, analysts create
data extracts from the EDW to analyse data offline in R or other local
analytical tools. Many a times these tools are limited to in-memory
analytics on desktops analysing samples of data, rather than the entire
population of a dataset. Because these analyses are based on data
extracts, they reside in a separate location, and the results of the
analysis-and any insights on the quality of the data or anomalies-rarely
are fed back into the main data repository.
Challenges in the current architecture:
1. The typical data architectures just described are designed for storing
and processing mission-critical data, supporting enterprise
applications, and enabling corporate reporting activities. Although
reports and dashboards are still important for organizations, most
traditional data architectures inhibit data exploration and more
sophisticated analysis. Moreover, traditional data architectures have
several additional implications for data scientists.
2. High-value data is hard to reach and leverage, and predictive analytics
and data mining activities are last in line for data. Because the EDWs
are designed for central data management and reporting, those wanting
data for analysis are generally prioritized after operational processes.
3. Data moves in batches from EDW to local analytical tools. This workflow
means that data scientists are limited to performing in-memory
analytics (such as with R, SAS, SPSS, or Excel), which will restrict the
size of the data sets they can use. As such, analysis may be subject to
constraints of sampling, which can skew model accuracy.
4. Data Science projects will remain isolated and ad hoc, rather than
centrally managed. The implication of this isolation is that the
organization can never harness the power of advanced analytics in a
scalable way, and Data Science projects will exist as nonstandard
initiatives, which are frequently not aligned with corporate business
goals or strategy.
Drivers of Big Data
To better understand the market drivers related to Big Data, it is helpful
to first understand some past history of data stores and the kinds of
repositories and tools to manage these data stores.
1. Medical information, such as genomic sequencing and diagnostic
imaging
2. Photos and video footage uploaded to the World Wide Web Video
surveillance, such as the thousands of video cameras spread across a
city
3. Mobile devices, which provide geospatial location data of the users, as
well as metadata about text messages, phone calls, and application
usage on smart phones
4. Smart devices, which provide sensor-based collection of information
from smart electric grids, smart buildings, and many other public and
industry infrastructures
5. Non-traditional IT devices, including the use of radio-frequency
identification (RFID) readers, GPS navigation systems, and seismic
processing
Emerging Big Data Ecosystem and a New Approach to Analytics
Organizations and data collectors are realizing that the data they can
gather from individuals contains intrinsic value and, as a result, a new
economy is emerging. As this new digital economy continues to evolve, the
market sees the introduction of data vendors and data cleaners that use
crowdsourcing (such as Mechanical Turk and Ga laxyZoo) to test the
outcomes of machine learning techniques. Other vendors offer added value by
repackaging open source tools in a simpler way and bringing the tools to
market.
Vendors such as Cloudera, Hortonworks, and Pivotal have provided this
value-add for the open source
framework Hadoop.
1. Data devices and the "Sensor net" gat her data from multiple
locations and continuously generate new data about this data. For each
gigabyte of new data created, an additional petabyte of data is created
about that data.
a. For example, consider someone playing an online video game
through a PC, game console, or smartphone. In this case, the
video game provider captures data about the skill and levels
attained by the player. Intelligent systems monitor and log how
and when the user plays the game. As a consequence, the game
provider can fine-tune the difficulty of the game, suggest other
related games that would most likely interest the user, and offer
additional equipment and enhancements for the character based
on the user's age, gender, and interests. This information may get
stored locally or uploaded to the game provider's cloud to analyse
the gaming habits and opportunities for ups ell and cross-sell and
identify archetypical profiles of specific kinds of users.
b. Smartphones provide another rich source of data. In addition to
messaging and basic phone usage, they store and transmit data
about Internet usage, SMS usage, and real-time location. This
metadata can be used for analysing traffic patterns by scanning
the density of smartphones in locations to track the speed of cars
or the relative traffic congestion on busy roads. In this way, GPS
devices in cars can give drivers real-time updates and offer
alternative routes to avoid traffic delays.
c. Retail shopping loyalty cards record not just the amount an
individual spends, but the locations of stores that person visits,
the kinds of products purchased, the stores where goods are
purchased most often, and the combinations of products
purchased together. Collecting this data provides insights into
shopping and travel habits and the likelihood of successful
advertisement targeting for certa in types of retail promotions.
2. Data collectors include sample entities that collect data from the
device and users.
a. Data results from a cable TV provider tracking the shows a person
watches, which TV channels someone will and will not pay for to
watch on demand, and the prices someone is willing to pay for
premium TV content.
b. Retail stores tracking the path a customer takes through their
store while pushing a shopping cart with an RFID chip so they
can gauge which products get the most foot traffic using
geospatial data collected from the RFID chips
3. Data aggregators make sense of the data collected from the various
entities from the "SensorNet" or the "Internet of Things." These
organizations compile data from the devices and usage patterns
collected by government agencies, retail stores, and websites. ln turns,
they can choose to transform and package the data as products to sell
to list brokers, who may want to generate marketing lists of people who
may be good targets for specific ad campaigns.
a. Retail banks, acting as a data buyer, may want to know which
customers the highest likelihood have to apply for a second
mortgage or a home equity line of credit. To provide input for this
analysis, retail banks may purchase data from a data aggregator.
This kind of data may include demographic information about
people living in specific locations; people who appear to have a
specific level of debt, yet still have solid credit scores (or other
characteristics such as paying bills on time and having savings
accounts) that can be used to infer credit worthiness; and those
who are searching the web for information about paying off debts
or doing home remodelling projects. Obtaining data from these
various sources and aggregators will enable a more targeted
marketing campaign, which would have been more challenging
before Big Data due to the lack of information or high-performing
technologies.
b. Using technologies such as Hadoop to perform natural language
processing on unstructured, textual data from social media
websites, users can gauge the reaction to events such as
presidential campaigns. People may, for example, want to
determine public sentiments toward a candidate by analysing
related blogs and online comments. Similarly, data users may
want to track and prepare for natural disasters by identifying
which areas a hurricane affects first and how it moves, based on
which geographic areas are tweeting about it or discussing it via
social media.
KEY ROLES FOR THE NEW BIG DATA ECOSYSTEM
BIG DATA ANALYTICS ACTIVITIES
There are three recurring sets of activities that data scientists perform:
1. Reframe business challenges as analytics challenges. Specifically, this
is a skill to diagnose business problems, consider the core of a given
problem, and determine which kinds of candidate analytical methods
can be applied to solve it.
2. Design, implement, and deploy statistical models and data mining
techniques on Big Data. This set of activities is mainly what people
think about when they consider the role of the Data Scientist: namely,
applying complex or advanced analytical methods to a variety of
business problems using data.
3. Develop insights that lead to actionable recommendations. It is critical
to note that applying advanced methods to data problems does not
necessarily drive new business value. Instead, it is important to learn
how to draw insights out of the data and communicate them effectively.
DATA ANALYST SKILLS
1. Quantitative skill: such as mathematics or statistics
2. Technical aptitude: namely, software engineering, machine learning,
and programming skills
3. Skeptical mind-set and critical thinking: It is important that data
scientists can examine their work critically rather than in a one-sided
way.
4. Curious and creative: Data scientists are passionate about data and
finding creative ways to solve problems and portray information.
5. Communicative and collaborative: Data scientists must be able to
articulate the business value in a clear way and collaboratively work
with other groups, including project sponsors and key stakeholders.