UNIT – 1:
What is Big Data?
The term Big Data refers to all the data that is being generated
across the globe at an unprecedented rate.
Big data is a collection of massive and complex data sets and
data volume that include the huge quantities of data, data
management capabilities, social media analytics and real-time
data.
Big Data Requirement
Basic requirement of handling a huge amount of data is
increasing day by day…
How much time does it take to process the following?
Excel : Have you ever tried a pivot table on 500 MB file?
SAS/R : Have you ever tried a frequency table on 2 GB file?
Access: Have you ever tried running a query on 10 GB file
SQL: Have you ever tried running a query on 50 GB file
• Can we think of running a query on 20,980,000 GB file.
What if we get a new data set like this, every day?
• What if we need to execute complex queries on this data set
everyday ?
• Does anybody really deal with this type of data set?
• Is it possible to store and analyze this data?
Yes Google deals with more than 20 PB data everyday.
BigData means-
Collection of data sets so large and complex that it becomes
difficult to process using on-hand database management
tools or traditional data processing applications
“Big Data” is the data whose scale, diversity, and complexity
require new architecture, techniques, algorithms, and
analytics to manage it and extract value and hidden
knowledge from it…
Big data examples:
Predictive inventory ordering.
Personalised marketing.
Streamlined media streaming.
Personalized health plans for cancer patients.
Live road mapping for autonomous vehicles.
The New York Stock Exchange is an example of Big Data
that generates about one terabyte of new trade data per
day.
Social Media
The statistic shows that 500+terabytes of new data get
ingested into the databases of social media site Facebook,
every day.
Big Data analytics is fueling everything we do online—in
every industry.
Take the music streaming platform Spotify for
example. The company has nearly 96 million users that
generate a tremendous amount of data every day.
Through this information, the cloud-based platform
automatically generates suggested songs—through a smart
recommendation engine—based on likes, shares, search
history, and more. What enables this is the techniques, tools,
and frameworks that are a result of Big Data analytics.
Types of Digital Data
Today, data undoubtedly is an invaluable asset of any
enterprise (big or small). Even though professionals work with
data all the time, the understanding, management and analysis
of data from heterogeneous sources remains a serious
challenge.
Digital Data In fact, the computer and Internet duo has
imparted the digital form to data. Digital data can be classified
into three forms:
– Unstructured
– Semi-structured
– Structured
Usually, data is in the unstructured format which makes
extracting information from it difficult.
According to Merrill Lynch, 80–90% of business data is either
unstructured or semi-structured.
Gartner also estimates that that unstructured data
constitutes 80% of the whole enterprise data.
Formats of Digital Data Here is a percent distribution of the
three forms of data:
Unstructured data:
• This is the data which does not conform to a data model or is
not in a form which can be used easily by a computer program.
It cannot be stored in the form of rows and columns as in a
database.
It does not usable by a program
It does not follow any rule or semantics.
• About 80—90% data of an organization is in this format;
For example, memos, chat rooms, PowerPoint presentations,
images, videos, letters, researches, body of an email, etc.
Where does Unstructured Data Come from?
• Anything in a non-database form is unstructured data.
• It can be classified into two broad categories:
Bitmap objects: For example, image, video, or audio
files.
Textual objects: For example, Microsoft Word
documents, emails, or Microsoft Excel spread-sheets.
A lot of unstructured data is also noisy text such as
chats, emails and SMS texts.
• The language of noisy text differs significantly from the
standard form of language.
The Unstructured data is further divided into –
Captured
User-Generated data
a. Captured data:
It is the data based on the user’s behavior. The best
example to understand it is GPS via smartphones which
help the user each and every moment and provides a
real-time output.
b. User-generated data:
It is the kind of unstructured data where the user itself
will put data on the internet every movement. For
example, Tweets and Re-tweets, Likes, Shares,
Comments, on Youtube, Facebook, etc.
How to Manage Unstructured Data?
• Let us look at a few generic tasks to be performed to
enable storage and search of unstructured data:
• Indexing: Let us go back to our understanding of the
Relational Database Management System(RDBMS). In this
system, data is indexed to enable faster search and
retrieval. On the basis of some value in the data, index is
defined which is nothing but an identifier and represents
the large record in the data set. In the absence of an index,
the whole data set/ document will be scanned for
retrieving the desired information. In the case of
unstructured data too, indexing helps in searching and
retrieval. Based on text or some other attributes, e.g. file
name, the unstructured data is indexed. Indexing in
unstructured data is difficult because neither does this
data have any predefined attributes nor does it follow any
pattern or naming conventions. Text can be indexed based
on a text string but in case of non-text based files, e.g.
audio/video, etc., indexing depends on file names. This
becomes a hindrance when naming conventions are not
being followed.
• Tags/Metadata: Using metadata, metadata, data in a
document, document, etc. can be tagged. This enables
search and retrieval. But in unstructured data, this is
difficult as little or no metadata is available. Structure of
data has to be determined which is very difficult as the
data itself has no particular format and is coming from
more than one source.
Semi-structured Data
Semi-structured data does not conform to any data model i.e. it
is difficult to determine the meaning of data neither can data
be stored in rows and columns as in a database but semi-
structured data has tags and markers which help to group data
and describe how data is stored, giving some metadata but it is
not sufficient for management and automation of data.
• Similar entities in the data are grouped and organized in a
hierarchy. The attributes or the properties within a group may
or may not be the same. For example two addresses may or
may not contain the same number of properties as in Address 1
Semi-structured Data Address 2
• For example an e-mail follows a standard format To: From:
Subject: CC: Body: The tags give us some metadata but the
body of the e-mail contains no format neither is such which
conveys meaning of the data it contains.
What is Semi-structured Data?
• Semi-structured data, also known as having a schema-less
or self-describing structure, refers to a form of structured
data that contains tags or markup elements in order to
separate elements and generate hierarchies of records
and fields in the given data.
How to Manage Semi-structured Data?
XML – A Solution for Semi-structured Data Management XML
has no predefined tags.
The words in the <> (angular brackets) are user-defined tags
XML is known as self-describing as data can exist without a
schema and schema can be added later Schema can be
described in XSLT or XML schema.
Structured Data
• Structured data is organized in semantic chunks (entities)
• Similar entities are grouped together (relations or classes)
• Entities in the same group have the same descriptions
(attributes)
• Descriptions for all entities in a group (schema)
have the same defined format
have a predefined length
Example:- Data stored in databases . The structured data comes
from the databases, Spreadsheets, SQL, OLTP Systems.
What is a Data Model?
A data model is a collection of concepts that describe the
structure of a database means datatypes, relationships,
constraints etc.
What are the 5 Vs of Big Data?
Doug Laney introduced this concept of 3 Vs of Big Data,
viz. Volume, Variety, and Velocity.
Volume refers to the amount of data that is being
collected. The data could be structured or unstructured.
Today, the volume of data in most organizations is
approaching exabytes. Some experts predict the volume
of data to reach zettabytes in the coming years.
Velocity refers to the rate at which data is generated,
captured, and shared.
The sources of high velocity data include the following:
IT devices, including routers, switches, firewalls etc.,
constantly generate valuable data.
Social media, including facebook posts, tweets, and
other social media activities, create huge amount of
data, which is to be analyzed instantly at a fast speed.
Portable device, including mobile, PDA, etc., also
generate data at a high speed.
Variety refers to the different kinds of data (data types,
formats, etc.) that is coming in for analysis. The data is
generated from different types of sources, such as
internal, external, social, and behavioral, and comes in
different formats, such as images, text, videos, etc.
Over the last few years, 2 additional Vs of data have also
emerged – value and veracity.
Value refers to the usefulness of the collected data.
Veracity refers to the quality of data that is coming in from
different sources. It generally refers to the uncertainty of
data, i.e., whether the obtained data is correct or
consistent. Out of the huge amount of data that is
generated in almost every process, only the data that is
correct and consistent can be used for further analysis.
web analytics in big data
Web Analytics is the methodological study of
online/offline patterns and trends. It is a technique that we can
employ to collect, measure, report, and analyze our website
data. It is normally carried out to analyze the performance of a
website and optimize its web usage.
Web analytics is the process of analyzing the behavior of
visitors to a website. This involves tracking, reviewing and
reporting data to measure web activity, including the use of a
website and its components, such as webpages, images and
videos.
Web analytics aids in the process of:
Refine marketing campaigns
Visitor search analysis on various websites.
Analyze website conversions and customer buying
patterns.
Improve website user experience
Boost search engine ranking and SEO(Search engine
optimization)
Understand and optimize referral sources
Boost online sales.
Web Analytics Process
The primary objective of carrying out Web Analytics is to
optimize the website in order to provide better user
experience. It provides a data-driven report to measure visitors’
flow throughout the website.
It depicts the process of web analytics.
Set the business goals.
To track the goal achievement, set the Key Performance
Indicators (KPI).
Collect correct and suitable data.
To extract insights, Analyze data.
Based on assumptions learned from the data analysis, Test
alternatives.
Based on either data analysis or website
testing, Implement insights.
Which tool is used in web analytics?
Google Analytics is the most popular free web analytics tool
out there. It's a traditional analytics solution, meaning it
provides real-time data about your site's traffic like pageviews ,
sessions, time on page, bounce rates, and other stats and
metrics.