Wh sem -
(Course Code : 1TD08011) (Department Optional Course - 5)
Big Data Analytics
Dr. Sangeeta Vhatkar Dipali Pawar
hakur COE and Technology,
ndivatice))
DUE ta
(Zeal COE and Research, Pune)
Rupali D. Pashte Dr. Zahir Aalm
Shree L. R. Tiwari COE, Mumbai) (Thakur COE and Technology,
Kandivali(e))
eae ete
Pcneatinemerer 1 seen
|
|
Geen
Ro een eed
Scanned with CamScannerIntroduction to
Big Data
University Prescribed Syllabus
Introduction to Big Data, Big Data characteristics, types of Big Data,
Traditional vs. Big Data business approach, Big Data Challenges,
Examples of Big Data in Real Life, Big Data Applications.
Self-learning Topics : Identification of Big Data applications and its
solutions.
»4 1.1 INTRODUCTION TO BIG DATA AND HADOOP
ees
: GQ. _ Firstly, We need to know “what is data” ?
(1) Now a day the amount of data created by various advanced
technologies like Social networking sites, E-commerce etc. is
very large. It is really difficult to store such huge data by using
the traditional data storage facilities.
2)
Until 2003, the size of data produced was 5 billion gigabytes. If
this data is stored in the form of disks it may fill an entire football
field. In 2011, the same amount of data was created in every two
days and in 2013 it was created in every ten minutes. This is
really tremendous rate.
Scanned with CamScanner|
poduCtON SO meme
pm
wu-sem8
yt,
will discuss
non coneel
some of t
about fa
pis ate 10 ig data, We
ne proceses and eng
ia)
ig Data An
3) In his topics NE
ind define om
about
wide
oa " ie a -
Da ve collection of data that continues
1 Big Data is a massive coe re
1. Stock Exchange : The data in the share market regarding
information about prices and status details of shares of | om
‘of companies is very huge. ;
Social Media Data : The data of social networking sites contains
information about all the account holders, their posts, chil
history, advertisements ete. On topmost sites like facebook and
whatsapp, there are literally billions of users.
Big Data Analytics (MU-Som 847) _(Invoduction to Big Data) Pg. 90. (1-3)
3. Video sharing portals : Video sharing portals like youtube,
‘Vimeo etc. contains millions of videos each of which requires lots
of memory to store.
1. Stock Exchange
2, Social Media Data
3, Video sharing portals
4, Search Engine Data
Search Engine Data : The search engines like Google and
Yahoo holds lot much of metadata regarding various sites.
‘Transport Data : Transport data contains information about
‘model, capacity, distance and availability of various vehicles.
Banking Data : The big giants in banking domain like SBI or
ICICI hold large amount of data regarding huge transactions of
account holders.
DM 1.2 BIG DATA CHARACTERISTICS
{Gq What are Characteristics of Big Data?
| ug. Describe any five characteristics of Big Data
UQ. Explain what characteristic of Social Networks makes
Data, TEI
{Q. Explain Big data along with SV's es i
5.
(wu 2029 (0-130) WadrecnseoPuicavons
‘Scanned with CamScannerwon
aig Data Ana
he volume ie. amount of ata that j
‘
Volume represents a
ts vigil ata volume in Petabyte BOviy
ata
refers to turing data into vale. By tming scceeg
a ere lus, businesses may generate FEVENUE i
ata into valves,
f available data,
fers to the uncertainty OF v
ve fe
(3) Veracity
srs duet the high votume of dat that brings incompl
and inconsistency
(4) Visualization isthe process ©
saps, and other visual forms.
{ferent data types ie. various data f
f displ
ng data in charts,
(5) Variety refers to the di
like text, audios, videos, ete: j
(6) Velocity is the rat t whic data grows Social media con
a myjor oe inte velocity of growing data oad :
(a) Virality describes ow quickly information gets spread seg |
people to people (P2P) networks. |
1.2.1 Volume |
|
« Asit follows from the name, big data is used to refer to enorm
‘amounts of information. |
We are tking about ot gigabytes but terabytes and petites
ata,
«The [oT (Intemet of Things) is creating exponential growth in
data,
+The volume of data is projected to change significantly in the
coming years.
Hence, "Volume is one characteristic which needs to be
considered while dealing with Big Data,
(ne 2:25) (ae-138)
Big Data Analytics (MU-Sem 8-7) (introduction to Big Data)_.Pg.no._.(1-5)
"= Volume
{Data at Rest]
Terabytes, Petabytes Records/Arch Table/Files Distributed
% 1.2.2 Variety
«Variety refers to heterogeneous sources and the nature of data,
both structured and unstructured.
‘© Data comes in different formats — from structured, numeric data
in traditional databases to unstructured text documents, emails,
videos, audios, stock ticker data and financial transactions.
«This variety of unstructured data poses certain issues for storage,
‘mining and analysing data
+ Organizing the data in a meaningful way is no simple task,
especially when the data itself changes rapidly.
Another challenge of Big Data processing goes beyond the
massive volumes and increasing velocities of data but also in
‘manipulating the enormous variety ofthese data
"© Variety
[Data in many Forms }
Text Multimedia
Structured Unstructured
w% 1.2.3 Veracity
© Veracity describes whether the data can be trusted. Veracity refers
to the uncertainty of available data.
Veracity arises due to the high volume of data that brings
incompleteness and inconsistency.
‘Hygiene of data in analytics i important
‘cannot guarantee the accuracy of your results
because otherwise, you
ren nec patcatrs
(ws 2229) (48-131)
‘Scanned with CamScannernroduction 8g Oata)_P9. no
yg oats ana MSS D
ones rom so many TE SOUS i iy
Besa nse ad ransform Gat TOSS SYST,
to link, mate
being analysed are
+ However, it is useless ifthe data being oll
incomplete ~
‘eracity is al about making 8 the dat is soon
. ee poner ep a a FE SEAN
systems
«@ veracity
{Data in Doubt }
rstworhiness Authenticity ‘Accurate Availability
1.2.4 Velocity
Velociy is the speed in which data i BTOWS, process an
becomes accessible
4. A.data flows in from sources like business processes, application
Togs, nexworks, and social media sites, sensors, Mobile deve,
etc.
“The flow of data is massive and continuous.
‘Most data are warehoused before analysis, there is an increasing
need for realtime processing ofthese enormous volumes.
+ Real-time processing reduces storage requirements while
providing more responsive, accurate and profitable responses.
«It should be processed fast by batch, in a stream-like manner
because it just keeps growing every years.
© Velocity
{Data in Motion }
Streaming‘ Batch——Real / Near Time Processes
(wu-22-28)(Me-131)
:
}
}
|
Big Data Analytics (MU-Sem 7) (introducton to Big Data)_.Pg 0.(1-7
12.5 Value
+ Irefers to turning data into value, By turing accessed big data
into values, businesses may generate revenue.
+ Value is the end game, After addressing volume, velocity, variety,
variability, veracity, and visualization — which takes a lot of time,
effort and resources ~ you want to be sure your organization is
getting value from the data,
‘+ For example, data that can be used to analyze consumer bebavior
is valuable for your company because you can use the research
results to make individualized offers.
Value
{Data into Money}
Statistical Events Correlations
% 1.2.6 Visualization
+ Big data visualization isthe process of displaying data in charts,
graphs, maps, and other visual forms.
+ Its used to help people easily understand and interpret their data
at a glance, and to clearly show trends and pattems that arise from
this dat
+ Raw data comes in a different formats, so creating data
visualizations is process of gathering, managing, and
transforming data into a format that’s most usable and
‘meaningful
+ Big Data Visualization makes your data as accessible as possible
to everyone within your organization, whether they have technical
data skills or not
ab rereoritiins
(wu 22.28) (M131)
‘Scanned with CamScannerinrosucton to Big Data
mu-son8.l
isadve
red Data
pa)
oe the data which lacks a proper format or sequence
vn ined by # fixed schema.
1
isnot constr
ec schema.
4, Very Flexible due to absense ©
44, Data is portable.
5, Ibis very scalable.
6 teean deal easly with the heterogencltY of sources. i
+ rase pe of data hve a variety of BUSINESS inelligene gy
i
4
analytics applications.
«= pisadvantages !
itis dificult to store and manage unstructured data due to
schema and structure,
2, Indexing the data is difficult and error prone due to
structure and not having pre-defined altributes. Due to
search results are not very accurate.
3. Ensuring security to data is difficult task.
1.3.2 Type #2 : Structured
‘+ Any data that can be stored, accessed and processed in the fom
‘of fixed format is termed as a “Structured” data.
+ Over the period of time, talent in computer science have achieved
‘greater success in developing techniques for working with
‘kind of data (where the format is well known in advance) and
determining value out of it
* When size of such data grows to a huge extent, typical sizes
Pa 2 the range of multiple zettabyte, Data stored in
relatio
fe ae ‘Management system in one example of
(22.25) (a-191)
~ Blew | (u-20.23) 48191)
Big Data Analytics (MU-Sem 8-7) (Introduction to Big Data)._.Pg. no... (1-11
+ Structured data isthe data which conforms to a data model, has
a well define structure, follows a consistent order and can be
‘easily accessed and used by a person or a computer program.
+ Structured data is usually stored in well-defined schemas such as
Databases. It is generally tabular with column and rows that
clearly define its atributes.
+ SQL (Structured Query language) is often used to manage
structured data stored in databases.
%®_1.3.2(A) Characteristics of Structured Data
+ Data conforms to a data model and has easily identifiable
structure
+ Data is stored in the form of rows and columns.
‘Example : Database
Data is well organised so, Definition, Format and Meaning of
data is explicitly known.
‘+ Data resides in fixed fields within a record or file.
+ Similar entities are grouped together to form relations or classes.
‘+ Entities in the same group have same attributes.
© Easy to access and query, So data can be easily used by other
programs.
‘+ Data elements are addressable, so efficient to analyse and process.
% 1.3.2(B) Sources of Structured Data
(1) SQL Databases (2). Spreadsheets such as Excel
(3) OLTP Systems (4) Online forms
(5) Sensors such as GPS or RFID tags
(6) Network and Web server logs
(1) Medical devices
ech Neo Publications.
‘Scanned with CamScannert Pg. no. (1-1
fig Data Analytics (MU-Sem.8-1 Introduction to Big Data 0. (101
vantages of Structured
well defined structure that helps in eagy
Data
we 1.3.2(0) A
1, Structured data have @
storage and access of data
>, Data can be indexed based on text strin
cl mn hassle-free.
‘This makes search operatior
c. knowledge can be easily extracted from
1g as well as attributes
Data mining is easy i
data,
erations such as Updating and deleting is easy due to wel
4. Of
structured form of data
5, Business Intelligence operations such as Data warehousing can be
easily undertaken.
{6 Easily scalable in case there is an increment of data
7. Ensuring security to data is easy
Structured - Example
Employee_Table
[ee Employee Name| Gender | Depariment|Salary_In_lacs
1 XYK MALE | FINANCE 850000,
2 ‘ABC ‘MALE | ADMIN 250000
3 POR FEMALE) SALES 350000,
4 wR —|FEMALE| FINANCE | 600000
% 1.3.3 Type #3 : Semi Structured
‘+ Semi structured is the third type of big data. Semi-structured data
can contain both the forms of data.
‘+ Semi-structured data pertains to the data containing both the
formats mentioned above, that is, structured and unstructured
data
|
|
|
(MU. 22:29) 6-191)
ig Data Araytes (MU-Som 8. (ireductono Big Data... no. (119)
‘+ To be precise, it refers to the data that although has not been
classified under a particular repository (database), yet contains
Vital information or tags that segregate individual elements within
the data,
‘+ Web application data, which is unstructured, consists of log files,
transaction history files etc.
‘+ Online transaction processing systems are built to work with
structured data wherein data is stored in relations (tables).
+ Semi-structured data is data that does not conform to a data
model but has some structure. It lacks a fixed or rigid schema. It
is the data that does not reside in a rational database but that have
some organizational properties that make it easier to analyze
With some processes, we can store them in the relational
database.
%_1.3.3(A) Characteristics of Semi-structured
Data
1. Data does not conform to a data model but has some structure
Data can not be stored in the form of rows and columns as in
Databases
2. Semi-structured data contains tags and elements (Metadata)
which is used to group data and describe how the data is stored.
3. Similar entities are grouped together and organized in a hierarchy.
Entities in the same group may or may not have the same
attributes or properties.
4. Does not contain sufficient metadata which makes automation
and management of data difficult
Size and type of the same attributes in a group may differ.
6. Due to lack of a well-defined structure, it can not used by
Computer programs easily
(0mu. 22:29) (Mé-131) [dl rech-o Puvications
‘Scanned with CamScanner9.17) (Introduction to Big Data)__Pp. no.
fig Data Analytics (MU.
Ys 1.3.3(B) Sources of semi-structured Data
Wen (2) XMLand other markup languages
(1) E-mails
(3) Binary executables (4) ‘TCPAP packets
(5) Zipped files (6) Integration of data from ditterey
sources
(1) Web pages
ys 1.3.3(C) Advantages and Disadvantages of
‘Semi-structured Data
57 Advantages
1, The data is not constrained by a fixed schema.
2. Flexible ie. Schema can be easily changed.
3. Datais portable.
4. Itis possible to view structured data as semi-structured data.
‘5. Its supports users who can not express their need in SQL.
6. Itcan deal easily with the heterogeneity of sources.
"= Disadvantages
1, Lack of fixed, rigid schema make it difficult in storage of the
data,
2. Interpreting the relationship between data is difficult as there is
‘no separation of the schema and the data,
3. Queries are less efficient as compared to structured data
© Semi-structured - Example
+ User can see semi-structured data as a structured in form but it is
actually not defined with eg. a table definition in relational
DBMS.
(Mu. 22.28) (M6-131)
[abr rie Putco
;
Big Data Analyics (MU-Som 87) (Intoducton to Big Data). no._ (1-15)
+ Personal data stored in a XML file
- Prashant
Rao Male 35 <
name> Seema
R. Female 41
Satish
Mane Male20
Do 1.4 DIFFERENCE BETWEEN STRUCTURED, SEMI-
STRUCTURED AND UN-STRUCTURED DATA
GQ. What is difference between structured, semi-structured and
Un-Structured Data ?
Properties | Structured | Semi-structured | Unstructured
data data data
Technology | Itis based | Itis based on Itis based on
on XMLIRDF(Resource | character and
Relational | Description binary data |
database | Framework),
table
Transaction | Matured | Transaction is No
management | transaction | adapted from DBMS | transaction
and various | not matured ‘management
concurrency and no
techniques concurrency
Version | Versioning | Versioning over __| Versioned as
‘management | over tuples, | tuples or graph is | a whole
row, tables | possible
Flexibility | Itis schema | tt is more flexible | Itis more
dependent | than structured data | flexible and
and ess | but less flexible than | there is
flexible | unstructured data | absence of
schema
(omy. 22:28) (M191) W Toch Nao Publications
‘Scanned with CamScanner{troduction to Big Data)...Pg. no... (1-16)
1a Analytics (MU-Sem.8-1
BigDat
| Properties | Structured Semi-structured | Unstructureg ‘Bg Data Analytica (MU-Sem.6-17) (Introduction to 64g. 0808)
a se os ed, sem
«© Itdeals with large volume of both structures ©
Scalability | Itis very | I's scaling is Ttis more Se ee (ia VO Ts a8
difficult | simpler than scalable. eee
seale DB | structured data Veracity and Value refer to the
schema data van wer!
bust e «Big data not only refers to large amount OF TA yt of
usiness | Very robust | New technology, not ig eee
eee very spread extracting meaningful data by analy2inB th
je complex data set. s
Structured | Queries over Only textual 5 aiaaie ais
seat query allow | anonymous nodes | queries are UQ.” Compare big data analytics with taditiongy
complex | are possible possible a =
a ‘Traditional Data Big Dale
generated
1, | Traddional daa is generated | Big dat Pr ve
in enterprise level. Bi ide and enterprise IEVET
voume ranges [0m |
2, [ts volume ranges from | Its
Zenabytes
Gigabytes to Terabytes. Petabytes to Zettaby! |
Exabytes. ta
Sytem deals with
3, | Traditional database system | Big data system CC rag |
structured, semi
4, Traditional Data
deals with structured data. |
‘+Traditional data is the structured data which is being majorly, and unstructured data_|
maintained by all types of businesses starting from very 4. | Traditional data is generated | But big data 's generated |
small to big organizations. perhour or per day or more. | more frequently mainly PEF |
+ In traditional database system a centralized database seconds. —
architecture used to store and maintain the data in a fixed 5. | Traditional data source is | Big data source is distributed |
format or fields in a file. For managing and accessing the ‘centralized and it is managed | and it is managed in |
data Structured Query Language (SQL) is used. in centralized form, distributed form.
2. Bigdata 6. | Data integration is very easy. | Data integration is very
difficult,
‘+ Wecan consider big data an upper version of traditional data.
Big data deal with to0 large or complex data sets which is
difficult to manage in traditional data-processing application
software.
7. | Normal system configuration | High system configuration is
is capable to process | required to process big data.
traditional data
(wu 2229) 8-13 Wl recreate (MU- 22.23) (Mot Tech.
) “ ) (8-131) & Pu
‘Scanned with CamScannerData Analytics (MU-Sem 8.17) (Invoduction to Big Data) Py no (1-1
> 1. Fraud detection
Fraud detection is a Big Data application example for
ye
'. | The size of the data is very
‘ traditional data size, businesses which has operations like any type of claims or
sal
transaction processin
Traditional data base tools | Special kind of data base tools 7
. red to perform any | are required 10 perform any ‘+ Number of times the detection of fraud is concluded long
ey ‘operation, data base operation. ‘after the fact. At this point the damage has been already done
Goes! kind of foocticos ol all that’s left is to decrease the harm and revite policies to
10, | Normal functions can prevent it in future,
manipulate data manipulate data,
Th [ws data model is strict | Its data model is flat schema -2
"| schema based and itis state | based and tis dynamic, SS
12. | Traditional data is stable and | Big data is not stable and [rood ecton |
inter relationship. unknown relationship. [2g anainics |
Traditional data is in| Big data is in huge volume { 3. Cat contr anaiyics |
manageable volume. which becomes [7 coil Tr
14, | 1 is easy to manage and Pig. 1.6.1 + Big data applications
It is difficult to manage and
|__| manipulate the data, © The Big Data platforms can analyze claims and transactions
15, | is data sources includes ERP | Ks data sources includes of businesses. They identify large-scale patterns across many
transaction data, CRM | social media, device data, transactions or detect anomalous behaviour of a some user.
transaction data, financial | sensor data, video, images,
data, organizational data, | audio ete “Tae La i evel fa
‘Web transection date ofc. > 2, IT log analytics
s + An enormous quantity of logs and trace data is generated in
Ib EXA | i IT solutions and IT departments. Many times such data go
ene ‘unexamined: organizations simply don't have the manpower
Sj ——
‘There ar various big data applications as shown in Fig 1.6.1
As 22°25) (av)
a
‘Scanned with CamScannerIntroduction to Big Data)...Pa. no. (1.
em 81)
Big Data Analytics (MU-Sem.8- Big Data Analytics (MU-Som.8-IT) (Introduction to Big Data)...Pg_ no... (1-21
e ability to quickly identify lange.
© Big data has the ability (0 a ee «It is necessary for the data to be available in an accurate,
to help in diagnosing and preventing problems
patterns Bee eee k complete and timely manner because if data inthe companies
helps the organization with a lars information system is to be used to make accurate decisions
> 3, Call center analytics in time then it becomes necessary for data to be available in
this manner.
Now we tum to the customer-facing Big Data applicatig
examples, of which call center analytics ae partculg, 2. Privacy and Security
powerful. Without a Big Data solution, much of the insigh + Its another most important challenge with Big Data. This
that a call center can provide will be ignored or exposed late challenge includes sensitive, conceptual, technical as well a5.
+ By making sense of time/quality resolution metrics, the Big legal significance.
Data solutions are able to identify recurring problems op + Most of the organizations are unable to maintain regular
customer and staff behaviour pattems. Big data can alsy checks due to large amounts of data generation. However, it
capture and process call content itself. should be necessary to perform security checks and
> 4, Social media analysis | ‘observation in realtime because it is most beneficial
With the help of Social media we can observe the real-time ‘+ There is some information of a person which when combined
with external large data may lead to some facts of a person
insights into how the market is responding to products and
atest which may be secretive and he might not want the owner to
; know this information about that person.
* With the help of these insights, it is possible for companies to :
adjust their peicing, promotion, and campaign placemeaiiil + Some of the organization collet information of the ea
‘get optimal results, ] in order to add value to their business. This is done by
: making insights into their lives that they're unaware of.
D4 _1.7 BIG DATA CHALLENGES ‘Analytical Challenges
«There are some huge analytical challenges in big data which
1, Sharing and Accessing Data ?
on ‘2 arise some main challenges questions like how to deal with a
* Peshaps the most frequent challenge in big data efforts is the problem if data volume gets too large?
inaccessibility of data sets from external sources, “0: Or hw to Sia onthe adh Soli?
3 3 Se ae
a ig data can cause substantial challenges, ‘© Orhow to use data to the best advantage?
~ a a the need for inter and intra- institutional legal ‘© These large amount of data on which these type of analysis is
‘ument
to be done can be structured (organized data), semi-structured
* Accessing data from public repositories leads to multiple (Semi-organized data) or unstructured (unorganized data).
| sities.
Pe ee
ms 70.25) e131) ech No Puan
er ee, Ue rec. neo Puticatons
a : me ed
‘Scanned with CamScanner‘Introduction to Big D:
pio
aro ects oush which desson making cy
fig Data A
There
be done
1, ithe incorporate mass
2. Or determine upfront which
we data volumes in the analy.
Big data is relevant,
4, Technical challenges
Quality of data
When there is a collection of a large amount of data and storage
of this data it comes at a cost. Big companies, business leaders
and IT leaders always want large data storage.
2. For better results and conclusions, Big data rather than having
inrelevant data, focuses on quality data storage.
1
3. This futher arise a question that how it can be ensured that data ig
relevant, how much data would be enough for decision making
and whether the stored data is accurate or not.
© Fault tolerance
1. Fault tolerance is another technical challenge and fault tolerance
computing is extremely hard, involving intricate algorithms.
2. Nowadays some of the new technologies like cloud computing
and big data always intended that whenever the failure occurs the
damage done should be within the acceptable threshold that is the
whole task should not begin from the scratch.
"= Scalability
1. Big data projects can grow and evolve rapidly. The scalability
issue of Big Data has lead towards cloud computing,
2. Itleads to various challenges like how to run and execute various
jobs so that goal of each workload can be achieved cost-
effectively.
(nu 22:29) 8-131)
ig Data Analytics (MU-Sem.8-T) (Introduction to Big Data)..Pg. no... (1-23)
3. It also requires dealing with the system failures in an efficient
manner. This leads to a big question again that what kinds of
storage devices are to be used.
DH 1.8 EXAMPLES OF BIG DATA IN REAL LIFE
(4) In the Education Industry
‘The University of Alabama has more than 38,000 students and an
‘ocean of data. In the past when there were no real solutions to analyze
that much data, some of them seemed useless. Now, administrators
can use analytics and data visualizations for this data to draw out
patterns of students revolutionizing the university's operations,
recruitment, and retention efforts,
(2) In the Healthcare
Wearable devices and sensors have been introduced in the
healthcare industry which can provide real-time feed to the electronic
health record of a patient. One such technology is Apple.
Apple has come up with Apple HealthKit, CareKit, and
ResearchKit. The main goal is to empower iPhone users to store and
access their real-time health records on their phones.
(3) In Government Sector
Food and Drug Administration (FDA) which runs under the
jurisdiction of the Federal Government of the USA leverages the
analysis of big data to discover pattems and associations to identify
and examine the expected or unexpected occurrences of food-based
infections.
(4) In Media and Entertainment Industry
Spotify, on-demand music-providing platform, uses Big Data
Analytics, collects data from all its users around the globe, and then
uses the analyzed data to give informed music recommendations and
suggestions to every individual user,
(wu. 20.29 0-19) Dad recr no Pucaton
‘Scanned with CamScannerIntroduction to Big Data). Pg, no... (1-25)
Analytics (MU:
Bi
rem dens, mst KINDO 9) a9 spac Sector
which
‘Amazon Prime aa
onusing big
one-stop shop is als is: NASA is collecting data from different satellites and rovers about
rns | ne geography, atmospheric conditions, and other factors of mars Tor
5) in Weather Patte 3
‘ Deep Thunder, which isa research project by IBM, proyige their upcoming mission. It uses big data to manage all that data and
a recasting wirou inh performance cOmpUENE Of Big dug) anaes that to run simulations.
E cast ee eee
sing ToS WH improved weather forecasting fg iar ea
oes redicting the probability of damaged power lings Be
natural disasters OF Pe
sportation Industry
Uber generaes and uses # BBE amount of data regarding divey,
their vices, loleations Very 1 from every vehicle, etc, All ty
data is analyzed and then ‘used to predict suPPIYs demand, Ting
sve and fare that il beset f0F EVEN wip. ;
(6) In Tran
(7) In Banking Sector
‘Various an-money laundering software such as SAS AML
pata Analytis in Banking 10 detect SUSPICIONS transactions
analyze customer data. Bank
‘customer for more than 25 years.
of America has been a SAS
(8) In Marketing
‘Amazon collected data about the purchase done by milion
raed the purchase patterns
people around the word, They analy
payment methods used bythe customers and used the results 10
new offers and advertisements
(9) In Business Sights
Netflix is using Big Data to understand the user behavior,
type of content they like, popular movies on the website,
content that can suggest to the user, and which series or movies
they inves in
‘Scanned with CamScannerIntroduction to Big
Data Frameworks
University Prescribed Syllabus
What is Hadoop? Core Hadoop Components; Hadoop Ecosystem,
Working with Apache Spark What is NoSQL? NoSQL data
architecture pattems : Keyvalue stores, Graph stores, Column family
(Gigtable) stores, Document stores, MongoDB
Seltearning Toples : HOFS vs GFS, MongoDB vs other NoSaL
system, Implementation of Apache Spark
D_2.1 CONCEPT OF HADOOP
°% 2.1.1 What is Hadoop ?
Hadoop is an open-source software Platform for storing massive
Yolumes of data and running applications on clusters (groups) of
commodity software. It gives us the massive data storage capability,
‘massive computational power and the ability to handle different
irtually limitless jobs that can be a running job, waiting jobs or |
iPport growing big data
{asks Ils main essential component is to su
females, techy spp forvawthinking. analytics ike
tive analytics, Machine earning and data mining. Hadoop bas
the capability to handle different ‘modes of data such as structured,
Big Data Analytics (MU-Sem 8-17) nto_to Big Data Fram)._Pg.n0.. (2-2
unstructured and semi-structured data. It gives us the elasticity to
collect, process, and investigate data thatthe old data warehouses
concept failed to do,
YW 21.2 History of Hadoop
‘+ The Hadoop was introduced by Doug Cutting and Mike Cafarella
in 2002. Its beginning was the Google File System paper, printed
by Google.
«In the year 2002, Doug Cutting and Mike Cafarella stared to
work on a project of Apache Nutch. It is an open source ie. free
web crawler software project.
+ While working on Apache Nutch, they were facing some issue
with big data. To store that data, they have invested lot of money
which becomes the challenging of that project for completion
+ Duc to this problem appearance of Hadoop came into existence
+ In 2003, Google presented a file system known as GFS (Google
file system). Iti a registered distributed file system developed to
provide effective access to data.
‘+ In year 2004, Google released the concept of @ white paper on
Map Reduce,
+ This technique makes simpler the data processing on large
clusters (groups),
‘+ In 2005, Doug Cutting and Mike Cafarella presented a new file
system known as NDFS (Nutch Distributed File System), ths file
system also contains Map reduce. In 2006, Doug Cutting resign
Google and joined Yahoo. Based on the Nutch project, Dough
Cutting announce a new project Hadoop witha file system known
as HDFS (Hadoop Distributed File System)
wu-22.29) e198) LB recto rstetens
Scanned with CamScanneroa nays wu. Sen.81D. ten ata Fram).P9. no. (a
renin 0.10 was sased a0 DOUE CUD gay
cop afer his sonst elephant. In 207,
Pawo sccesflly ran wo cuss OF 1000 machines,
In 2008, Hadoop became the quickest system 10 Sort 1 terabyte of
tuna on 900-node cluster in 209 seconds In 2013, Hadoop 2
aed, And Curent In 2017, Hadoop 3.0 was released,
was el
+ Hadoop firs
‘named his project Had
2s 2.1.3 Features of Hadoop
1. Suitable for Big Data Analysis : As Big Data manages to be
ibuted and unstructured in nature, Hadoop clusters are well
matched for analysis of Big Data. Meanwhile it is processing
Tose (ot the actual data) that flows tothe computing nodes and
less network bandwidth is spent. This concept is called as data
locality which helps to increase the productivity of Hadoop based
applications.
2. Scalability : Hadoop clusters can easily be scaled to any amount
by adding extra cluster nodes and thus allows for the growth of
Big Data. Also, scaling does not require adjustments to
application logic.
3. Fault Tolerance : Hadoop network has a facility to duplicate the
input data on to other cluster nodes. So, inthe event of a cluster
node failure the data processing can still process data by using
data stored on another cluster node.
% 2.1.4 Advantages of Hadoop
4. Fast: In HDFS the data distributed over the cluster and mapped
such @ way which helps in faster recovery. Even the tools to
Process the data are often on the same servers, thus reducing the
processing time can be efficient way to manage the data. It al
processes terabytes of data in minutes and Peta bytes in hours 7
(wu 22:29) 8-131)
aa
pig Data Analytics (MU-Sem 8:7. to ig Data Fram). Pg. nea. [28)
Scalable : Hadoop cluster can be extended by just adding nodes
in the cluster so failure chance can be less
p is open source and uses commodity
to traditional
bs
3. Cost Effective : Hadooy
hardware to store data, it is cheaper as compared
RDMS.
44, Tough to failure : HDFS has the property with which it can
duplicate data over the network, so if one node is down or some
other network failure happens, then Hadoop takes the backup data
and use it. Normally, data are replicated thrice but the replication
factor is configurable.
\ges of Hadoop
24.5 Chi
1. Hadoop is a complex distributed
Application programming interface.
2. Specialized skills are required for using Hadoop and,
developers from efficiently bringing solutions.
45, Business logic and infrastructure APIs have no clear separation
therefore burdening come on app developers.
4, Automated testing of end-to-end solutions is unfeasible or
terrible.
5. Common data patterns often require but does not
steadiness and accuracy.
(6. Hadoop is more than just disconnected storage.
7. Hadoop is a various collection of many open source projects
8. Understanding multiple technologies and hand-coding
combination between them is difficult,
9. Significant effort is wasted on simple tasks like data absorptions
and ETL (Extract, Transform, Load).
system with low-level
prevent most
support data
(mu. 22.23) (MB-131)
‘Scanned with CamScanner_—_—_—S—rs—sesesia‘Ss<‘ 1. Issue with Small Files
Hadoop is not suitable for the small data. (HDFS) Hadoop
distributed file system wants the capability to professionally
support the arbitrary reading of the small files since of itis high
volume design. Small files are the main problematic in HDFS. A
small file is expressively minor than the HDES block size (default
128MB). If we are storing these vast numbers of small files, then
HDFS cannot handle these files while HDFS is working
accurately with a small unit of large files by storing large data
(mu. 2229) (MB-131) Wa teen eo Puicatons
‘Scanned with CamScannera (wu 22:29 (8-131)
‘am)..P9.10..(24
sigoataA
than storing several small Cae a se
le
3. Support for Batch Processing only }
rts the batch processing only and it does not
Ma mo aa ander complete perfomance i
uber MapReduce framework of the Hadoop does not
slower,
influence the memory of the Hadoop cluster to the extreme level,
> 4, NoReab-time Data Processing
‘Apache Hadoop is for the operation of batch processing, a
allow it to take a vast amount of data in input and execute it and
‘generate the outcome. Even though batch processing is very well:
conganized for processing a data of high volume dependent on
size of the data that is being processes and the co
power of te system but basically an output can be delay so
‘Hadoop is not appropriate for Real-time data processing.
> 5, NoDelta Iteration
Hadoop isnt well-organized for the constant processing
the Hadoop doesn't support the repeated data flow ie.
‘of phases during which the respective output of the earlier
1s the input tothe succeeding phase.
ig Dae Ara Sem) it. Big Data Fram). no (2-14)
> 6. Latency
The Hadoop MapReduce framework is that the comparatively
slower so meanwhile its supporting the various format, structure
‘and vast capacity of information or data. In MapReduce, Map
takes the collection of the data and decodes it into the alternative
Set of data where the separate elements are fragmented down into
‘key-value pairs and reduce the output from the map as input and
Process extra and MapReduce requires plenty of the time to
‘accomplish these tasks thus by increasing the latency,
> 7. Security
Hadoop is challenges in handling the compound application. If
the user doesn’t know the way to enable a platform who is
‘managing the platform thatthe data can be inthe danger.
AL storage and network levels, Hadoop is missing the encryption
part, which can leads to the major point of concem. Hadoop
supports Kerberos authentication, which is tough to manage
HDFS supports access control lists (ACLs) and a standard file
Permissions model. However, third-party vendors have enabled
an association to influence the Active Directory Kerberos and
LDAP for verification,
> 8. No Abstraction
Hadoop doesn’t have any kind of abstraction; thus, MapReduce
creators need to the hand code for each process which makes it
difficult to work.
> 9 No Caching
Hadoop is not well-organized for caching. In Hadoop,
MapReduce cannot store the intermediate data in the memory for
4 additional condition which reduces the performance of the
Hadoop.
(ws 2220 09) WrecrreoPutctons
‘Scanned with CamScannerGr
| (Mu 2229) e131)
ram)._.Pg.no.
iyo. 1 Big Da Fra).
Mu-sem 8.17) tnt
fig Data An
‘Code
). Lengthy Line of!
roximately about 120,000 codes of lin, hy
=o at Ee eooe by the bugs and it will tke mop
umber offi
period fe ine tenet IN PCE. . ae
‘explain how Hadoop goals =e Covered in Hadoop distrib
ug
i file system.
HDFS and MapReduet
Hadoop, whereas /HDFS is
«are two important major Components gf,
‘xeful for infrastructure point of view ang_
ree wxeful for programming concept. TO understand th
Make i vetabilty of Hadoop from single-node to a thousand
concept behind scalability of Hadoop 2
soe cer, HDS is very wsefl. It covered the goals of Hadoop ag
follows: ]
Handling of large dataset : As Hadoop supports distribu
storage and processing of large data set, HDFS architecture
designed as it most useful to store and retrieve large data,
Fault tolerance and data replication : In HDFS, the data files
are divided into big blocks of data and for fault tolerance each
block is stored on three nodes from which two nodes are
same rack and one is from a different rack. A block is consi
as the amount of data stored on each data node. The redun
of data leads to robustness, fault detection, quick recovery of
and scalability
3. Commodity Hardware : HDFS consider that the luster will
con common hardware such as less expensive or
‘machines. And an important feature of Hadoop is that Hi
can be installed on any average commodity hardware,
installation and execution of Hadoop It does not require any
computers or high-end hardware, This reduces the overall cost.
4, Data Locality concept : Data locality means locé
computation logic nea to the data, instead of moving data to
i
a
ig Data Anais (MU-Sem.£-(1 (Miro. to Big Data Fram). Pp. no.. (2-16)
computation logic or application space . This reduces the
bandwidth utilization in a system, HDFS provide interfaces for
applications to relocate themselves closer to the location where
data.
on, aa rare
Oitalinda? fi bee 2.5 Mora
vision to make over India into a digitally empowered society and
build knowledge economy.
Its vision is mostly focused on three main areas
(Digital Infrastructure as a basic utility to every citizen,
(ii) Governance and services on demand and
(Gil) Digital empowerment of citizens of India.
'* Subsequently, to fulfil this vision, availability of different Data
resources has got increased.
+ In Big data huge amount of data collected and stored whether it is
in structured or unstructured form or semi-structured. This data
‘may contains various business related transactions, email, images,
audio, surveillance camera videos, logs and unstructured data
from blogs or messages from social media, medical data, banks
related transaction data, e-Goyemance data, media data, defence
related data, IT sectors.
+ If this data get efficiently cleaned and then get analyzed, it can
helpful in data visualization for business trade for various
‘enterprises or organizations,
+ This digital technology has make the progress enterprises or
‘organizations mote easier, Data from collected from tweets,
various blogs and other social networks sites can useful to an
‘enterprise or organization to analyze consumer's view. It will
helpful for them to understand needs and choices of their
customers,
(Mv. 22:29) (8-131) Tech-Neo Publications
‘Scanned with CamScanneremma cons a
(oven) (222-9)
sn
WHOM) ANA OW pouoALOD st we Big wor exp ap
Tryosn suou st “210g iq yo asn am fq uonestnaio
pomes siyouog atqistaut 40 ajqss jo amseow 5: ape,
“wasn ey
dope} apa 0} a2r0y9 Jo wamXoydap amp simp soyeu Aarons
3s
‘SI PP SUCH 424 wo sqof Rds AremQse uns uM Hed J980
UL "YIN doopeH wun apts 4g apis yredg uny pue saysnj> dopey Ze eteno2 oe 5 ep 38 amp Sue
‘a sourgowal Jo 29sqns 20 [fe uo s2amosal aro Aqeanes ues BUIBEUEWY -sav07 feorydesf099 pu pnot> axp se yons
300 wamoydep auorepuris ay quLAL + ywamo|dap auofEpURS + Prep ofdrypmur ywoxaymp era soysuen 01 aygr 2q wnt
rez ta 3H Tey areorpu YoryM Aurxajduso> sueayy : AypeIIA “yp
Canis) aw ErU
URES ©) we s00(@) euopURS( Ames om soprioad sontyeue wep iq pomouns
2 ono © a
‘aM dope Woy wep jo sadky quasynp o1 sayy: MOURA E
rep
1
a a eur 09 0} spa3U 3 “a589 305
IS Po "NAV A “PUoTepuRrs :saisnjo dopey v wn xseds, § SS Madadeaanel yueuads Yang “aun
Aotdap 01 skew aay ase arty “seqnoned uy jyedg und 0} nos 30} Stoojan si o saasaudas vip 314 J 0
fom © 51 23 YOU 20 Jaen doopey xn amdiguos o safapaud — PAPUOODS uy jest, y98 saBessau pau [OS TH UAC
SAMERSTURpE aAey nok soMaqm JoEW OU ue “(NAVA) OZ 9h Sy “one ‘aumideo qam dat] “VIPAT [PIIOS “sOsUEs
oper 20 x1 doopey; uns nok somaya soneu on “sanmqedeo A ‘suaysks suquo se gons uuopMd [ep snoues
Saweds Jo afmueape oye 01 2980 doopeyy K1ax3 40} aIq1ssod a oy siajas wiep 31g UT: AIPORA T
se Asvo se 1 Suryew uo posnooy équersuoo axey am ‘puosag « a "
qovuzonur eta pay Pur PLO
‘Spomaurey wep 31q Jomo pue ‘osogy “sonpaydeyy nom 1 2sou
doopeH tum ayeds Summquios Aq sonmgedes Srssaooad POM paroyeos are yetp sasegerep 100 As parnqunsip
Zan WOuH? oxo JoopEH ons sy eg s,uozemy poe — PR PRD sim aioys wo damn 0m 08 SWISS
25PGH 9 Wns “swaIsis a8eior sono se jam se “SHH 01 pur 9 ain sasn wep iq erep aBy sm HUM [PAP
{wou FIEP sium PE Peas or pauBlsap sem eds ‘suo Kep wo 4 40 * ppow [est
rca ‘gous vonez
pas doopeH om) ‘serdas rou ‘souequs or popuaiu st eds nig © eng eke sumapy : oaunyon“T
af ‘MON “surejuod 1 wep Jo yunoure 2H 6
yaw dyed “APUEA
*avdS SHOVEV HLM ONDRIOM EZ KA ge he
J s,A 24
ea) Nou hg Vueia 880 Bg or onal Ursus aM ee eG EE | PA ‘Minjon : sonqume SA
Tuer veg BGO
suoddns wep 3M %4LHadoop users who have
aa
2.
Analytics (MU-Som, 8-11) (nto. to Big Data Fram )__Pg.n0.
‘There are different kinds of management systems
ment
. deploy Hadoop Yam can simpy
deployed or are pani (© Sr hasta? saa ‘ RDBMS oar | NoSQL
without any pre-ins
Spark on ee ‘allows users t0 easily integrate cad (Relational (Online Analytical (Not only
access required Me ake advantage of he fall ONE Sy Database Processing) SQL)
their Hadoop stack running on top of Spark, Management
as well as of other components : System)
spark In MapReduce (SIMR): For the Hadoop fe 2
running YARN yet, another option, in or ss sd
deployment, is to we SIMR 1 launch Spark jobs
MapReduce, Wh SIM, user can Sart oxPSTimenig
Spark and use its shell within a couple of minutes
downloading it! This wemendously lowers the by
deployment, and lets vitally everyone play wit Spark,
% 2.4.1 Introduction to NoSQL
|A database is a systematic collection of data. And a
‘management system supports storage and manipulation of
‘which makes data management easy. For example, an
telephone directory uses a database to store data of
phone numbers and other contact details that can be
service provider to manage billing client related issues and!
fall data etc, That means A database management
provides the mechanism to store and retrieve the data.
NoSQL refers to all databases and data stores that are not based
‘on the relational database management system or RDBMS
principles. NoSQL are the new set of databases that has emerged
recent pastas an alternative solution to relational databases.
Carl Strozzi introduced the term NoSQL to name his file-
based database in 1998.
NoSQL does not represent single product or technology but it
represents a group products and various related data concepts for
storage and management. NoSQL is an approach to database
‘management that can accommodate a wide variety of data models
including key-value, document, column and graph formats.
NoSQL database generally means that it is non-relational,
distributed, flexible and scalable. So, we can bind it up as
NoSQL an approach to database design that provides flexible
schemas for the storage and retrieval of data beyond the
traditional table structures found in relational databases.
It relates to large data sets accessed and manipulated on a Web
scale,
2.4.2. Brief History of NoSQL Databases
1998- Carlo Strozzi use the term NoSQL for his lightweight,
‘open-source relational database
2000- Graph database Neos} is launched,
2029 (190) Le rene utiatons
‘Scanned with CamScannerMu-sem 9-1) intro, 10.8
s004- Google BigTabe is launched
2005- CouchDB is launched:
2007. Te research paer on Amazon Dynan
s open sources the Cassandra project
is released,
2008- Facebook’
2000- The term NoSL was reintroduced
yw 2.4.3 Why NoSQL?
‘The concept of NoSQL. databases became popular with i
« Google, Facebook, Amazon etc, Who deals with
giants like
se time becomes slow when we:
volumes of data the system respon
RDBMS for massive volumes of data soto resolve this problem,
could scale up our system by upgrading our existing hardware but
process is an expensive. So alternative for this issue is t0 ds
Gaatase load on multiple hosts whenever the load increases
method known as scaling out.
NoSQL databases are non-relational so they scale-out better
‘relational databases. As they designed with the web applications
mind, Now NoSQL database is exactly type pf database that
handle all sors of semi structured data, unstructured data, rapi
changing data and bigdata, So, to resolve the problems related to
‘volume and semi structured data. NoSQL. databases have emerged.
% 2.4.4. CAP Theorem
It plays important role in NoSQL databases, CAP theorem is
called brewer's theorem which states that it is impossible fe
distributed data store to offer more than two out of three guarantees
Big Data Analytics (MU-Sem 81) (Into. 10 Big Data Fram) Pg. no._ (2-22)
So basically, some NoSQL databases offer consistency and
partition tolerance. While some offer availability and partition
tolerance. But partition tolerance is common as NoSQL databases are
distributed in nature so based on requirement, we can choose NoSQL
database has to be used. Different types of NoSQL databases are
available based on data models,
Fig, 2411: CAP Property
"= Consistency
This means that the data in the database remains consistent after
the execution of an operation
‘© For example, after an update operation all clients see the same
data.
"= availability
‘This means that the system is always on (service guarantee
availability), no downtime.
"= Partition Tolerance
+ This means that the system continues to function even the
communication among the servers is unreliable, ie., the servers
may be partitioned into multiple groups that cannot communicate
‘with one another.
as 22-29) (8-131)
iw 2229 90:90 ree
‘Scanned with CamScannera (mu. 22-29) 121)
yios (mu-Sem.stT) ito. t08 ‘Data Fram)...Pg.no,
ta Analytics
impossible to fulfil all 3 requirements, Cy
ides the basic requirements for 2 distributed system to folgy
provides the basic
Y ahe 3 requiements. Therefore all the cure Noga
2o
“abe follow the cifferent combinations ofthe C. A. Prom gg
CAP theorem.
+ In theoretically itis
ere i the brief description of three combinations CA, Cp
AP:
‘> CA- Single site cluster therefore all nodes re always in contagl)
‘When a partition occurs, the system blocks. ;
CP ~ Some data may not be accessible, but the rest is sti)
consstenvaccurate.
[AP - System is still available under partitioning,
data returned may be inaccurate
‘The use of the word consistency in CAP and its use in ACID dg
1 al
not refer to the same identical concept
In CAP, the term consistency refers to the consistency of
values in different copies of the same data item in a replicated
distributed system. In ACID, it refers tothe fact that a transaction will
not violate the integrity constraints specified on the database schema
% 2.4.5 Characteristics / Features of NoSQL
Ug. Descrbe characteristics of a NoSQL database.
4, Non-relational
+ NoSQL databases never follow the relational model
+ Never provide tables with at fixed-column records,
+ Work with self-contained aggregates or BLOBs
+ Doesnt require object-elational mapping and data
‘normalization
Big Data Analytics (MU-Sem 8-1) (nro to Big Data Fram.)._Pg.n0._ (2:24)
+ No complex features like query languages, query planners,
referential integrity joins, ACID
2. Open-source
NoSQL databases don’t require expensive licensing fees and can
run on inexpensive hardware, rendering their deployment cost
effective.
3. Schema-free
© NoSQL databases are either schema-free or have relaxed
schemas
‘© Do not require any sort of definition of the schema of the
data
‘© Offers heterogeneous structures of data in the same domain
‘Simple API
‘+ Offers easy to use interfaces for storage and querying data
provided
‘* APIs allow low-level data manipulation & selection methods
Ul recr nao Puoictans
(u- 22:29) (we131)
BbrecrieoPsicion
‘Scanned with CamScanner.
ig bat Anaytos
‘Text-based protocol
JSON
Mostly used 90 standar
Web-enabled databases
ata Fram)..Pg. no... (24
em), (inro 108
used with HTTP REST wig,
most
1d based query language
running as interaet-facing services,
ea
2 tt ones betel tg
fashion
Often ACID concept can be sacrificed for scalability ang
Mostly no synchronous replication between distributed nog
“Asynchronous MultiMaster Replication, peer-to-peer, HDFS
Replication
nly providing eventual consistency
and higher distribution,
2.4.6 Advantages and Disadvantages of NoSQL
a
Advantages of NoSQL
1. Scale(horizontal)
2. SQL databases are vertically scalable. This means that you
increase the load on a single server by increasing things
RAM, CPU or SSD. But on the other hand, NoSQL databases
horizontally scalable, This means that you handle more traffic
sharding, or adding more servers in your NoSQL database
‘Simple data model (fewer joins)
Streaming) volume
Reliability
Schema-less (no modelling or prototyping)
Dp (uu. 2225) 8-199)
—
Shared Nothing Architecture. This enables less coordinatiog,
Bhrecnres 11 3n9 43
‘ay Aqqeotda st sxaysereyo 30 suadaqut 'sBumns Jo sauas y “sé dust
duu 249 se seg anjea-Cay yo waned ayy ut pa29109 steep
2X1 Tapa Sst apo asqeiep TOSON 21889 SOW aH J0 200)
-95n ued ey Suogedyidde ony Ayquapr
i "ussnedsia ’
| ened fembeinare TOson we enisckan jeep uiiuertg Dn
a
seiois onjen-Aoy STR
saosin (a)
saioig wewnsog (9)
seuoig (ajqmilig) cjurey unjoy —(g)
saris anjea-Koy (¥)
susomed ammoonrgoxe
wp sno} 2m Jo Kv" swotloy TOSON uw pasors cep auL
‘ep SurBuey {ypides 30 pammansun
soipre azsteue pur “aAainas ‘exo 0) poou ym suonvetue’i0
1 fradde 0} sanunuoo yom “ywiod Sumas ssorea:8 s,JOSON
uaaq SigenBre sey veg) “erp FurrweBio oy yovosdde ,01-P%,
sn “AgP9y sm $1 puy wamdoqarap 01 wyBrens 08 01 2Iqe a8
ie sm ‘uay>s ywouydn ue ambos 3,upip fox asneoaq sosoaeieP
oe nN eee Ssedojaaq “sip some sdysuoneas ys? |
102 "91ge1 asn Yous ‘saseqerep Og 40 jeuonele!
Teuontpen J0 APH 2K) 0 Ro woq arsm saseqenp TOSON
" ‘Sejdluexa yuenajar yim
suaued 2115 hued wune> pue aos eiep ders vecieg
SNWALLVd UNLDALIHOUY Viva TOSON 57 KA
—— ee
Se or Ba Tas Ra
“ ' ermine Ti ‘Pwes-ny) souneuy e100 PD
ATOSEN Y suaned YemDenDLe waieyp ay aie rey “DN Pr ms por oat ep Su 55
hemersonron van ose
‘seseqeiep
SON 10} saAUP ssoursnq awos0q sey Aime Sm
Squarory wep yo aumyor Kure appueY
pur Arisea so8ueys uonwaridde ayepouruosoe wes Seu,
“Kyse9
1p pafess 2q we pue snewayss 978 seseqEIep OS9tLL
-aseqeiep JOSON 9 auoar2K0 1899 ar SONY 25241 TTY
-saqnpayos Susan pur 1uaudoyan2p Ut SUROPNOTS
nyo owas “pos pooua.9dx9 as TOA
‘sradoasap axeatjos
9 v9 swsonbas afte
mnbos Suddeus fouone(as 129LG0 ‘AINEUED
suoneaqydde Sunstx
0109
fauadxa sau
Krpour 10 mau Suidofaxop wags a8ueg> pide
ou st sson0ud STL
1 aq yp poweroosse st pure apd
sage} aoucissiod
yo anour 0} 1HSKHaTEIS
say Jo voneUgIOD
isos UL *
gary amp wosy pur or exp 1°24
IS apes pue arzyep ‘arepdn “Hes
109 ayy ayesouas or sf 9Ke] StH JO Aaqig3st0%
22601
10 we apnpout 1 pau nok ‘soars
a ep 06 JL
deus yruoner ota
30 sdnosfqns povradas pu Pas
asogenep 24030 0
ood amp st SWETE
red xoqditoo ou! UL
sep jo stunouse 2871
ryan
Musn suonvondde Supima 3°
sn
yonaeuuoo pur
: 3 assotesPu®
Pounuosse o1 vonvaydde we a> 7
2) oaey oO
ot sarabae or Trewesw"
TB weld ee
—_
‘Scanned with CamScannerrF | _ gs Sata eal loted lel
ig Data Fram)..P9.n0..
som 9. gn. t0 ig. 28 ata Analytics (MU-Sem 6.
gosta Anaytos US Bi LEAT) (nto to Big Data Fram). Pp no. (2-34)
related 10 the key, ay, " a
GQ. State example of any two ky vai
cted oF cO-
sue is connect’
ae ically store informat
databases for key-value Palt ically iad
‘hashtable where eac ey #8 UMAR i
re of any form (JavaSeript Object Notas
|
. storage (YP!
value may i
os inary Large object (BLOB), Sings et)
tication : This style of architecture 3s commonly used
splat
Ar er esoms los ni np
seis its ability for wide management of data volumes, hey
i ke weet ore da
loads =
Fig. 25.1: An example of Key-Value
Keys and values are flexible, Keys can be image names,
page URLs, or file path names that point to values like binary i
HTML web pages, and PDF documents.
Constraints associated with the key-value store databases is
complexity in handling queries which will attempt to include
‘key-value pairs that may delay output and may cause data to cl
‘with many-to-many relationships.
(22:29 (8-193),
(wu 22:29) (48-193)
databases
(2 Marks)
Examples here are
DynamoDB (developed by Amazon)
Berkeley DB (developed by Oracle)
REDIS : An advanced open-source key-value store, also referred |
to as a data structure server because Keys can include strings, |
hashes, lists, sets and sorted sets. This product, written in C/C++,
is searingly quick, which makes it perfect for data collection in |
real time,
Riak : An open source that is powerful, distributed database that
predictably scales capability and simplifies creation by
prototyping, developing, and deploying applications quickly.
‘Written in Erlang and C this technology gives transparent fault-
toleranvfail-over functionality, a comprehensive and versatile
API perfect for point-of-sale and factory control systems
VoltDB : scalable database in memory that offers complete
transactional ACID consistency and ultra-high throughput, self-
referred to as the NewSQL.
‘This technology relies on segmentation and replication to achieve
hhigh-availabilty data snapshots and durable command logging
using Java stored processes (or crash recovery), making i ideal
for capital markets digital networks, network services, and for
online gaming.
Teche Publications
Es
‘Scanned with CamScannerF ug Data Fram)._Pg. no.
som 847) (nto. t0
goat Anais MUS fig Oat Arayics U-Ser 67) (rot ig Ona Fram). no. 2-36
atabase
0:
va 25.2 column store D3 + Basically, columns ate in this son of storage mode. Data is
readily available and it is possible to perform queries such as
Number, AVERAGE, COUNT on columns easly
+The setbacks for this system includes: transactions should be
avoided or not supported, queries can decrease high performance
with table joins, record updates and deletes reduce storage
efficiency, and it can be difficult to design efficient
partitioning/indexing schemes.
Q._ State example of any two column store databases (2 Marks)
Examples here are
+ HBase : HBase is a distributed, portable, Big Data Store
modelled after Google's BigTable technology, the Hadoop
database
ef «Google's BigTable
a + Cassandra : An open-source distributed database management
ae system built to manage very large volumes of data scattered over
Seecaa | several servers without a single point of failure while delivering a
Sear : highly accessible service.
Fig. 25.2: An Example of Column Store + Written in Java, this product is best for non-transactional real-
time data analysis with linear scalability and proven fault-
tolerance combined with column indexes.
+ This pattem employs data storage in individual cells that is
further divided into columns, rather than storing data in relational
tuples
+ Databases that are column-oriented operate only on columns,» In the form of key-value pairs, the record database fetches and
They together store vast quantities of data in columns. The
column format and tiles will diverge from one row to another.
accumulates information, but here the values are called
documents. A complicated data structure can be represented as a
+ Esch column is handled differently, but stil, like conventional =
databases, each individual column will contain several other
columns (Niharika, 2020)
+ Itis hierarchical version of key-value databases,
che Pts
!
:
| %& 2.5.3 Document Database
:
(tm 22-29) 48193)
Tecr-Noo Publications (MU: 22:29) (M191)
‘Scanned with CamScannerrE. t~—”
Big Data Analytics (MU-Sem.
rr) (inv. to Big Data Fram). 9.00. (2
The document can be in text form’ arrays, strings, JSON
(JavaScript Object XML (Extensible Markup
Language) or any other forma
«The use of nested documents
efficient since most of the gen
the form of JSONs and is unstructured.
Notation),
js immensely popular. It is highly
erated information is generally in
Fig. 2.5. : An Example of Document
pig pots Arbvics MU-Sem 8) tn 3540
ig Data Fram.).Pg no.. (2-38)
sphis format is extemely useful ang
stud daa and iS Spl to reine serine for semi
Sage: THE drawbacks associated amy ae bane fon
serrening for of handing mile dy is system includes the
pair
‘output for the reduce step,
slytics (MU-Sem.8-17)
vq 03a An Ma Paradigm)..Pg.no.. (35)
(2) A Reduce Task process an ouput ofa map tsk Siar 5
rap stage, all edie asks cca at te se tine sat ne
independently. The data is aggregated aay 'y work
the desired output. The final resut i a reducey ees
value> pits which MapReduce by defau, sores in HDES.
3.1.3 How Hadoop Map and Red,
Together Bee Wott
‘As the name Suggests, MapReduce works by processing input
data in two stages ~Map and Reduce, To demonstrate this, we
will use a simple example with counting the number of
‘occurrences of words in each document,
‘The final output we are looking fors: How many times the words
‘Apache, Hadoop, Class, and Track appear in total in all
documents,
+ For illustration purposes, the example environment consists of
three nodes. The input contains six documents distributed across
the cluster. We will keep it simple here, but in real circumstances,
there is no limit. You can have thousands of servers and billions
of documents,
Tabrrcrio putin
isshcpiacs
masa &
‘Scanned with CamScannerMapReduce Par
1. First in themap stage, the input data (ihe six docu
is split and distributed across the cluster (the three servers), In
‘each map task works on a split containing two docu
case,
ication between the
During mapping, there is no communi
They perform independently.
2, Then, map tasks create a pair for every word.
pairs show how many times a word occurs. A word is Key, and
value is its count, For example, one document contains thre.
four words we are looking for: Apache 7 times, Cl
times, and Track 6 times. The key-value pairs in one map
‘output look like this:
track, 6>
¢ # .
‘This process is done in parallel tasks on all nodes for
documents and gives a unique output.
3. After input spliting and mapping completes, the outputs of
map task are shuffled. This is the first step of the Reduce
Since we are looking for the frequency of occurrence for
words, there are four parallel Reduce tasks, The reduce tasks
run on the same nodes as the map tasks, or they can run on
‘other node,
The shuffle step ensures the ‘keys Apache,
(Class, and Track are sorted for the reduce step. This
groups the values by keys inthe form of
4. In the reduce step of the Reduce stage, each of the four
| process a to provide a final key-value pair.
reduce tasks also happen at the same time and
independent!
Blreriieor
eee
oa anaiytios (MU-Som8-
ago tcce Paration)_.Po, no. (2-1
in our example from the diagram, te
sollowing individual results;
apache, * sno, ie
2 > ‘
ei 18> aS
Pe a
«Combiner always works in between Mapper and Reducer. The
output produced by the Mapper is the intermediate output in
terms of key-value pairs which is massive in size,
«If we directly feed this huge output the Reducer, then that will
result in increasing the Network Congestion. Soto minimize this
[Network congestion we have to put combiner in between Mapper
and Reducer.
+ These combiners are also known as semi-educer. It is not
necessary 10 add a combiner to your Map-Reduce program, itis
optional.
+ Combiner is also a class in our java program like Map and
Reduce class that is used in between this Map and Reduce
classes.
+ Combiner helps us to produce abstract details or a summary of
very large datasets. When we process or deal with very large
datasets using Hadoop Combiner is very much necessary,
resulting in the enhancement of overall performance,
%® 3.2.1 How does combiner works
+ In the above example, we can see that two Mappers are
containing different data. The main text file is divided into two
different Mappers. Each mapper is assigned to process a diferent
line of our data. In our above example, we have two lines of data
so we have two Mappers to handle each line.
recn neo Pucatons
educe tasks get the
(mu. 22-23)(Ma-131)
nese INNS SU
‘Scanned with CamScanner| 4,
aradigm) P9.00,
ig Daa Anaics (Mu-Sems.T_ mapheduce PES 3.9)
the intermediate Key-*AN€ AS, Whey
key and its count is its Value
For Geeks For the key-valye
+ Mappers are produciny
the name of the particular word is
For example, for the data Geeks
pairs are shown below
1W Key Value pairs generated for data
(Geeks,1)
or)
(Geeks,1)
(For!)
The Key-value pairs generated by the Mapper are known athe
intermediate key-value pairs or intermediate output of the
Mapper.
+ Now we can minimize the number of these key-value pairs by
introducing a combiner for each Mapper in our program. In our
case, we have 4 key-value pairs generated by each of the Mapper,
Since these intermediate key-value pairs are not ready to directly
feed to Reducer because that can increase Network congestion
so Combiner will combine these intermediate key-value pairs
before sending them to Reducer
‘The combiner combines these intermediate key-value pairs as per
their Key. For the above example for data for the combiner will
Partially reduce them by merging the same pairs according to
their key value and generate new key-value pairs as shown below.
1 Patally reduced key-value pairs with combiner
Geeks,2)
(For.2)
With the help of Combiner, the Mapper output got partially
reduced in terms of size(key-value pairs) which now can be made
available to the Reducer for better performance.
Ub recn sie Pubcon
r
Analytics (MU-Sem.8-
No’
+ Jombinets and produces the final cup
(qiadoop Distributed File System),
3.2.2 Advantage of Combiners
peduces the me taken for tansfering the daa from Mapper to
Reducer
Reduces the size of the intrmediae ouput genented by the
Mappet-
Improves performance by minimizing Network congestion
3 Disadvantage of combiners
Eo
‘The intermediate key-value pairs generated by Mappers are stored
on Local Disk and combiners wil run later on o partially reduce
the output which results in expensive Disk Input Output
‘The map-Reduce job cannot depend on the function of the
combiner because there is no such guaante in its execution,
MATRIX VECTOR MULTIPLICATIONBY
‘MAPREDUCE
Ww 33
“UQ Write pseudo code for Matrix vector Multiplication by
MapReduce. Ilustrate with an example showing al thesteps.
‘What happen when vector does nat fit in memory in matrix
6.
© vector multiplication ?
to multiply to matrices.
or aoc
+ MapReduce is a technique in which a huge program i subdivided
into small tasks and run paralllly to make computation faster,
save time, and mostly used in distributed systems. It bas 2
important parts
UQ. Write MapReduce
lustrate the procedure on
Brererrnicnors
(tu 22.25) (ua-131)
——
‘Scanned with CamScanner
, (iu: 2223) e131)| «ree
Big Data Analytics (MU-Sem&-1T)_(MapBeduce Paradigm). 22.00. (3-10
+ Mapper st takes raw data input and organizes into KeY, value
p 4 ary, you search for the wong
acts and statistics collecteg
together for reference or analysis". Here the Key is Data ang
the Value associated with is facts and statistics collected together
pairs. For example, In a dictio
‘Data and its associated meaning is
for reference or analysis.
‘© Reducer : It is responsible for processing data in parallel ang
produce final output
{Algorithm 1 : The map function
1. Foreach element m of M do
2. product (key value) pair as (i,k), (M, jm) for k = 1, 2,3,..up
to the number of columns of N
3. foreach element ny of N do
4, product (key value) pair as (i k), (N, jm), fori = 1, 2,3,..Up to
the number of rows of M.
5. return set of (key value) pairs that each key, (i,k), has a list
‘with values (M, j,my) and (N, j, ny) forall possible values of j
© Algorithm 2: The reduce function
Foreach key (j,k) do
sort value begin with M by jin listy
1
3. sort value begin with N by jin listy
4. multiply my and ny for jy value of each list
5S. sumup my * my
6 return (i,k), Y my * ny.
j=l
Let us consider the matrix multiplication example to visualize
MapReduce
(2229) e391)
gous trace MU Some i ec
‘Consider the following matrix no (8-14)
12 E §]
Gabe
Here matrix A is 22 2 matix which
‘means
jows()= Zand the numberof columns) «2. Yan
ix where number of rOWSG) = 2 and
umber of coh
shy cell of the matrix is label ee
a Med a Ay and By. Ex. element 3 in
spatrix Ais called Azy ie. 2". r0W Ieolumn, Now one
sultplication has! MEpper and I edser The Fomigag
Mapper fr Matrix A (k, ¥)= (i, (Ajay fora
Mapper for Matrix B (,¥)= (8, j, By) foram
Therefore, ecomputing the mapper for Matrix
4 k;i,j computes the number of times it occurs
Bisalsoa2x2
# Here all are 2, therefore when k=, ican have
# 2values 1 & 2, each case can have ? funher
# values of j = 1 and j= 2. Substituting all values
# in formula
(40,410)
j=2 ,0,(A,2,2),
ie2 j
J=2 @D,(A2,9)
(@D.A1,3)
(0,2,(4 1, D)
J=2 (1,2)(A,2,.2))
(2,2,(4,1,3)
J=2 22,024)
Matrix-Vector Multiplication by Map-Reduce
(MU 22:29) (48-131),
‘Scanned with CamScannerBig Data Analytics (Mu-Sem 8-T,_(MapReduce Paradigm) 59. 00.. (3-19
Computing the mapper for matrix B
isl k=l (0.15)
2 (1,2), B16)
k (0,82.
j=2 1,2, 8,2.8)
(20,815)
k=22,2,B1.0)
k=l (2,0,8,2,0)
j=2
k=2 (2,2), B,2,8)
‘+ Matrix-Vector Multiplication by Map-Reduce
Reducer (k, v) = (i, k)=>Make sorted Aj and Bise
(i.e) => Summation (Ay * By) ford
Output =>((i sum)
Therefore, computing the reducer:
4 We can observe from Mapper computation
# that 4 pairs are common (1, 1), (1.2),
# (2,1) and (2,2)
# Make a lst separate for Matrix A &
4B with adjoining values taken from
# Mapper step above :
Matrix- Vector Multiplication by Map-Reduce
(1,1) =>Aw=((A, 1, D,(A,2,2))
Biu=((B, 1,5), (B,2,7))
rr
sua anaiytics MUSE 2.7)
row AixBu: (16) + (248 )) ap oy
(A13)(A.2,4)) i)
Bu=((B: 1.5). (B,2,7))
Now AuxBu: ((3*5)+(4°7)) 2 43
-Auw=((A, 1,3), (A,2,4))
Bul, 1,6), (B, 2,8))
Now AixBj: (3%) + (4°8)] = 50
From (i), (i, (ii) and (i) we conclude that
0.19
(1.2.22)
(2.0. 43)
(2.2.50)
‘Therefore, the final matrix is,
[3 3)
wo34 RELATIONAL ALGEBRA OPERATIONS
EE See
Selection 2. Projection
4. Natural Join
(iv)
Union & Intersection
Grouping & Aggregation
Selection
‘Apply a condition ¢ to each tple inthe relation and produce as
output only those tuples that satisfy ©
«+The result of this selection is denoted by 6R)6e(R)
Now AyxBu: ((1*5) + (2*7)] =19 i)
(1,2) =>Ange((A, 1,1), (A,2,2) + Sskctonsealy tose aa an
} ly
Byy=((B, 1,6), (B, 2,8) -
(wu 22-23) (6-131) Tech-Neo Publications
(MU- 22-23) (MB-131) Bhrnser =
‘Scanned with CamScanner
ee