0% found this document useful (0 votes)

40 views60 pages

Lecture 02

This document provides an overview of big data concepts. It begins with defining big data and explaining its characteristics including volume, velocity, variety, veracity, and value. Examples of big data sources and amounts generated by various organizations are given. Hadoop is introduced as a framework for distributed processing of large datasets across clusters of computers. The core components of Hadoop - the Hadoop Distributed File System (HDFS) and MapReduce - are described at a high level. HDFS is explained as a distributed file system that stores multiple copies of data across nodes for reliability. Its master/slave architecture with a NameNode and DataNodes is summarized.

Uploaded by

natnael

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views60 pages

Lecture 02

Uploaded by

natnael

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Selected Topics in Computer Science(CoSc4181)

Lecture 02: Basic concepts of Big Data

Department of Computer Science

Dilla University

By: Tsegalem G/hiwot

2022 G.C

1
Contents of Big Data
• Understand the basics of Big Data

• Evolution of Big Data

• Characteristics of Big Data

• Big Data sources

• Benefits of Big Data

• Challenges of Big Data and solution

• The main concepts Hadoop

• Hadoop components and ecosystem

2
What Is Big Data?
 Big Data is a term used to describe a collection of data that is
huge in size and yet growing exponentially with time.

 It generates value from different sources and processing of very

large quantities of digital information that cannot be analyzed
with traditional computing techniques.

 Having data bigger it requires different approaches:

– Techniques, tools and architecture

 Big data is the realization of storing, processing, and analyzing data

that was previously ignored due to the limitations of traditional data
management technologies.

 Big data is all data (structured, semi-structured and unstructured).

3
Factors for Evolution of Big Data
 Evolution of technology
• World changed from Telephone to Cellphone

• From stand alone computers to networked computers(Internet)

 IoT (50 billion IoT devices in 2020)

 Social media e.g. Instagram, Facebook, YouTube etc.

 Others eg. Amazon, Flipkart , etc.

4
Examples of Big Data
 Walmart handles more than 2.5 petabytes of data.

 NASA Climate Simulation 32 PB/Day.

 Google processes 20 PB a day (2008)

 Facebook has 2.5 PB of user data per day.

 eBay has 6.5 PB of user data + 50 TB/day.

 Tweeter produces over 90 billion tweets/day.

5
How can you avoid big data?
 Pay cash for everything!

 Never go online!

 Don’t use a Cellphone!

 Don’t fill any online prescriptions!

 Never leave your house!

6
Characteristics of Big Data (5V)
 Volume

 Velocity

 Variety

 Veracity

 Value

7
Volume(Scale)
 Refers to the quantity of generated and stored data.

 The name Big Data itself is related to size which is enormous.

 Size of data plays a very crucial role in determining value out

of data.

 Hence, volume is one characteristics which needs to be

considered while dealing with big data.

8
Velocity(Speed)
 Refers to the speed of generation of data.

 Data is live streaming or in motion.

• Example: medical devices that monitor patients need to collect

data and sent to its destination and analysed quickly.

 Big data is often available in real-time.

 Compared to small data, big data are produced more continually.

9
Variety (Complexity)
 The type and nature of the data.

 Refers to varied source and the nature of data both structure

and unstructured.

 Traditional database systems were designed to address

smaller volumes of structured data, fewer updates or a
predictable, consistent data structure. However, Now a day
80% of the data is unstructured cant be put in table easily.

10
Veracity/Validity
 It refers to the quality and accuracy of data.
 Gathered data could have missing pieces, may be inaccurate or may
not be able to provide real, valuable insight.
 Veracity, overall, refers to the level of trust there is in the collected data.
 Data can sometimes become messy and difficult to use.
 A large amount of data can cause more confusion than insights if it's
incomplete.
• For example, concerning the medical field, if data about what drugs
a patient is taking is incomplete, then the patient's life may be
endangered.

11
Value
 This refers to the value that big data can provide, and it relates
directly to what organizations can do with that collected data.

 Being able to pull value from big data is a requirement, as the

value of big data increases significantly depending on the insights
that can be gained from them.

 Organizations can use the same big data tools to gather and
analyse the data, but how they derive value from that data should
be unique to them.

12
Big data is all data
.

13
Big Data sources
 Users

 Application

 Systems

 Sensors

14
Who’s Generating Big Data
 Social media and networks (all of us are generating data)

 Scientific instruments (measuring all kinds of data)

 Mobile devices (tracking all objects all the time)

 Sensor technology and networks(collecting all sorts of data)

15
Benefits of Big Data
 To make better decisions and take meaningful actions at the
right time.

 Reduce costs of business processes

 Fraud Detection: Financial companies, in particular, use big

data to detect fraud. Data analysts use machine learning
algorithms and artificial intelligence to detect anomalies and
transaction patterns.

 Increased productivity

 Improved customer service

 Increase innovation and development of next generation

16
product.
Challenges of Big Data
 Lack of proper understanding of Big Data.

 Data growth issues or Data storage growth.

 Insufficient budget/Costs increase too fast.

 Nature of Data.

 Lack of Trained Professionals: Experts in using the new

technology and dealing with big data and Lack of analytic
skills. Confusion while Big Data tool selection. Integrating data
from a variety of sources.

17
Cont…
 Traditional systems are useful for structured data but they can’t
manage such a large amount of unstructured data.
 80% of the data over the glob is unstructured or available in widely
varying structures, which are difficult to analyse through the
traditional systems.
 So, How to process big data with reasonable cost and time?”
 To better address the high storage and computational needs of big
data, computer clusters are a better fit.
 A computer cluster is a set of computers that work together so that
they can be viewed as a single system.
 Do we have a framework for cluster computing?

18
What is Hadoop?
 Hadoop is a collection of open-source software utilities that
facilitate using a network of many computers to solve problems
involving massive amounts of data and computation.

 Allows for distributed processing of large datasets across

clusters of commodity computers using a simple programming
model.

 Originally designed for computer clusters built from commodity

hardware

19
Cont…
 Commodity computers are cheap and widely available.

 These are manly useful for achieving greater computational

power at low cost.

 Similar to data residing in local file system of a personal file

system of personal computer, in hadoop data resides in
distributed file system which is called as HDFS.

20
Core-Components of Hadoop
1. Hadoop distributive file system (HDFS)

2. Map reduce.

21
1. Hadoop Distributed File System (HDFS)
 Type of distributed file system and its Storage part of hadoop.

 Splits files into number of blocks and distribute them across

nodes in machine.

 It then transfer package codes into nodes to process the Data

in parallel.

 This approach takes advantages of data locality where nodes

manipulate data they have access.

 Hadoop splits files into blocks and distributes them across

nodes in a cluster.

22
Cont.…
 Stores multiple copies of data on different nodes

 Typically has a single Namenode and number of Datanodes.

 HDFS works on master/slave architecture.

 Master services can communicate with each other and in the

same way slave service can communicate with each other .

 Namenode is masternode and datanode is corresponding

slavenode and can talk with other.

23
HDFS daemons
Has three services

1. Name node

2. Secondary name node

3. Data node

24
1. Name/Master node
 HDFS consists of only one Namenode we call it master node
which can track file.

 Manages all file system metadata and has meta data and
about whole data in it.

 Contains details file name , ownership , permission, number

of block, location at what data node data is stored and
where the replication are stored and other details.

25
Cont.…
 Only one namenode we call it as single point failure

 It has direct connect with client.

 Receives Heartbeat and block report from all the data nodes.

 Authentication and authorization.

26
2. Data node
 Stores actual data in it as blocks.

 Known as slavenodes and is responsible for the client to read

and write.

 Receiving data instructed by namenode and reporting back as

acknowledgement.

 Stored multiple copy for each block.

 Every datanode sends heartbeat message to the namenode

every 3 seconds and conveys that its live.

27
Cont.…
 In this way when Namenode does not receive a heartbeat
from a data node for 2 minutes, it take that data node as dead
and starts the process of block replications on some other data
node.

 Serves read and write requests from clients.

 Has no knowledge about HDFS file system.

 Receive data from namenode or from peer.

28
3. Secondary name node
 This is only to take care of the check point of the file system
meta data which is in the namemode.

 This also known as checkpoint node.

 It is the helper Node for the Name Node.

 List of files

• List of blocks for each file

• List of Data Nodes for each block

• File attributes

• Creation time

• Records every change in the metadata 29

HDFS Master/Slave Architecture

30
HDFS Blocks

 Data files in HDFS are broken into block sized chunks, which
are stored as independent units.

• Default size of block is 128 MB in apache hadoop 2.0

(64MB in apache hadoop 1.0)

 Blocks are replicated for reliability.

• Multiple copy of data blocks are created and distributed

on nodes throughout the cluster to enable high
availability of data even a node failure occurred.

31
Cont.…

 One copy on local node and another copy on a remote rack

 Third copy on local rack, Additional replicas are randomly

placed.

 Default replication is 3-fold

32
Cont…

Eg : 420 MB file is split as

33
File write operation on HDFS
 To write file to HDFS, client needs to interact with Namenode.

 Namenode provides address of slave on which client will start

writing the data.

 As soon as client finishes writing the block, the slave start

copying the block to another slave which in term copy the
block to another slave(3 replicas by default).

 After required replicas are created then it will send the

acknowledgement to client.

34
1. Setting up HDFS pipeline.

35
2. Write pipe line

36
3. Acknowledgement in HDFS write

37
File read operation on HDFS
 To read file from HDFS, clients needs to interact with Namenode.

 Namenode provides address of slaves where file is stored.

 Client will interact with respective Datanodes to read the file.

 Namenode also provides to client which it shows to datanode

for authentication.

38
HDFS file reading mechanisms

39
HDFS Features
 Distributed

 Scalable

 Cost effective

 Fault tolerant.

 High throughout

 Others

40
Distributed

 Stores huge files in distributed manner in network approach.

 This happens by combining commodity or cheap computers

in cluster.

 Client uses as a single computer.

41
2. Scalable
 As it is discussed in case of distributed file HDFS.

 It is easily scalable both, horizontally and vertically.

 A few extra nodes help in scaling up the framework.

42
3. Economical/cost effective

 One important feature is no need of buying higher expensive

server machine because its possible to collect cheap machines

 Its systems are highly economical as ordinary computers can

be used for data processing

43
4. Fault tolerance
 It stores copies of the data on different machines and is
resistant to hardware failure

 This is true for also Failures in main switch or rack. (puts copy
of rack) is called rack awareness.

 Replication is expensive it takes 1/3 of total storage.

 If name node failure happen then backup is solution.

44
5. High throughout
 HDFS stores data in a distributed fashion, which allows data to
be processed parallels on a cluster of nodes.

 This decreases the processing time and thus provides high

throughput.

 Latency-time to get first record.

 Throughout-number of records processed per unit of time.

45
6. Others
 Unlimited data storage

 High speed processing system

 All verities of data processing

1. Structural
2. Unstructured
3. semi-structural

46
2. MapReduce

 Mapreduce is processing part of hadoop.

 It process data parallel in distributed environment.

 Hadoop will distribute computation over cluster

 Programming framework (library and runtime) for analyzing

data sets stored in HDFS)

 MapReduce jobs are composed of two functions:

47
The Mapper
1. Data split and sent to worker nodes.

2. Maps are individual tasks that transform input into

intermediate records.

3. Each block is processed in isolation by a map task called

mapper

4. The following diagram shows a simplified flow diagram for the

MapReduce program.

48
Shuffling
 There is shuffling before reducer which is called exchanging
the intermediate outputs from the map tasks into where they
are required by the reducer .

49
Reducer
 Reduces the set of intermediate values with share key to a
similar set of values.

 All of the value with the same key are presented to single
reducer together.

 Produce final output

50
Example 01:- sum of square

51
Example: square of even and odd numbers

52
Example: square of even and odd and prime numbers

53
Example:-do the following word count process.

54
MapReduce Engine

1. Job Tracker
 Responsible for accepting jobs from clients, dividing those
jobs into tasks, and assigning those tasks to be executed by
worker nodes.

 Job tracker talks to the NameNode to find out the location of

the data and will request the NameNode for the processing
data.

 NameNode in response gives the meta data to job tracker.

55
Cont.…
2. Task tracker
 Runs Map Reduce tasks periodically

 Its slave node for the job tracker and it will take the task
from the job tracker.

 And also receives code from the job tracker.

 The process of applying that code on file is known as

mapper.

56
Big Data Life Cycle with Hadoop
1. Ingesting data into the system
 The data is ingested/transferred to Hadoop from various
sources such as RDBMS, systems, or local files.

 Sqoop transfers data from RDBMS to HDFS, whereas

Flume transfers event data.

2. Processing the data in storage

 The second stage is Processing.

 In this stage, the data is stored and processed.

57
Cont.…
 The data is stored in the , HDFS, HBase. and MapReduce
perform data processing

3. Computing and analyzing data

 The third stage is to Analyze.

 Here, the data is analyzed by processing frameworks such

as Pig, Hive,.

4. Visualizing the results

 In this stage, the analyzed data can be accessed by users.

58
Assignment two
1. List and describe Hadoop ecosystem
2. Write Application of Big Data Analytics
3. What is Network File System?
4. Define the following terms
 RPC
 SSH
 TCP/IP
5. Compare traditional RDBMS and Hbase
6. Advantages and disadvantages of Hadoop.

59
The end

Hadoop Week 1
No ratings yet
Hadoop Week 1
25 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
IET Udaipur BDA Unit-1
No ratings yet
IET Udaipur BDA Unit-1
10 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
BDA Answers-1
No ratings yet
BDA Answers-1
15 pages
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
No ratings yet
BigData Terminology Hadoop MapReduce Yarn Spark File Formats
42 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Data Science Essentials & Big Data Concepts
No ratings yet
Data Science Essentials & Big Data Concepts
20 pages
Big Data: Presented By, Nishaa R
No ratings yet
Big Data: Presented By, Nishaa R
24 pages
EmTec Chapter 2
No ratings yet
EmTec Chapter 2
32 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
229 pages
BDA-Ass01 (082) Compressed
No ratings yet
BDA-Ass01 (082) Compressed
17 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
BDA Final
No ratings yet
BDA Final
23 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Big Data NOTES
No ratings yet
Big Data NOTES
14 pages
BDA Question Bank
No ratings yet
BDA Question Bank
33 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Big Data & Hadoop Architecture Guide
50% (2)
Big Data & Hadoop Architecture Guide
168 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Data Science
No ratings yet
Data Science
87 pages
Hadoop & Big Data Overview
No ratings yet
Hadoop & Big Data Overview
23 pages
Test 1 Big Data
No ratings yet
Test 1 Big Data
17 pages
Uc PDF
No ratings yet
Uc PDF
10 pages
Big Data
No ratings yet
Big Data
25 pages
Big Data and BDA
No ratings yet
Big Data and BDA
44 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
Hadoop Architecture and Its Functionality
No ratings yet
Hadoop Architecture and Its Functionality
7 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
Elementary Concepts of Big Data and Hadoop
No ratings yet
Elementary Concepts of Big Data and Hadoop
4 pages
Bda U2
No ratings yet
Bda U2
68 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
51 pages
Big Data
No ratings yet
Big Data
25 pages
Sdcbdasparkweek1 1
No ratings yet
Sdcbdasparkweek1 1
9 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
Bda Ut1 Que Ans
No ratings yet
Bda Ut1 Que Ans
13 pages
Cloud Computing Unit-5
No ratings yet
Cloud Computing Unit-5
22 pages
Lect 2 Big Data Lesson01
No ratings yet
Lect 2 Big Data Lesson01
26 pages
Experiment No - 1 Bda
No ratings yet
Experiment No - 1 Bda
10 pages
Big Data Analytics - Project
50% (2)
Big Data Analytics - Project
27 pages
TIE - 21CS71 SIMP With Key Answers
No ratings yet
TIE - 21CS71 SIMP With Key Answers
19 pages
Unit 5
No ratings yet
Unit 5
32 pages
Big Data Challenges & Solutions
100% (1)
Big Data Challenges & Solutions
17 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Ashish Presentation Stage1 Modify LR
No ratings yet
Ashish Presentation Stage1 Modify LR
24 pages
07 BigData DataAnalysis
No ratings yet
07 BigData DataAnalysis
66 pages
Wa0000.
No ratings yet
Wa0000.
35 pages
Big Data
No ratings yet
Big Data
51 pages
Mod-1 Q1. Characteristics of Big Data (5'v) Volumes
No ratings yet
Mod-1 Q1. Characteristics of Big Data (5'v) Volumes
15 pages
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
100% (1)
Reference: Apache Hadoop: Hadoop: The Definitive Guide, by Tom White, 2 Edition, Oreilly's, 2010
57 pages
Big Data S All Units
No ratings yet
Big Data S All Units
122 pages
Module 2
No ratings yet
Module 2
6 pages
Determine Internal Components Required
No ratings yet
Determine Internal Components Required
6 pages
Research Proposal
No ratings yet
Research Proposal
20 pages
Titus
No ratings yet
Titus
7 pages
Based On May 2011 Occupational Standards: Ethiopian TVET-System
No ratings yet
Based On May 2011 Occupational Standards: Ethiopian TVET-System
92 pages
Lecture 01
No ratings yet
Lecture 01
82 pages
Chapter 2 - Intelligent Agent
No ratings yet
Chapter 2 - Intelligent Agent
33 pages
Istqb Agile Tester Learning Objectives Single
No ratings yet
Istqb Agile Tester Learning Objectives Single
3 pages
Bosch Turbo Scara SR6 / SR8 Planning Manual: Dvvhpeo/Whfkqrorj
No ratings yet
Bosch Turbo Scara SR6 / SR8 Planning Manual: Dvvhpeo/Whfkqrorj
65 pages
Interfaces and Abstract Class by Durga Sir
No ratings yet
Interfaces and Abstract Class by Durga Sir
40 pages
CBSE Class 9 Computer Science Worksheet (1) - 0
No ratings yet
CBSE Class 9 Computer Science Worksheet (1) - 0
4 pages
(Enter Post Title Here) : Question 1. What Are The Different Service Group Types?
No ratings yet
(Enter Post Title Here) : Question 1. What Are The Different Service Group Types?
10 pages
Modbus Function Codes in MOTCP
No ratings yet
Modbus Function Codes in MOTCP
7 pages
Secure Cloud Deduplication System
No ratings yet
Secure Cloud Deduplication System
18 pages
Agile Key With Answers Consolidatedpdf
No ratings yet
Agile Key With Answers Consolidatedpdf
12 pages
Isp Workshop PDF
No ratings yet
Isp Workshop PDF
218 pages
Uds v3229 Modem Ocr
No ratings yet
Uds v3229 Modem Ocr
126 pages
UNAVCO Teqc Tutorial
No ratings yet
UNAVCO Teqc Tutorial
61 pages
Assignment No Title: Demonstration of STL Objectives: 1) To Learn and Understand Concepts of Standard Template Library
No ratings yet
Assignment No Title: Demonstration of STL Objectives: 1) To Learn and Understand Concepts of Standard Template Library
6 pages
Whatisafile?: Attributes of The File
No ratings yet
Whatisafile?: Attributes of The File
15 pages
AT28C256
No ratings yet
AT28C256
14 pages
Design and Construction of A Gsm-Based Home Automation System
100% (2)
Design and Construction of A Gsm-Based Home Automation System
93 pages
CXC - Csec - Electrical Electronnics - Sba Booklet 2010
100% (1)
CXC - Csec - Electrical Electronnics - Sba Booklet 2010
38 pages
Module 2: Computer-System Structures
No ratings yet
Module 2: Computer-System Structures
6 pages
Packet Tracer Configure Access Control Answer Key
No ratings yet
Packet Tracer Configure Access Control Answer Key
5 pages
C Programming by Mr. Mubiru Abubakari
No ratings yet
C Programming by Mr. Mubiru Abubakari
35 pages
Xapp1322 Transceiver Link Tuning
No ratings yet
Xapp1322 Transceiver Link Tuning
29 pages
C++20 Coroutines for Developers
No ratings yet
C++20 Coroutines for Developers
91 pages
Sider Fra The Definitive Guide To Power Query M Kode
No ratings yet
Sider Fra The Definitive Guide To Power Query M Kode
10 pages
Handout 04
No ratings yet
Handout 04
17 pages
Manual Inversor Hibrido On Grid Hys 9.6 LV Usg1 Hoymiles
No ratings yet
Manual Inversor Hibrido On Grid Hys 9.6 LV Usg1 Hoymiles
16 pages
OG Aruba CX Switch License
No ratings yet
OG Aruba CX Switch License
5 pages
ProfibusDP IO Series2
No ratings yet
ProfibusDP IO Series2
5 pages
Codex
No ratings yet
Codex
3 pages
Arithmetic Instructions
No ratings yet
Arithmetic Instructions
30 pages
Browser WS Install Guide
No ratings yet
Browser WS Install Guide
114 pages
Network Multi-PDL Printer Kit-B1 Service Manual
No ratings yet
Network Multi-PDL Printer Kit-B1 Service Manual
61 pages