Selected Topics in Computer Science(CoSc4181)
Lecture 02: Basic concepts of Big Data
Department of Computer Science
Dilla University
By: Tsegalem G/hiwot
2022 G.C
1
Contents of Big Data
• Understand the basics of Big Data
• Evolution of Big Data
• Characteristics of Big Data
• Big Data sources
• Benefits of Big Data
• Challenges of Big Data and solution
• The main concepts Hadoop
• Hadoop components and ecosystem
2
What Is Big Data?
Big Data is a term used to describe a collection of data that is
huge in size and yet growing exponentially with time.
It generates value from different sources and processing of very
large quantities of digital information that cannot be analyzed
with traditional computing techniques.
Having data bigger it requires different approaches:
– Techniques, tools and architecture
Big data is the realization of storing, processing, and analyzing data
that was previously ignored due to the limitations of traditional data
management technologies.
Big data is all data (structured, semi-structured and unstructured).
3
Factors for Evolution of Big Data
Evolution of technology
• World changed from Telephone to Cellphone
• From stand alone computers to networked computers(Internet)
IoT (50 billion IoT devices in 2020)
Social media e.g. Instagram, Facebook, YouTube etc.
Others eg. Amazon, Flipkart , etc.
4
Examples of Big Data
Walmart handles more than 2.5 petabytes of data.
NASA Climate Simulation 32 PB/Day.
Google processes 20 PB a day (2008)
Facebook has 2.5 PB of user data per day.
eBay has 6.5 PB of user data + 50 TB/day.
Tweeter produces over 90 billion tweets/day.
5
How can you avoid big data?
Pay cash for everything!
Never go online!
Don’t use a Cellphone!
Don’t fill any online prescriptions!
Never leave your house!
6
Characteristics of Big Data (5V)
Volume
Velocity
Variety
Veracity
Value
7
Volume(Scale)
Refers to the quantity of generated and stored data.
The name Big Data itself is related to size which is enormous.
Size of data plays a very crucial role in determining value out
of data.
Hence, volume is one characteristics which needs to be
considered while dealing with big data.
8
Velocity(Speed)
Refers to the speed of generation of data.
Data is live streaming or in motion.
• Example: medical devices that monitor patients need to collect
data and sent to its destination and analysed quickly.
Big data is often available in real-time.
Compared to small data, big data are produced more continually.
9
Variety (Complexity)
The type and nature of the data.
Refers to varied source and the nature of data both structure
and unstructured.
Traditional database systems were designed to address
smaller volumes of structured data, fewer updates or a
predictable, consistent data structure. However, Now a day
80% of the data is unstructured cant be put in table easily.
10
Veracity/Validity
It refers to the quality and accuracy of data.
Gathered data could have missing pieces, may be inaccurate or may
not be able to provide real, valuable insight.
Veracity, overall, refers to the level of trust there is in the collected data.
Data can sometimes become messy and difficult to use.
A large amount of data can cause more confusion than insights if it's
incomplete.
• For example, concerning the medical field, if data about what drugs
a patient is taking is incomplete, then the patient's life may be
endangered.
11
Value
This refers to the value that big data can provide, and it relates
directly to what organizations can do with that collected data.
Being able to pull value from big data is a requirement, as the
value of big data increases significantly depending on the insights
that can be gained from them.
Organizations can use the same big data tools to gather and
analyse the data, but how they derive value from that data should
be unique to them.
12
Big data is all data
.
13
Big Data sources
Users
Application
Systems
Sensors
14
Who’s Generating Big Data
Social media and networks (all of us are generating data)
Scientific instruments (measuring all kinds of data)
Mobile devices (tracking all objects all the time)
Sensor technology and networks(collecting all sorts of data)
15
Benefits of Big Data
To make better decisions and take meaningful actions at the
right time.
Reduce costs of business processes
Fraud Detection: Financial companies, in particular, use big
data to detect fraud. Data analysts use machine learning
algorithms and artificial intelligence to detect anomalies and
transaction patterns.
Increased productivity
Improved customer service
Increase innovation and development of next generation
16
product.
Challenges of Big Data
Lack of proper understanding of Big Data.
Data growth issues or Data storage growth.
Insufficient budget/Costs increase too fast.
Nature of Data.
Lack of Trained Professionals: Experts in using the new
technology and dealing with big data and Lack of analytic
skills. Confusion while Big Data tool selection. Integrating data
from a variety of sources.
17
Cont…
Traditional systems are useful for structured data but they can’t
manage such a large amount of unstructured data.
80% of the data over the glob is unstructured or available in widely
varying structures, which are difficult to analyse through the
traditional systems.
So, How to process big data with reasonable cost and time?”
To better address the high storage and computational needs of big
data, computer clusters are a better fit.
A computer cluster is a set of computers that work together so that
they can be viewed as a single system.
Do we have a framework for cluster computing?
18
What is Hadoop?
Hadoop is a collection of open-source software utilities that
facilitate using a network of many computers to solve problems
involving massive amounts of data and computation.
Allows for distributed processing of large datasets across
clusters of commodity computers using a simple programming
model.
Originally designed for computer clusters built from commodity
hardware
19
Cont…
Commodity computers are cheap and widely available.
These are manly useful for achieving greater computational
power at low cost.
Similar to data residing in local file system of a personal file
system of personal computer, in hadoop data resides in
distributed file system which is called as HDFS.
20
Core-Components of Hadoop
1. Hadoop distributive file system (HDFS)
2. Map reduce.
21
1. Hadoop Distributed File System (HDFS)
Type of distributed file system and its Storage part of hadoop.
Splits files into number of blocks and distribute them across
nodes in machine.
It then transfer package codes into nodes to process the Data
in parallel.
This approach takes advantages of data locality where nodes
manipulate data they have access.
Hadoop splits files into blocks and distributes them across
nodes in a cluster.
22
Cont.…
Stores multiple copies of data on different nodes
Typically has a single Namenode and number of Datanodes.
HDFS works on master/slave architecture.
Master services can communicate with each other and in the
same way slave service can communicate with each other .
Namenode is masternode and datanode is corresponding
slavenode and can talk with other.
23
HDFS daemons
Has three services
1. Name node
2. Secondary name node
3. Data node
24
1. Name/Master node
HDFS consists of only one Namenode we call it master node
which can track file.
Manages all file system metadata and has meta data and
about whole data in it.
Contains details file name , ownership , permission, number
of block, location at what data node data is stored and
where the replication are stored and other details.
25
Cont.…
Only one namenode we call it as single point failure
It has direct connect with client.
Receives Heartbeat and block report from all the data nodes.
Authentication and authorization.
26
2. Data node
Stores actual data in it as blocks.
Known as slavenodes and is responsible for the client to read
and write.
Receiving data instructed by namenode and reporting back as
acknowledgement.
Stored multiple copy for each block.
Every datanode sends heartbeat message to the namenode
every 3 seconds and conveys that its live.
27
Cont.…
In this way when Namenode does not receive a heartbeat
from a data node for 2 minutes, it take that data node as dead
and starts the process of block replications on some other data
node.
Serves read and write requests from clients.
Has no knowledge about HDFS file system.
Receive data from namenode or from peer.
28
3. Secondary name node
This is only to take care of the check point of the file system
meta data which is in the namemode.
This also known as checkpoint node.
It is the helper Node for the Name Node.
List of files
• List of blocks for each file
• List of Data Nodes for each block
• File attributes
• Creation time
• Records every change in the metadata 29
HDFS Master/Slave Architecture
30
HDFS Blocks
Data files in HDFS are broken into block sized chunks, which
are stored as independent units.
• Default size of block is 128 MB in apache hadoop 2.0
(64MB in apache hadoop 1.0)
Blocks are replicated for reliability.
• Multiple copy of data blocks are created and distributed
on nodes throughout the cluster to enable high
availability of data even a node failure occurred.
31
Cont.…
One copy on local node and another copy on a remote rack
Third copy on local rack, Additional replicas are randomly
placed.
Default replication is 3-fold
32
Cont…
Eg : 420 MB file is split as
33
File write operation on HDFS
To write file to HDFS, client needs to interact with Namenode.
Namenode provides address of slave on which client will start
writing the data.
As soon as client finishes writing the block, the slave start
copying the block to another slave which in term copy the
block to another slave(3 replicas by default).
After required replicas are created then it will send the
acknowledgement to client.
34
1. Setting up HDFS pipeline.
35
2. Write pipe line
36
3. Acknowledgement in HDFS write
37
File read operation on HDFS
To read file from HDFS, clients needs to interact with Namenode.
Namenode provides address of slaves where file is stored.
Client will interact with respective Datanodes to read the file.
Namenode also provides to client which it shows to datanode
for authentication.
38
HDFS file reading mechanisms
39
HDFS Features
Distributed
Scalable
Cost effective
Fault tolerant.
High throughout
Others
40
Distributed
Stores huge files in distributed manner in network approach.
This happens by combining commodity or cheap computers
in cluster.
Client uses as a single computer.
41
2. Scalable
As it is discussed in case of distributed file HDFS.
It is easily scalable both, horizontally and vertically.
A few extra nodes help in scaling up the framework.
42
3. Economical/cost effective
One important feature is no need of buying higher expensive
server machine because its possible to collect cheap machines
Its systems are highly economical as ordinary computers can
be used for data processing
43
4. Fault tolerance
It stores copies of the data on different machines and is
resistant to hardware failure
This is true for also Failures in main switch or rack. (puts copy
of rack) is called rack awareness.
Replication is expensive it takes 1/3 of total storage.
If name node failure happen then backup is solution.
44
5. High throughout
HDFS stores data in a distributed fashion, which allows data to
be processed parallels on a cluster of nodes.
This decreases the processing time and thus provides high
throughput.
Latency-time to get first record.
Throughout-number of records processed per unit of time.
45
6. Others
Unlimited data storage
High speed processing system
All verities of data processing
1. Structural
2. Unstructured
3. semi-structural
46
2. MapReduce
Mapreduce is processing part of hadoop.
It process data parallel in distributed environment.
Hadoop will distribute computation over cluster
Programming framework (library and runtime) for analyzing
data sets stored in HDFS)
MapReduce jobs are composed of two functions:
47
The Mapper
1. Data split and sent to worker nodes.
2. Maps are individual tasks that transform input into
intermediate records.
3. Each block is processed in isolation by a map task called
mapper
4. The following diagram shows a simplified flow diagram for the
MapReduce program.
48
Shuffling
There is shuffling before reducer which is called exchanging
the intermediate outputs from the map tasks into where they
are required by the reducer .
49
Reducer
Reduces the set of intermediate values with share key to a
similar set of values.
All of the value with the same key are presented to single
reducer together.
Produce final output
50
Example 01:- sum of square
51
Example: square of even and odd numbers
52
Example: square of even and odd and prime numbers
53
Example:-do the following word count process.
54
MapReduce Engine
1. Job Tracker
Responsible for accepting jobs from clients, dividing those
jobs into tasks, and assigning those tasks to be executed by
worker nodes.
Job tracker talks to the NameNode to find out the location of
the data and will request the NameNode for the processing
data.
NameNode in response gives the meta data to job tracker.
55
Cont.…
2. Task tracker
Runs Map Reduce tasks periodically
Its slave node for the job tracker and it will take the task
from the job tracker.
And also receives code from the job tracker.
The process of applying that code on file is known as
mapper.
56
Big Data Life Cycle with Hadoop
1. Ingesting data into the system
The data is ingested/transferred to Hadoop from various
sources such as RDBMS, systems, or local files.
Sqoop transfers data from RDBMS to HDFS, whereas
Flume transfers event data.
2. Processing the data in storage
The second stage is Processing.
In this stage, the data is stored and processed.
57
Cont.…
The data is stored in the , HDFS, HBase. and MapReduce
perform data processing
3. Computing and analyzing data
The third stage is to Analyze.
Here, the data is analyzed by processing frameworks such
as Pig, Hive,.
4. Visualizing the results
In this stage, the analyzed data can be accessed by users.
58
Assignment two
1. List and describe Hadoop ecosystem
2. Write Application of Big Data Analytics
3. What is Network File System?
4. Define the following terms
RPC
SSH
TCP/IP
5. Compare traditional RDBMS and Hbase
6. Advantages and disadvantages of Hadoop.
59
The end
60