KEMBAR78
Module 1 - Introduction To Big Data | PDF | Apache Hadoop | Map Reduce
100% found this document useful (1 vote)
196 views40 pages

Module 1 - Introduction To Big Data

This document provides an introduction to big data and Hadoop. It discusses what big data is, the challenges it poses, and different types of data like structured, unstructured, and semi-structured data. It also discusses Hadoop capabilities like distributed processing and storage. Key components of Hadoop include HDFS for storage, YARN for distributed processing, MapReduce for distributed computations, Pig and Hive for data analysis, Sqoop and Flume for data transfer. Hadoop has evolved from version 1 which used MapReduce for both processing and storage to version 2 which separates these into YARN and HDFS respectively and version 3 which further improves YARN.

Uploaded by

raghunath sastry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
196 views40 pages

Module 1 - Introduction To Big Data

This document provides an introduction to big data and Hadoop. It discusses what big data is, the challenges it poses, and different types of data like structured, unstructured, and semi-structured data. It also discusses Hadoop capabilities like distributed processing and storage. Key components of Hadoop include HDFS for storage, YARN for distributed processing, MapReduce for distributed computations, Pig and Hive for data analysis, Sqoop and Flume for data transfer. Hadoop has evolved from version 1 which used MapReduce for both processing and storage to version 2 which separates these into YARN and HDFS respectively and version 3 which further improves YARN.

Uploaded by

raghunath sastry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Big Data and Hadoop

Introduction to Big Data and


Hadoop
Session Objectives
This Session will help you to:

ᗍ Understand what is Big Data?


ᗍ List the challenges associated with Big Data
ᗍ Understand the difference between Real-time and Batch
Processing
ᗍ Hadoop capabilities
ᗍ Hadoop ecosystem

Slide 2
Units of Data
Data Generated by Social media platforms
ᗍ billions of users
ᗍ Generates PBs of data per day
ᗍ Fires millions queries on that every day

Slide 4
Data Generated by
Entertainment/Infotainment platforms

Slide 5
Space Agencies

NASA Centre for Climate Simulation (NCSS) stores


32 Petabytes of data comprising of climatic
observations

Slide 6
What is Big Data?
ᗍ Huge Amount of Data (Terabytes or Petabytes)

ᗍ Big data is the term for a collection of data sets


so large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications

ᗍ The challenges include capture, curation, storage,


search, sharing, transfer, analysis, and visualization

Data veracity is the degree to which data is accurate,


precise and trusted. Data is often viewed as certain and
reliable. The reality of problem spaces, data sets and
operational environments is that data is often uncertain,
imprecise and difficult to trust.

https://simplicable.com/new/data-veracity Slide 7
Slide 8
What is Unstructured Data?

Unstructured data is essentially everything else.


Unstructured data has internal structure but is not
structured via pre-defined data models or schema. It may
be textual or non-textual, and human- or machine-
generated. It may also be stored within a non-relational
database like NoSQL.
.
What is Unstructured Data?
Typical human-generated unstructured data includes:
•Text files: Word processing, spreadsheets, presentations, email, logs.
•Email: Email has some internal structure thanks to its metadata, and we sometimes
refer to it as semi-structured. However, its message field is unstructured and
traditional analytics tools cannot parse it.
•Social Media: Data from Facebook, Twitter, LinkedIn.
•Website: YouTube, Instagram, photo sharing sites.
•Mobile data: Text messages, locations.
•Communications: Chat, IM, phone recordings, collaboration software.
•Media: MP3, digital photos, audio and video files.
•Business applications: MS Office documents, productivity applications.

Typical machine-generated unstructured data includes:


•Satellite imagery: Weather data, land forms, military movements.
•Scientific data: Oil and gas exploration, space exploration, seismic imagery,
atmospheric data.
•Digital surveillance: Surveillance photos and video.
•Sensor data: Traffic, weather, oceanographic sensors.
IBM’s Definition of
Big Data

Variety Photo Web Video Audio


MB GB TB PB Volume

Slide 11
Structured and Unstructured Data

ᗍ 2,500 exabytes of new information in 2012 with internet as primary driver


ᗍ Digital universe grew by 62% last year to 800K petabytes and will go to 1.2 “Zettabytes” this
year Slide 12
What is
Hadoop?
Apache Hadoop is a framework that allows the distributed processing of large data sets
across clusters of commodity computers using a simple programming mode

It is an Open-source Data Management with scale-out storage and distributed


processing

Slide 13
Batch Processing
ᗍ Processing transactions in a group or batch
ᗍ Following three phases are common to batch processing or business analytics project, irrespective of the type
of data (structured or unstructured)

Data Data Data


Collection Preparation Presentation

Slide 14
Data
Collection

Real Time System Business


Analytics / Batch
Flume Processing
System

Unstructure
d Data
Sqoop

Structure
d Data

Slide 15
Flume
Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming data into the Hadoop Distributed
File System (HDFS). It has a simple and flexible architecture based on streaming data
flows; and is robust and fault tolerant with tunable reliability mechanisms for failover
and recovery.

YARN coordinates data ingest from Apache Flume and other services that deliver raw
data into an Enterprise Hadoop cluster

Flume lets Hadoop users ingest high-volume streaming data into HDFS for storage
Sqoop
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is
used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and
export from Hadoop file system to relational databases.
MapReduce
Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for
distributed processing of large data sets on computing clusters. It is a
sub-project of the Apache Hadoop project. Apache Hadoop is an open-
source framework that allows to store and process big data in a
distributed environment across clusters of computers using simple
programming models. MapReduce is the core component for data
processing in Hadoop framework.
MapReduce
Pig
Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data
workers to write complex data transformations without knowing Java. Pig's simple SQL-like
scripting language is called Pig Latin, and appeals to developers already familiar with scripting
languages and SQL.

Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING PigStorage(',') as
(id:int,name:chararray,city:chararray); Dump student;
Data
Presentation

Business
Analytics / Batch
Pig
Processing
System

Data Processing

Output

Slide 21
What is
Hadoop?
Apache Hadoop is a framework that allows the distributed processing of large data sets
across clusters of commodity computers using a simple programming mode

It is an Open-source Data Management with scale-out storage and distributed


processing

Slide 22
Hadoop Key
Characteristics
Reliabl
e

Scalabl Characteristic Economical


e s

Flexible

Slide 23
Hadoop
Ecosystem Apache Oozie
(Workflow)
Hive Pig Latin Other
DW System Data Analysis YARN
Frameworks HBase
MapReduce Framework (MPI, GIRAPH)

YARN
Cluster
Resource Management

HDFS
(Hadoop
Flum Distributed File Sqoo
System)
e p
Import Or
Export
Unstructured or Structured Data
Semi-Structured data Slide 24
Hadoop versions with history
https://archive.apache.org/dist/hadoop/core/

Slide 25
Hadoop 2.x Core
Components

ᗍ Entire infrastructure is divided into 2 parts - storage and processing


 ᗍ Hadoop Distributed File System (HDFS) provides the storage mechanism.
Yet Another Resource Negotiator (YARN) provides the processing part
ᗍ In total 5 Hadoop Daemons - 3 for HDFS and 2 for YARN. They work in a
master and slave mode
 ᗍ Name Node, Secondary Name Node (Masters) and Data Nodes (Slaves) for
HDFS and Resource Manager (Master) and Node Managers (Slaves) for YARN

Slide 26
Hadoop 1.x Vs Hadoop 2.x

Slide 27
Hadoop 3.x Core Components
A major improvement in Hadoop 3.0 is related to the way YARN
works and what it can support. Hadoop’s resource manager YARN
was introduced in Hadoop 2.0 to make hadoop clusters run
efficiently. In hadoop 3.0, YARN is coming off with multiple
enhancements in the following areas –
•Support for long running services with the need to consolidate
infrastructure.
•Better resource isolation for disk and network, resource
utilization, user experiences, docker opportunities and elasticity.
•YARN Timeline Service Rearchitecture to ATS v2

Slide 28
Difference between 2.x and 3.x

Slide 29
Difference between 2.x and 3.x

Slide 30
Hadoop 2.x Core Components

Resourc Node Node Node Node


YARN
e Manager Manager Manager Manager
Manager

HDFS Admin Node Data Data Data Data


Cluster Name Node Node Node Node
Node

Slide 31
Components
YARN- Apache Yarn – “Yet Another Resource Negotiator” is
the
resource management layer of Hadoop.
The Yarn was introduced in Hadoop 2.x. Yarn allows
different data processing engines like graph processing,
interactive processing, stream processing as well as batch
processing to run and process data stored in HDFS (Hadoop
Distributed File System).
Apart from resource management, Yarn also does job
Scheduling.
Yarn extends the power of Hadoop to other evolving
technologies. Slide 32
Components
HDFS Cluster- A cluster is a collection of nodes. A node is a
process running on a virtual or physical machine or in a
container.
When you run Hadoop in local node it writes data to the local
file system instead of HDFS (Hadoop Distributed File System).

Slide 33
Components
Node- A node is a process running on a virtual or physical
machine or in a container. We say process because a code
would be running other programs beside Hadoop.

Slide 34
Components
Resource Manager - The Resource Manager is the core
component of YARN

Slide 35
Components
Name Node - NameNode is the centerpiece of
HDFS,NameNode is also known as the Master
.NameNode only stores the metadata of HDFS – the directory
tree of all files in the file system, and tracks the files across
the cluster.

Slide 36
Components
Node Manager - the NodeManager is more of a generic and
efficient version of TaskTracker (of Hadoop1 architecture)
which is more flexible than TaskTracker.
In contrast to fixed number of slots for map and reduce tasks
in MRV1, the NodeManager of MRV2 has a number
of dynamically created resource containers.

Slide 37
Components
Data Node - DataNode is responsible for storing the actual
data in HDFS.
DataNode is also known as the Slave,NameNode and
DataNode are in constant communication.

Slide 38
Secondary
NameNode
Metadata

Secondary NameNode:
NameNode
ᗍ In HDFS 1.0, not a hot standby for the NameNode
ᗍ By Default connects to NameNode every hour*
ᗍ Housekeeping, backup of NameNode metadata
ᗍ Saved metadata is used to bring up the
Secondary NameNode
Secondary It'll take metadata
every hour and
N ameN ode will make it
secure

Slide 39
Thank you

You might also like