KEMBAR78
Big Data Open Source Framework-Hadoop | PDF | Apache Hadoop | Map Reduce
0% found this document useful (0 votes)
20 views22 pages

Big Data Open Source Framework-Hadoop

Apache Hadoop is an open-source framework designed for the distributed processing of large data sets across clusters of commodity computers. It includes key components such as HDFS, YARN, and MapReduce, and is widely used by organizations like Yahoo, Facebook, and Amazon for big data management. The Hadoop ecosystem also features additional tools like HBase, Pig, and Hive to enhance data processing and analysis capabilities.

Uploaded by

hassanali2415
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views22 pages

Big Data Open Source Framework-Hadoop

Apache Hadoop is an open-source framework designed for the distributed processing of large data sets across clusters of commodity computers. It includes key components such as HDFS, YARN, and MapReduce, and is widely used by organizations like Yahoo, Facebook, and Amazon for big data management. The Hadoop ecosystem also features additional tools like HBase, Pig, and Hive to enhance data processing and analysis capabilities.

Uploaded by

hassanali2415
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Big Data Open Source Framework

By: Faizan Irshad


What is Hadoop?

 Apache Hadoop is a framework that allows the


distributed processing of large data sets across clusters
of commodity computers using a simple programming
mode.
 It is an Open-source Data Management with scale-out
storage and distributed processing.
Brief History of Hadoop

Designed to answer the question: “How to process


big data with reasonable cost and time?”
Hadoop’s Developers

2005: Doug Cutting and Michael J.


Cafarella developed Hadoop to support
distribution for the Nutch search engine
project.

The project was funded by Yahoo.

2006: Yahoo gave the project to Apache


Doug Cutting
Software Foundation.
Hadoop Users

• Hadoop is in use at most organizations that handle big


data. E.g. Yahoo, Facebook, Amazon, Netflix etc.

• Some examples of scale:


o Yahoo!’s Search Webmap runs on 10,000 core
Linux cluster and powers Yahoo! Web search

o FB’s Hadoop cluster hosts 100+ PB of data


Goals / Requirements:
• Abstract and facilitate the storage and processing of
large and/or rapidly growing data sets
• Structured and unstructured data
• Simple programming models

• High scalability and availability

• Use commodity (cheap!) hardware with little


redundancy

• Fault-tolerance
Hadoop Framework
 Data is processed in
parallel and accomplish
the entire statistical
analysis on large amount
of data.
 It is a framework which is
based on java
programming.
 It is intended to work upon
from a single server to
thousands of machines
each offering local
computation and storage.
Hadoop Framework
 Being a framework, Hadoop is made up of several modules that
are supported by a large ecosystem of technologies.
 Hadoop Ecosystem is a platform or a suite which provides
various services to solve the big data problems. It includes
Apache projects and various commercial tools and solutions.
 There are four major elements of Hadoop i.e. HDFS, MapReduce,
YARN, and Hadoop Common Utilities. Most of the tools or
solutions are used to supplement or support these major
elements. All these tools work collectively to provide services
such as absorption, analysis, storage and maintenance of data
etc.
Hadoop Ecosystem
Following are the components that collectively form a Hadoop ecosystem:

• HDFS: Hadoop Distributed File System


• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout: Machine Learning algorithm library
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling

Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
Hadoop Ecosystem
HDFS:
• HDFS is the primary or major component of Hadoop
ecosystem and is responsible for storing large data sets of
structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files.
• HDFS consists of two core components i.e.
• Name node
• Data Node
• Name Node is the prime node which contains metadata (data
about data) requiring comparatively fewer resources than the
data nodes that stores the actual data. The data nodes are
commodity hardware in the distributed environment which
actually store data. Commodity hardware makes Hadoop cost
effective.
• HDFS maintains all the coordination between the clusters and
hardware, thus working at the heart of the system.
Apache HBase:
• HBase is an open source database from Apache that runs on
Hadoop cluster. It’s a NoSQL database which supports all
kinds of data and thus capable of handling anything of
Hadoop Database. It provides capabilities of Google’s
BigTable, thus able to work on Big Data sets effectively.

• At times where we need to search or retrieve the occurrences


of something small in a huge database, the request must be
processed within a short quick span of time. At such times,
HBase comes handy as it gives us a tolerant way of storing
limited data
YARN:
• Yet Another Resource Negotiator, as the name implies, YARN is
the one which helps to manage the resources across the
clusters. In short, it performs scheduling and resource allocation
for the Hadoop System. Consists of three major components i.e.
• Resource Manager
• Nodes Manager
• Application Manager
• Resource manager has the privilege of allocating resources for
the applications in a system whereas Node managers work on
the allocation of resources such as CPU, memory, bandwidth
per machine and later on acknowledges the resource manager.
Application manager works as an interface between the
resource manager and node manager and performs negotiations
as per the requirement of the two.
MapReduce:
• MapReduce is a method for processing of data stored using
HDFS.

• MapReduce makes the use of two functions i.e. Map() and Reduce()
whose task is:
• Map() performs sorting and filtering of data and thereby organizing
them in the form of group. Map generates a key-value pair based
result which is later on processed by the Reduce() method.
• Reduce(), as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the output
generated by Map() as input and combines those tuples into smaller
set of tuples.
MapReduce
PIG:
Pig was basically developed by Yahoo which works on a pig Latin
language, which is Query based language similar to SQL.
• It is a platform for structuring the data flow, processing and
analyzing huge data sets.
• Pig does the work of executing commands and in the
background, all the activities of MapReduce are taken care of.
After the processing, pig stores the result in HDFS.
• Pig Latin language is specially designed for this framework
which runs on Pig Runtime. Just the way Java runs on the JVM.
• Pig helps to achieve ease of programming and optimization
and hence is a major segment of the Hadoop Ecosystem.
HIVE:
• With the help of SQL methodology and interface, HIVE performs
reading and writing of large data sets. However, its query
language is called as HQL (Hive Query Language).
• It is highly scalable as it allows real-time processing as well as
batch processing. Also, all the SQL datatypes are supported by
Hive thus, making the query processing easier.
Mahout:
• Mahout, allows Machine Learnability to a system or
application. Machine Learning, as the name suggests helps the
system to develop itself based on some patterns,
user/environmental interaction or on the basis of algorithms.

• It provides various libraries or functionalities such as


collaborative filtering, clustering, and classification which are
nothing but concepts of Machine learning. It allows invoking
algorithms as per our need with the help of its own libraries.
Other Components
Apart from all of these, there are some other components too that
carry out a huge task in order to make Hadoop capable of processing
large datasets. They are as follows:

• Solr, Lucene: These are the two services that perform the task of
searching and indexing with the help of some java libraries.
• Zookeeper: There was a huge issue of management of
coordination and synchronization among the resources or the
components of Hadoop which resulted in inconsistency. Zookeeper
overcame all the problems by performing synchronization.
• Oozie: Oozie simply performs the task of a scheduler, thus
scheduling jobs and binding them together as a single unit.
Hadoop Key Characteristics

Reliable

Scalable Characteristics Economical

Flexible
Apache Spark:
• Apache Hadoop and Apache Spark are two open-source
frameworks you can use to manage and process large volumes of
data for analytics. Organizations must process data at scale and
speed to gain real-time insights for business intelligence.
• Apache Hadoop allows you to cluster multiple computers to
analyze massive datasets in parallel more quickly. Apache Spark
uses in-memory caching and optimized query execution for fast
analytic queries against data of any size.
• Spark is a more advanced technology than Hadoop, as Spark uses
artificial intelligence and machine learning (AI/ML) in data
processing.
• Spark is best suited for real-time data whereas Hadoop is best
suited for structured data or batch processing, hence many
companies use Spark and Hadoop together to meet their data
analytics goals.
Thank you!

You might also like