KEMBAR78
BigDataAnalytics Unit5 | PDF | Apache Hadoop | Databases
0% found this document useful (0 votes)
17 views6 pages

BigDataAnalytics Unit5

The document provides an overview of the Hadoop Ecosystem, detailing various tools such as Apache Flume, Pig, Sqoop, Hive, HBase, Storm, Zookeeper, and Oozie, which facilitate Big Data management. Each tool serves distinct purposes, from data ingestion and processing to real-time analytics and job scheduling. The document emphasizes the importance of these tools in handling large datasets efficiently and effectively within the Hadoop framework.

Uploaded by

Wipro pvt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views6 pages

BigDataAnalytics Unit5

The document provides an overview of the Hadoop Ecosystem, detailing various tools such as Apache Flume, Pig, Sqoop, Hive, HBase, Storm, Zookeeper, and Oozie, which facilitate Big Data management. Each tool serves distinct purposes, from data ingestion and processing to real-time analytics and job scheduling. The document emphasizes the importance of these tools in handling large datasets efficiently and effectively within the Hadoop framework.

Uploaded by

Wipro pvt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Big Data Analytics

UNIT 5: HADOOP ECOSYSTEM

5.1 Overview and comparison of different ecosystem tools like Apache Flume, Pig, Sqoop, Hive,
HBase, Storm, Zookeeper, Oozie

Hadoop Ecosystem is nothing but a platform or framework which is a collection of different tools that
enable Big Data management. These include,

• HDFS -> Hadoop Distributed File System


• YARN -> Yet Another Resource Negotiator
• MapReduce -> Data processing using programming
• Spark -> In-memory Data Processing
• PIG, HIVE-> Data Processing Services using Query (SQL-like)
• HBase -> NoSQL Database
• Mahout, Spark MLlib -> Machine Learning
• Apache Drill -> SQL on Hadoop
• Zookeeper -> Managing Cluster
• Oozie -> Job Scheduling
• Flume, Sqoop -> Data Ingesting Services
• Solr & Lucene -> Searching & Indexing
• Ambari -> Provision, Monitor and Maintain cluster

1
Apache Flume - Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of datasets. It has a simple and flexible architecture based on
streaming data flows. It is robust and fault tolerant with tuneable reliability mechanisms and many
failover and recovery mechanisms. It uses a simple extensible data model that allows for online
analytic application.

Apache Flume is a service which helps in ingesting unstructured and semi-structured data into HDFS
or HBase (a NoSQL database).

It can be used to transport massive quantities of event data including but not limited to network
traffic data, social-media-generated data, email messages and pretty much any data source possible.

A Flume agent is a (JVM) process that hosts the components through which events flow from an
external source to the next destination (hop). An event in Flume is a unit of data flow having a byte
payload and an optional set of string attributes.

2
A Flume source consumes events delivered to it by an external source like a web server. The external
source sends events to Flume in a format that is recognized by the target Flume source. For example,
an Avro Flume source can be used to receive Avro events from Avro clients or other Flume agents in
the flow that send events from an Avro sink. A similar flow can be defined using a Thrift Flume
Source to receive events from a Thrift Sink or a Flume Thrift RPC Client or Thrift clients written in any
language generated from the Flume thrift protocol. When a Flume source receives an event, it stores
it into one or more channels. The channel is a passive store that keeps the event until it’s consumed
by a Flume sink. The file channel is one example – it is backed by the local filesystem. The sink
removes the event from the channel and puts it into an external repository like HDFS (via Flume
HDFS sink) or forwards it to the Flume source of the next Flume agent (next hop) in the flow. The
source and sink within the given agent run asynchronously with the events staged in the channel.

Apache Pig – Apache Pig is a platform for analysing very large data sets. Pig was initially developed
by Yahoo. It gives you a platform for building data flow for ETL (Extract, Transform and Load),
processing and analysing huge data sets.

Apache Pig has two parts: The language layer and infrastructure layer. Pig Latin, is in the language
layer and the pig runtime is in the infrastructure layer or execution environment.

Pig's language layer currently consists of a textual language called Pig Latin. It has SQL like command
structure and it is very easy to learn. It does not require extensive programming skills. For example,
ten lines of Pig Latin is sufficient to implement a MapReduce Java code that has hundred lines.
Internally, the Pig Latin compiler converts Pig Latin code to MapReduce code or jobs. It produces a
sequential set of MapReduce jobs, and that happens in the background. It has the following key
properties:

• Ease of programming. It is trivial to achieve parallel execution of simple data analysis tasks.
Complex tasks comprised of multiple interrelated data transformations are explicitly encoded
as data flow sequences, making them easy to write, understand, and maintain.

• Optimization opportunities. The system optimizes the execution of tasks automatically.

• Extensibility. Users can create their own functions to do special-purpose processing.

Pig's infrastructure layer (the runtime environment) consists of a compiler that produces sequences
of Map-Reduce programs or jobs, for which large-scale parallel implementations already exist (e.g.,
the Hadoop subproject).

3
In Apache Pig, first the load command, loads the data. Then we perform various functions on it like
grouping, filtering, joining, sorting, etc. At end, either the data can be displayed on the screen or
stored in HDFS.

Apache Sqoop – Sqoop provides data ingestion service. It is a tool designed for efficiently
transferring bulk data between Apache Hadoop and structured datastores such as relational
databases.

There is a key difference between Flume and Sqoop. Flume ingests unstructured data or semi-
structured data into HDFS. It does not export data from HDFS. Sqoop can import structured data
from RDBMS or Enterprise data warehouses to HDFS. In addition, it can export data from HDFS into
RDBMS or Enterprise data warehouses.

Apache Hive – Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a
massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed
storage using SQL.

The query language of Hive is called Hive Query Language (HQL), which is very similar like SQL. It has
2 basic components: Hive Command Line and JDBC/ODBC driver. The Hive Command line interface
is used to execute HQL commands. Java Database Connectivity (JDBC) and Object Database
Connectivity (ODBC) are used to establish connection from data storage.

Hive is highly scalable. It can be used for large data set processing (i.e. batch processing) and real
time processing (i.e. interactive query processing). It supports all primitive data types of SQL. Tasks
can be performed using predefined functions, and/or user defined functions (UDF).

Apache HBase – HBase is a type of NoSQL database. It is not an RDBMS which supports SQL as its
primary access language. HBase is very much a distributed database. Being a NoSQL database, HBase
lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers,
and advanced query languages, etc.

One can store data in HDFS either directly or through HBase.

HBase is a distributed column-oriented database built on top of the Hadoop file system. It provides
fast record lookups (and updates) for large tables. It is based on a data model that is similar to
Google’s big table designed to provide quick random access to huge amounts of structured data. It
leverages the fault tolerance provided by the Hadoop File System (HDFS). It is a part of the Hadoop
ecosystem that provides random real-time read/write access to data in the Hadoop File System.

Hadoop by default, can perform only batch processing and data can be accessed only in a sequential
manner. Sequential access is time consuming. HBase enables random access to data. HBase sits on
top of the Hadoop File System and provides read and write access.

Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase.

Apache Storm – Apache Storm is a distributed real-time big data processing system. It is designed to
process vast amount of data in a fault-tolerant and horizontal scalable method. It is a streaming data
framework that has the capability of highest ingestion rates. Though Storm is stateless, it manages
distributed environment and cluster state via Apache ZooKeeper. It is simple and you can execute all
kinds of manipulations on real-time data in parallel.

4
Apache Storm is continuing to be a leader in real-time data analytics. Storm is easy to setup, operate
and it guarantees that every message will be processed through the topology at least once.

Hadoop performs batch processing whereas Storm performs real-time data processing. Hadoop is
stateful. Strom is stateless.

A Storm streaming process can access tens of thousands of messages per second on cluster. In
Hadoop MapReduce jobs are executed in a sequential order and completed eventually.

Many organizations are using Storm as an integral part of their system. For example, Twitter is using
Apache Storm for its range of “Publisher Analytics products”. “Publisher Analytics Products” process
all the tweets and clicks in the Twitter Platform. Apache Storm is deeply integrated with Twitter
infrastructure.

Strom is used for real-time fraud detection in banking systems. It is used for dynamic pricing in retail
systems.

Strom provides the following benefits.

1. Storm is open source, robust, and user friendly. It could be utilized in small companies as
well as large corporations.

2. Storm is fault tolerant, flexible, reliable, and supports any programming language.

3. Storm is very fast even under increasing load because it is highly scalable - resources can be
added linearly.

4. Storm provides guaranteed data processing even if any of the connected nodes in the cluster
die or messages are lost.

Apache ZooKeeper – Apache ZooKeeper is a centralized service for maintaining configuration


information, naming, providing distributed synchronization, and providing group services. All of these
kinds of services are used in some form or another by distributed applications. Co-ordinating and
managing a service in a distributed environment is a complicated process. ZooKeeper solves this
issue with its simple architecture and API. ZooKeeper allows developers to focus on core application
logic without worrying about the distributed nature of the application.

For example, in HBase, Zookeeper is a centralized monitoring server which maintains configuration
information and provides distributed synchronization. Distributed synchronization is to access the
distributed applications running across the cluster with the responsibility of providing coordination
services between nodes. In addition, Apache HBase uses ZooKeeper to track the status of distributed
data.

In other words, ZooKeeper is a distributed co-ordination service to manage large set of hosts.

The ZooKeeper framework was originally built at Yahoo for accessing their applications in an easy
and robust manner. Later, Apache ZooKeeper became a standard for organized service used by
Hadoop, HBase, and other distributed frameworks.

Apache Oozie – Apache Oozie is an open-source tool based on Java technology that simplifies the
process of creating workflows and managing coordination among jobs. One advantage of the Oozie
framework is that it is fully integrated with the Apache Hadoop stack and supports Hadoop jobs for
Apache MapReduce, Pig, Hive, and Sqoop. In addition, it can be used to schedule jobs specific to a
system, such as Java programs.

5
In addition, Apache Oozie provides clock and alarm service inside Hadoop Ecosystem. For Apache
jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs and binds them together as one
logical work.

There are two kinds of Oozie jobs:

1. Oozie workflow: These are sequential set of actions to be executed. You can assume it as a
relay race. Where each athlete waits for the last one to complete his part.

2. Oozie Coordinator: These are the Oozie jobs which are triggered when the data is made
available to it. Think of this as the response-stimuli system in our body. In the same manner
as we respond to an external stimulus, an Oozie coordinator responds to the availability of
data and it rests otherwise.

https://www.edureka.co/blog/hadoop-ecosystem

https://www.geeksforgeeks.org/hadoop-ecosystem/

https://www.oreilly.com/library/view/hadoop-the-definitive/9781491901687/ch07.html/BD_11.pdf

https://www.databricks.com/glossary/hadoop-ecosystem

https://www.geeksforgeeks.org/hadoop-architecture/

You might also like