0% found this document useful (0 votes)

17 views6 pages

BigDataAnalytics Unit5

The document provides an overview of the Hadoop Ecosystem, detailing various tools such as Apache Flume, Pig, Sqoop, Hive, HBase, Storm, Zookeeper, and Oozie, which facilitate Big Data management. Each tool serves distinct purposes, from data ingestion and processing to real-time analytics and job scheduling. The document emphasizes the importance of these tools in handling large datasets efficiently and effectively within the Hadoop framework.

Uploaded by

Wipro pvt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views6 pages

BigDataAnalytics Unit5

Uploaded by

Wipro pvt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Big Data Analytics

UNIT 5: HADOOP ECOSYSTEM

5.1 Overview and comparison of different ecosystem tools like Apache Flume, Pig, Sqoop, Hive,
HBase, Storm, Zookeeper, Oozie

Hadoop Ecosystem is nothing but a platform or framework which is a collection of different tools that
enable Big Data management. These include,

• HDFS -> Hadoop Distributed File System

• YARN -> Yet Another Resource Negotiator
• MapReduce -> Data processing using programming
• Spark -> In-memory Data Processing
• PIG, HIVE-> Data Processing Services using Query (SQL-like)
• HBase -> NoSQL Database
• Mahout, Spark MLlib -> Machine Learning
• Apache Drill -> SQL on Hadoop
• Zookeeper -> Managing Cluster
• Oozie -> Job Scheduling
• Flume, Sqoop -> Data Ingesting Services
• Solr & Lucene -> Searching & Indexing
• Ambari -> Provision, Monitor and Maintain cluster

1
Apache Flume - Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of datasets. It has a simple and flexible architecture based on
streaming data flows. It is robust and fault tolerant with tuneable reliability mechanisms and many
failover and recovery mechanisms. It uses a simple extensible data model that allows for online
analytic application.

Apache Flume is a service which helps in ingesting unstructured and semi-structured data into HDFS
or HBase (a NoSQL database).

It can be used to transport massive quantities of event data including but not limited to network
traffic data, social-media-generated data, email messages and pretty much any data source possible.

A Flume agent is a (JVM) process that hosts the components through which events flow from an
external source to the next destination (hop). An event in Flume is a unit of data flow having a byte
payload and an optional set of string attributes.

2
A Flume source consumes events delivered to it by an external source like a web server. The external
source sends events to Flume in a format that is recognized by the target Flume source. For example,
an Avro Flume source can be used to receive Avro events from Avro clients or other Flume agents in
the flow that send events from an Avro sink. A similar flow can be defined using a Thrift Flume
Source to receive events from a Thrift Sink or a Flume Thrift RPC Client or Thrift clients written in any
language generated from the Flume thrift protocol. When a Flume source receives an event, it stores
it into one or more channels. The channel is a passive store that keeps the event until it’s consumed
by a Flume sink. The file channel is one example – it is backed by the local filesystem. The sink
removes the event from the channel and puts it into an external repository like HDFS (via Flume
HDFS sink) or forwards it to the Flume source of the next Flume agent (next hop) in the flow. The
source and sink within the given agent run asynchronously with the events staged in the channel.

Apache Pig – Apache Pig is a platform for analysing very large data sets. Pig was initially developed
by Yahoo. It gives you a platform for building data flow for ETL (Extract, Transform and Load),
processing and analysing huge data sets.

Apache Pig has two parts: The language layer and infrastructure layer. Pig Latin, is in the language
layer and the pig runtime is in the infrastructure layer or execution environment.

Pig's language layer currently consists of a textual language called Pig Latin. It has SQL like command
structure and it is very easy to learn. It does not require extensive programming skills. For example,
ten lines of Pig Latin is sufficient to implement a MapReduce Java code that has hundred lines.
Internally, the Pig Latin compiler converts Pig Latin code to MapReduce code or jobs. It produces a
sequential set of MapReduce jobs, and that happens in the background. It has the following key
properties:

• Ease of programming. It is trivial to achieve parallel execution of simple data analysis tasks.
Complex tasks comprised of multiple interrelated data transformations are explicitly encoded
as data flow sequences, making them easy to write, understand, and maintain.

• Optimization opportunities. The system optimizes the execution of tasks automatically.

• Extensibility. Users can create their own functions to do special-purpose processing.

Pig's infrastructure layer (the runtime environment) consists of a compiler that produces sequences
of Map-Reduce programs or jobs, for which large-scale parallel implementations already exist (e.g.,
the Hadoop subproject).

3
In Apache Pig, first the load command, loads the data. Then we perform various functions on it like
grouping, filtering, joining, sorting, etc. At end, either the data can be displayed on the screen or
stored in HDFS.

Apache Sqoop – Sqoop provides data ingestion service. It is a tool designed for efficiently
transferring bulk data between Apache Hadoop and structured datastores such as relational
databases.

There is a key difference between Flume and Sqoop. Flume ingests unstructured data or semi-
structured data into HDFS. It does not export data from HDFS. Sqoop can import structured data
from RDBMS or Enterprise data warehouses to HDFS. In addition, it can export data from HDFS into
RDBMS or Enterprise data warehouses.

Apache Hive – Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a
massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed
storage using SQL.

The query language of Hive is called Hive Query Language (HQL), which is very similar like SQL. It has
2 basic components: Hive Command Line and JDBC/ODBC driver. The Hive Command line interface
is used to execute HQL commands. Java Database Connectivity (JDBC) and Object Database
Connectivity (ODBC) are used to establish connection from data storage.

Hive is highly scalable. It can be used for large data set processing (i.e. batch processing) and real
time processing (i.e. interactive query processing). It supports all primitive data types of SQL. Tasks
can be performed using predefined functions, and/or user defined functions (UDF).

Apache HBase – HBase is a type of NoSQL database. It is not an RDBMS which supports SQL as its
primary access language. HBase is very much a distributed database. Being a NoSQL database, HBase
lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers,
and advanced query languages, etc.

One can store data in HDFS either directly or through HBase.

HBase is a distributed column-oriented database built on top of the Hadoop file system. It provides
fast record lookups (and updates) for large tables. It is based on a data model that is similar to
Google’s big table designed to provide quick random access to huge amounts of structured data. It
leverages the fault tolerance provided by the Hadoop File System (HDFS). It is a part of the Hadoop
ecosystem that provides random real-time read/write access to data in the Hadoop File System.

Hadoop by default, can perform only batch processing and data can be accessed only in a sequential
manner. Sequential access is time consuming. HBase enables random access to data. HBase sits on
top of the Hadoop File System and provides read and write access.

Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase.

Apache Storm – Apache Storm is a distributed real-time big data processing system. It is designed to
process vast amount of data in a fault-tolerant and horizontal scalable method. It is a streaming data
framework that has the capability of highest ingestion rates. Though Storm is stateless, it manages
distributed environment and cluster state via Apache ZooKeeper. It is simple and you can execute all
kinds of manipulations on real-time data in parallel.

4
Apache Storm is continuing to be a leader in real-time data analytics. Storm is easy to setup, operate
and it guarantees that every message will be processed through the topology at least once.

Hadoop performs batch processing whereas Storm performs real-time data processing. Hadoop is
stateful. Strom is stateless.

A Storm streaming process can access tens of thousands of messages per second on cluster. In
Hadoop MapReduce jobs are executed in a sequential order and completed eventually.

Many organizations are using Storm as an integral part of their system. For example, Twitter is using
Apache Storm for its range of “Publisher Analytics products”. “Publisher Analytics Products” process
all the tweets and clicks in the Twitter Platform. Apache Storm is deeply integrated with Twitter
infrastructure.

Strom is used for real-time fraud detection in banking systems. It is used for dynamic pricing in retail
systems.

Strom provides the following benefits.

1. Storm is open source, robust, and user friendly. It could be utilized in small companies as
well as large corporations.

2. Storm is fault tolerant, flexible, reliable, and supports any programming language.

3. Storm is very fast even under increasing load because it is highly scalable - resources can be
added linearly.

4. Storm provides guaranteed data processing even if any of the connected nodes in the cluster
die or messages are lost.

Apache ZooKeeper – Apache ZooKeeper is a centralized service for maintaining configuration

information, naming, providing distributed synchronization, and providing group services. All of these
kinds of services are used in some form or another by distributed applications. Co-ordinating and
managing a service in a distributed environment is a complicated process. ZooKeeper solves this
issue with its simple architecture and API. ZooKeeper allows developers to focus on core application
logic without worrying about the distributed nature of the application.

For example, in HBase, Zookeeper is a centralized monitoring server which maintains configuration
information and provides distributed synchronization. Distributed synchronization is to access the
distributed applications running across the cluster with the responsibility of providing coordination
services between nodes. In addition, Apache HBase uses ZooKeeper to track the status of distributed
data.

In other words, ZooKeeper is a distributed co-ordination service to manage large set of hosts.

The ZooKeeper framework was originally built at Yahoo for accessing their applications in an easy
and robust manner. Later, Apache ZooKeeper became a standard for organized service used by
Hadoop, HBase, and other distributed frameworks.

Apache Oozie – Apache Oozie is an open-source tool based on Java technology that simplifies the
process of creating workflows and managing coordination among jobs. One advantage of the Oozie
framework is that it is fully integrated with the Apache Hadoop stack and supports Hadoop jobs for
Apache MapReduce, Pig, Hive, and Sqoop. In addition, it can be used to schedule jobs specific to a
system, such as Java programs.

5
In addition, Apache Oozie provides clock and alarm service inside Hadoop Ecosystem. For Apache
jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs and binds them together as one
logical work.

There are two kinds of Oozie jobs:

1. Oozie workflow: These are sequential set of actions to be executed. You can assume it as a
relay race. Where each athlete waits for the last one to complete his part.

2. Oozie Coordinator: These are the Oozie jobs which are triggered when the data is made
available to it. Think of this as the response-stimuli system in our body. In the same manner
as we respond to an external stimulus, an Oozie coordinator responds to the availability of
data and it rests otherwise.

https://www.edureka.co/blog/hadoop-ecosystem

https://www.geeksforgeeks.org/hadoop-ecosystem/

https://www.oreilly.com/library/view/hadoop-the-definitive/9781491901687/ch07.html/BD_11.pdf

https://www.databricks.com/glossary/hadoop-ecosystem

https://www.geeksforgeeks.org/hadoop-architecture/

Screenshot 2025-01-13 at 12.17.38 PM
No ratings yet
Screenshot 2025-01-13 at 12.17.38 PM
12 pages
Lect - 11 - BIG DATA
No ratings yet
Lect - 11 - BIG DATA
42 pages
Sqoop & Flume: Issues With Data Load Into Hadoop
No ratings yet
Sqoop & Flume: Issues With Data Load Into Hadoop
6 pages
Big Data Technology Stack Guide
100% (1)
Big Data Technology Stack Guide
12 pages
BDA Report
No ratings yet
BDA Report
11 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
21 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Bda Iat2
No ratings yet
Bda Iat2
23 pages
Unit 4
No ratings yet
Unit 4
4 pages
Big Data BASICS
No ratings yet
Big Data BASICS
3 pages
Flume Agent
No ratings yet
Flume Agent
23 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
5 pages
Hadoop
No ratings yet
Hadoop
14 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
8 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
11 pages
BDA Module 2-2023
No ratings yet
BDA Module 2-2023
30 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Big Data Infarstructure
No ratings yet
Big Data Infarstructure
7 pages
Unit 2 (2 Part)
No ratings yet
Unit 2 (2 Part)
69 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
55 pages
BDA UNIT-2 (Final)
No ratings yet
BDA UNIT-2 (Final)
27 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
28 pages
HDFS3
No ratings yet
HDFS3
8 pages
Apache Flume
No ratings yet
Apache Flume
8 pages
Unit 2
No ratings yet
Unit 2
9 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Tools - A Brief Overview
No ratings yet
Hadoop Tools - A Brief Overview
18 pages
Unit - V - Hadoop Related Tools
No ratings yet
Unit - V - Hadoop Related Tools
31 pages
2 Unit 5
No ratings yet
2 Unit 5
24 pages
Unit-3 (HDFS-II)
No ratings yet
Unit-3 (HDFS-II)
28 pages
Unit 3 (HDFS) - Part B
No ratings yet
Unit 3 (HDFS) - Part B
30 pages
226 Unit-7
No ratings yet
226 Unit-7
26 pages
Expose BDD
No ratings yet
Expose BDD
16 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
Data Science and Big Data UNIT 4
No ratings yet
Data Science and Big Data UNIT 4
10 pages
Open Source Technologies
No ratings yet
Open Source Technologies
19 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
42 pages
Apache Flume: Distributed Log Collection For Hadoop - Second Edition - Sample Chapter
No ratings yet
Apache Flume: Distributed Log Collection For Hadoop - Second Edition - Sample Chapter
13 pages
Big Data Emerging Technologie
No ratings yet
Big Data Emerging Technologie
10 pages
BigData Nov2019
No ratings yet
BigData Nov2019
50 pages
Q. What Is Big Data?
No ratings yet
Q. What Is Big Data?
8 pages
Big Data Analytics Using Hadoop Tools - Apache Hive VS Apache Pig - 1604726800
No ratings yet
Big Data Analytics Using Hadoop Tools - Apache Hive VS Apache Pig - 1604726800
5 pages
Big Data & Hadoop Ecosystem Guide
No ratings yet
Big Data & Hadoop Ecosystem Guide
4 pages
Apache Sqoop, Apache Pig & Apache Hive
No ratings yet
Apache Sqoop, Apache Pig & Apache Hive
31 pages
FLUME
No ratings yet
FLUME
31 pages
BDA - Unit 4
No ratings yet
BDA - Unit 4
18 pages
Module 1 Glossary What Is Big Data
No ratings yet
Module 1 Glossary What Is Big Data
2 pages
Big Data
No ratings yet
Big Data
11 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
Hadoop-Compatible Projects
No ratings yet
Hadoop-Compatible Projects
7 pages
Hadoop Ecosystem Tools
No ratings yet
Hadoop Ecosystem Tools
2 pages
S Pig Hive HBase Zookeeper
No ratings yet
S Pig Hive HBase Zookeeper
19 pages
Bda Exp7 Chinmay
No ratings yet
Bda Exp7 Chinmay
5 pages
PHP Database Connection
No ratings yet
PHP Database Connection
9 pages
XML New
No ratings yet
XML New
80 pages
String Functions SQL Code
No ratings yet
String Functions SQL Code
3 pages
Subqueries Solution 2 SQL Code
No ratings yet
Subqueries Solution 2 SQL Code
1 page
Solution 4 SQL Code
No ratings yet
Solution 4 SQL Code
1 page
Dbvisit 7 Eleven Case Study
No ratings yet
Dbvisit 7 Eleven Case Study
2 pages
Cont Int ProjektServerTools DOC V12 en
No ratings yet
Cont Int ProjektServerTools DOC V12 en
11 pages
Information Technology - Remote Access - SOP
No ratings yet
Information Technology - Remote Access - SOP
4 pages
Zomboid Server Log Analysis
No ratings yet
Zomboid Server Log Analysis
546 pages
Goods Management System-1
No ratings yet
Goods Management System-1
33 pages
1 Literature Review
No ratings yet
1 Literature Review
11 pages
Car Rental Srs
No ratings yet
Car Rental Srs
28 pages
Vighnesh Sethuram Resume
No ratings yet
Vighnesh Sethuram Resume
2 pages
Weekly Report Februari 2023 (Baba) Terbaru
No ratings yet
Weekly Report Februari 2023 (Baba) Terbaru
72 pages
Asm Note
No ratings yet
Asm Note
1 page
Enterprise Computing Note
No ratings yet
Enterprise Computing Note
108 pages
SME Cyber Insurance Proposal Form v2
No ratings yet
SME Cyber Insurance Proposal Form v2
4 pages
Web - Enabled Data Warehouse Why The Web? Convergence of Technologies The Web As A Data Source
No ratings yet
Web - Enabled Data Warehouse Why The Web? Convergence of Technologies The Web As A Data Source
14 pages
SDG-UAEPASS SP Authentication Integration OAuth Version 1.0
50% (2)
SDG-UAEPASS SP Authentication Integration OAuth Version 1.0
23 pages
Azure PDF
No ratings yet
Azure PDF
159 pages
DBMS Question Bank PDF
No ratings yet
DBMS Question Bank PDF
10 pages
SEO Audit Report Unetree
No ratings yet
SEO Audit Report Unetree
13 pages
DevOps Overview and Benefits
100% (1)
DevOps Overview and Benefits
17 pages
2 4 Flexcube - Ai EnhancedRPA
No ratings yet
2 4 Flexcube - Ai EnhancedRPA
11 pages
Abap On Hana
No ratings yet
Abap On Hana
10 pages
ESSL Device Management Utilities Manual
No ratings yet
ESSL Device Management Utilities Manual
43 pages
Nitin Patel Resume
No ratings yet
Nitin Patel Resume
2 pages
BIM 360 Document Management Workflow Guide
No ratings yet
BIM 360 Document Management Workflow Guide
38 pages
I Earned $3500 and 40 Points For A GraphQL Blind SQL Injection Vulnerability. by Nav1n? Mar, 2023 Medium
No ratings yet
I Earned $3500 and 40 Points For A GraphQL Blind SQL Injection Vulnerability. by Nav1n? Mar, 2023 Medium
8 pages
Cyber Forum Program 2022
No ratings yet
Cyber Forum Program 2022
2 pages
Certified Digital Marketing Professional - Session 01
No ratings yet
Certified Digital Marketing Professional - Session 01
13 pages
CS408 Solved MCQs by Kaami
No ratings yet
CS408 Solved MCQs by Kaami
7 pages
Unit 4-Configuration Management
No ratings yet
Unit 4-Configuration Management
30 pages
I MHA Unit II
No ratings yet
I MHA Unit II
10 pages
Enterprise Architecture
100% (7)
Enterprise Architecture
910 pages

BigDataAnalytics Unit5

Uploaded by

BigDataAnalytics Unit5

Uploaded by

Big Data Analytics

UNIT 5: HADOOP ECOSYSTEM

• HDFS -> Hadoop Distributed File System

• Optimization opportunities. The system optimizes the execution of tasks automatically.

• Extensibility. Users can create their own functions to do special-purpose processing.

One can store data in HDFS either directly or through HBase.

Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase.

Strom provides the following benefits.

Apache ZooKeeper – Apache ZooKeeper is a centralized service for maintaining configuration

There are two kinds of Oozie jobs:

You might also like