Apache hadoop introduction and architecture

APACHE HADOOP
–
INTRODUCTION
PRESENTED BY HARIKRISHNAN K

OUR M AIN TOPICS TODAY
What is Hadoop?
What is Big Data?
Introduction to Apache Hadoop
Hadoop Architecture
Conclusion

INTRODUCTION
Apache Hadoop is a core part of the computing
infrastructure for many web companies, such as
Facebook, Amazon, LinkedIn, Twitter, IBM,AOL,
and Alibaba.
Most of the Hadoop framework is written in Java
language, some part of it in C language and the
command line utility is written as shell scripts.

WHAT IS HADOOP?
Hadoop is an open source framework by Apache
Software Foundation and known for writing and
running distributed applications that process large
amounts of data.

What is Big Data?
Big data refers to data that is so large, fast or
complex that it’s difficult or impossible to process
using traditional methods

TRADITIONAL APPROACH
In this approach, an enterprise will have a
computer to store and process big data.
For storage purpose, the programmers will take
the help of their choice of database vendors
such as Oracle, IBM, etc.
In this approach, the user interacts with the
application, which in turn handles the part of
data storage and analysis

LIMITATION
Thisapproach works fine with those applications
that process less voluminous data that can be
accommodated by standard database servers,
or up to the limit of the processor that is
processing the data.
But when it comes to dealing with huge
amounts of scalable data, it is a hectic task to
process such data through a single database
bottleneck.

GOOGLE’S SOLUTION
Google solved this problem using an algorithm
called MapReduce.
Thisalgorithm divides the task into small parts
and assigns them to many computers, and
collects the results from them which when
integrated, form the result dataset.

HADOOP
Using the solution provided by Google, Doug
Cutting and his team developed an Open
Source Project called HADOOP.
Hadoop runs applications using the
MapReduce algorithm, where the data is
processed in parallel with others.
In short, Hadoop is used to develop
applications that could perform complete
statistical analysis on huge amounts of data.

HADOOP FEATURES
1.Reliability
When machines are working as a single unit, if one of
the machines fails, another machine will take over the
responsibility and work in a reliable and fault-tolerant
fashion.
Hadoop infrastructure has inbuilt fault tolerance
features and hence, Hadoop is highly reliable.

2.Economical
Hadoop usescommodity hardware (like your PC, laptop).
For example, in a small Hadoop cluster, all your DataNodes can have normal
configurations like 8-16GBRAMwith 5-10 TBhard disk and Xeon processors.
But if I would have used hardware-based RAIDwith Oracle for the same purpose, I
would end up spending 5x times more at least. So, the cost of ownership of a
Hadoop-based project is minimized.
It is easier to maintain a Hadoop environment and is economical as well.
Also, Hadoop is open-source software and hence there is no licensing cost.

3.Scalability
Hadoop has the inbuilt capability of integrating seamlessly with cloud-
based services.
So, if you are installing Hadoop on a cloud, you don’t need to worry about
the scalability factor because you can go ahead and procure more
hardware and expand your set up within minutes whenever required.

4.Distributed Processing
In Hadoop, any job submitted by the client gets divided into the number of
sub-tasks.
Thesesub-tasks are independent of each other. Hence they execute in
parallel giving high throughput.

5.Distributed Storage
Hadoop splits each file into the number of blocks.
Theseblocks get stored distributedly on the cluster of machines.

HIVE
Hive is a dataware house system which is used for querying
and analyzing large datasets stored in HDFS
Hive usesa query language called HiveQLwhich is similar
to SQL

HIVE COM M ANDS :
Data Definition Language (DDL)
DDLstatements are used to build and modify the tables and other objects in the database.
Data Manipulation Language (DML)
DMLstatements are used to retrieve, store, modify, delete, insert and update data in the
database.
Example :
LOAD, INSERTStatements.

M AP REDUCE
MapReduce is a software framework and programming model
used for processing huge amounts of data
MapReduce program work in two phases, namely, Map and Reduce
Map tasks deal with splitting and mapping of data while Reduce
tasks shuffle and reduce the data.

PIG
Pig is a scripting platform that runs on Hadoop clusters designed to process
and analyze large datasets
Pig is extensible, self-optimizing, and easily programmed.
Programmers can use Pig to write data transformations without knowing Java
Pig usesboth structured and unstructured data as input to perform
analytics and usesHDFSto store the results

A P A C H E S Q O O P
Apache Sqoop is a tool in Hadoop ecosystem which is designed to transfer data
between HDFS(Hadoop storage) and relational database servers like mysql,
Oracle RDB,SQLite, Teradata, Netezza, Postgres etc.
Apache Sqoop imports data from relational databases to HDFS,and exports
data from HDFSto relationaldatabases.
Sqoop is used to import data from external datastores into Hadoop ecosystem’s
tools like Hive &HBase

Z O O K E EPE R
ZooKeeper is a distributed co-ordination service to manage large set of hosts
Co-ordinating and managing a service in a distributed environment is a complicated
process.
ZooKeeper solves this issue with its simple architecture and API

A P A C H E S P A R K
Apache Spark is a fast, in-memory data processing engine suitable for
use in a wide range of circumstances
Spark can be deployed in several ways
features Java, Python, Scala, and Rprogramming languages
supports SQL,streaming data, machine learning, and graph processing

A P A C H E M A H O U T
It performs collaborative filtering, clustering and classification.
Mahout provides a command line to invoke various algorithms.
It has a predefined set of library which already contains different
inbuilt algorithms for different use cases.

C O L L A B O R A T I V E F I L T E RI N G
Mahout mines user behaviors, their patterns and their characteristics
and based on that it predicts and make recommendations to the
users.
Thetypical use case is E-commerce website.

C L U S T E RI N G
It organizes a similar group of data together like articles can contain
blogs, news, research papers etc.

C L A S S I F I C A T I O N :
It means classifying and categorizing data into various sub-
departments like articles can be categorized into blogs, news, essay,
research papers and other categories.

A P A C H E DR I L L
Apache Drill is used to drill into any kind of data.
It’s an open source application which works with distributed
environment to analyze large data sets.
It supports different kinds NoSQLdatabases and file systems, which is
a powerful feature of Drill
example: Azure Blob Storage, Google Cloud Storage, HBase,
MongoDB, MapR-DB HDFS,MapR-FS,Amazon S3, Swift, NASand
local files.

A P A C H E O O Z I E
Apache Oozie as a clock and alarm service inside Hadoop Ecosystem.
ForApache jobs, Oozie has been just like a scheduler.
It schedules Hadoop jobs and binds them together as one logical work.
there are two kinds of Oozie jobs:
1.Oozie workflow:
Theseare sequential set of actions to be executed.
2.Oozie Coordinator:
Theseare the Oozie jobs which are triggered when the data is made available to
it.

A P A C H E F L U M E
TheFlume is a service which helps in ingesting unstructured and semi-structured
data into HDFS.
It gives usa solution which is reliable and distributed and helps usin collecting,
aggregating and moving large amount of data sets.
It helps usto ingest online streaming data from various sources like network traffic,
social media, email messages, log files etc. in HDFS

A P A C H E F L U M E A R C H I T E C T U R E
There is a Flume agent which ingests the streaming data from various data sources to HDF
Theflume agent has 3 components: source, sink and channel.
1.Source: it accepts the data from the incoming streamline and stores the data in
the channel.
2. Channel: it acts as the local storage or the primary storage. AChannel is a temporary
storage between the source of data and persistent data in the HDFS.
3.Sink: Then,our last component i.e. Sink, collects the data from the channel and commits or
writes the data in the HDFSpermanently.

H A D O O P D I S T R I B U T E D FILE S Y S T E M
HDFSmakes it possible to store different types of large data sets (i.e. structured, unstructured,
and semi-structured data).
HDFShas two core components (NameNode and DataNode).

Y A R N
YARNcomprises of two major components: ResourceManager and NodeManager.

C O N C L U S I O N
Whenbig software vendors like Facebook, IBM,Yahoo were struggling to find a solution to deal
with the voluminous data, Hadoop is the only technology which offered a moderate solution.
Apache Hadoop has become a necessary tool to tackle big data. Asthe world is turning
digital, we would definitely come across more and more data and need to think of a more
simplified solution to handle growing big data

Apache hadoop introduction and architecture

More Related Content

What's hot

Similar to Apache hadoop introduction and architecture

Recently uploaded

Apache hadoop introduction and architecture