KEMBAR78
Apache hadoop introduction and architecture | PPTX
APACHE HADOOP
–
INTRODUCTION
PRESENTED BY HARIKRISHNAN K
OUR M AIN TOPICS TODAY
What is Hadoop?
What is Big Data?
Introduction to Apache Hadoop
Hadoop Architecture
Conclusion
INTRODUCTION
Apache Hadoop is a core part of the computing
infrastructure for many web companies, such as
Facebook, Amazon, LinkedIn, Twitter, IBM,AOL,
and Alibaba.
Most of the Hadoop framework is written in Java
language, some part of it in C language and the
command line utility is written as shell scripts.
WHAT IS HADOOP?
Hadoop is an open source framework by Apache
Software Foundation and known for writing and
running distributed applications that process large
amounts of data.
What is Big Data?
Big data refers to data that is so large, fast or
complex that it’s difficult or impossible to process
using traditional methods
TRADITIONAL APPROACH
In this approach, an enterprise will have a
computer to store and process big data.
For storage purpose, the programmers will take
the help of their choice of database vendors
such as Oracle, IBM, etc.
In this approach, the user interacts with the
application, which in turn handles the part of
data storage and analysis
LIMITATION
Thisapproach works fine with those applications
that process less voluminous data that can be
accommodated by standard database servers,
or up to the limit of the processor that is
processing the data.
But when it comes to dealing with huge
amounts of scalable data, it is a hectic task to
process such data through a single database
bottleneck.
GOOGLE’S SOLUTION
Google solved this problem using an algorithm
called MapReduce.
Thisalgorithm divides the task into small parts
and assigns them to many computers, and
collects the results from them which when
integrated, form the result dataset.
HADOOP
Using the solution provided by Google, Doug
Cutting and his team developed an Open
Source Project called HADOOP.
Hadoop runs applications using the
MapReduce algorithm, where the data is
processed in parallel with others.
In short, Hadoop is used to develop
applications that could perform complete
statistical analysis on huge amounts of data.
HADOOP FEATURES
1.Reliability
When machines are working as a single unit, if one of
the machines fails, another machine will take over the
responsibility and work in a reliable and fault-tolerant
fashion.
Hadoop infrastructure has inbuilt fault tolerance
features and hence, Hadoop is highly reliable.
2.Economical
Hadoop usescommodity hardware (like your PC, laptop).
For example, in a small Hadoop cluster, all your DataNodes can have normal
configurations like 8-16GBRAMwith 5-10 TBhard disk and Xeon processors.
But if I would have used hardware-based RAIDwith Oracle for the same purpose, I
would end up spending 5x times more at least. So, the cost of ownership of a
Hadoop-based project is minimized.
It is easier to maintain a Hadoop environment and is economical as well.
Also, Hadoop is open-source software and hence there is no licensing cost.
3.Scalability
Hadoop has the inbuilt capability of integrating seamlessly with cloud-
based services.
So, if you are installing Hadoop on a cloud, you don’t need to worry about
the scalability factor because you can go ahead and procure more
hardware and expand your set up within minutes whenever required.
4.Distributed Processing
In Hadoop, any job submitted by the client gets divided into the number of
sub-tasks.
Thesesub-tasks are independent of each other. Hence they execute in
parallel giving high throughput.
5.Distributed Storage
Hadoop splits each file into the number of blocks.
Theseblocks get stored distributedly on the cluster of machines.
HADOOP ARCHITECHTURE
HIVE
Hive is a dataware house system which is used for querying
and analyzing large datasets stored in HDFS
Hive usesa query language called HiveQLwhich is similar
to SQL
HIVE COM M ANDS :
Data Definition Language (DDL)
DDLstatements are used to build and modify the tables and other objects in the database.
Data Manipulation Language (DML)
DMLstatements are used to retrieve, store, modify, delete, insert and update data in the
database.
Example :
LOAD, INSERTStatements.
M AP REDUCE
MapReduce is a software framework and programming model
used for processing huge amounts of data
MapReduce program work in two phases, namely, Map and Reduce
Map tasks deal with splitting and mapping of data while Reduce
tasks shuffle and reduce the data.
PIG
Pig is a scripting platform that runs on Hadoop clusters designed to process
and analyze large datasets
Pig is extensible, self-optimizing, and easily programmed.
Programmers can use Pig to write data transformations without knowing Java
Pig usesboth structured and unstructured data as input to perform
analytics and usesHDFSto store the results
A P A C H E S Q O O P
Apache Sqoop is a tool in Hadoop ecosystem which is designed to transfer data
between HDFS(Hadoop storage) and relational database servers like mysql,
Oracle RDB,SQLite, Teradata, Netezza, Postgres etc.
Apache Sqoop imports data from relational databases to HDFS,and exports
data from HDFSto relationaldatabases.
Sqoop is used to import data from external datastores into Hadoop ecosystem’s
tools like Hive &HBase
Z O O K E EPE R
ZooKeeper is a distributed co-ordination service to manage large set of hosts
Co-ordinating and managing a service in a distributed environment is a complicated
process.
ZooKeeper solves this issue with its simple architecture and API
A P A C H E S P A R K
Apache Spark is a fast, in-memory data processing engine suitable for
use in a wide range of circumstances
Spark can be deployed in several ways
features Java, Python, Scala, and Rprogramming languages
supports SQL,streaming data, machine learning, and graph processing
A P A C H E M A H O U T
It performs collaborative filtering, clustering and classification.
Mahout provides a command line to invoke various algorithms.
It has a predefined set of library which already contains different
inbuilt algorithms for different use cases.
C O L L A B O R A T I V E F I L T E RI N G
Mahout mines user behaviors, their patterns and their characteristics
and based on that it predicts and make recommendations to the
users.
Thetypical use case is E-commerce website.
C L U S T E RI N G
It organizes a similar group of data together like articles can contain
blogs, news, research papers etc.
C L A S S I F I C A T I O N :
It means classifying and categorizing data into various sub-
departments like articles can be categorized into blogs, news, essay,
research papers and other categories.
A P A C H E DR I L L
Apache Drill is used to drill into any kind of data.
It’s an open source application which works with distributed
environment to analyze large data sets.
It supports different kinds NoSQLdatabases and file systems, which is
a powerful feature of Drill
example: Azure Blob Storage, Google Cloud Storage, HBase,
MongoDB, MapR-DB HDFS,MapR-FS,Amazon S3, Swift, NASand
local files.
A P A C H E O O Z I E
Apache Oozie as a clock and alarm service inside Hadoop Ecosystem.
ForApache jobs, Oozie has been just like a scheduler.
It schedules Hadoop jobs and binds them together as one logical work.
there are two kinds of Oozie jobs:
1.Oozie workflow:
Theseare sequential set of actions to be executed.
2.Oozie Coordinator:
Theseare the Oozie jobs which are triggered when the data is made available to
it.
A P A C H E F L U M E
TheFlume is a service which helps in ingesting unstructured and semi-structured
data into HDFS.
It gives usa solution which is reliable and distributed and helps usin collecting,
aggregating and moving large amount of data sets.
It helps usto ingest online streaming data from various sources like network traffic,
social media, email messages, log files etc. in HDFS
A P A C H E F L U M E A R C H I T E C T U R E
There is a Flume agent which ingests the streaming data from various data sources to HDF
Theflume agent has 3 components: source, sink and channel.
1.Source: it accepts the data from the incoming streamline and stores the data in
the channel.
2. Channel: it acts as the local storage or the primary storage. AChannel is a temporary
storage between the source of data and persistent data in the HDFS.
3.Sink: Then,our last component i.e. Sink, collects the data from the channel and commits or
writes the data in the HDFSpermanently.
H A D O O P D I S T R I B U T E D FILE S Y S T E M
HDFSmakes it possible to store different types of large data sets (i.e. structured, unstructured,
and semi-structured data).
HDFShas two core components (NameNode and DataNode).
Y A R N
YARNcomprises of two major components: ResourceManager and NodeManager.
C O N C L U S I O N
Whenbig software vendors like Facebook, IBM,Yahoo were struggling to find a solution to deal
with the voluminous data, Hadoop is the only technology which offered a moderate solution.
Apache Hadoop has become a necessary tool to tackle big data. Asthe world is turning
digital, we would definitely come across more and more data and need to think of a more
simplified solution to handle growing big data

Apache hadoop introduction and architecture

  • 1.
  • 2.
    OUR M AINTOPICS TODAY What is Hadoop? What is Big Data? Introduction to Apache Hadoop Hadoop Architecture Conclusion
  • 3.
    INTRODUCTION Apache Hadoop isa core part of the computing infrastructure for many web companies, such as Facebook, Amazon, LinkedIn, Twitter, IBM,AOL, and Alibaba. Most of the Hadoop framework is written in Java language, some part of it in C language and the command line utility is written as shell scripts.
  • 4.
    WHAT IS HADOOP? Hadoopis an open source framework by Apache Software Foundation and known for writing and running distributed applications that process large amounts of data.
  • 5.
    What is BigData? Big data refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods
  • 6.
    TRADITIONAL APPROACH In thisapproach, an enterprise will have a computer to store and process big data. For storage purpose, the programmers will take the help of their choice of database vendors such as Oracle, IBM, etc. In this approach, the user interacts with the application, which in turn handles the part of data storage and analysis
  • 7.
    LIMITATION Thisapproach works finewith those applications that process less voluminous data that can be accommodated by standard database servers, or up to the limit of the processor that is processing the data. But when it comes to dealing with huge amounts of scalable data, it is a hectic task to process such data through a single database bottleneck.
  • 8.
    GOOGLE’S SOLUTION Google solvedthis problem using an algorithm called MapReduce. Thisalgorithm divides the task into small parts and assigns them to many computers, and collects the results from them which when integrated, form the result dataset.
  • 9.
    HADOOP Using the solutionprovided by Google, Doug Cutting and his team developed an Open Source Project called HADOOP. Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel with others. In short, Hadoop is used to develop applications that could perform complete statistical analysis on huge amounts of data.
  • 10.
    HADOOP FEATURES 1.Reliability When machinesare working as a single unit, if one of the machines fails, another machine will take over the responsibility and work in a reliable and fault-tolerant fashion. Hadoop infrastructure has inbuilt fault tolerance features and hence, Hadoop is highly reliable.
  • 11.
    2.Economical Hadoop usescommodity hardware(like your PC, laptop). For example, in a small Hadoop cluster, all your DataNodes can have normal configurations like 8-16GBRAMwith 5-10 TBhard disk and Xeon processors. But if I would have used hardware-based RAIDwith Oracle for the same purpose, I would end up spending 5x times more at least. So, the cost of ownership of a Hadoop-based project is minimized. It is easier to maintain a Hadoop environment and is economical as well. Also, Hadoop is open-source software and hence there is no licensing cost.
  • 12.
    3.Scalability Hadoop has theinbuilt capability of integrating seamlessly with cloud- based services. So, if you are installing Hadoop on a cloud, you don’t need to worry about the scalability factor because you can go ahead and procure more hardware and expand your set up within minutes whenever required.
  • 13.
    4.Distributed Processing In Hadoop,any job submitted by the client gets divided into the number of sub-tasks. Thesesub-tasks are independent of each other. Hence they execute in parallel giving high throughput.
  • 14.
    5.Distributed Storage Hadoop splitseach file into the number of blocks. Theseblocks get stored distributedly on the cluster of machines.
  • 15.
  • 16.
    HIVE Hive is adataware house system which is used for querying and analyzing large datasets stored in HDFS Hive usesa query language called HiveQLwhich is similar to SQL
  • 17.
    HIVE COM MANDS : Data Definition Language (DDL) DDLstatements are used to build and modify the tables and other objects in the database. Data Manipulation Language (DML) DMLstatements are used to retrieve, store, modify, delete, insert and update data in the database. Example : LOAD, INSERTStatements.
  • 18.
    M AP REDUCE MapReduceis a software framework and programming model used for processing huge amounts of data MapReduce program work in two phases, namely, Map and Reduce Map tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data.
  • 19.
    PIG Pig is ascripting platform that runs on Hadoop clusters designed to process and analyze large datasets Pig is extensible, self-optimizing, and easily programmed. Programmers can use Pig to write data transformations without knowing Java Pig usesboth structured and unstructured data as input to perform analytics and usesHDFSto store the results
  • 22.
    A P AC H E S Q O O P Apache Sqoop is a tool in Hadoop ecosystem which is designed to transfer data between HDFS(Hadoop storage) and relational database servers like mysql, Oracle RDB,SQLite, Teradata, Netezza, Postgres etc. Apache Sqoop imports data from relational databases to HDFS,and exports data from HDFSto relationaldatabases. Sqoop is used to import data from external datastores into Hadoop ecosystem’s tools like Hive &HBase
  • 23.
    Z O OK E EPE R ZooKeeper is a distributed co-ordination service to manage large set of hosts Co-ordinating and managing a service in a distributed environment is a complicated process. ZooKeeper solves this issue with its simple architecture and API
  • 24.
    A P AC H E S P A R K Apache Spark is a fast, in-memory data processing engine suitable for use in a wide range of circumstances Spark can be deployed in several ways features Java, Python, Scala, and Rprogramming languages supports SQL,streaming data, machine learning, and graph processing
  • 25.
    A P AC H E M A H O U T It performs collaborative filtering, clustering and classification. Mahout provides a command line to invoke various algorithms. It has a predefined set of library which already contains different inbuilt algorithms for different use cases.
  • 26.
    C O LL A B O R A T I V E F I L T E RI N G Mahout mines user behaviors, their patterns and their characteristics and based on that it predicts and make recommendations to the users. Thetypical use case is E-commerce website.
  • 27.
    C L US T E RI N G It organizes a similar group of data together like articles can contain blogs, news, research papers etc.
  • 28.
    C L AS S I F I C A T I O N : It means classifying and categorizing data into various sub- departments like articles can be categorized into blogs, news, essay, research papers and other categories.
  • 29.
    A P AC H E DR I L L Apache Drill is used to drill into any kind of data. It’s an open source application which works with distributed environment to analyze large data sets. It supports different kinds NoSQLdatabases and file systems, which is a powerful feature of Drill example: Azure Blob Storage, Google Cloud Storage, HBase, MongoDB, MapR-DB HDFS,MapR-FS,Amazon S3, Swift, NASand local files.
  • 30.
    A P AC H E O O Z I E Apache Oozie as a clock and alarm service inside Hadoop Ecosystem. ForApache jobs, Oozie has been just like a scheduler. It schedules Hadoop jobs and binds them together as one logical work. there are two kinds of Oozie jobs: 1.Oozie workflow: Theseare sequential set of actions to be executed. 2.Oozie Coordinator: Theseare the Oozie jobs which are triggered when the data is made available to it.
  • 31.
    A P AC H E F L U M E TheFlume is a service which helps in ingesting unstructured and semi-structured data into HDFS. It gives usa solution which is reliable and distributed and helps usin collecting, aggregating and moving large amount of data sets. It helps usto ingest online streaming data from various sources like network traffic, social media, email messages, log files etc. in HDFS
  • 32.
    A P AC H E F L U M E A R C H I T E C T U R E There is a Flume agent which ingests the streaming data from various data sources to HDF Theflume agent has 3 components: source, sink and channel. 1.Source: it accepts the data from the incoming streamline and stores the data in the channel. 2. Channel: it acts as the local storage or the primary storage. AChannel is a temporary storage between the source of data and persistent data in the HDFS. 3.Sink: Then,our last component i.e. Sink, collects the data from the channel and commits or writes the data in the HDFSpermanently.
  • 33.
    H A DO O P D I S T R I B U T E D FILE S Y S T E M HDFSmakes it possible to store different types of large data sets (i.e. structured, unstructured, and semi-structured data). HDFShas two core components (NameNode and DataNode).
  • 35.
    Y A RN YARNcomprises of two major components: ResourceManager and NodeManager.
  • 36.
    C O NC L U S I O N Whenbig software vendors like Facebook, IBM,Yahoo were struggling to find a solution to deal with the voluminous data, Hadoop is the only technology which offered a moderate solution. Apache Hadoop has become a necessary tool to tackle big data. Asthe world is turning digital, we would definitely come across more and more data and need to think of a more simplified solution to handle growing big data