Anoverviewon Big Dataand Hadoop
Anoverviewon Big Dataand Hadoop
net/publication/372631347
CITATIONS READS
16 2,549
1 author:
SEE PROFILE
All content following this page was uploaded by Shaikh Abdul Hannan on 26 July 2023.
29
International Journal of Computer Applications (0975 – 8887)
Volume 154 – No.10, November 2016
30
International Journal of Computer Applications (0975 – 8887)
Volume 154 – No.10, November 2016
31
International Journal of Computer Applications (0975 – 8887)
Volume 154 – No.10, November 2016
MapReduce [18] is a framework pioneered by Google for time to read the data is very high because it reads the data
processing large amounts of data in a distributed environment. from different machines [22].
Due to the simplicity of its programming model and the run-
time tolerance for node failures, MapReduce is widely used HDFS allows to put/get/delete files. It also follows the policy
by companies such as Facebook [19], the New York Times for write once and read multiple times. For example upload a
[20], etc. picture in Facebook,, it can be seen whenever the user want to
see it, and user can see the picture after 5 years. So user is
Distributed processing of large data sets so the data is not writing, user uploading file once and reading it multiple times.
stored at one place, input and output is not happened from
place that is stored across the cluster on commodity In this diagram two things are important to note and just
computers using a simple programming model, that remember Name node is nothing but Admin node it‘s a master
programming is called MapReduce. slave type of configuration. So there is something called
name node or admin node which is master then there are slave
The way MapReduce is works across the clusters of your which is called data nodes and then there is something called
computer then there is a framework programming model job tracker which is associated with name node and there is
which takes core of synchronizing the data or reading the data something called task tracker which is associated with data
from where to reading the data so all these things are taken node.
care by system. If the job is running, it has to restive this data
from multiple clusters all this is taken by Hadoop framework. Basically a HDFS cluster is the name given to whole
There is simple model programming which is called as configuration masters and slaves where the data is stored and
MapReduce programming and it is open source data MapReduce engine is the programming model which is used
management. to retrieve and analyze the data.
32
International Journal of Computer Applications (0975 – 8887)
Volume 154 – No.10, November 2016
users can request read and write operations. The heartbeat can
also help the Namenode to detect connectivity with its
Datanode. If the Namenode does not receive a heartbeat from
a Datanode in the configured period of time, it marks the node
down. Data blocks stored on this node will be considered lost
and the Namenode will automatically replicate those blocks of
this lost node onto some other datanodes [23].
The datanode is place where the actual data is stored i.e.
structured and unstructured data is stored. The Namenode is
the node which contains the metadata around this datanode.
Any information regarding the data, where it is available is in
the data node.
YARN is a complete framework which is a resource manager Fig5.1HDFS Structure. Source: http://hadoop.apache.org
that resides in Namenode which takes care of the all over
resource management. Then there is something called Node It contains Namenode and client. Now there is something
Task Tracker
manager which resides on data node which manages the data called Rack so rack is the storage area where multiple data
inside or jobs which are running inside the datanode [24]. nodes put together physically all these data nodes may be of
different places. Basically rack is a physical collection of data
nodes which are stored at single location and there can be
datanodes in different places. Basically the diagram shows
the client which interacts with Namenode. Here term called
replication so as to maintain the data from the fault tolerance
in Hadoop system, the same data is distributed in multiple
copies and replicate in multiple data nodes and minimum
number of replicas required by HDFS and it can be defined by
the user. So basically when the data writes on HDFS it
replicates in three data nodes by default on different nodes.
And there is something called Block Ops and there are the
operations are performs by using blocks and the default size
of each block in HDFS is 64 MB.
Fig. 4.2 HDFS cluster and YARN Metadata is the actual data, it is the data over data where the
One is HDFS side which is storage side and another is YARN block are stored which are racks available on which rack
side which is framework side. Under Namenode, the which datanodes is available or it is nothing but block map of
datanode and Namenode interacts with Node manager where Namenode. Namenode this data is stored in RAM to have fast
the program is run in datanode. The node manager is running access and actual data is stored in datanode.
on the datanode and resource manager interacts with node
manager to make sure which job is allocated to the node 5.1 Job Tracker and Task Tracker
manager inside it runs the job and on the data node and gives MapReduce is the programing model to retrieve and analyze
the signal back to resource manager [25]. the data. And client is actually application which is used to
interact with both the Namenode and datanode which means
to interact with Job Tracker and Task Tracker. A client is
application software which will be running on your machine
which is used to interact and give commands and look status
of Job Tracker or Task Tracker so it is an interaction between
user and Namenode and datanode is done through a client and
it is also called HDFS client [26].
5. HDFS ARCHITECTURE
Client is the application software which will run on machine
which is used to interact with datanode and Namenode. The
client, read and write the data from the datanodes [23].
33
International Journal of Computer Applications (0975 – 8887)
Volume 154 – No.10, November 2016
6. HADOOP ECOSYSTEM should move data, this all that is done by a framework called
The Hadoop Ecosystem consists of tools for data analysis, YARN and it is stands for ‗Yet Another Resource Negotiator‘.
moving large amounts of unstructured and structured data, It manages the framework which manages the complete
data processing, querying data, storing data, and other similar resource manager. Then it contains something called
data-oriented processes. These utilities each serve a unique MapReduce framework. The MapReduce framework is the
purpose and are geared toward different tasks completed framework which is used on top of Hadoop to process the job.
through or user roles interacting with Hadoop [28]. So the programs can be written in MapReduce then that
program is broken and sent to the different nodes on the
cluster where the actual data resides and to collect your results
and come back.
Hive is a data warehousing package built on top of Hadoop
that is used for complex data analysis and exploration. Hive is
a tool which essential internally used MapReduce but it gives
you the flexibility, those who are not from the programming
background and this was developed by Facebook.
Pig is an open-source; high-level dataflow system that sits on
top of the Hadoop framework and can read data from the
HDFS for analysis [29]. Pig Latin which was developed by
Yahoo and mostly it is used for data analysis.
Mahout is machine learning tool. Mahout is a scalable
machine learning library that implements various different
approaches machine learning. At present Mahout contains
four main groups of algorithms:
1. Recommendations, also known as collective filtering
So Hadoop is not one tool, its ecosystem its framework so that 3. Clustering
are the various things in Hadoop.
4. Frequent item set mining, also known as parallel frequent
The place where actually get data into Hadoop so Hadoop has pattern mining
HDFS which is called as Hadoop Distributed File System
There is other framework graph job, to run a real time job
which is nothing but the file system which is on top of cluster
which is again and in Hadoop 1.0, there is only MapReduce it
that is Hadoop Distributed File System. All the data is stored
was supported but now YARN coming to have a lot of tasks
in Hadoop Distributed File System or the HDFS. Now how
then there is no SQL data base which is called HBase. HBase
do you store more data into to Hadoop?
is a distributed, column oriented database and uses HDFS for
Flume is a framework for harvesting, aggregating and moving the underlying storage. As said earlier, HDFS works on write
huge amounts of log data or text files in and out of Hadoop. once and read many times pattern, but this isn‘t a case always.
There are multiple ways for that there is tool called Flume We may require real time read/write random access for huge
which is mostly used for moving unstructured or semi dataset; this is where HBase comes into the picture. HBase is
structured data into Hadoop. To store structured data into built on top of HDFS and distributed on column-oriented
Hadoop it doesn‘t stop for storing structured data. Anything database. Apache HBase is a column-oriented, NoSQL
which is coming from the web such as Facebook, LinkedIn, database built on top of Hadoop. This can be stored data in
Twitter or any social media, the loading of method that data is column then there is tool called Apache Oozie which is for
done through Flume so what Flume does is it is a tool from work flow management for example thousands of job running
the channel which HDFS it can send the data inside the and to manage the work flow then this tool is used.
Hadoop then the data such as in RDBMS from my SQL or
there is a connector for oracle, the tool called Sqoop. 7. CONCLUSION
This paper has presented all the overview and basic
Apache Sqoop efficiently transfers bulk data between Apache introductory discussion which is more important for the
Hadoop and structured data stores, such as relational researcher those who are joint as a beginners. Time has been
databases. Sqoop imports individual table or complete started when the world is capable of generating data in
dataset to HDFS. Sqoop can also be used to extract data terabytes and petabytes every day, every hour and every
from Hadoop and export it into external structured data minute. Big data and Hadoop provides the more facilities for
stores. Sqoop works with relational databases such as the technology and related to open source tools which is in
Teradeta, Oracle, MySQL, etc. developing stages given and becomes more important in
future. In the recent years, related to all the fields such as
The full form of Sqoop is SQL to Hadoop. The structured social media, government projects, sensor data all are giving
data can move into Hadoop the tool Sqoop is used and the more important to Big data and its technologies. This Paper
data from RDBMS tool the connectors are available on top of also throws some light on other big data emerging
it. Assume this is your cluster. On the top of this cluster, technologies.
need to have a system or a framework which can do resource
management or to run a Job on Hadoop or program on Although this paper clearly has not resolved and covered all
Hadoop there has to be a framework where should this job be the points so the researcher can extend this topic with their
done, where the data is available, where it stored, where it need in the subject of big data for different topics. The
34
International Journal of Computer Applications (0975 – 8887)
Volume 154 – No.10, November 2016
researcher can frame a framework and use this important [15] http://searchcloudcomputing.techtarget.com/definition/bi
overview for their basic research. The researcher can use this g-data-Big-Data
paper and can extend their work for reading and writing the
data in Hadoop. [16] Konstantin Shvachko, Hairong Kuang, Sanjay Radia,
Robert Chansler, ―The Hadoop Distributed File System‖.
8. REFERENCES In IEEE, Contemporary Computing (IC3), Sixth
[1] https://opensource.com/resources/big-data International Conference, pages 404-409, Noida, 2010.
IJCATM : www.ijcaonline.org
35