KEMBAR78
Anoverviewon Big Dataand Hadoop | PDF | Apache Hadoop | Art
0% found this document useful (0 votes)
38 views8 pages

Anoverviewon Big Dataand Hadoop

Data and Hadoop

Uploaded by

tmutopo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views8 pages

Anoverviewon Big Dataand Hadoop

Data and Hadoop

Uploaded by

tmutopo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/372631347

An overview on Big Data and Hadoop

Article in International Journal of Computer Applications · November 2016

CITATIONS READS

16 2,549

1 author:

Shaikh Abdul Hannan


Dr. Babasaheb Ambedkar Marathwada University
67 PUBLICATIONS 693 CITATIONS

SEE PROFILE

All content following this page was uploaded by Shaikh Abdul Hannan on 26 July 2023.

The user has requested enhancement of the downloaded file.


International Journal of Computer Applications (0975 – 8887)
Volume 154 – No.10, November 2016

An overview on Big Data and Hadoop


Shaikh Abdul Hannan
Computer Science and I.T.,
Al-Baha University,
Al-Baha, Saudi Arabia.

ABSTRACT creation, search, sharing, storage, transfer, visualization,


Big data: Everyone just talking about Big data, but what is querying, updating and information privacy. "Big Data refers
meant by big data actually? How is it changing the point of to the massive amounts of data that collect over time that are
view in different fields such as researchers of the science or at difficult to analyze and handle using common database
companies, non-profits, governments, institutions, and other management tools. Big Data includes business transactions, e-
organizations are learning about big data that is nothing the mail messages, photos, surveillance videos and activity logs
world around them? Where this data is coming from, how is it (see machine-generated data). Scientific data from sensors can
being processed, and how are the results being stored and reach mammoth proportions over time, and Big Data also
used for their future work? And why is open source so includes unstructured text posted on the Web, such as blogs
important to answering these questions? In this paper we will and social media." [3]. Or Big data is a collection of huge
discuss all above points to clear actually what Big data means amount of data that is larger and complex to process using on
and how it deals in our day to day life. In today‘s 21st hand database management tool or traditional data processing
century, the most important area is social media which shares, applications like REBMS, DBMS or SQL files.
search and shares the information and generates huge of data Analysis of data sets can find new correlations, to "spot
everyday. So the importance of big data is more as millions business trends, prevent diseases, combat crime and so on."
and billons of peoples are using this media to share and store Scientists, practitioners of media and advertising and
the information. Nowadays many projects are developing governments alike regularly meet difficulties with large data
under social media, sensor data, stock exchange, Transport sets in areas including Internet search, finance and business
data, and in the field of science where data is most important informatics. Scientists encounter limitations in e-Science
factor to store and retrieve. So we need new technology work, including meteorology, genomics, complex physics
which is Big data and Hadoop to handle this huge amount of simulations, and biological and environmental research [4].
data which is not possible to handle by RDBMS. Big data has
very basic important characteristics such as volume, variety, Normally the data size is like MB, GB for example
veracity and velocity. Big data handles the large amount of considering a video which a few GB may 1GB, 2 GB or 5Gb
data with management, analysis, storage and processed data or it can be some GB. An audio file which is 1000 x 1000 x
very fast within the time span. In this paper discusses, the 1000 terabyte, So in social media everyone shares picture,
important characteristics, types of data which is used in big posts, audio file, video file etc. so it is certainly a large
data, what are the various sources of big data in our day to day amount of data so this is what a big data is.
life, introduction to big data and Hadoop with explanation,
Every day, the data is created near about 2.5 quintillion bytes
Structure of Hadoop core components, role of Namenode and
of data — so much that 90% of the data in the world today has
data node, function of job tracker and task tracker, and
been created in the last two years alone. This data comes from
Hadoop Ecosystem is explained in detail.
everywhere: sensors used to gather climate information, posts
Keywords to social media sites, digital pictures and videos, purchase
transaction records, and cell phone GPS signals to name a
Big Data, Hadoop, HDFS, MapReduce, Hadoop Ecosystem,
Namenode, Datanode. few. This data is big data. In 1992, the data was produced by
the internet was just 100GB per day, 1997 the data produced
1. INTRODUCTION by the internet was 200GB per hour, in 2002 it was 100 GB
Data is very important factor in every field of today life. per second, and in 2013 it was 28,875GB and it is predicted
There is no hard and fast rule about exactly what size a that in 2018 50,000GB per second data might be generate.
database needs to be in order for the data inside of it to be The following figure how data is rapidly increasing[5].
considered "big." Instead, what typically defines big data is
the need for new techniques and tools in order to be able to
process it. In order to use big data, you need programs which
span multiple physical and/or virtual machines working 1992
together in concert in order to process all of the data in a 100 GB/ Day
reasonable span of time. [1]. Big data is the capability to
manage a huge volume of disparate data, at the right speed, 1997
and within the right time frame to allow real-time analysis and 100 GB/ Hour
reaction. Big data is an evolving term that describes any
amount of structured, semi-structured and unstructured data 2002
that has the potential to be mined for information [2] 100 GB/ Second
2013
Big data is a term for data sets that are so large or complex 28.875 GB/ Second
2018
that traditional data processing applications are inadequate to 50.000 GB/ Second
deal with them. Challenges include analysis, capture, data
Figure 1: Global internet traffic

29
International Journal of Computer Applications (0975 – 8887)
Volume 154 – No.10, November 2016

2. RELATED WORK Hadoop, can also be very helpful. HadoopDB, Hadoop++,


What are the Big data issues, challenges, tools are important CoHadoop, Hail, Dare, Cheetah, etc. are the main extensions
studies for the basic things to know for everyone who is of Hadoop and these are considered for comparison.
related to computer science. In this paper[6] author discussed Why we use big data and some important Characteristics
about basic properties of big data like volume, velocity, discussed [10] with unstructured data and how to deal with it.
variety, complexity, value has been discussed. And what are Results shown in this paper, that big data analytics is very
the important sources from which big data is coming and much important to make business intelligent. Kapil Bakshi
generated. Big data has a great importance in various fields [11] mainly discussed about the analysis of unstructured data.
like social media, government sector, sensor data, and log Map reduce and Hadoop are the major tools for the analysis of
storage and risk analysis. In social media there is great unstructured data and it is widely discussed in this paper.
importance of big data. Facebook is generating Terabytes of Demchenko,Y, de Laat, C., Membrey, P. [12] Discussed
data on every day. The authors discussed about the protection about the basic definition of big data and also focused on the
and the how to secure data, are very sensitive issue and it can importance of the Hadoop Eco system. 5 V‘s Volume,
disturb any human beings personal life because the data is so Velocity, Variety, Value and Veracity have been discussed as
important that has taken for analysis, it is personal data of any the main properties of big data. Big data analytics, security,
individual, in case of social media, after discovery of some data structure and models are the main components of the
pattern may be that person does not want that it should be Hadoop Eco system. The author reviewed these core
known to someone. Some technical challenges like fault architectural components of the Hadoop Eco system. These
tolerance and scalability are also discussed. Hadoop and map components are very important in big data challenges.
reduce has been discussed as the main tools and technologies
to process and deal with big data. Hadoop is basically a 3. IMPORTANT CHARACTERISTICS
framework on which map reduce works as a programming
model. It works in batch processing means it divides the task
OF BIG DATA
The general consensus of the day is that there are specific
into smaller units and then executes them parallel. At the end
attributes that define big data. In most big data circles, these
of the paper a comparison between Hadoop and grid
are called the four V‘s: volume, variety, veracity and velocity.
computing tools is also shown.[6]
In this paper [7], Big data is growing at an exponential rate so 1. Volume:
it becomes important to develop new technologies to deal Big Data implies enormous volumes of data. It used to be
with it. This paper covers the leading tools and technologies employees or the user created data. Now that data is
for big data storage and processing. Hadoop, Map Reduce and generated and created for different purposes by machines,
No SQL are the major big data technologies. These networks and human interaction on systems like social media
technologies are very helpful in big data management. the volume of data to be analyzed is huge amount of data [13].
Technologies based on Hadoop called Hadoop Eco system The volume a data which was made earlier come from only
have also discussed. This Paper also throws some light on the employee of the organization but today data comes from
other big data emerging technologies. There are so many areas employees, partners, customers, Facebook, Twitter, etc.
from which big data is being generated, this paper covered comes from every place and websites. Now consider a file
those areas and provide solutions for dealing with that data. which might of few KB, be a few GB, now imagine the
amount of data that everyday might have. If all that thousands
Ten common Hadoopable problems by Cloudera explained in of file put together, so this is volume of data.
detail about Hadoop problems. In this, paper, Cloudera
Company explained where Hadoop technology can be useful 2. Variety:
to implement and they also provide a solution to a specific Variety refers to the many sources and types of data both
problem with Hadoop. Banks need Hadoop to perform risk structured and unstructured. The data is stored in
analysis on their customer‘s profile. There is a important and spreadsheets and databases which comes from sources. Now
different type of data present in the banking, government, and data comes in the form of email, photos, videos, monitoring
financial institutes. This data is very sensitive and important devices, Pdf. Audio etc. This variety of unstructured data
due to its privacy and security purpose. So this data needs creates problems for storage, mining and analyzing data.
better management and Hadoop is capable of managing and Today data comes in all format for example social networking
retrieving data very fast. There are other areas also like sites that Facebook, Twitter, LinkedIn, YouTube etc. So this
advertisement targeting which helps companies to target is variety so it has been seen that approximately 400 million
perfect customer for their products to sell and Hadoop is very tweets are sent per day from 200 million active users on
much best technology in this area. Point of sale transaction twitter. So this is variety of data that is different forms of data.
analysis is now in huge demand, which figure out the
customer‘s buying pattern. Fraud analysis, analysing and The four V's
Click stream
tracking network data to predict failure, threat analysis or Active/passive sensor
Unstructured
Semi-structured
Log
other areas in which Hadoop can be very efficient and Event
Structured
Printed corpus
helpful[8]. Speech
Social media
Traditional
In this paper, [9] discussed about Hadoop and Hadoop Volume Variety
distributed file system infrastructure extensions in detail. In
this major enhancement in Hadoop is able to store the data so Big data
it is nothing but data storage, data processing and placement
are done by using MapReduce are also reviewed. It has also Speech of generation Velocity Veracity Untrusted
Rate of analysis Uncleansed
shown Comparison of Hadoop Infrastructure Extensions on
the basis of scalability, fault tolerance, load time, data
locality, data compression, etc. Hadoop is widely accepted in
many areas, but its extensions, which are the improvements of Fig. 3.1 Four V’s of Big Data

30
International Journal of Computer Applications (0975 – 8887)
Volume 154 – No.10, November 2016

3. Veracity So this need had to be fulfilled because reason being


Big data Veracity refers to the biases, noise and abnormality approximately total 80% data is either semi-structured or
in data is the data that is being stored and mined meaningful unstructured which cannot be deal by RDBMS that‘s why we
to the problem being analyzed. Veracity in data analysis is need Big data.[15]
the biggest challenge when compares to things like volume
and velocity. 3.2 Sources of Big Data
Sources of big data are first one is social media and
Now biases and noise and abnormally data or uncertainty of networks.
data it means the data which is not certain, which does not
1) Mobile devises
have any naming for example some peoples uses on the
Facebook, like good and good is actually written in Short 2) Sensor Technology and Networks and
hand GUD and this GUD has no meaning of its own so this is
uncertain data or noisy. Similarly good morning it is written 3) Sensor Technology
as GM and this GM has no meaning in any English dictionary 4) Scientific instruments
but still it is accepted and this type of data is called
uncertainty of data and it is part of big data Now consider the first one that is social media and network,
the social media network including Facebook, then LinkedIn,
4. Velocity Twitter, YouTube, Google, all the sources of big data. The
Big data Velocity deals with the pace at which data flows in big data is in the form of likes, comments, posts, uploads,
from sources like business processes, machines, networks and making pages, making groups this is all big data. So this
human interaction with things like social media sites, mobile social media is certainly the sources of big data.
devices etc. The flow of data is massive and continuous. This
real time data can help researchers and businesses make Mobile devices, earlier mobile devices are mostly used for
valuable decisions. calls and to send a text messages. No this has become a super
device which can track the user and this tracking can generate
Velocity means Analyzing of streaming data. How fast the a lot of data this is certainly the source of big data.
data is processed for example, the New York stock exchange
that is biggest stock exchange in this world. It generates and Sensor methodology and network in this sensor case, sending
captures and also deals with 1 Terabyte of data in each trading of the signal and receiving of a signal then this signal is
session so you can just imagine how much fast data has to be analyzed in real time and all this generates a lot of data so this
processed so this is about the velocity that analyzing of certainly the source of big data.
streaming data. So this was about four ‘v‘ of Big data [14]. Scientific instruments, now scientific instruments a satellite
moving around the earth is programmed to take picture after
3.1 Types of data interval of one min so now imagine the amount of data that
1. Structured Data how much it will be produced. If it is taking picture after
2. Unstructured Data and every one min so it is certainly a large amount of data. It is
not only important to store the picture but to analyses the
3. Semi-structured Data picture also. So this is also certainly a source of big data.
Structured data is data that has been organized into a
formatted repository, typically a database, so that its elements
4. BACKGROUND
The Hadoop Distributed File System (HDFS) is designed to
can be made addressable for more effective processing and
store very large data sets reliably, and to stream those data
analysis. Structured data for example RDBMS, DBMS and
sets at high bandwidth to user applications. In a large cluster,
all so this is structured data. The database which has
thousands of servers both host directly attached storage and
information stored in such a way that it can be readily used is
execute user application tasks [16].
called structured data. The data is stored in the form of tables
and rows and columns and perfect example is RDBMS. So The main challenge is not to store the data but to read and
this is structured data. write the data, to analyze the data and something which
distributes the files. So Hadoop is framework which solves
Unstructured data can be textual or non-textual, this is the
this problem and reading from single place is not going to
variety of data, and the data can be pdf files, audio files, video
solve the problem. So now need to have some distributed way
files, images and all sorts of data that comes from social
of storing that data and this HDFS fulfills that requirement
networking sites. Such as Facebook, Twitter, YouTube other
which stores the data across the nodes that‘s why Hadoop
websites which accepts and this data can be in the form of
technology is needed. Hadoop also helps in processing huge
likes, posts, comments then uploading a photo and every sort
amount of data at a very fast rate.
of this data comes under the unstructured data.
Hadoop what it does, instead of storing data at a single
Semi structured data is nothing but the metadata, here under
location it stores data in distributed fashion and it follows the
the parent node, the child node further this under child node
distributed file system where instead of storing data in a single
another child node, so questions might come in mind why it is
place, it stores data in different nodes and so that data can be
called as semi structured data. Now since they are organized
retrieve in parallel..
in a manner but not so organized but they can be stored in
databases. For example, consider HTML program, it has tag 4.1 Hadoop
in that, it has a parent tag then child tag. This type of data is
Hadoop was developed by Daug Cutting and Mike
called as semi structured data.
Caferella[17]. Apache Hadoop is a framework that allows for
So actually a traditional system that is RDBMS deals with the distributed processing of large data sets across clusters of
only structured data but now there is need to handle, commodity commuters using a simple programming model
structured, unstructured and semi structured data. So there
was need of technology that deals with all three types of data.

31
International Journal of Computer Applications (0975 – 8887)
Volume 154 – No.10, November 2016

MapReduce [18] is a framework pioneered by Google for time to read the data is very high because it reads the data
processing large amounts of data in a distributed environment. from different machines [22].
Due to the simplicity of its programming model and the run-
time tolerance for node failures, MapReduce is widely used HDFS allows to put/get/delete files. It also follows the policy
by companies such as Facebook [19], the New York Times for write once and read multiple times. For example upload a
[20], etc. picture in Facebook,, it can be seen whenever the user want to
see it, and user can see the picture after 5 years. So user is
Distributed processing of large data sets so the data is not writing, user uploading file once and reading it multiple times.
stored at one place, input and output is not happened from
place that is stored across the cluster on commodity In this diagram two things are important to note and just
computers using a simple programming model, that remember Name node is nothing but Admin node it‘s a master
programming is called MapReduce. slave type of configuration. So there is something called
name node or admin node which is master then there are slave
The way MapReduce is works across the clusters of your which is called data nodes and then there is something called
computer then there is a framework programming model job tracker which is associated with name node and there is
which takes core of synchronizing the data or reading the data something called task tracker which is associated with data
from where to reading the data so all these things are taken node.
care by system. If the job is running, it has to restive this data
from multiple clusters all this is taken by Hadoop framework. Basically a HDFS cluster is the name given to whole
There is simple model programming which is called as configuration masters and slaves where the data is stored and
MapReduce programming and it is open source data MapReduce engine is the programming model which is used
management. to retrieve and analyze the data.

Hadoop can be said as fault tolerant distributed system for


data storage and processing which is open source by Apache.
Hadoop provides the reliable shared storage and analysis
system. It is designed to scale up from a single server to
thousands of machines with a high degree of fault tolerance.
Reliable means it creates a replica of data three times, means a
data corrupt or bad or anything happen to access with data, the
replica created with cluster, so no need to think about data to
loose.
Hadoop is technology, data warehouse is where data store,
mining means analyzing and Hadoop is what data mining is
done on data warehousing.
Fig 4.1 Core Components of Hadoop
Data store in warehouses is from operational systems all the
data for example data center and data is stored in clusters and MapReduce
cluster is where that data is stored, that in combination of Hadoop MapReduce is the computation framework built upon
racks or racks are a combination of data nodes. When all HDFS. There are two versions of Hadoop MapReduce:
these combines a data center is made for example data MapReduce 1.0 and MapReduce 2.0 (Yarn [22]). The
warehouse is made. Data warehouse stores the current and MapReduce takes a set of input key/value pairs and produce a
historical data in order so that data mining can be done on it. set of output key/value pairs. When a MapReduce job is
Actually data mining is analyzing the data there has to be a submitted to the cluster, it is divided into M tasks and R
technique which we can do it. This technique is called reduce tasks, where each map task will process one block
Hadoop. (e.g., 64 MB) of input data. MapReduce is a distributed
programming paradigm used to analyse the data in HDFS and
4.2 Hadoop Core Components: it is made up of two procedures. In map phase, data is
A Hadoop cluster is composed of two parts: Hadoop mapping and sorting is done and in Reduce phase which
Distributed File System and MapReduce. A Hadoop cluster performs logic operations. So this is about core components.
uses Hadoop Distributed File System (HDFS) [21] to manage
its data. 4.3 HDFS Components
Basically there are two major components of HDFS. One is
HDFS is Hadoop Distributed File System: This HDFS Name node and Data Node. Namenode is the master node on
is Hadoop Distributed File System is used for storing and which the job tracker runs and Data node is slave on which
processing is done by MapReduce. Now Hadoop Distributed task tracker runs. Namenode is the master system with high
File System is a distributed file system which holds large reliability machine. It does not store any data it maintains and
amount of data across multiple nodes in a cluster which is in manages the blocks which are present on the datanodes. It is
contrast with primitive server which has limited storage and the datanode which actually keeps the data. Datanodes are
do not store data on multiple nodes. In HDFS the file is slaves are deployed on each machine and provide the storage
broken into a small blocks with default size of 64 MB and facility. These are responsible for serving read and write
these blocks are replicated across various clusters which is request for the client. Read and write directly happen from
three by default. Replica is for durability, high availability and the data node requested by the client and the task tracker runs
through put. Basically there are some features of HDFS which on each data node.
are very important as it is very highly fault tolerant because
the data is retrieved from different multiple nodes not from A Datanode periodically reports its status through a heartbeat
one single node because the data is replicated by default on message and asks the Namenode for instructions. Every
three different nodes. And high through put means amount of Datanode listens to the network so that other Datanodes and

32
International Journal of Computer Applications (0975 – 8887)
Volume 154 – No.10, November 2016

users can request read and write operations. The heartbeat can
also help the Namenode to detect connectivity with its
Datanode. If the Namenode does not receive a heartbeat from
a Datanode in the configured period of time, it marks the node
down. Data blocks stored on this node will be considered lost
and the Namenode will automatically replicate those blocks of
this lost node onto some other datanodes [23].
The datanode is place where the actual data is stored i.e.
structured and unstructured data is stored. The Namenode is
the node which contains the metadata around this datanode.
Any information regarding the data, where it is available is in
the data node.
YARN is a complete framework which is a resource manager Fig5.1HDFS Structure. Source: http://hadoop.apache.org
that resides in Namenode which takes care of the all over
resource management. Then there is something called Node It contains Namenode and client. Now there is something
Task Tracker
manager which resides on data node which manages the data called Rack so rack is the storage area where multiple data
inside or jobs which are running inside the datanode [24]. nodes put together physically all these data nodes may be of
different places. Basically rack is a physical collection of data
nodes which are stored at single location and there can be
datanodes in different places. Basically the diagram shows
the client which interacts with Namenode. Here term called
replication so as to maintain the data from the fault tolerance
in Hadoop system, the same data is distributed in multiple
copies and replicate in multiple data nodes and minimum
number of replicas required by HDFS and it can be defined by
the user. So basically when the data writes on HDFS it
replicates in three data nodes by default on different nodes.
And there is something called Block Ops and there are the
operations are performs by using blocks and the default size
of each block in HDFS is 64 MB.
Fig. 4.2 HDFS cluster and YARN Metadata is the actual data, it is the data over data where the
One is HDFS side which is storage side and another is YARN block are stored which are racks available on which rack
side which is framework side. Under Namenode, the which datanodes is available or it is nothing but block map of
datanode and Namenode interacts with Node manager where Namenode. Namenode this data is stored in RAM to have fast
the program is run in datanode. The node manager is running access and actual data is stored in datanode.
on the datanode and resource manager interacts with node
manager to make sure which job is allocated to the node 5.1 Job Tracker and Task Tracker
manager inside it runs the job and on the data node and gives MapReduce is the programing model to retrieve and analyze
the signal back to resource manager [25]. the data. And client is actually application which is used to
interact with both the Namenode and datanode which means
to interact with Job Tracker and Task Tracker. A client is
application software which will be running on your machine
which is used to interact and give commands and look status
of Job Tracker or Task Tracker so it is an interaction between
user and Namenode and datanode is done through a client and
it is also called HDFS client [26].

Fig. 4.3 Namenode and Resource Manager

5. HDFS ARCHITECTURE
Client is the application software which will run on machine
which is used to interact with datanode and Namenode. The
client, read and write the data from the datanodes [23].

Fig. 5.2 JobTracker

33
International Journal of Computer Applications (0975 – 8887)
Volume 154 – No.10, November 2016

6. HADOOP ECOSYSTEM should move data, this all that is done by a framework called
The Hadoop Ecosystem consists of tools for data analysis, YARN and it is stands for ‗Yet Another Resource Negotiator‘.
moving large amounts of unstructured and structured data, It manages the framework which manages the complete
data processing, querying data, storing data, and other similar resource manager. Then it contains something called
data-oriented processes. These utilities each serve a unique MapReduce framework. The MapReduce framework is the
purpose and are geared toward different tasks completed framework which is used on top of Hadoop to process the job.
through or user roles interacting with Hadoop [28]. So the programs can be written in MapReduce then that
program is broken and sent to the different nodes on the
cluster where the actual data resides and to collect your results
and come back.
Hive is a data warehousing package built on top of Hadoop
that is used for complex data analysis and exploration. Hive is
a tool which essential internally used MapReduce but it gives
you the flexibility, those who are not from the programming
background and this was developed by Facebook.
Pig is an open-source; high-level dataflow system that sits on
top of the Hadoop framework and can read data from the
HDFS for analysis [29]. Pig Latin which was developed by
Yahoo and mostly it is used for data analysis.
Mahout is machine learning tool. Mahout is a scalable
machine learning library that implements various different
approaches machine learning. At present Mahout contains
four main groups of algorithms:
1. Recommendations, also known as collective filtering

Fig. 6.2 Hadoop Ecosystem 2. Classifications, also known as categorization

So Hadoop is not one tool, its ecosystem its framework so that 3. Clustering
are the various things in Hadoop.
4. Frequent item set mining, also known as parallel frequent
The place where actually get data into Hadoop so Hadoop has pattern mining
HDFS which is called as Hadoop Distributed File System
There is other framework graph job, to run a real time job
which is nothing but the file system which is on top of cluster
which is again and in Hadoop 1.0, there is only MapReduce it
that is Hadoop Distributed File System. All the data is stored
was supported but now YARN coming to have a lot of tasks
in Hadoop Distributed File System or the HDFS. Now how
then there is no SQL data base which is called HBase. HBase
do you store more data into to Hadoop?
is a distributed, column oriented database and uses HDFS for
Flume is a framework for harvesting, aggregating and moving the underlying storage. As said earlier, HDFS works on write
huge amounts of log data or text files in and out of Hadoop. once and read many times pattern, but this isn‘t a case always.
There are multiple ways for that there is tool called Flume We may require real time read/write random access for huge
which is mostly used for moving unstructured or semi dataset; this is where HBase comes into the picture. HBase is
structured data into Hadoop. To store structured data into built on top of HDFS and distributed on column-oriented
Hadoop it doesn‘t stop for storing structured data. Anything database. Apache HBase is a column-oriented, NoSQL
which is coming from the web such as Facebook, LinkedIn, database built on top of Hadoop. This can be stored data in
Twitter or any social media, the loading of method that data is column then there is tool called Apache Oozie which is for
done through Flume so what Flume does is it is a tool from work flow management for example thousands of job running
the channel which HDFS it can send the data inside the and to manage the work flow then this tool is used.
Hadoop then the data such as in RDBMS from my SQL or
there is a connector for oracle, the tool called Sqoop. 7. CONCLUSION
This paper has presented all the overview and basic
Apache Sqoop efficiently transfers bulk data between Apache introductory discussion which is more important for the
Hadoop and structured data stores, such as relational researcher those who are joint as a beginners. Time has been
databases. Sqoop imports individual table or complete started when the world is capable of generating data in
dataset to HDFS. Sqoop can also be used to extract data terabytes and petabytes every day, every hour and every
from Hadoop and export it into external structured data minute. Big data and Hadoop provides the more facilities for
stores. Sqoop works with relational databases such as the technology and related to open source tools which is in
Teradeta, Oracle, MySQL, etc. developing stages given and becomes more important in
future. In the recent years, related to all the fields such as
The full form of Sqoop is SQL to Hadoop. The structured social media, government projects, sensor data all are giving
data can move into Hadoop the tool Sqoop is used and the more important to Big data and its technologies. This Paper
data from RDBMS tool the connectors are available on top of also throws some light on other big data emerging
it. Assume this is your cluster. On the top of this cluster, technologies.
need to have a system or a framework which can do resource
management or to run a Job on Hadoop or program on Although this paper clearly has not resolved and covered all
Hadoop there has to be a framework where should this job be the points so the researcher can extend this topic with their
done, where the data is available, where it stored, where it need in the subject of big data for different topics. The

34
International Journal of Computer Applications (0975 – 8887)
Volume 154 – No.10, November 2016

researcher can frame a framework and use this important [15] http://searchcloudcomputing.techtarget.com/definition/bi
overview for their basic research. The researcher can use this g-data-Big-Data
paper and can extend their work for reading and writing the
data in Hadoop. [16] Konstantin Shvachko, Hairong Kuang, Sanjay Radia,
Robert Chansler, ―The Hadoop Distributed File System‖.
8. REFERENCES In IEEE, Contemporary Computing (IC3), Sixth
[1] https://opensource.com/resources/big-data International Conference, pages 404-409, Noida, 2010.

[2] http://www.ibmbigdatahub.com/infographic/four-vs-big- [17] http://searchcloudcomputing.techtarget.com/definition/H


data adoop

[3] http://www.opentracker.net/article/definitions-big-data [18] J. Dean and S. Ghemawat, ―Mapreduce: simplified data


processing on large clusters,‖ in Proceedings of the 6th
[4] http://studymafia.org/wp-content/uploads/ 2015/05/CSE- conference on Symposium on Opearting Systems Design
Big-Data-Report.pdf & Implementation - Volume 6, ser. OSDI‘04. Berkeley,
CA, USA: USENIX Association, 2004, pp. 10–10.
[5] http://www.vcloudnews.com/every-day-big-data-
statistics-2-5-quintillion-bytes-of-data-created-daily/ [19] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy,
S. Shenker, and I. Stoica, ―Delay scheduling: a simple
[6] Avita Katal, Mohammad Wazid, R H Goudar ―Big Data:
technique for achieving locality and fairness in cluster
Issues, Challenges, Tools and Good Practices‖. In IEEE,
scheduling,‖ in Proceedings of the 5th European
Contemporary Computing (IC3), Sixth International
conference on Computer systems. ACM, 2010, pp. 265–
Conference, pages 404-409, Noida, 2013.
278.
[7] Jaskaran Singh and Varun Singla ―Big Data: Tools and
[20] D. GOTTFRID, ―Self-service, prorated supercomputing
Technologies in Big Data‖, International Journal of
fun!‖ http://open.blogs.nytimes.com/2007/11/01/self-
Computer Applications, Volume 112, No. 15, Feb. 2015.
service-prorated-supercomputing-fun/.
[8] Cloudera White paper,‖Ten Common Hadoop able
[21] Apache, ―Hdfs,‖ http://apache.hadoop. org/hdfs/
Problems‖, 2011.
[22] A. Foundation, ―Yarn,‖
[9] Kala Karun. A, Chitharanjan. K, ―A Review on Hadoop
https://hadoop.apache.org/docs/r0.23.0/hadoopyarn/hado
– HDFS Infrastructure Extensions‖. In IEEE,
op-yarn-site/YARN.html
Information & Communication Technologies (ICT),
pages 132-137, 2013. [23] Konstantin Shvachko, Hairong Kuang, Sanjay Radia,
Robert Chansler, ―The Hadoop Distributed File System‖
[10] Sachchidanand Singh, Nirmala Singh, ―Big Data
Yahoo! Sunnyvale, California USA.
Analytics‖. In IEEE, International Conference on
Communication, Information & Computing Technology [24] Jia-Chun Lin, Ingrid Chieh Yu, Einar Broch Johnsen ,
(ICCICT) pages 1-4, 2012. ―ABS-YARN: A Formal Framework for Modeling
Hadoop YARN Clusters ?, Ming-Chang Lee Department
[11] Kapil Bakshi, ―Considerations for Big Data: Architecture
of Informatics, University of Oslo, Norway.
and Approach‖. In IEEE, Aerospace Conference, pages
1-7 2012. [25] Khalid Adam Ismail Hammad, et. al. Big Data Analysis
and Storage, Proceedings of the 2015 International
[12] Demchenko,Y, de Laat, C., Membrey, P.,‖ Defining
Conference on Operations Excellence and Service
architecture components of the Big Data Ecosystem‖.In
Engineering Orlando, Florida, USA, September 10-11,
Collaboration Technologies and Systems (CTS),pages
2015.
104-112,2014.
[26] https://hadoopinku.wordpress.com/category/hadoop-2/
[13] http://insidebigdata.com/2013/09/12/beyond-volume-
variety-velocity-issue-big-data-veracity/
[14] http://www.ibmbigdatahub.com/infographic/four-vs-big-
data

IJCATM : www.ijcaonline.org
35

View publication stats

You might also like