Big Data?
Big Data is very large, loosely structured data set that defies traditional storage.
Human Generated Data is emails, documents, photos and tweets, number of videos uploaded to You Tube and tweets
swirling around. This data can be Big Data too.
Machine Generated Data is a new breed of data. This category consists of sensor data, and logs generated
by 'machines' such as email logs, click stream logs, etc.
Original big data was the web data. Ex. Facebook , Yahoo, Twitter, EBay
A lot of Big Data is unstructured.
Hadoop solves the Big Data problem
More storage and compute power can be achieved by adding more nodes to a Hadoop cluster. Hadoop doesn't
enforce a 'schema' on the data it stores. It can handle arbitrary text and binary data. So Hadoop can 'digest' any
unstructured data easily.
Hadoop provides storage for Big Data at reasonable cost.
Sometimes organizations don't capture a type of data, because it was too cost prohibitive to store it. Since
Hadoop provides storage at reasonable cost, this type of data can be captured and stored. One example would be
web site click logs.
With Hadoop, one can store data longer.
Hadoop not only provides distributed storage, but also distributed processing as well. The compute framework of
Hadoop is called Map Reduce. Map Reduce has been proven to the scale of peta bytes.
Hadoop provides rich analytics
Hadoop?
Hadoop is an open source software stack that runs on a cluster of machines. Hadoop provides distributed
storage and distributed processing for very large data sets.
Hadoop is open source software. Originally Hadoop was developed by Yahoo. It is an Apache project released under
Apache Open Source License v2.0.
Cost of hardware : Hadoop runs on a cluster of machines. The cluster size can be anywhere from 10 nodes to 1000s
of nodes.
Who can learn Hadoop?
A hands-on developer or admin can learn Hadoop.
Hadoop is written in Java. So knowing Java helps
Hadoop runs on Linux, so you should know basic Linux command line navigation skills
Some Linux scripting skills will go a long way
Technical roles are available in Hadoop?
Job Type Job functions Skills
Hadoop Developer develops MapReduce jobs, designs Java, Scripting, Linux
data warehouses
Hadoop Admin manages Hadoop cluster, designs Linux administration, Network
data pipelines Management, Experience in
managing large cluster of machines
Data Scientist Data mining and figuring out hidden Math, data mining algorithms
knowledge in data
Business Analyst Analyzes data! Pig, Hive, SQL superman,
familiarity with other BI tools
Hadoop development tools are :
Karmasphere IDE : tuned for developing for Hadoop
Eclipse and other Java IDEs : When writing Java code
Command line editor like VIM : No matter what editor you use, you will be editing a lot of files / scripts.
So familiarity with CLI editors is essential.
Hadoop provides two things : Storage & processing.
Nornally it says that 'Hadoop' it usually includes two core components : HDFS and MapReduce
Storage is provided by Hadoop Distributed File System (HDFS).
Whereas processing is provided by MapReduce.
MapReduce is a programming framework. Its description was published by Google in 2004 Much like other
frameworks, such as Spring, Struts, or MFC. Hadoop running in the world's largest computer centers and at the
largest companies. As you will discover, the Hadoop framework organizes the data and the computations, and then
runs your code. At times, it makes sense to run your solution, expressed in a MapReduce paradigm, even on a single
machine.
The Hadoop Distributed File System (HDFS)
provides unlimited file space available from any Hadoop node. HBase is a high-performance unlimited-size database
working on top of Hadoop. If you need the power of familiar SQL over your large data sets, Pig provides you with
an answer. While Hadoop can be used by programmers and taught to students as an introduction to Big Data, its
companion projects (including ZooKeeper
Are you forced to give up on projects because you dont know how to easily distribute the computations between
multiple computers? MapReduce helps you solve these problems.
you want to have unlimited storage, solving this problem once and for all, so as to concentrate on what's really
important. The answer is: you can mount HDFS as a FUSE file system,
Each single use is saved in a log, and you need to generate a summary of use of resources for each client by day or
by hour. From this you will do your invoices, so it IS important. But the data set is large. You can write a quick
MapReduce job for that. Better yet, you can use Hive, a data warehouse infrastructure built on top of Hadoop, with
its ETL capabilities, to generate your invoices in no time.
HDFS - Hadoop Distributed File System
HDFS is the 'file system' or 'storage layer' of Hadoop. It takes care of storing data -- and it can handle very
large amount of data. In an HDFS cluster, there is ONE master node and many worker nodes. The master node is
called the Name Node (NN) and the workers are called Data Nodes (DN). Data nodes actually store the data. They
are the workhorses. It is designed to run on commodity hardware. HDFS keeps multiple copies of data around the
cluster. HDFS was built to work with mechanical disk drives. HDFS supports writing files once. Appending is
supported to enable applications like HBase.