0% found this document useful (0 votes)

733 views34 pages

Hadoop

Hadoop is a framework for distributed processing of large datasets across clusters of computers. It uses MapReduce as a programming model where users define map and reduce functions. The MapReduce framework automatically parallelizes the job and manages task execution and hardware failures. The Hadoop Distributed File System (HDFS) stores very large files reliably and provides high throughput access to application data. Major companies use Hadoop to analyze petabytes of data.

Uploaded by

forjunklikescribd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

733 views34 pages

Hadoop

Uploaded by

forjunklikescribd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Hadoop MapReduce

Felipe Meneses Besson

IME-USP, Brazil

July 7, 2010

Agenda

What is Hadoop? Hadoop Subprojects MapReduce HDFS Development and tools

What is Hadoop?
A framework for large-scale data processing (Tom White, 2009):

Project of Apache Software Foundation Most written in Java Inspired in Google MapReduce and GFS (Google File System)

A brief history

2004: Google published a paper that introduced MapReduce and GFS as a alternative to handle the volume of data to be processed 2005: Doug Cutting integrated MapReduce in the Hadoop 2006: Doug Cutting joins Yahoo! 2008: Cloudera was founded 2009: Hadoop cluster sort 100 terabyte in 173 minutes (on 3400 nodes) Nowadays, Cloudera company is an active contributor to the Hadoop project and provide Hadoop consulting and commercial products.
[1]Cloudera: http://www.cloudera.com [2] Sort Benchmark: http://sortbenchmark.org/

Hadoop Characteristics

A scalable and reliable system for shared storage and analyses. It automatically handles data replication and node failure It does the hard work developer can focus on processing data logic Enable applications to work of petabytes of data in parallel
5

Who's using Hadoop

Source: Hadoop wiki, September 2009

Hadoop Subprojects
Apache Hadoop is a collection of related subprojects that fall under the umbrella of infrastructure for distributed computing.

All projects are hosted by the Apache Software Foundation.

MapReduce
MapReduce is a programming model and an associated implementation for processing and generating large data sets (Jeffrey Dean and Sanjay Ghemawat, 2004)

Based on a functional programming model A batch data processing system A clean abstraction for programmers Automatic parallelization & distribution Fault-tolerance

MapReduce
Programming model Users implement the interface of two functions: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list

MapReduce
Map Function
Input:

Records from some data source (e.g., lines of files, rows of a databases, ) are associated in the (key, value) pair Example: (filename, content)

Output:

One or more intermediate values in the (key, value) format Example: (word, number_of_occurrences)

MapReduce
Map Function
map (in_key, in_value) (out_key, intermediate_value) list

Source: (Cloudera, 2010)

MapReduce
Map Function

Example: map (k, v): if (isPrime(v)) then emit (k, v) (foo, 7) (test, 10) (foo, 7) (nothing)

MapReduce
Reduce function
After map phase is over, all the intermediate values for a given output key are combined together into a list Input:

Intermediate values Example: (A, [42, 100, 312])

Output:

usually only one final value per key Example: (A, 454)

MapReduce
Reduce Function
reduce (out_key, intermediate_value list) out_value list

Source: (Cloudera, 2010)

MapReduce
Reduce Function
Example: reduce (k, vals): sum = 0 foreach int v in vals: sum += v emit (k, sum) (A, [42, 100, 312]) (B, [12, 6, -2]) (A, 454) (B, 16)
15

MapReduce
Terminology
Job: unit of work that the client wants to be performed

Input data + MapReduce program + configuration information

Task: part of the job

map and reduce tasks

Jobtracker: node that coordinates all the jobs in the system by scheduling tasks to run on tasktrackers
16

MapReduce
Terminology
Tasktracker: nodes that run tasks and send progress reports to the jobtracker Split: fixed-size piece of the input data

MapReduce
DataFlow

Source: (Cloudera, 2010)

MapReduce
Real Example

map (String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");

MapReduce
Real Example

reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

MapReduce
Combiner function

Compress the intermediate values Run locally on mapper nodes after map phase It is like a mini-reduce Used to save bandwidth before sending data to the reducer

MapReduce
Combiner Function

Applied in a mapper machine

Source: (Cloudera, 2010)

HDFS
Hadoop Distributed Filesystem

Inspired on GFS Designed to work with very large files Run on commodity hardware Streaming data access Replication and locality

HDFS
Nodes

A Namenode (the master) Manages the filesystem namespace Knows all the blocks location Datanodes (workers) Keep blocks of data Report back to namenode its lists of blocks periodically
24

HDFS
Duplication
Input data is copied into HDFS is split into blocks

Each data blocks is replicated to multiple machines

HDFS
MapReduce Data flow

Source: (Tom White, 2009)

Hadoop filesystems

Source: (Tom White, 2009)

Development and Tools

Hadoop operation modes

Hadoop supports three modes of operation:

Standalone Pseudo-distributed Fully-distributed

More details:
http://oreilly.com/other-programming/excerpts/hadooptdg/installing-apache-hadoop.html
28

Development and Tools

Java example

Development and Tools

Java example

Development and Tools

Java example

Development and Tools

Guidelines to get started The basic steps for running a Hadoop job are:

Compile your job into a JAR file Copy input data into HDFS Execute hadoop passing the jar and relevant args Monitor tasks via Web interface (optional) Examine output when job is complete

Development and Tools

Api, tools and training

Do you want to use a scripting language?

http://wiki.apache.org/hadoop/HadoopStreaming http://hadoop.apache.org/core/docs/current/streaming.html

Eclipse plugin for MapReduce development

http://wiki.apache.org/hadoop/EclipsePlugIn

Hadoop training (videos, exercises, )

http://www.cloudera.com/developers/learn-hadoop/training/
33

Bibliography
Hadoop The definitive guide Tom White (2009). Hadoop The Definitive Guide. O'Reilly, San Francisco, 1st Edition Google Article Jeffrey Dean and Sanjay Ghemawat (2004). MapReduce: Simplified Data Processing on Large Clusters. Available on: http://labs.google.com/papers/mapreduce-osdi04.pdf Hadoop In 45 Minutes or Less Tom Wheeler. Large-Scale Data Processing for Everyone. Available on: http://www.tomwheeler.com/publications/2009/lambda_lounge_hadoop_200910/twheelerhadoop-20091001-handouts.pdf Cloudera Videos and Training http://www.cloudera.com/resources/?type=Training

Pig Full Lecture
No ratings yet
Pig Full Lecture
38 pages
Apache Pig Quick Reference Guide
50% (2)
Apache Pig Quick Reference Guide
13 pages
Apache Pig
No ratings yet
Apache Pig
61 pages
Data Science Internship
No ratings yet
Data Science Internship
2 pages
Hadoop Commands
No ratings yet
Hadoop Commands
2 pages
Hadoop Pig Presentation
No ratings yet
Hadoop Pig Presentation
33 pages
04 - Introduction To The Big Data Ecosystem
No ratings yet
04 - Introduction To The Big Data Ecosystem
25 pages
Hadoop Fundamentals
No ratings yet
Hadoop Fundamentals
45 pages
Data Warehouse Dimensional Modeling
No ratings yet
Data Warehouse Dimensional Modeling
21 pages
Planning For Big Data - CIO's Handbook For The Changing Data Landscape, O'Reilly 2012
No ratings yet
Planning For Big Data - CIO's Handbook For The Changing Data Landscape, O'Reilly 2012
84 pages
Data Wrangling: Column Renaming
No ratings yet
Data Wrangling: Column Renaming
50 pages
Wrangling Webinar
No ratings yet
Wrangling Webinar
151 pages
Hadoop Framework Overview
No ratings yet
Hadoop Framework Overview
4 pages
Top 100 Hadoop Interview Questions and Answers 2016
No ratings yet
Top 100 Hadoop Interview Questions and Answers 2016
21 pages
1 Hdfs Notes
No ratings yet
1 Hdfs Notes
38 pages
Pig and Pig Latin
No ratings yet
Pig and Pig Latin
16 pages
Supercharge Your Data Lake With Snowflake
No ratings yet
Supercharge Your Data Lake With Snowflake
13 pages
DimensionalityModeling 2023
No ratings yet
DimensionalityModeling 2023
25 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Cybersecurity Essentials
No ratings yet
Cybersecurity Essentials
2 pages
Hive and HBase for Data Engineers
No ratings yet
Hive and HBase for Data Engineers
25 pages
Big Data - RDBMS, NoSQL and DynamoDB
No ratings yet
Big Data - RDBMS, NoSQL and DynamoDB
6 pages
Data Wrangling With R
No ratings yet
Data Wrangling With R
174 pages
Service-Oriented Architecture Best Practices
No ratings yet
Service-Oriented Architecture Best Practices
441 pages
Data Warehousing OLAP
No ratings yet
Data Warehousing OLAP
8 pages
Hands-On Hadoop Tutorial
100% (1)
Hands-On Hadoop Tutorial
13 pages
EDB Postgres Failover Manager Guide v2.1
No ratings yet
EDB Postgres Failover Manager Guide v2.1
86 pages
Cloudera Hadoop Admin Notes PDF
No ratings yet
Cloudera Hadoop Admin Notes PDF
65 pages
Cloudera Big Data Architecture Diagram
100% (1)
Cloudera Big Data Architecture Diagram
3 pages
Planning For Big Data PDF
100% (1)
Planning For Big Data PDF
88 pages
Nosql: Non-Relational Next Generation Operational Datastores and Databases
No ratings yet
Nosql: Non-Relational Next Generation Operational Datastores and Databases
19 pages
Big Data Hadoop Training Certification 7
No ratings yet
Big Data Hadoop Training Certification 7
40 pages
Cloud Computing Gov Conf 1209
No ratings yet
Cloud Computing Gov Conf 1209
21 pages
Hands On
No ratings yet
Hands On
26 pages
Kenny-230718-Top 70 Microsoft Data Science Interview Questions
No ratings yet
Kenny-230718-Top 70 Microsoft Data Science Interview Questions
17 pages
Document Database Data Modeling
No ratings yet
Document Database Data Modeling
27 pages
Data Modelling Windsor Castle
No ratings yet
Data Modelling Windsor Castle
26 pages
Hadoop Notes
No ratings yet
Hadoop Notes
11 pages
Big Data & Hadoop Essentials
No ratings yet
Big Data & Hadoop Essentials
36 pages
PostgreSQL Internals and Performance Optimization
No ratings yet
PostgreSQL Internals and Performance Optimization
30 pages
Database Services in AWS: Relational Databases
No ratings yet
Database Services in AWS: Relational Databases
9 pages
HADOOP PPT
No ratings yet
HADOOP PPT
21 pages
MapR Certified Hadoop Developer Study Guide (MCHD)
No ratings yet
MapR Certified Hadoop Developer Study Guide (MCHD)
26 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
44 pages
Azure Cloud Intro
No ratings yet
Azure Cloud Intro
34 pages
Homework Labs Lecture01
No ratings yet
Homework Labs Lecture01
9 pages
Subject: Business Intelligence
100% (1)
Subject: Business Intelligence
30 pages
Data Warehouses and Data Cubes
No ratings yet
Data Warehouses and Data Cubes
21 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
No ratings yet
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
1 page
Big Data Hadoop Certification Training: About Intellipaat
No ratings yet
Big Data Hadoop Certification Training: About Intellipaat
13 pages
AWS Database MindMap
No ratings yet
AWS Database MindMap
1 page
Mcca Study Guide 7.2017 Uvawomo
No ratings yet
Mcca Study Guide 7.2017 Uvawomo
30 pages
MongoDB Schema Design Guide
No ratings yet
MongoDB Schema Design Guide
59 pages
BDA Module 3
No ratings yet
BDA Module 3
69 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Holistic Decision Making
100% (1)
Holistic Decision Making
8 pages
FBW1102 en
No ratings yet
FBW1102 en
603 pages
Total Eligible Voters For Home Voting Part and Serial Number Wise
No ratings yet
Total Eligible Voters For Home Voting Part and Serial Number Wise
7 pages
Ver5 - 2023-2024 Modified CRLA Pre-Test
No ratings yet
Ver5 - 2023-2024 Modified CRLA Pre-Test
25 pages
LP English (Declarative Sentence) April Day 1
No ratings yet
LP English (Declarative Sentence) April Day 1
4 pages
Example - 5.0 To 5.1 Upgrade Summary - Unix
No ratings yet
Example - 5.0 To 5.1 Upgrade Summary - Unix
8 pages
RSW 2
No ratings yet
RSW 2
10 pages
11 EdgeDetection
No ratings yet
11 EdgeDetection
35 pages
AM18 PROGRAM-w PDF
No ratings yet
AM18 PROGRAM-w PDF
92 pages
InteliVision 5 Reference Guide
No ratings yet
InteliVision 5 Reference Guide
45 pages
Sample of An Environmental Monitoring Protocol
No ratings yet
Sample of An Environmental Monitoring Protocol
3 pages
Passive
No ratings yet
Passive
63 pages
ISO 20022 Payments Guide
No ratings yet
ISO 20022 Payments Guide
88 pages
Lecture 2 John Austin
No ratings yet
Lecture 2 John Austin
13 pages
Legal Language & Writing Course Plan
No ratings yet
Legal Language & Writing Course Plan
34 pages
Decision Criteria For Selecting Main Contractors in Malaysia
No ratings yet
Decision Criteria For Selecting Main Contractors in Malaysia
8 pages
C 34367
No ratings yet
C 34367
19 pages
Delhi Public School Panvel Syllabus For Half Yearly Assessment Academic Year 2022-23
No ratings yet
Delhi Public School Panvel Syllabus For Half Yearly Assessment Academic Year 2022-23
1 page
Liquid Retaining Structures: Crack-Width & Reinforcement Review
100% (1)
Liquid Retaining Structures: Crack-Width & Reinforcement Review
12 pages
GR7. Gasparski. 1994
No ratings yet
GR7. Gasparski. 1994
11 pages
NCVTCTSCertificateAnnual R200809249472
100% (1)
NCVTCTSCertificateAnnual R200809249472
1 page
Cisco NewQuestions 210-065 v2015-07-20 by - Randy
No ratings yet
Cisco NewQuestions 210-065 v2015-07-20 by - Randy
34 pages
Pre-Application Consultations in Wales: A Guide For Communities
No ratings yet
Pre-Application Consultations in Wales: A Guide For Communities
20 pages
Children Playground Equipment
No ratings yet
Children Playground Equipment
11 pages
A Tapestry of Values: An Introduction To Values in Science 1st Edition Elliott Download
100% (1)
A Tapestry of Values: An Introduction To Values in Science 1st Edition Elliott Download
59 pages
MTC 2257 SPECTRA 304 HR COIL 8mm X 1250
No ratings yet
MTC 2257 SPECTRA 304 HR COIL 8mm X 1250
1 page
Math Olympiad Problems
No ratings yet
Math Olympiad Problems
3 pages
ATM Maintenance Log
No ratings yet
ATM Maintenance Log
270 pages
Child Study 2003
100% (1)
Child Study 2003
37 pages
HR Policy Manual for Employees
100% (5)
HR Policy Manual for Employees
58 pages

Hadoop

Uploaded by

Hadoop

Uploaded by

Hadoop MapReduce

Felipe Meneses Besson

What is Hadoop? Hadoop Subprojects MapReduce HDFS Development and tools

Who's using Hadoop

Source: Hadoop wiki, September 2009

All projects are hosted by the Apache Software Foundation.

Source: (Cloudera, 2010)

Intermediate values Example: (A, [42, 100, 312])

Source: (Cloudera, 2010)

Input data + MapReduce program + configuration information

Task: part of the job

map and reduce tasks

Source: (Cloudera, 2010)

Applied in a mapper machine

Source: (Cloudera, 2010)

Each data blocks is replicated to multiple machines

Source: (Tom White, 2009)

Source: (Tom White, 2009)

Development and Tools

Hadoop supports three modes of operation:

Standalone Pseudo-distributed Fully-distributed

Development and Tools

Development and Tools

Development and Tools

Development and Tools

Development and Tools

Do you want to use a scripting language?

Eclipse plugin for MapReduce development

Hadoop training (videos, exercises, )

You might also like