The document discusses using Python with Hadoop frameworks. It introduces Hadoop Distributed File System (HDFS) and MapReduce, and how to use the mrjob library to write MapReduce jobs in Python. It also covers using Python with higher-level Hadoop frameworks like Pig, accessing HDFS with snakebite, and using Python clients for HBase and the PySpark API for the Spark framework. Key advantages discussed are Python's rich ecosystem and ability to access Hadoop frameworks.
Introduction to Hadoop, covering the Hadoop Distributed File System (HDFS), MapReduce framework, and the Hadoop ecosystem including Pig, Hive, and others.
Discusses key features of Hadoop such as linear scalability, schema on read, transparent parallelism, and handling unstructured data.
Explains advantages and disadvantages of using Python for Hadoop, including performance issues and community support.
Introduction to mrjob for writing MapReduce jobs in Python, including a canonical word count example.
Continues with various implementations of word count using mrjob with sample outputs for different strings.
An overview of MRJOB with an emphasis on alternatives like Hadoop Streaming and Pydoop for Python integration.
Describes Pydoop for writing MapReduce jobs in Python, focusing on performance and usage with sample code.
Introduction to Pig for data analysis in Hadoop with examples of Pig Latin syntax and user-defined functions.
Overview of Snakebite, a pure Python client for HDFS, explaining library usage and CLI commands.
Introduces Apache HBase, a NoSQL database on Hadoop, its use cases, and Python client options.
Overview of Apache Spark as a cluster computing system, including its capabilities and use of PySpark.
Provides a sample PySpark code for a word count application demonstrating its syntax and operations.
Wraps up the presentation, reiterating the main focus on Hadoop with Python.
Agenda
⢠Introduction toHadoop
⢠MapReduce with mrjob
⢠Pig with Python UDFs
⢠snakebite for HDFS
⢠HBase and python clients
⢠Spark and PySpark
3.
Hadoop Distributed FileSystem (HDFS)
⢠Stores files in folders (thatâs it)
⢠Nobody cares whatâs in your files
⢠Chunks large files into blocks (~64MB-2GB)
⢠3 replicas of each block (better safe than sorry)
⢠Blocks are scattered all over the place
FILE BLOCKS
4.
MapReduce
⢠Analyzes rawdata in HDFS where the data is
⢠Jobs are split into Mappers and Reducers
Reducers (you code this, too)
Automatically Groups by the
mapperâs output key
Aggregate, count, statistics
Outputs to HDFS
Mappers (you code this)
Loads data from HDFS
Filter, transform, parse
Outputs (key, value)
pairs
5.
Hadoop Ecosystem
⢠Higher-levellanguages like Pig and Hive
⢠HDFS Data systems like HBase and Accumulo
⢠Alternative execution engines like Storm and Spark
⢠Close friends like ZooKeeper, Flume, Avro, Kafka
6.
Cool Thing #1:Linear Scalability
⢠HDFS and MapReduce
scale linearly
⢠If you have twice as many
computers, jobs run twice
as fast
⢠If you have twice as much
data, jobs run twice as
slow
⢠If you have twice as many
computers, you can store
twice as much data
DATA LOCALITY!!
7.
Cool Thing #2:Schema on Read
LOAD DATA FIRST, ASK QUESTIONS LATER
Data is parsed/interpreted as it is loaded out of HDFS
What implications does this have?
BEFORE:
ETL, schema design upfront,
tossing out original data,
comprehensive data study
Keep original data around!
Have multiple views of the same data!
Work with unstructured data sooner!
Store first, figure out what to do with it later!
WITH HADOOP:
8.
Cool Thing #3:Transparent Parallelism
Network programming?
Inter-process communication?
Threading?
Distributed stuff?
With MapReduce, I DONâT CARE
Your solution
⌠I just have to be sure my solution fits into this tiny box
Fault tolerance?
Code deployment?
RPC?
Message passing?
Locking?
MapReduce
Framework
Data storage?
Scalability?
Data center fires?
9.
Cool Thing #4:Unstructured Data
⢠Unstructured data:
media, text,
forms, log data
lumped structured data
⢠Query languages like SQL and
Pig assume some sort of
âstructureâ
⢠MapReduce is just Java:
You can do anything Java can
do in a Mapper or Reducer
10.
Why Python?
⢠Pythonvs. Java
⢠Compiled vs. scripts
⢠Python libraries we all love
⢠Integration with other things
11.
Why Not?
⢠Pythonvs. Java
⢠Almost nothing is native
⢠Performance
⢠Being out of date
⢠Being âweirdâ
⢠Smaller community, almost no official support
mrjob
⢠Write MapReducejobs in Python!
⢠Open sourced and maintained by Yelp
⢠Wraps âHadoop Streamingâ in cpython Python 2.5+
⢠Well documented
⢠Can run locally, in Amazon EMR, or Hadoop
14.
Canonical Word Count
frommrjob.job import MRJob
import re
WORD_RE = re.compile(r"[w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount.run()
15.
Canonical Word Count
frommrjob.job import MRJob
import re
WORD_RE = re.compile(r"[w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount.run()
The quick brown fox jumps over the lazy dog
the, 1
quick, 1
brown, 1
fox, 1
jumps, 1
over, 1
the, 1
lazy, 1
dog, 1
16.
Canonical Word Count
frommrjob.job import MRJob
import re
WORD_RE = re.compile(r"[w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount.run()
I like this Hadoop thing
i, 1
like, 1
this, 1
hadoop, 1
thing, 1
17.
Canonical Word Count
frommrjob.job import MRJob
import re
WORD_RE = re.compile(r"[w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount.run()
dog, [1, 1, 1, 1, 1, 1]
dog, 6
18.
Canonical Word Count
frommrjob.job import MRJob
import re
WORD_RE = re.compile(r"[w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount.run()
cat, [1, 1, 1, 1, 1, 1, 1, 1]
cat, 8
Pydoop
⢠Write MapReducejobs in Python!
⢠Uses Hadoop C++ Pipes, which should be faster than
wrapping streaming
⢠Actively being worked on
⢠Iâm not sure which is better
22.
Pydoop Word Count
withopen('stop.txt') as f:
STOP_WORDS = set(l.strip() for l in f if not l.isspace())
def mapper(_, v, writer):
for word in v.split():
if word in STOP_WORDS:
writer.count("STOP_WORDS", 1)
else:
writer.emit(word, 1)
def reducer(word, icounts, writer):
writer.emit(word, sum(map(int, icounts)))
$ pydoop script wc.py hdfs_input hdfs_output --upload-
file-to-cache stop.txt
23.
Pig
⢠Pig isa higher-level platform and language for analyzing
data that happens to run MapReduce underneath
a = LOAD âinputdata.txtâ;
b = FOREACH a GENERATE
FLATTEN(TOKENIZE((chararray)$0)) as word;
c = GROUP b BY word;
d = FOREACH c GENERATE group, COUNT(c);
STORE d INTO âwc';
24.
Pig UDFs
Users canwrite user-defined functions to extend the
functionality of Pig
Can use jython (faster) or cpython (access to more libs)
b = FOREACH a GENERATE revster(phonenum);
...
m = GROUP j BY username;
n = FOREACH m GENERATE group, sortedconcat(j.tags);
@outputSchema(âtags:chararray")
def sortedconcat(bag):
out = set()
for tag in bag:
out.add(tag)
return â-â.join(sorted(out))
@outputSchema(ârev:chararray")
def revstr(instr):
return instr[::-1]
25.
⢠A purePython client
⢠Handles most NameNode ops (moving/renaming files,
deleting files)
⢠Handles most DataNode reading ops (reading files,
getmerge)
⢠Doesnât handle writing to DataNodes yet
⢠Two ways to use: library and command line interface
26.
- Library
from snakebite.clientimport Client
client = Client(â1.2.3.4", 54310, use_trash=False)
for x in client.ls(['/data']):
print x
print ââ.join(client.cat(â/data/ref/refdata*.csvâ))
Useful for doing HDFS file manipulation in data flows or job setups
Can be used to read reference data from MapReduce jobs
27.
- CLI
$ snakebiteget /path/in/hdfs/mydata.txt /local/path/data.txt
$ snakebite rm /path/in/hdfs/mydata.txt
$ for fp in `snakebite ls /data/new/`; do
snakebite mv â/data/new/$fpâ â/data/in/`date â+%Y/%m/%d/â$fp
done
The âhadoopâ CLI client is written in Java and spins up a new JVM every time (1-3 sec)
Snakebite doesnât have that problem, making it good for lots of programmatic
interactions with HDFS.
28.
From the website:
ApacheHBase is the Hadoop database, a distributed, scalable, big data store.
When Would I Use Apache HBase?
Use Apache HBase when you need random, realtime read/write access to
your Big Data. This project's goal is the hosting of very large tables --
billions of rows X millions of columns -- atop clusters of commodity
hardware. Apache HBase is an open-source, distributed, versioned, non-
relational database modeled after Google's Bigtable: A Distributed Storage
System for Structured Data by Chang et al. Just as Bigtable leverages the
distributed data storage provided by the Google File System, Apache HBase
provides Bigtable-like capabilities on top of Hadoop and HDFS.
29.
Python clients
Starbase orHappybase
Uses the HBase Thrift gateway interface (slow)
Last commit 6 months ago
Appears to be fully featured
Not really there yet and have failed to gain community momentum. Java is still
king.
30.
From the website:
ApacheSpark is a fast and general-purpose cluster
computing system. It provides high-level APIs in Scala,
Java, and Python that make parallel jobs easy to write,
and an optimized engine that supports general computation
graphs. It also supports a rich set of higher-level tools
including Shark (Hive on Spark), MLlib for machine
learning, GraphX for graph processing, and Spark
Streaming.
In general, Spark is faster than MapReduce and
easier to write than MapReduce
31.
PySpark
⢠Sparkâs nativelanguage is Scala, but it also supports Java
and Python
⢠Python API is always a tad behind Scala
⢠Programming in Spark (and PySpark) is in the form of
chaining transformations and actions on RDDs
⢠RDDs are âResilient Distributed Datasetsâ
⢠RDDs are kept in memory for the most part
32.
PySpark Word CountExample
import sys
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
if len(sys.argv) != 2:
print >> sys.stderr, "Usage: wordcount <file>"
exit(-1)
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile(sys.argv[1], 1)
counts = lines.flatMap(lambda x: x.split(' '))
.map(lambda x: (x, 1))
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print "%s: %i" % (word, count)
sc.stop()
#2Â Donald Miner will do a quick introduction to Apache Hadoop, then discuss the different ways Python can be used to get the job done in Hadoop. This includes writing MapReduce jobs in Python in various different ways, interacting with HBase, writing custom behavior in Pig and Hive, interacting with the Hadoop Distributed File System, using Spark, and integration with other corners of the Hadoop ecosystem. The state of Python with Hadoop is far from stable, so we'll spend some honest time talking about the state of these open source projects and what's missing will also be discussed.
#34Â Donald Miner will do a quick introduction to Apache Hadoop, then discuss the different ways Python can be used to get the job done in Hadoop. This includes writing MapReduce jobs in Python in various different ways, interacting with HBase, writing custom behavior in Pig and Hive, interacting with the Hadoop Distributed File System, using Spark, and integration with other corners of the Hadoop ecosystem. The state of Python with Hadoop is far from stable, so we'll spend some honest time talking about the state of these open source projects and what's missing will also be discussed.