Introduction to Data Science with Hadoop

1
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

IntroducAon
to
Data
Science

with
Hadoop

Glynn
Durham,
Senior
Instructor,
Cloudera

glynn@cloudera.com

2
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

I
will
cover:

Hadoop,
Hadoop
ecosystem

HDFS

MapReduce

Sqoop

Flume

Hive

Pig

Mahout

Machine
learning

Data
science
using
Hadoop

Terms

with
a
few
extras:

YARN

HBase

Impala

Oozie

data
products

3
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Hadoop
is:

 
a
plaLorm
for
big
data

 
several
Apache
SoNware

FoundaOon
(ASF)
projects

 
free
open
source
soNware

Major
parts:

Hadoop
Core

Hadoop
ecosystem

Hadoop

4
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Hadoop
Core
Main
Features:
File
System
and
Batch
Programming

5
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Hadoop
Core
consists
of:

HDFS

– 
(Hadoop
Distributed
File
System),
for
storage

MapReduce

– 
for
batch
programming

Hadoop
Core

6
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

HDFS
Writes

7
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

HDFS
Reads

8
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

HDFS
is
good
at:

– 
storing
enormous
files

– 
storing
a
lot
of
data
reliably

– 
throughput
on
sequenAal
writes

– 
throughput
on
sequenAal
reads
of
a
file
or
part
of
a
file

HDFS
is
not
good
at:

– 
high
speed
random
reads
of
parts
of
a
file

HDFS
cannot:

– 
update
any
part
of
a
file
once
wri>en*

– 
*
but
you
can
always
write
a
new
file,
and/or
delete,
move,

and
rename
files
and
directories

HDFS
Strengths
and
Weaknesses

9
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

MapReduce:
Programming
with
Simple
FuncAons

10
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

MapReduce
Chains

11
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

MapReduce
at
Scale

12
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

MapReduce
in
Hadoop

13
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

MapReduce
is
good
at:

– 
processing
enormous
amounts
of
data

– 
scaling
out
as
you
add
more
machines

– 
conAnuing
to
compleAon,
even
when
some
machines
die

MapReduce
is
not
good
at:

– 
running
any
algorithm
you
can
think
up

– 
algorithms
that
require
shared
state
overall*

– 
*
but
maybe
you
can
get
clever
with
your
algorithm
design

MapReduce
cannot:

– 
run
in
real
Ame:
MapReduce
jobs
are
batch
jobs

MapReduce
Strengths
and
Weaknesses

14
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Detour:
YARN,
Yet
Another
Resource
NegoAator—near
future

15
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

The
Hadoop
Ecosystem
consists
of
other
projects
that
round

out
Hadoop
Core
to
make
it
a
useful
plaorm:

– Sqoop,
for
RDBMS
integraAon

– Flume,
for
event
ingesAon

– Hive,
for
"SQL"-‐like
high-‐level
programming

– Pig,
another
high-‐level
programming
paradigm

– Mahout,
a
Java
library
for
machine
learning
in
Hadoop

Plus:

– HBase,
a
"NoSQL"
database
system

– Oozie,
a
workﬂow
manager
for
Hadoop
acAons

– ....

Hadoop
Ecosystem

16
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Sqoop:
RDBMS
to
Hadoop
and
Back

17
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Flume:
IngesAng
ConAnuing
Event
Data

18
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Detour:
General
File
Input/Output

19
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Java
MapReduce
API

MapReduce
revisited:
How
to
write
MapReduce
programs?

•  The
most
expressive
technique
possible

•  The
most
work,
by
far

•  (Can
be
easier
with
Hadoop
Streaming:
a
way
to
use
streaming
programming

such
as
shell
scripOng
or
Python)

20
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Hive:
MapReduce
as
"SQL"

•  Familiar
language
and
programming
paradigm

•  Provides
interface
to
many
SQL-‐compliant
tools

21
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Detour:
Impala,
High
Speed
AnalyAcs
in
Hadoop

•  5
to
30
Omes
faster
then
Hive
queries
(someOmes
100's
of
Omes
faster!)

•  Cloudera
exclusive
oﬀering,
but
Apache
licensed,
so
it's
free
and
open
source

22
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Impala
Does
Not
Use
MapReduce

23
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Detour:
HBase,
A
NoSQL
Database
System

24
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

HBase
is
a
NoSQL
database
system:

– 
programmers
create
and
use
database
tables

– 
high
volume,
high
performance
access
to
individual
cells

– 
much
weaker
query
language
than
SQL

– 
lacks
ACID-‐compliant
transacAons

HBase
is
not
strictly
needed
to
do
"data
science"

– 
a
resource
hog;
competes
with
analyAcal
programs

– 
ogen
deployed
on
its
own
separate
cluster

– 
may
be
part
of
your
organizaAon's
data
storage
and
delivery,

so
you
may
need
to
get
or
put
data
into
an
HBase
system*

– 
*
(or
other
NoSQL
system)

Detour:
A
bit
more
about
HBase

26
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Mahout
is:

 a
collecOon
of
algorithms,
mainly
focused
on
"the
three
C's"
of

machine
learning

 wriden
in
Java

 largely
implemented
over
Hadoop
MapReduce

 invocable
from
the
command
line

 extensible,
with
the
Java
API

Mahout
is
not:

 a
turnkey
soluOon
for
doing
machine
learning

 always
user-‐friendly

Mahout:
Machine
Learning
in
MapReduce

27
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

"The
three
C's"
of
machine
learning:

 
ClassiﬁcaOon

 
Clustering

 
CollaboraOve
ﬁltering
(recommenders)

Machine
Learning

32
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Simple
workﬂow
within
Hadoop:

1.  Clear
out
staging
directory
in
HDFS

2.  Sqoop
import
from
OLTP
tables

3.  Hive
(or
Pig)
script
to
transform
data

4.  Sqoop
export
to
data
warehouse

Detour:
Oozie,
Workﬂow
within
Hadoop

34
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

A
data
scienOst
will:

1.  IdenOfy
internal
and
external
data
for
potenOal
use
(general
data
wrangling
tools).

2.  Help
build
ingesOon
pipelines
to
obtain
data
for
use
(Flume,
Sqoop,
other).

3.  Examine,
clean,
and
anonymize
ingested
data
(Hive,
Impala,
Pig,
Hadoop
Streaming).

4.  Shape
data
into
useful
formats
(Hive,
Pig).

5.  Explore
data
sets
to
gain
understanding
of
problems,
trends,
reality
(Impala,
Hive,
Pig,

staOsOcal
programming).

6.  Build
predicOve
models
using
staOsOcal
programming,
machine
learning
(Mahout).

7.  Contribute
to
data
products:
products
in
the
organizaOon
that
are
built
in
large
part

from
the
data
itself
(Mahout,
Sqoop
export,
general
file
export).

8.  Conduct
experiments
with
data
products,
quanOfying
benefits
and/or
tradeoffs
of

system
changes
(Flume,
Sqoop,
staOsOcal
tests).

9.  Communicate
results
and
insights
to
stakeholders
(visualizaOon*).

Data
Science
with
Hadoop

36
of
36
©
Copyright
2010-‐2013
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.

Thank
you!

QuesAons?

ContribuAons?

Glynn
Durham,
Senior
Instructor,
Cloudera

glynn@cloudera.com

Introduction to Data Science with Hadoop

More Related Content

What's hot

Viewers also liked

Similar to Introduction to Data Science with Hadoop

More from Dr. Volkan OBAN

Recently uploaded

Introduction to Data Science with Hadoop