KEMBAR78
Intro to hadoop tutorial | PDF
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Introduc=on	
  to	
  Apache	
  Hadoop	
  	
  
and	
  its	
  Ecosystem	
  
Mark	
  Grover	
  	
  |	
  	
  Budapest	
  Data	
  Forum	
  
June	
  5th,	
  2015	
  
@mark_grover	
  
github.com/markgrover/hadoop-­‐intro-­‐fast	
  
©	
  Copyright	
  2010-­‐2014	
  	
  
	
  	
  	
  	
  	
  Cloudera,	
  Inc.	
  	
  	
  
	
  	
  	
  	
  	
  All	
  rights	
  reserved.	
  	
  	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
•  Facili=es	
  
•  Ques=ons	
  
•  Schedule	
  (=mes	
  are	
  approximate)	
  
Schedule	
  and	
  Logis=cs	
  
Time	
   Event	
  
9:00	
  –	
  10:20	
   Tutorial	
  
10:20	
  –	
  10:30	
   Break	
  
10:30	
  –	
  12:00	
   Tutorial	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
About	
  the	
  Presenta=on…	
  
•  What’s	
  ahead	
  
•  Fundamental	
  Concepts	
  
•  HDFS:	
  The	
  Hadoop	
  Distributed	
  File	
  System	
  
•  Data	
  Processing	
  with	
  MapReduce	
  
•  The	
  Hadoop	
  Ecosystem	
  
•  Hadoop	
  Clusters:	
  Past,	
  Present,	
  and	
  Future	
  
•  Conclusion	
  +	
  Q&A	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Fundamental	
  Concepts	
  
Why	
  the	
  World	
  Needs	
  Hadoop	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Volume	
  
•  Every	
  day…	
  
•  More	
  than	
  1.5	
  billion	
  shares	
  are	
  traded	
  on	
  the	
  NYSE	
  
•  Facebook	
  stores	
  2.7	
  billion	
  comments	
  and	
  Likes	
  
•  Every	
  minute…	
  
•  Foursquare	
  handles	
  more	
  than	
  2,000	
  check-­‐ins	
  
•  TransUnion	
  makes	
  nearly	
  70,000	
  updates	
  to	
  credit	
  files	
  
•  Every	
  second…	
  
•  Banks	
  process	
  more	
  than	
  10,000	
  credit	
  card	
  transac=ons	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
•  We	
  are	
  genera=ng	
  data	
  faster	
  than	
  ever	
  
•  Processes	
  are	
  increasingly	
  automated	
  
•  People	
  are	
  increasingly	
  interac=ng	
  online	
  
•  Systems	
  are	
  increasingly	
  interconnected	
  
Velocity	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Variety	
  
•  We’re	
  producing	
  a	
  variety	
  of	
  data,	
  including	
  
•  Audio	
  
•  Video	
  
•  Images	
  
•  Log	
  files	
  
•  Web	
  pages	
  
•  Product	
  ra=ngs	
  
•  Social	
  network	
  connec=ons	
  
•  Not	
  all	
  of	
  this	
  maps	
  cleanly	
  to	
  the	
  rela=onal	
  model	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Big	
  Data	
  Can	
  Mean	
  Big	
  Opportunity	
  
•  One	
  tweet	
  is	
  an	
  anecdote	
  
•  But	
  a	
  million	
  tweets	
  may	
  signal	
  important	
  trends	
  
•  One	
  person’s	
  product	
  review	
  is	
  an	
  opinion	
  
•  But	
  a	
  million	
  reviews	
  might	
  uncover	
  a	
  design	
  flaw	
  
•  One	
  person’s	
  diagnosis	
  is	
  an	
  isolated	
  case	
  
•  But	
  a	
  million	
  medical	
  records	
  could	
  lead	
  to	
  a	
  cure	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
We	
  Need	
  a	
  System	
  that	
  Scales	
  
•  Too	
  much	
  data	
  for	
  tradi=onal	
  tools	
  
•  Two	
  key	
  problems	
  
•  How	
  to	
  reliably	
  store	
  this	
  data	
  at	
  a	
  reasonable	
  cost	
  
•  How	
  to	
  we	
  process	
  all	
  the	
  data	
  we’ve	
  stored	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
What	
  is	
  Apache	
  Hadoop?	
  
•  Scalable	
  data	
  storage	
  and	
  processing	
  
•  Distributed	
  and	
  fault-­‐tolerant	
  	
  
•  Runs	
  on	
  standard	
  hardware	
  
•  Two	
  main	
  components	
  
•  Storage:	
  Hadoop	
  Distributed	
  File	
  System	
  (HDFS)	
  
•  Processing:	
  MapReduce	
  
•  Hadoop	
  clusters	
  are	
  composed	
  of	
  computers	
  called	
  nodes	
  
•  Clusters	
  range	
  from	
  a	
  single	
  node	
  up	
  to	
  several	
  thousand	
  nodes	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
How	
  Did	
  Apache	
  Hadoop	
  Originate?	
  
•  Heavily	
  influenced	
  by	
  Google’s	
  architecture	
  
•  Notably,	
  the	
  Google	
  Filesystem	
  and	
  MapReduce	
  papers	
  
•  Other	
  Web	
  companies	
  quickly	
  saw	
  the	
  benefits	
  
•  Early	
  adop=on	
  by	
  Yahoo,	
  Facebook	
  and	
  others	
  
2002 2003 2004 2005 2006
Google publishes
MapReduce paper
Nutch rewritten
for MapReduce
Hadoop becomes
Lucene subproject
Nutch spun off
from Lucene
Google publishes
GFS paper
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
How	
  Are	
  Organiza=ons	
  Using	
  Hadoop?	
  
•  Let’s	
  look	
  at	
  a	
  few	
  common	
  uses…	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Hadoop	
  Use	
  Case:	
  Log	
  Sessioniza=on	
  
February 12, 2014
10.174.57.241 - - [17/Feb/2014:17:57:41 -0500] "GET /s?q=widget HTTP/1.1" 200 3617 "http://www.hotbot.com/find/dualcore" "WebTV 1.2" "U=129"
1) Search for 'Widget'
2) Widget Results
3) View Details for Widget X
Recent Activity for John Smith
February 17, 2014
5) Track Order
6) Click 'Contact Us' Link
7) Submit Complaint
4) Order Widget X
...
10.174.57.241 - - [17/Feb/2014:17:58:03 -0500] "GET /wres.html HTTP/1.1" 200 5741 "http://www.example.com/s?q=widget" "WebTV 1.2" "U=129"
10.174.57.241 - - [17/Feb/2014:17:58:25 -0500] "GET /detail?w=41 HTTP/1.1" 200 8584 "http://www.example.com/wres.html" "WebTV 1.2" "U=129"
10.174.57.241 - - [17/Feb/2014:17:59:36 -0500] "GET /order.do HTTP/1.1" 200 964 "http://www.example.com/detail?w=41" "WebTV 1.2" "U=129"
10.174.57.241 - - [17/Feb/2014:17:59:47 -0500] "GET /confirm HTTP/1.1" 200 964 "http://www.example.com/order.do" "WebTV 1.2" "U=129"
10.218.46.19 - - [17/Feb/2014:17:57:43 -0500] "GET /ide.html HTTP/1.1" 404 955 "http://www.example.com/s?q=JBuilder" "Mosaic/3.6 (X11;SunOS)"
10.32.51.237 - - [17/Feb/2014:17:58:04 -0500] "GET /os.html HTTP/1.1" 404 955 "http://www.example.com/s?q=VMS" "Mozilla/1.0b (Win3.11)"
10.157.96.181 - - [17/Feb/2014:17:58:26 -0500] "GET /mp3.html HTTP/1.1" 404 955 "http://www.example.com/s?q=Zune" "Mothra/2.77" "U=3622"
...
Web Server Log Data
Clickstream Data for User Sessions
Process Logs
with Hadoop
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Hadoop	
  Use	
  Case:	
  Customer	
  Analy=cs	
  
Example Inc. Public Web Site (February 9 - 15)
Category Unique Visitors Page Views Bounce Rate Conversion RateAverage Time on Page
Television 1,967,345 8,439,206 23% 51%17 seconds
Smartphone 982,384 3,185,749 47% 41%23 seconds
MP3 Player 671,820 2,174,913 61% 12%42 seconds
Stereo 472,418 1,627,843 74% 19%26 seconds
Monitor 327,018 1,241,837 56% 17%37 seconds
Tablet 217,328 816,545 48% 28%53 seconds
Printer 127,124 535,261 27% 64%34 seconds
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Hadoop	
  Use	
  Case:	
  Sen=ment	
  Analysis	
  
00 01 02 03 04 05 06 07 08 09 10
Negative
Neutral
Positive
References to Product in Social Media (Hourly)
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
Hour 11 12 13 14 15 16 17 18 19 20 21 22 23
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Hadoop	
  Use	
  Case:	
  Product	
  Recommenda=ons	
  
Audio Adapter - Stereo
Products Recommended for You
$4.99
(341 ratings)
Over-the-Ear Headphones
$29.99
(1,672 ratings)
9-volt Alkaline Battery
$1.79
(847 ratings)
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
—	
  	
  Grace	
  Hopper,	
  early	
  advocate	
  of	
  distributed	
  compu=ng	
  
“In	
  pioneer	
  days	
  they	
  used	
  oxen	
  for	
  heavy	
  
pulling,	
  and	
  when	
  one	
  ox	
  couldn’t	
  budge	
  a	
  log,	
  
we	
  didn’t	
  try	
  to	
  grow	
  a	
  larger	
  ox”	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Comparing	
  Hadoop	
  to	
  Other	
  Systems	
  
•  Monolithic	
  systems	
  don’t	
  scale	
  
•  Modern	
  high-­‐performance	
  compu=ng	
  systems	
  are	
  distributed	
  
•  They	
  spread	
  computa=ons	
  across	
  many	
  machines	
  in	
  parallel	
  
•  Widely-­‐used	
  used	
  for	
  scien=fic	
  applica=ons	
  
•  Let’s	
  examine	
  how	
  a	
  typical	
  HPC	
  system	
  works	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Architecture	
  of	
  a	
  Typical	
  HPC	
  System	
  
Storage System
Compute Nodes
Fast Network
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Architecture	
  of	
  a	
  Typical	
  HPC	
  System	
  
Storage System
Compute Nodes
Step 1: Copy input data
Fast Network
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Architecture	
  of	
  a	
  Typical	
  HPC	
  System	
  
Storage System
Compute Nodes
Step 2: Process the data
Fast Network
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Architecture	
  of	
  a	
  Typical	
  HPC	
  System	
  
Storage System
Compute Nodes
Step 3: Copy output data
Fast Network
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
You	
  Don’t	
  Just	
  Need	
  Speed…	
  
•  The	
  problem	
  is	
  that	
  we	
  have	
  way	
  more	
  data	
  than	
  code	
  
$ du -ks code/
1,087
$ du –ks data/
854,632,947,314
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
You	
  Need	
  Speed	
  At	
  Scale	
  
Storage System
Compute Nodes
Bottleneck
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Hadoop	
  Design	
  Fundamental:	
  Data	
  Locality	
  
•  This	
  is	
  a	
  hallmark	
  of	
  Hadoop’s	
  design	
  
•  Don’t	
  bring	
  the	
  data	
  to	
  the	
  computa=on	
  
•  Bring	
  the	
  computa=on	
  to	
  the	
  data	
  
•  Hadoop	
  uses	
  the	
  same	
  machines	
  for	
  storage	
  and	
  processing	
  
•  Significantly	
  reduces	
  need	
  to	
  transfer	
  data	
  across	
  network	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Other	
  Hadoop	
  Design	
  Fundamentals	
  
•  Machine	
  failure	
  is	
  unavoidable	
  –	
  embrace	
  it	
  
•  Build	
  reliability	
  into	
  the	
  system	
  
•  “More”	
  is	
  usually	
  beqer	
  than	
  “faster”	
  
•  Throughput	
  maqers	
  more	
  than	
  latency	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
The	
  Hadoop	
  Distributed	
  Filesystem	
  
HDFS	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
HDFS:	
  Hadoop	
  Distributed	
  File	
  System	
  
•  Inspired	
  by	
  the	
  Google	
  File	
  System	
  
•  Reliable,	
  low-­‐cost	
  storage	
  for	
  massive	
  amounts	
  of	
  data	
  
•  Similar	
  to	
  a	
  UNIX	
  filesystem	
  in	
  some	
  ways	
  
•  Hierarchical	
  
•  UNIX-­‐style	
  paths	
  (e.g.,	
  /sales/alice.txt)	
  
•  UNIX-­‐style	
  file	
  ownership	
  and	
  permissions	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
HDFS:	
  Hadoop	
  Distributed	
  File	
  System	
  
•  There	
  are	
  also	
  some	
  major	
  devia=ons	
  from	
  UNIX	
  filesystems	
  
•  Highly-­‐op=mized	
  for	
  processing	
  data	
  with	
  MapReduce	
  
•  Designed	
  for	
  sequen=al	
  access	
  to	
  large	
  files	
  
•  Cannot	
  modify	
  file	
  content	
  once	
  wriqen	
  
•  It’s	
  actually	
  a	
  user-­‐space	
  Java	
  process	
  
•  Accessed	
  using	
  special	
  commands	
  or	
  APIs	
  
•  No	
  concept	
  of	
  a	
  current	
  working	
  directory	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
HDFS	
  Blocks	
  
•  Files	
  added	
  to	
  HDFS	
  are	
  split	
  into	
  fixed-­‐size	
  blocks	
  
•  Block	
  size	
  is	
  configurable,	
  but	
  defaults	
  to	
  64	
  megabytes	
  
Block #1: First 64 MB
Lorem ipsum dolor sit
amet, consectetur sed
adipisicing elit, ado lei
eiusmod tempor etma
incididunt ut libore tua
dolore magna alli quio
ut enim ad minim veni
veniam, quis nostruda
exercitation ul laco es
sed laboris nisi ut eres
aliquip ex eaco modai
consequat. Duis hona
irure dolor in repre sie
honerit in ame mina lo
voluptate elit esse oda
cillum le dolore eu fugi
gia nulla aria tur. Ente
culpa qui officia ledea
un mollit anim id est o
laborum ame elita tu a
magna omnibus et.
Lorem ipsum dolor sit
amet, consectetur sed
adipisicing elit, ado lei
eiusmod tempor etma
incididunt ut libore tua
dolore magna alli quio
ut enim ad minim veni
veniam, quis nostruda
exercitation ul laco es
sed laboris nisi ut eres
aliquip ex eaco modai
consequat. Duis hona
irure dolor in repre sie
honerit in ame mina lo
voluptate elit esse oda
cillum le dolore eu fugi
gia nulla aria tur. Ente
culpa qui officia ledea
un mollit anim id est o
laborum ame elita tu a
magna omnibus et.
Block #4: Remaining 33 MB
Block #2: Next 64 MB
Block #3: Next 64 MB
225 MB File
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Why	
  Does	
  HDFS	
  Use	
  Such	
  Large	
  Blocks?	
  
Current location of
disk head
Where the data you
need is stored
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
HDFS	
  Replica=on	
  
•  Each	
  block	
  is	
  then	
  replicated	
  across	
  mul=ple	
  nodes	
  
•  Replica=on	
  factor	
  is	
  also	
  configurable,	
  but	
  defaults	
  to	
  three	
  
•  Benefits	
  of	
  replica=on	
  
•  Availability:	
  data	
  isn’t	
  lost	
  when	
  a	
  node	
  fails	
  
•  Reliability:	
  HDFS	
  compares	
  replicas	
  and	
  fixes	
  data	
  corrup=on	
  
•  Performance:	
  allows	
  for	
  data	
  locality	
  
•  Let’s	
  see	
  an	
  example	
  of	
  replica=on…	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
HDFS	
  Replica=on	
  (cont’d)	
  
Lorem ipsum dolor sit amet,
consectetur sed adipisicing
elit, ado lei eiusmod tempor
etma incididunt ut libore tua
dolore magna alli quio
ut enim ad minim veni
veniam, quis nostruda
exercitation ul laco es sed
laboris nisi ut eres aliquip ex
eaco modai consequat. Duis
hona
irure dolor in repre sie
honerit in ame mina lo
voluptate elit esse oda cillum
le dolore eu fugi gia nulla
aria tur. Ente culpa qui officia
ledea
un mollit anim id est o
laborum ame elita tu a
magna omnibus et.
A
B
C
Block #1
Block #2
Block #3
Block #4
E
D
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
HDFS	
  Replica=on	
  (cont’d)	
  
C
E
Lorem ipsum dolor sit amet,
consectetur sed adipisicing
elit, ado lei eiusmod tempor
etma incididunt ut libore tua
dolore magna alli quio
irure dolor in repre sie
honerit in ame mina lo
voluptate elit esse oda cillum
le dolore eu fugi gia nulla
aria tur. Ente culpa qui officia
ledea
un mollit anim id est o
laborum ame elita tu a
magna omnibus et.
D
B
ABlock #1
Block #2
Block #3
Block #4
ut enim ad minim veni
veniam, quis nostruda
exercitation ul laco es sed
laboris nisi ut eres aliquip ex
eaco modai consequat. Duis
hona
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
HDFS	
  Replica=on	
  (cont’d)	
  
Lorem ipsum dolor sit amet,
consectetur sed adipisicing
elit, ado lei eiusmod tempor
etma incididunt ut libore tua
dolore magna alli quio
ut enim ad minim veni
veniam, quis nostruda
exercitation ul laco es sed
laboris nisi ut eres aliquip ex
eaco modai consequat. Duis
hona
un mollit anim id est o
laborum ame elita tu a
magna omnibus et.
Block #1
Block #2
Block #3
Block #4
A
E
B
irure dolor in repre sie
honerit in ame mina lo
voluptate elit esse oda cillum
le dolore eu fugi gia nulla
aria tur. Ente culpa qui officia
ledea
D
C
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
HDFS	
  Replica=on	
  (cont’d)	
  
Lorem ipsum dolor sit amet,
consectetur sed adipisicing
elit, ado lei eiusmod tempor
etma incididunt ut libore tua
dolore magna alli quio
ut enim ad minim veni
veniam, quis nostruda
exercitation ul laco es sed
laboris nisi ut eres aliquip ex
eaco modai consequat. Duis
hona
irure dolor in repre sie
honerit in ame mina lo
voluptate elit esse oda cillum
le dolore eu fugi gia nulla
aria tur. Ente culpa qui officia
ledea
Block #1
Block #2
Block #3
Block #4
B
C
E
A
D
un mollit anim id est o
laborum ame elita tu a
magna omnibus et.
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Accessing	
  HDFS	
  via	
  the	
  Command	
  Line	
  
•  Users	
  typically	
  access	
  HDFS	
  via	
  the	
  hadoop fs	
  command	
  
•  Ac=ons	
  specified	
  with	
  subcommands	
  (prefixed	
  with	
  a	
  minus	
  sign)	
  
•  Most	
  are	
  similar	
  to	
  corresponding	
  UNIX	
  commands	
  
$ hadoop fs -ls /user/tomwheeler
$ hadoop fs -cat /customers.csv
$ hadoop fs -rm /webdata/access.log
$ hadoop fs -mkdir /reports/marketing
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Copying	
  Local	
  Data	
  To	
  and	
  From	
  HDFS	
  
•  Remember	
  that	
  HDFS	
  is	
  dis=nct	
  from	
  your	
  local	
  filesystem	
  
•  hadoop fs –put	
  copies	
  local	
  files	
  to	
  HDFS	
  
•  hadoop fs –get	
  fetches	
  a	
  local	
  copy	
  of	
  a	
  file	
  from	
  HDFS	
  
$ hadoop fs -put sales.txt /reports
Hadoop Cluster
Client Machine
$ hadoop fs -get /reports/sales.txt
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
HDFS	
  Daemons	
  
•  There	
  are	
  two	
  daemon	
  processes	
  in	
  HDFS	
  
•  NameNode	
  (master)	
  
•  Exactly	
  one	
  ac=ve	
  NameNode	
  per	
  cluster	
  
•  Manages	
  namespace	
  and	
  metadata	
  
•  DataNode	
  (slave)	
  
•  Many	
  per	
  cluster	
  
•  Performs	
  block	
  storage	
  and	
  retrieval	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
How	
  File	
  Reads	
  Work	
  
client
C
D
A
B
E
Step 1:
Get block locations from
the NameNode
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
How	
  File	
  Reads	
  Work	
  (cont'd)	
  
client
C
D
A
B
E
Step 2:
Read those blocks directly
from the DataNodes
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
HDFS	
  Demo	
  
•  I	
  will	
  now	
  demonstrate	
  the	
  following	
  
1.  How	
  to	
  list	
  the	
  contents	
  of	
  a	
  directory	
  
2.  How	
  to	
  create	
  a	
  directory	
  in	
  HDFS	
  
3.  How	
  to	
  copy	
  a	
  local	
  file	
  to	
  HDFS	
  
4.  How	
  to	
  display	
  the	
  contents	
  of	
  a	
  file	
  in	
  HDFS	
  
5.  How	
  to	
  remove	
  a	
  file	
  from	
  HDFS	
  
TODO:	
  provide	
  VM	
  and	
  
instruc=ons	
  for	
  demo	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
A	
  Scalable	
  Data	
  Processing	
  Framework	
  
Data	
  Processing	
  with	
  MapReduce	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
What	
  is	
  MapReduce?	
  
•  MapReduce	
  is	
  a	
  programming	
  model	
  
•  It’s	
  a	
  way	
  of	
  processing	
  data	
  	
  
•  You	
  can	
  implement	
  MapReduce	
  in	
  any	
  language	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Understanding	
  Map	
  and	
  Reduce	
  
•  You	
  supply	
  two	
  func=ons	
  to	
  process	
  data:	
  Map	
  and	
  Reduce	
  
•  Map:	
  typically	
  used	
  to	
  transform,	
  parse,	
  or	
  filter	
  data	
  
•  Reduce:	
  typically	
  used	
  to	
  summarize	
  results	
  
•  The	
  Map	
  func=on	
  always	
  runs	
  first	
  
•  The	
  Reduce	
  func=on	
  runs	
  averwards,	
  but	
  is	
  op=onal	
  
•  Each	
  piece	
  is	
  simple,	
  but	
  can	
  be	
  powerful	
  when	
  combined	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
MapReduce	
  Benefits	
  
•  Scalability	
  
•  Hadoop	
  divides	
  the	
  processing	
  job	
  into	
  individual	
  tasks	
  
•  Tasks	
  execute	
  in	
  parallel	
  (independently)	
  across	
  cluster	
  
•  Simplicity	
  
•  Processes	
  one	
  record	
  at	
  a	
  =me	
  
•  Ease	
  of	
  use	
  
•  Hadoop	
  provides	
  job	
  scheduling	
  and	
  other	
  infrastructure	
  
•  Far	
  simpler	
  for	
  developers	
  than	
  typical	
  distributed	
  compu=ng	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
MapReduce	
  in	
  Hadoop	
  
•  MapReduce	
  processing	
  in	
  Hadoop	
  is	
  batch-­‐oriented	
  
•  A	
  MapReduce	
  job	
  is	
  broken	
  down	
  into	
  smaller	
  tasks	
  
•  Tasks	
  run	
  concurrently	
  
•  Each	
  processes	
  a	
  small	
  amount	
  of	
  overall	
  input	
  
•  MapReduce	
  code	
  for	
  Hadoop	
  is	
  usually	
  wriqen	
  in	
  Java	
  
•  This	
  uses	
  Hadoop’s	
  API	
  directly	
  
•  You	
  can	
  do	
  basic	
  MapReduce	
  in	
  other	
  languages	
  
•  Using	
  the	
  Hadoop	
  Streaming	
  wrapper	
  program	
  
•  Some	
  advanced	
  features	
  require	
  Java	
  code	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
MapReduce	
  Example	
  in	
  Python	
  
•  The	
  following	
  example	
  uses	
  Python	
  
•  Via	
  Hadoop	
  Streaming	
  
•  It	
  processes	
  log	
  files	
  and	
  summarizes	
  events	
  by	
  type	
  
•  I’ll	
  explain	
  both	
  the	
  data	
  flow	
  and	
  the	
  code	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Job	
  Input	
  
•  Here’s	
  the	
  job	
  input	
  
	
  
•  Each	
  map	
  task	
  gets	
  a	
  chunk	
  of	
  this	
  data	
  to	
  process	
  
•  Typically	
  corresponds	
  to	
  a	
  single	
  block	
  in	
  HDFS	
  
2013-06-29 22:16:49.391 CDT INFO "This can wait"
2013-06-29 22:16:52.143 CDT INFO "Blah blah blah"
2013-06-29 22:16:54.276 CDT WARN "This seems bad"
2013-06-29 22:16:57.471 CDT INFO "More blather"
2013-06-29 22:17:01.290 CDT WARN "Not looking good"
2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant"
2013-06-29 22:17:05.362 CDT ERROR "Out of memory!"
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
#!/usr/bin/env python
import sys
levels = ['TRACE', 'DEBUG', 'INFO',
'WARN', 'ERROR', 'FATAL']
for line in sys.stdin:
fields = line.split()
level = fields[3].upper()
if level in levels:
print "%st1" % level
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Python	
  Code	
  for	
  Map	
  Func=on	
  
If	
  it	
  matches	
  a	
  known	
  level,	
  print	
  
it,	
  a	
  tab	
  separator,	
  and	
  the	
  literal	
  
value	
  1	
  (since	
  the	
  level	
  can	
  only	
  
occur	
  once	
  per	
  line)	
  
Read	
  records	
  from	
  standard	
  input.	
  
Use	
  whitespace	
  to	
  split	
  into	
  fields.	
  	
  	
  
Define	
  list	
  of	
  known	
  log	
  levels	
  
Extract	
  “level”	
  field	
  and	
  convert	
  to	
  
uppercase	
  for	
  consistency.	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Output	
  of	
  Map	
  Func=on	
  
•  The	
  map	
  func=on	
  produces	
  key/value	
  pairs	
  as	
  output	
  
INFO 1
INFO 1
WARN 1
INFO 1
WARN 1
INFO 1
ERROR 1
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
The	
  “Shuffle	
  and	
  Sort”	
  
•  Hadoop	
  automa9cally	
  merges,	
  sorts,	
  and	
  groups	
  map	
  output	
  
•  The	
  result	
  is	
  passed	
  as	
  input	
  to	
  the	
  reduce	
  func=on	
  
•  More	
  on	
  this	
  later…	
  
INFO 1
INFO 1
WARN 1
INFO 1
WARN 1
INFO 1
ERROR 1
ERROR 1
INFO 1
INFO 1
INFO 1
INFO 1
WARN 1
WARN 1
Shuffle	
  and	
  Sort	
  
Map	
  Output	
   Reduce	
  Input	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Input	
  to	
  Reduce	
  Func=on	
  
•  Reduce	
  func=on	
  receives	
  a	
  key	
  and	
  all	
  values	
  for	
  that	
  key	
  	
  
	
  
•  Keys	
  are	
  always	
  passed	
  to	
  reducers	
  in	
  sorted	
  order	
  
•  Although	
  not	
  obvious	
  here,	
  values	
  are	
  unordered	
  
ERROR 1
INFO 1
INFO 1
INFO 1
INFO 1
WARN 1
WARN 1
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Python	
  Code	
  for	
  Reduce	
  Func=on	
  
#!/usr/bin/env python
import sys
previous_key = None
sum = 0
for line in sys.stdin:
key, value = line.split()
if key == previous_key:
sum = sum + int(value)
# continued on next slide
1
2
3
4
5
6
7
8
9
10
11
12
13
Ini=alize	
  loop	
  variables	
  
Extract	
  the	
  key	
  and	
  value	
  
passed	
  via	
  standard	
  input	
  
If	
  key	
  unchanged,	
  	
  
increment	
  the	
  count	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Python	
  Code	
  for	
  Reduce	
  Func=on	
  
# continued from previous slide
else:
if previous_key:
print '%st%i' % (previous_key, sum)
previous_key = key
sum = 1
print '%st%i' % (previous_key, sum)
14
15
16
17
18
19
20
21
22 Print	
  data	
  for	
  the	
  final	
  
key	
  
If	
  key	
  changed,	
  	
  
print	
  data	
  for	
  old	
  level	
  
Start	
  tracking	
  data	
  for	
  
the	
  new	
  record	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Output	
  of	
  Reduce	
  Func=on	
  
•  Its	
  output	
  is	
  a	
  sum	
  for	
  each	
  level	
  
ERROR 1
INFO 4
WARN 2
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Recap	
  of	
  Data	
  Flow	
  
	
  	
  
ERROR 1
INFO 4
WARN 2
INFO 1
INFO 1
WARN 1
INFO 1
WARN 1
INFO 1
ERROR 1
ERROR 1
INFO 1
INFO 1
INFO 1
INFO 1
WARN 1
WARN 1
Map	
  input	
  
Map	
  output	
   Reduce	
  input	
   Reduce	
  output	
  
2013-06-29 22:16:49.391 CDT INFO "This can wait"
2013-06-29 22:16:52.143 CDT INFO "Blah blah blah"
2013-06-29 22:16:54.276 CDT WARN "This seems bad"
2013-06-29 22:16:57.471 CDT INFO "More blather"
2013-06-29 22:17:01.290 CDT WARN "Not looking good"
2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant"
2013-06-29 22:17:05.362 CDT ERROR "Out of memory!"
Shuffle	
  
and	
  sort	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
How	
  to	
  Run	
  a	
  Hadoop	
  Streaming	
  Job	
  
•  I’ll	
  demonstrate	
  this	
  now…	
  
	
  
TODO:	
  provide	
  VM	
  and	
  
instruc=ons	
  for	
  demo	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
MapReduce	
  Daemons	
  
•  There	
  are	
  two	
  daemon	
  processes	
  in	
  MapReduce	
  
•  JobTracker	
  (master)	
  
•  Exactly	
  one	
  ac=ve	
  JobTracker	
  per	
  cluster	
  
•  Accepts	
  jobs	
  from	
  client	
  
•  Schedules	
  and	
  monitors	
  tasks	
  on	
  slave	
  nodes	
  
•  Reassigns	
  tasks	
  in	
  case	
  of	
  failure	
  
•  TaskTracker	
  (slave)	
  
•  Many	
  per	
  cluster	
  	
  
•  Performs	
  the	
  shuffle	
  and	
  sort	
  
•  Executes	
  map	
  and	
  reduce	
  tasks	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Open	
  Source	
  Tools	
  that	
  Complement	
  Hadoop	
  
The	
  Hadoop	
  Ecosystem	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
The	
  Hadoop	
  Ecosystem	
  
•  "Core	
  Hadoop"	
  consists	
  of	
  HDFS	
  and	
  MapReduce	
  
•  These	
  are	
  the	
  kernel	
  of	
  a	
  much	
  broader	
  plazorm	
  
•  Hadoop	
  has	
  many	
  related	
  projects	
  
•  Some	
  help	
  you	
  integrate	
  Hadoop	
  with	
  other	
  systems	
  
•  Others	
  help	
  you	
  analyze	
  your	
  data	
  
•  These	
  are	
  not	
  considered	
  “core	
  Hadoop”	
  
•  Rather,	
  they’re	
  part	
  of	
  the	
  Hadoop	
  ecosystem	
  
•  Many	
  are	
  also	
  open	
  source	
  Apache	
  projects	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Apache	
  Sqoop	
  
•  Sqoop	
  exchanges	
  data	
  between	
  RDBMS	
  and	
  Hadoop	
  
•  Can	
  import	
  en=re	
  DB,	
  a	
  single	
  table,	
  or	
  a	
  table	
  subset	
  into	
  HDFS	
  
•  Does	
  this	
  very	
  efficiently	
  via	
  a	
  Map-­‐only	
  MapReduce	
  job	
  
•  Can	
  also	
  export	
  data	
  from	
  HDFS	
  back	
  to	
  the	
  database	
  
Database Hadoop Cluster
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Apache	
  Flume	
  
	
  	
  
§ Flume	
  imports	
  data	
  into	
  HDFS	
  as	
  it	
  is	
  being	
  generated	
  by	
  various	
  sources	
  
Hadoop Cluster
Program
Output
UNIX
syslog
Log Files
Custom
Sources
And many
more...
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Apache	
  Pig	
  
•  Pig	
  offers	
  high-­‐level	
  data	
  processing	
  on	
  Hadoop	
  
•  An	
  alterna=ve	
  to	
  wri=ng	
  low-­‐level	
  MapReduce	
  code	
  
	
  
	
  
	
  
	
  
	
  
•  Pig	
  turns	
  this	
  into	
  MapReduce	
  jobs	
  that	
  run	
  on	
  Hadoop	
  
people = LOAD '/data/customers' AS (cust_id, name);
orders = LOAD '/data/orders' AS (ord_id, cust_id, cost);
groups = GROUP orders BY cust_id;
totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t;
result = JOIN totals BY group, people BY cust_id;
DUMP result;
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Apache	
  Hive	
  
•  Hive	
  is	
  another	
  abstrac=on	
  on	
  top	
  of	
  MapReduce	
  
•  Like	
  Pig,	
  it	
  also	
  reduces	
  development	
  =me	
  	
  
•  Hive	
  uses	
  a	
  SQL-­‐like	
  language	
  called	
  HiveQL	
  
SELECT customers.cust_id, SUM(cost) AS total
FROM customers
JOIN orders
ON customers.cust_id = orders.cust_id
GROUP BY customers.cust_id
ORDER BY total DESC;
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Apache	
  Mahout	
  
•  Mahout	
  is	
  a	
  scalable	
  machine	
  learning	
  library	
  
•  Support	
  for	
  several	
  categories	
  of	
  problems	
  
•  Classifica=on	
  
•  Clustering	
  
•  Collabora=ve	
  filtering	
  
•  Frequent	
  itemset	
  mining	
  
•  Many	
  algorithms	
  implemented	
  in	
  MapReduce	
  
•  Can	
  parallelize	
  inexpensively	
  with	
  Hadoop	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Apache	
  HBase	
  
•  HBase	
  is	
  a	
  NoSQL	
  database	
  built	
  on	
  top	
  of	
  Hadoop	
  
•  Can	
  store	
  massive	
  amounts	
  of	
  data	
  
•  Gigabytes,	
  terabytes,	
  and	
  even	
  petabytes	
  of	
  data	
  in	
  a	
  table	
  
•  Tables	
  can	
  have	
  many	
  thousands	
  of	
  columns	
  
•  Scales	
  to	
  provide	
  very	
  high	
  write	
  throughput	
  
•  Hundreds	
  of	
  thousands	
  of	
  inserts	
  per	
  second	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Cloudera	
  Impala	
  
•  Massively	
  parallel	
  SQL	
  engine	
  which	
  runs	
  on	
  a	
  Hadoop	
  cluster	
  
•  Inspired	
  by	
  Google’s	
  Dremel	
  project	
  
•  Can	
  query	
  data	
  stored	
  in	
  HDFS	
  or	
  HBase	
  tables	
  
•  High	
  performance	
  	
  
•  Typically	
  >	
  10	
  =mes	
  faster	
  than	
  Pig	
  or	
  Hive	
  
•  Query	
  syntax	
  virtually	
  iden=cal	
  to	
  Hive	
  /	
  SQL	
  
•  Impala	
  is	
  100%	
  open	
  source	
  (Apache-­‐licensed)	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Querying	
  Data	
  with	
  Hive	
  and	
  Impala	
  
•  I’ll	
  demonstrate	
  this	
  now…	
  
	
  
TODO:	
  provide	
  VM	
  and	
  
instruc=ons	
  for	
  demo	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Visual	
  Overview	
  of	
  a	
  Complete	
  Workflow	
  
Import Transaction Data
from RDBMSSessionize Web
Log Data with Pig
Analyst uses Impala for
business intelligence
Sentiment Analysis on
Social Media with Hive
Hadoop Cluster
with Impala
Generate Nightly Reports
using Pig, Hive, or Impala
Build product
recommendations for
Web site
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Hadoop	
  Distribu=ons	
  
•  Conceptually	
  similar	
  to	
  a	
  Linux	
  distribu=on	
  
•  These	
  distribu=ons	
  include	
  a	
  stable	
  version	
  of	
  Hadoop	
  
•  Plus	
  ecosystem	
  tools	
  like	
  Flume,	
  Sqoop,	
  Pig,	
  Hive,	
  Impala,	
  etc.	
  
•  Benefits	
  of	
  using	
  a	
  distribu=on	
  
•  Integra=on	
  tes=ng	
  helps	
  ensure	
  all	
  tools	
  work	
  together	
  
•  Easy	
  installa=on	
  and	
  updates	
  
•  Compa=bility	
  cer=fica=on	
  from	
  hardware	
  vendors	
  
•  Commercial	
  support	
  
•  Apache	
  Bigtop	
  –	
  upstream	
  distribu=on	
  for	
  many	
  commercial	
  
distribu=ons	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Cloudera’s	
  Distribu=on	
  (CDH)	
  
•  Cloudera’s	
  Distribu=on	
  including	
  Apache	
  Hadoop	
  (CDH)	
  
•  The	
  most	
  widely	
  used	
  distribu=on	
  of	
  Hadoop	
  
•  Stable,	
  proven	
  and	
  supported	
  environment	
  
•  Combines	
  Hadoop	
  with	
  many	
  important	
  ecosystem	
  tools	
  
•  Including	
  all	
  those	
  I’ve	
  men=oned	
  
•  How	
  much	
  does	
  it	
  cost?	
  
•  It’s	
  completely	
  free	
  
•  Apache	
  licensed	
  –	
  it’s	
  100%	
  open	
  source	
  too	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Past,	
  Present,	
  and	
  Future	
  
Hadoop	
  Clusters	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Hadoop	
  Cluster	
  Overview	
  
•  A	
  cluster	
  is	
  made	
  up	
  of	
  nodes	
  
•  A	
  node	
  is	
  simply	
  a	
  (typically	
  rackmount)	
  server	
  
•  There	
  may	
  be	
  a	
  few	
  –	
  or	
  a	
  few	
  thousand	
  –	
  nodes	
  
•  Most	
  are	
  slave	
  nodes,	
  but	
  a	
  few	
  are	
  master	
  nodes	
  
•  Every	
  node	
  is	
  responsible	
  for	
  both	
  storage	
  and	
  processing	
  
•  Nodes	
  are	
  connected	
  together	
  by	
  network	
  switches	
  
•  Slave	
  nodes	
  do	
  not	
  use	
  RAID	
  
•  Block	
  spli•ng	
  and	
  replica=on	
  is	
  built	
  into	
  HDFS	
  
•  Nearly	
  all	
  produc=on	
  clusters	
  run	
  Linux	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Anatomy	
  of	
  a	
  Small	
  Hadoop	
  Cluster	
  
•  Typically	
  consists	
  of	
  	
  
industry-­‐standard	
  
rackmounted	
  servers	
  
•  JobTracker	
  and	
  NameNode	
  
might	
  run	
  on	
  same	
  server	
  
•  TaskTracker	
  and	
  DataNode	
  
are	
  always	
  co-­‐located	
  on	
  
each	
  slave	
  node	
  for	
  data	
  
locality	
  
Slave Nodes
Master Node
JobTracker
NameNode
TaskTracker
DataNode
Network switch
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Anatomy	
  of	
  a	
  Large	
  Cluster	
  
"core" network switch connected to
each top-of-rack switch
421 3
1. Master (active NameNode)
2. Master (standby NameNode)
3. Master (active JobTracker)
4. Master (standby JobTracker)
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Build,	
  Buy,	
  or	
  Use	
  the	
  Cloud?	
  
•  There	
  are	
  several	
  ways	
  to	
  run	
  Hadoop	
  at	
  scale	
  
•  Build	
  your	
  own	
  cluster	
  
•  Buy	
  a	
  pre-­‐configured	
  cluster	
  from	
  hardware	
  vendor	
  
•  Run	
  Hadoop	
  in	
  the	
  cloud	
  
•  Private	
  cloud:	
  virtualized	
  hardware	
  in	
  your	
  data	
  center	
  
•  Public	
  cloud:	
  on	
  a	
  service	
  like	
  Amazon	
  EC2	
  
•  Let’s	
  cover	
  the	
  pros,	
  cons,	
  and	
  concerns…	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Build	
  versus	
  Buy	
  
•  Pros	
  for	
  building	
  your	
  own	
  
•  Select	
  whichever	
  components	
  you	
  like	
  
•  Can	
  reuse	
  components	
  you	
  already	
  have	
  
•  Avoid	
  reliance	
  on	
  a	
  single	
  vendor	
  
•  Pros	
  for	
  buying	
  pre-­‐configured	
  cluster	
  
•  Vendor	
  tests	
  all	
  the	
  components	
  together	
  (cer=fica=on)	
  
•  Avoids	
  “blame	
  the	
  other	
  vendor”	
  during	
  support	
  calls	
  
•  May	
  actually	
  be	
  less	
  expensive	
  due	
  to	
  economies	
  of	
  scale	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Host	
  in	
  the	
  Cloud?	
  
•  Good	
  choice	
  when	
  need	
  for	
  cluster	
  is	
  sporadic	
  
•  And	
  amount	
  of	
  data	
  stored	
  /	
  transferred	
  is	
  rela=vely	
  low	
  
Your	
  Own	
  Cluster	
   Hosted	
  in	
  the	
  Cloud	
  
Hardware	
  Cost	
   X	
  
Staffing	
  Cost	
   X	
  
Power	
  /	
  HVAC	
   X	
  
Storage	
  Cost	
   X	
  
Bandwidth	
  Cost	
   X	
  
Performance	
   X	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Hadoop's	
  Current	
  Data	
  Processing	
  Architecture	
  
•  Hadoop's	
  facili=es	
  for	
  data	
  	
  
processing	
  are	
  based	
  on	
  the	
  
MapReduce	
  framework	
  
•  Works	
  well,	
  but	
  there	
  are	
  two	
  
important	
  limita=ons	
  
•  Only	
  one	
  ac=ve	
  JobTracker	
  	
  
per	
  cluster	
  (scalability)	
  
•  MapReduce	
  is	
  not	
  an	
  ideal	
  
fit	
  for	
  all	
  processing	
  needs	
  	
  
(flexibility)	
  
JobTracker
TaskTracker
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
YARN:	
  Next	
  Genera=on	
  Processing	
  Architecture	
  
•  YARN	
  generalizes	
  scheduling	
  and	
  resource	
  alloca=on	
  	
  
•  Improves	
  scalability	
  
•  Names	
  and	
  roles	
  of	
  daemons	
  have	
  changed	
  
•  Many	
  responsibili=es	
  previously	
  associated	
  with	
  JobTracker	
  are	
  
now	
  delegated	
  to	
  slave	
  nodes	
  
•  Improves	
  flexibility	
  
•  Allows	
  processing	
  frameworks	
  other	
  than	
  MapReduce	
  
•  Example:	
  Apache	
  Giraph	
  (graph	
  processing	
  framework)	
  
•  Maintains	
  backwards	
  compa=bility	
  
•  MapReduce	
  is	
  also	
  supported	
  by	
  YARN	
  
•  Requires	
  no	
  change	
  to	
  exis=ng	
  MapReduce	
  jobs	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Hadoop	
  File	
  Formats	
  
•  Hadoop	
  supports	
  a	
  number	
  of	
  input/output	
  formats,	
  including	
  
•  Free-­‐form	
  text	
  
•  Delimited	
  text	
  
•  Several	
  specialized	
  formats	
  for	
  efficient	
  storage	
  
•  Sequence	
  files	
  
•  Avro	
  
•  RCFile	
  
•  Parquet	
  
•  Also	
  possible	
  to	
  add	
  support	
  for	
  custom	
  formats	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Typical	
  Dataset	
  Example	
  
•  Most	
  of	
  these	
  formats	
  store	
  data	
  as	
  rows	
  of	
  fields	
  
•  Each	
  row	
  contains	
  all	
  fields	
  for	
  a	
  single	
  record	
  
•  Addi=onal	
  files	
  contain	
  addi=onal	
  records	
  in	
  the	
  same	
  format	
  
2014-02-11 22:16:49 Alice Cable 19.23
2014-02-11 22:17:52 Bob DVD 28.78
2014-02-11 22:17:54 Alice Keyboard 36.99
2014-02-12 22:16:57 Alice Adapter 19.23
2014-02-12 22:17:01 Bob Cable 28.78
2014-02-12 22:17:03 Alice Mouse 36.99
2014-02-12 22:17:05 Chuck Antenna 24.99
File #1
File #2
date time buyer item price
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
The	
  Parquet	
  File	
  Format	
  
•  Parquet	
  is	
  a	
  new	
  high-­‐performance	
  file	
  format	
  
•  Originally	
  developed	
  by	
  engineers	
  from	
  Cloudera	
  and	
  Twiqer	
  
•  Open	
  source,	
  with	
  an	
  ac=ve	
  developer	
  community	
  
•  Can	
  store	
  each	
  column	
  in	
  its	
  own	
  file	
  
•  Allows	
  for	
  much	
  beqer	
  compression	
  due	
  to	
  similar	
  values	
  
•  Reduces	
  I/O	
  when	
  only	
  a	
  subset	
  of	
  columns	
  are	
  needed	
  
2014-02-11
2014-02-11
2014-02-11
2014-02-12
2014-02-12
2014-02-12
2014-02-12
22:16:49
22:16:52
22:16:54
22:16:57
22:17:01
22:17:03
22:17:05
Alice
Bob
Alice
Alice
Bob
Alice
Chuck
19.23
28.78
36.99
19.23
28.78
36.99
24.99
File #1 (date) File #2 (time) File #3 (buyer) File #5 (price)
Cable
DVD
Keyboard
Adapter
Cable
Mouse
Antenna
File #4 (item)
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Conclusion	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Key	
  Points	
  
•  We’re	
  genera=ng	
  massive	
  volumes	
  of	
  data	
  
•  This	
  data	
  can	
  be	
  extremely	
  valuable	
  
•  Companies	
  can	
  now	
  analyze	
  what	
  they	
  previously	
  discarded	
  
•  Hadoop	
  supports	
  large-­‐scale	
  data	
  storage	
  and	
  processing	
  
•  Heavily	
  influenced	
  by	
  Google's	
  architecture	
  
•  Already	
  in	
  produc=on	
  by	
  thousands	
  of	
  organiza=ons	
  
•  HDFS	
  is	
  Hadoop's	
  storage	
  layer	
  
•  MapReduce	
  is	
  Hadoop's	
  processing	
  framework	
  
•  Many	
  ecosystem	
  projects	
  complement	
  Hadoop	
  
•  Some	
  help	
  you	
  to	
  integrate	
  Hadoop	
  with	
  exis=ng	
  systems	
  
•  Others	
  help	
  you	
  analyze	
  the	
  data	
  you’ve	
  stored	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Highly	
  Recommended	
  Books	
  
Author:	
  Tom	
  White	
  
ISBN:	
  1-­‐449-­‐31152-­‐0	
  
Author:	
  Eric	
  Sammer	
  
ISBN:	
  1-­‐449-­‐32705-­‐2	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
My	
  book	
  
@hadooparchbook	
  
hadooparchitecturebook.com	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
•  Helps	
  companies	
  profit	
  from	
  their	
  data	
  
•  Founded	
  by	
  experts	
  from	
  Facebook,	
  Google,	
  Oracle,	
  and	
  Yahoo	
  
•  We	
  offer	
  products	
  and	
  services	
  for	
  large-­‐scale	
  data	
  analysis	
  
•  Sovware	
  (CDH	
  distribu=on	
  and	
  Cloudera	
  Manager)	
  
•  Consul=ng	
  and	
  support	
  services	
  
•  Training	
  and	
  cer=fica=on	
  
•  Ac=ve	
  developers	
  of	
  open	
  source	
  “Big	
  Data”	
  sovware	
  
•  Staff	
  includes	
  commiqers	
  to	
  every	
  single	
  project	
  I’ll	
  cover	
  today	
  
About	
  Cloudera	
  
©	
  2010	
  –	
  2015	
  Cloudera,	
  Inc.	
  All	
  Rights	
  Reserved	
  
Ques=ons?	
  
•  Thank	
  you	
  for	
  aqending!	
  
•  I’ll	
  be	
  happy	
  to	
  answer	
  any	
  addi=onal	
  ques=ons	
  now…	
  
•  Want	
  to	
  learn	
  even	
  more?	
  	
  
•  Cloudera	
  training:	
  developers,	
  analysts,	
  sysadmins,	
  and	
  more	
  
•  Offered	
  in	
  more	
  than	
  50	
  ci=es	
  worldwide,	
  and	
  online	
  too!	
  
•  See	
  hqp://university.cloudera.com/	
  for	
  more	
  info	
  
•  Demo	
  and	
  slides	
  at	
  github.com/markgrover/hadoop-­‐intro-­‐fast	
  
•  Twiqer:	
  mark_grover	
  
•  Survey	
  page:	
  =ny.cloudera.com/mark	
  

Intro to hadoop tutorial

  • 1.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Introduc=on  to  Apache  Hadoop     and  its  Ecosystem   Mark  Grover    |    Budapest  Data  Forum   June  5th,  2015   @mark_grover   github.com/markgrover/hadoop-­‐intro-­‐fast   ©  Copyright  2010-­‐2014              Cloudera,  Inc.                All  rights  reserved.      
  • 2.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   •  Facili=es   •  Ques=ons   •  Schedule  (=mes  are  approximate)   Schedule  and  Logis=cs   Time   Event   9:00  –  10:20   Tutorial   10:20  –  10:30   Break   10:30  –  12:00   Tutorial  
  • 3.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   About  the  Presenta=on…   •  What’s  ahead   •  Fundamental  Concepts   •  HDFS:  The  Hadoop  Distributed  File  System   •  Data  Processing  with  MapReduce   •  The  Hadoop  Ecosystem   •  Hadoop  Clusters:  Past,  Present,  and  Future   •  Conclusion  +  Q&A  
  • 4.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Fundamental  Concepts   Why  the  World  Needs  Hadoop  
  • 5.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Volume   •  Every  day…   •  More  than  1.5  billion  shares  are  traded  on  the  NYSE   •  Facebook  stores  2.7  billion  comments  and  Likes   •  Every  minute…   •  Foursquare  handles  more  than  2,000  check-­‐ins   •  TransUnion  makes  nearly  70,000  updates  to  credit  files   •  Every  second…   •  Banks  process  more  than  10,000  credit  card  transac=ons  
  • 6.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   •  We  are  genera=ng  data  faster  than  ever   •  Processes  are  increasingly  automated   •  People  are  increasingly  interac=ng  online   •  Systems  are  increasingly  interconnected   Velocity  
  • 7.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Variety   •  We’re  producing  a  variety  of  data,  including   •  Audio   •  Video   •  Images   •  Log  files   •  Web  pages   •  Product  ra=ngs   •  Social  network  connec=ons   •  Not  all  of  this  maps  cleanly  to  the  rela=onal  model  
  • 8.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Big  Data  Can  Mean  Big  Opportunity   •  One  tweet  is  an  anecdote   •  But  a  million  tweets  may  signal  important  trends   •  One  person’s  product  review  is  an  opinion   •  But  a  million  reviews  might  uncover  a  design  flaw   •  One  person’s  diagnosis  is  an  isolated  case   •  But  a  million  medical  records  could  lead  to  a  cure  
  • 9.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   We  Need  a  System  that  Scales   •  Too  much  data  for  tradi=onal  tools   •  Two  key  problems   •  How  to  reliably  store  this  data  at  a  reasonable  cost   •  How  to  we  process  all  the  data  we’ve  stored  
  • 10.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   What  is  Apache  Hadoop?   •  Scalable  data  storage  and  processing   •  Distributed  and  fault-­‐tolerant     •  Runs  on  standard  hardware   •  Two  main  components   •  Storage:  Hadoop  Distributed  File  System  (HDFS)   •  Processing:  MapReduce   •  Hadoop  clusters  are  composed  of  computers  called  nodes   •  Clusters  range  from  a  single  node  up  to  several  thousand  nodes  
  • 11.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   How  Did  Apache  Hadoop  Originate?   •  Heavily  influenced  by  Google’s  architecture   •  Notably,  the  Google  Filesystem  and  MapReduce  papers   •  Other  Web  companies  quickly  saw  the  benefits   •  Early  adop=on  by  Yahoo,  Facebook  and  others   2002 2003 2004 2005 2006 Google publishes MapReduce paper Nutch rewritten for MapReduce Hadoop becomes Lucene subproject Nutch spun off from Lucene Google publishes GFS paper
  • 12.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   How  Are  Organiza=ons  Using  Hadoop?   •  Let’s  look  at  a  few  common  uses…  
  • 13.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Hadoop  Use  Case:  Log  Sessioniza=on   February 12, 2014 10.174.57.241 - - [17/Feb/2014:17:57:41 -0500] "GET /s?q=widget HTTP/1.1" 200 3617 "http://www.hotbot.com/find/dualcore" "WebTV 1.2" "U=129" 1) Search for 'Widget' 2) Widget Results 3) View Details for Widget X Recent Activity for John Smith February 17, 2014 5) Track Order 6) Click 'Contact Us' Link 7) Submit Complaint 4) Order Widget X ... 10.174.57.241 - - [17/Feb/2014:17:58:03 -0500] "GET /wres.html HTTP/1.1" 200 5741 "http://www.example.com/s?q=widget" "WebTV 1.2" "U=129" 10.174.57.241 - - [17/Feb/2014:17:58:25 -0500] "GET /detail?w=41 HTTP/1.1" 200 8584 "http://www.example.com/wres.html" "WebTV 1.2" "U=129" 10.174.57.241 - - [17/Feb/2014:17:59:36 -0500] "GET /order.do HTTP/1.1" 200 964 "http://www.example.com/detail?w=41" "WebTV 1.2" "U=129" 10.174.57.241 - - [17/Feb/2014:17:59:47 -0500] "GET /confirm HTTP/1.1" 200 964 "http://www.example.com/order.do" "WebTV 1.2" "U=129" 10.218.46.19 - - [17/Feb/2014:17:57:43 -0500] "GET /ide.html HTTP/1.1" 404 955 "http://www.example.com/s?q=JBuilder" "Mosaic/3.6 (X11;SunOS)" 10.32.51.237 - - [17/Feb/2014:17:58:04 -0500] "GET /os.html HTTP/1.1" 404 955 "http://www.example.com/s?q=VMS" "Mozilla/1.0b (Win3.11)" 10.157.96.181 - - [17/Feb/2014:17:58:26 -0500] "GET /mp3.html HTTP/1.1" 404 955 "http://www.example.com/s?q=Zune" "Mothra/2.77" "U=3622" ... Web Server Log Data Clickstream Data for User Sessions Process Logs with Hadoop
  • 14.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Hadoop  Use  Case:  Customer  Analy=cs   Example Inc. Public Web Site (February 9 - 15) Category Unique Visitors Page Views Bounce Rate Conversion RateAverage Time on Page Television 1,967,345 8,439,206 23% 51%17 seconds Smartphone 982,384 3,185,749 47% 41%23 seconds MP3 Player 671,820 2,174,913 61% 12%42 seconds Stereo 472,418 1,627,843 74% 19%26 seconds Monitor 327,018 1,241,837 56% 17%37 seconds Tablet 217,328 816,545 48% 28%53 seconds Printer 127,124 535,261 27% 64%34 seconds
  • 15.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Hadoop  Use  Case:  Sen=ment  Analysis   00 01 02 03 04 05 06 07 08 09 10 Negative Neutral Positive References to Product in Social Media (Hourly) 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 Hour 11 12 13 14 15 16 17 18 19 20 21 22 23
  • 16.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Hadoop  Use  Case:  Product  Recommenda=ons   Audio Adapter - Stereo Products Recommended for You $4.99 (341 ratings) Over-the-Ear Headphones $29.99 (1,672 ratings) 9-volt Alkaline Battery $1.79 (847 ratings)
  • 17.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   —    Grace  Hopper,  early  advocate  of  distributed  compu=ng   “In  pioneer  days  they  used  oxen  for  heavy   pulling,  and  when  one  ox  couldn’t  budge  a  log,   we  didn’t  try  to  grow  a  larger  ox”  
  • 18.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Comparing  Hadoop  to  Other  Systems   •  Monolithic  systems  don’t  scale   •  Modern  high-­‐performance  compu=ng  systems  are  distributed   •  They  spread  computa=ons  across  many  machines  in  parallel   •  Widely-­‐used  used  for  scien=fic  applica=ons   •  Let’s  examine  how  a  typical  HPC  system  works  
  • 19.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Architecture  of  a  Typical  HPC  System   Storage System Compute Nodes Fast Network
  • 20.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Architecture  of  a  Typical  HPC  System   Storage System Compute Nodes Step 1: Copy input data Fast Network
  • 21.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Architecture  of  a  Typical  HPC  System   Storage System Compute Nodes Step 2: Process the data Fast Network
  • 22.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Architecture  of  a  Typical  HPC  System   Storage System Compute Nodes Step 3: Copy output data Fast Network
  • 23.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   You  Don’t  Just  Need  Speed…   •  The  problem  is  that  we  have  way  more  data  than  code   $ du -ks code/ 1,087 $ du –ks data/ 854,632,947,314
  • 24.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   You  Need  Speed  At  Scale   Storage System Compute Nodes Bottleneck
  • 25.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Hadoop  Design  Fundamental:  Data  Locality   •  This  is  a  hallmark  of  Hadoop’s  design   •  Don’t  bring  the  data  to  the  computa=on   •  Bring  the  computa=on  to  the  data   •  Hadoop  uses  the  same  machines  for  storage  and  processing   •  Significantly  reduces  need  to  transfer  data  across  network  
  • 26.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Other  Hadoop  Design  Fundamentals   •  Machine  failure  is  unavoidable  –  embrace  it   •  Build  reliability  into  the  system   •  “More”  is  usually  beqer  than  “faster”   •  Throughput  maqers  more  than  latency  
  • 27.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   The  Hadoop  Distributed  Filesystem   HDFS  
  • 28.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   HDFS:  Hadoop  Distributed  File  System   •  Inspired  by  the  Google  File  System   •  Reliable,  low-­‐cost  storage  for  massive  amounts  of  data   •  Similar  to  a  UNIX  filesystem  in  some  ways   •  Hierarchical   •  UNIX-­‐style  paths  (e.g.,  /sales/alice.txt)   •  UNIX-­‐style  file  ownership  and  permissions  
  • 29.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   HDFS:  Hadoop  Distributed  File  System   •  There  are  also  some  major  devia=ons  from  UNIX  filesystems   •  Highly-­‐op=mized  for  processing  data  with  MapReduce   •  Designed  for  sequen=al  access  to  large  files   •  Cannot  modify  file  content  once  wriqen   •  It’s  actually  a  user-­‐space  Java  process   •  Accessed  using  special  commands  or  APIs   •  No  concept  of  a  current  working  directory  
  • 30.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   HDFS  Blocks   •  Files  added  to  HDFS  are  split  into  fixed-­‐size  blocks   •  Block  size  is  configurable,  but  defaults  to  64  megabytes   Block #1: First 64 MB Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea un mollit anim id est o laborum ame elita tu a magna omnibus et. Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea un mollit anim id est o laborum ame elita tu a magna omnibus et. Block #4: Remaining 33 MB Block #2: Next 64 MB Block #3: Next 64 MB 225 MB File
  • 31.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Why  Does  HDFS  Use  Such  Large  Blocks?   Current location of disk head Where the data you need is stored
  • 32.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   HDFS  Replica=on   •  Each  block  is  then  replicated  across  mul=ple  nodes   •  Replica=on  factor  is  also  configurable,  but  defaults  to  three   •  Benefits  of  replica=on   •  Availability:  data  isn’t  lost  when  a  node  fails   •  Reliability:  HDFS  compares  replicas  and  fixes  data  corrup=on   •  Performance:  allows  for  data  locality   •  Let’s  see  an  example  of  replica=on…  
  • 33.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   HDFS  Replica=on  (cont’d)   Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea un mollit anim id est o laborum ame elita tu a magna omnibus et. A B C Block #1 Block #2 Block #3 Block #4 E D
  • 34.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   HDFS  Replica=on  (cont’d)   C E Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea un mollit anim id est o laborum ame elita tu a magna omnibus et. D B ABlock #1 Block #2 Block #3 Block #4 ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona
  • 35.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   HDFS  Replica=on  (cont’d)   Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona un mollit anim id est o laborum ame elita tu a magna omnibus et. Block #1 Block #2 Block #3 Block #4 A E B irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea D C
  • 36.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   HDFS  Replica=on  (cont’d)   Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea Block #1 Block #2 Block #3 Block #4 B C E A D un mollit anim id est o laborum ame elita tu a magna omnibus et.
  • 37.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Accessing  HDFS  via  the  Command  Line   •  Users  typically  access  HDFS  via  the  hadoop fs  command   •  Ac=ons  specified  with  subcommands  (prefixed  with  a  minus  sign)   •  Most  are  similar  to  corresponding  UNIX  commands   $ hadoop fs -ls /user/tomwheeler $ hadoop fs -cat /customers.csv $ hadoop fs -rm /webdata/access.log $ hadoop fs -mkdir /reports/marketing
  • 38.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Copying  Local  Data  To  and  From  HDFS   •  Remember  that  HDFS  is  dis=nct  from  your  local  filesystem   •  hadoop fs –put  copies  local  files  to  HDFS   •  hadoop fs –get  fetches  a  local  copy  of  a  file  from  HDFS   $ hadoop fs -put sales.txt /reports Hadoop Cluster Client Machine $ hadoop fs -get /reports/sales.txt
  • 39.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   HDFS  Daemons   •  There  are  two  daemon  processes  in  HDFS   •  NameNode  (master)   •  Exactly  one  ac=ve  NameNode  per  cluster   •  Manages  namespace  and  metadata   •  DataNode  (slave)   •  Many  per  cluster   •  Performs  block  storage  and  retrieval  
  • 40.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   How  File  Reads  Work   client C D A B E Step 1: Get block locations from the NameNode
  • 41.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   How  File  Reads  Work  (cont'd)   client C D A B E Step 2: Read those blocks directly from the DataNodes
  • 42.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   HDFS  Demo   •  I  will  now  demonstrate  the  following   1.  How  to  list  the  contents  of  a  directory   2.  How  to  create  a  directory  in  HDFS   3.  How  to  copy  a  local  file  to  HDFS   4.  How  to  display  the  contents  of  a  file  in  HDFS   5.  How  to  remove  a  file  from  HDFS   TODO:  provide  VM  and   instruc=ons  for  demo  
  • 43.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   A  Scalable  Data  Processing  Framework   Data  Processing  with  MapReduce  
  • 44.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   What  is  MapReduce?   •  MapReduce  is  a  programming  model   •  It’s  a  way  of  processing  data     •  You  can  implement  MapReduce  in  any  language  
  • 45.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Understanding  Map  and  Reduce   •  You  supply  two  func=ons  to  process  data:  Map  and  Reduce   •  Map:  typically  used  to  transform,  parse,  or  filter  data   •  Reduce:  typically  used  to  summarize  results   •  The  Map  func=on  always  runs  first   •  The  Reduce  func=on  runs  averwards,  but  is  op=onal   •  Each  piece  is  simple,  but  can  be  powerful  when  combined  
  • 46.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   MapReduce  Benefits   •  Scalability   •  Hadoop  divides  the  processing  job  into  individual  tasks   •  Tasks  execute  in  parallel  (independently)  across  cluster   •  Simplicity   •  Processes  one  record  at  a  =me   •  Ease  of  use   •  Hadoop  provides  job  scheduling  and  other  infrastructure   •  Far  simpler  for  developers  than  typical  distributed  compu=ng  
  • 47.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   MapReduce  in  Hadoop   •  MapReduce  processing  in  Hadoop  is  batch-­‐oriented   •  A  MapReduce  job  is  broken  down  into  smaller  tasks   •  Tasks  run  concurrently   •  Each  processes  a  small  amount  of  overall  input   •  MapReduce  code  for  Hadoop  is  usually  wriqen  in  Java   •  This  uses  Hadoop’s  API  directly   •  You  can  do  basic  MapReduce  in  other  languages   •  Using  the  Hadoop  Streaming  wrapper  program   •  Some  advanced  features  require  Java  code  
  • 48.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   MapReduce  Example  in  Python   •  The  following  example  uses  Python   •  Via  Hadoop  Streaming   •  It  processes  log  files  and  summarizes  events  by  type   •  I’ll  explain  both  the  data  flow  and  the  code  
  • 49.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Job  Input   •  Here’s  the  job  input     •  Each  map  task  gets  a  chunk  of  this  data  to  process   •  Typically  corresponds  to  a  single  block  in  HDFS   2013-06-29 22:16:49.391 CDT INFO "This can wait" 2013-06-29 22:16:52.143 CDT INFO "Blah blah blah" 2013-06-29 22:16:54.276 CDT WARN "This seems bad" 2013-06-29 22:16:57.471 CDT INFO "More blather" 2013-06-29 22:17:01.290 CDT WARN "Not looking good" 2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant" 2013-06-29 22:17:05.362 CDT ERROR "Out of memory!"
  • 50.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   #!/usr/bin/env python import sys levels = ['TRACE', 'DEBUG', 'INFO', 'WARN', 'ERROR', 'FATAL'] for line in sys.stdin: fields = line.split() level = fields[3].upper() if level in levels: print "%st1" % level 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Python  Code  for  Map  Func=on   If  it  matches  a  known  level,  print   it,  a  tab  separator,  and  the  literal   value  1  (since  the  level  can  only   occur  once  per  line)   Read  records  from  standard  input.   Use  whitespace  to  split  into  fields.       Define  list  of  known  log  levels   Extract  “level”  field  and  convert  to   uppercase  for  consistency.  
  • 51.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Output  of  Map  Func=on   •  The  map  func=on  produces  key/value  pairs  as  output   INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1
  • 52.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   The  “Shuffle  and  Sort”   •  Hadoop  automa9cally  merges,  sorts,  and  groups  map  output   •  The  result  is  passed  as  input  to  the  reduce  func=on   •  More  on  this  later…   INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1 ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1 Shuffle  and  Sort   Map  Output   Reduce  Input  
  • 53.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Input  to  Reduce  Func=on   •  Reduce  func=on  receives  a  key  and  all  values  for  that  key       •  Keys  are  always  passed  to  reducers  in  sorted  order   •  Although  not  obvious  here,  values  are  unordered   ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1
  • 54.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Python  Code  for  Reduce  Func=on   #!/usr/bin/env python import sys previous_key = None sum = 0 for line in sys.stdin: key, value = line.split() if key == previous_key: sum = sum + int(value) # continued on next slide 1 2 3 4 5 6 7 8 9 10 11 12 13 Ini=alize  loop  variables   Extract  the  key  and  value   passed  via  standard  input   If  key  unchanged,     increment  the  count  
  • 55.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Python  Code  for  Reduce  Func=on   # continued from previous slide else: if previous_key: print '%st%i' % (previous_key, sum) previous_key = key sum = 1 print '%st%i' % (previous_key, sum) 14 15 16 17 18 19 20 21 22 Print  data  for  the  final   key   If  key  changed,     print  data  for  old  level   Start  tracking  data  for   the  new  record  
  • 56.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Output  of  Reduce  Func=on   •  Its  output  is  a  sum  for  each  level   ERROR 1 INFO 4 WARN 2
  • 57.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Recap  of  Data  Flow       ERROR 1 INFO 4 WARN 2 INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1 ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1 Map  input   Map  output   Reduce  input   Reduce  output   2013-06-29 22:16:49.391 CDT INFO "This can wait" 2013-06-29 22:16:52.143 CDT INFO "Blah blah blah" 2013-06-29 22:16:54.276 CDT WARN "This seems bad" 2013-06-29 22:16:57.471 CDT INFO "More blather" 2013-06-29 22:17:01.290 CDT WARN "Not looking good" 2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant" 2013-06-29 22:17:05.362 CDT ERROR "Out of memory!" Shuffle   and  sort  
  • 58.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   How  to  Run  a  Hadoop  Streaming  Job   •  I’ll  demonstrate  this  now…     TODO:  provide  VM  and   instruc=ons  for  demo  
  • 59.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   MapReduce  Daemons   •  There  are  two  daemon  processes  in  MapReduce   •  JobTracker  (master)   •  Exactly  one  ac=ve  JobTracker  per  cluster   •  Accepts  jobs  from  client   •  Schedules  and  monitors  tasks  on  slave  nodes   •  Reassigns  tasks  in  case  of  failure   •  TaskTracker  (slave)   •  Many  per  cluster     •  Performs  the  shuffle  and  sort   •  Executes  map  and  reduce  tasks  
  • 60.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Open  Source  Tools  that  Complement  Hadoop   The  Hadoop  Ecosystem  
  • 61.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   The  Hadoop  Ecosystem   •  "Core  Hadoop"  consists  of  HDFS  and  MapReduce   •  These  are  the  kernel  of  a  much  broader  plazorm   •  Hadoop  has  many  related  projects   •  Some  help  you  integrate  Hadoop  with  other  systems   •  Others  help  you  analyze  your  data   •  These  are  not  considered  “core  Hadoop”   •  Rather,  they’re  part  of  the  Hadoop  ecosystem   •  Many  are  also  open  source  Apache  projects  
  • 62.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Apache  Sqoop   •  Sqoop  exchanges  data  between  RDBMS  and  Hadoop   •  Can  import  en=re  DB,  a  single  table,  or  a  table  subset  into  HDFS   •  Does  this  very  efficiently  via  a  Map-­‐only  MapReduce  job   •  Can  also  export  data  from  HDFS  back  to  the  database   Database Hadoop Cluster
  • 63.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Apache  Flume       § Flume  imports  data  into  HDFS  as  it  is  being  generated  by  various  sources   Hadoop Cluster Program Output UNIX syslog Log Files Custom Sources And many more...
  • 64.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Apache  Pig   •  Pig  offers  high-­‐level  data  processing  on  Hadoop   •  An  alterna=ve  to  wri=ng  low-­‐level  MapReduce  code             •  Pig  turns  this  into  MapReduce  jobs  that  run  on  Hadoop   people = LOAD '/data/customers' AS (cust_id, name); orders = LOAD '/data/orders' AS (ord_id, cust_id, cost); groups = GROUP orders BY cust_id; totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t; result = JOIN totals BY group, people BY cust_id; DUMP result;
  • 65.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Apache  Hive   •  Hive  is  another  abstrac=on  on  top  of  MapReduce   •  Like  Pig,  it  also  reduces  development  =me     •  Hive  uses  a  SQL-­‐like  language  called  HiveQL   SELECT customers.cust_id, SUM(cost) AS total FROM customers JOIN orders ON customers.cust_id = orders.cust_id GROUP BY customers.cust_id ORDER BY total DESC;
  • 66.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Apache  Mahout   •  Mahout  is  a  scalable  machine  learning  library   •  Support  for  several  categories  of  problems   •  Classifica=on   •  Clustering   •  Collabora=ve  filtering   •  Frequent  itemset  mining   •  Many  algorithms  implemented  in  MapReduce   •  Can  parallelize  inexpensively  with  Hadoop  
  • 67.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Apache  HBase   •  HBase  is  a  NoSQL  database  built  on  top  of  Hadoop   •  Can  store  massive  amounts  of  data   •  Gigabytes,  terabytes,  and  even  petabytes  of  data  in  a  table   •  Tables  can  have  many  thousands  of  columns   •  Scales  to  provide  very  high  write  throughput   •  Hundreds  of  thousands  of  inserts  per  second  
  • 68.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Cloudera  Impala   •  Massively  parallel  SQL  engine  which  runs  on  a  Hadoop  cluster   •  Inspired  by  Google’s  Dremel  project   •  Can  query  data  stored  in  HDFS  or  HBase  tables   •  High  performance     •  Typically  >  10  =mes  faster  than  Pig  or  Hive   •  Query  syntax  virtually  iden=cal  to  Hive  /  SQL   •  Impala  is  100%  open  source  (Apache-­‐licensed)  
  • 69.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Querying  Data  with  Hive  and  Impala   •  I’ll  demonstrate  this  now…     TODO:  provide  VM  and   instruc=ons  for  demo  
  • 70.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Visual  Overview  of  a  Complete  Workflow   Import Transaction Data from RDBMSSessionize Web Log Data with Pig Analyst uses Impala for business intelligence Sentiment Analysis on Social Media with Hive Hadoop Cluster with Impala Generate Nightly Reports using Pig, Hive, or Impala Build product recommendations for Web site
  • 71.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Hadoop  Distribu=ons   •  Conceptually  similar  to  a  Linux  distribu=on   •  These  distribu=ons  include  a  stable  version  of  Hadoop   •  Plus  ecosystem  tools  like  Flume,  Sqoop,  Pig,  Hive,  Impala,  etc.   •  Benefits  of  using  a  distribu=on   •  Integra=on  tes=ng  helps  ensure  all  tools  work  together   •  Easy  installa=on  and  updates   •  Compa=bility  cer=fica=on  from  hardware  vendors   •  Commercial  support   •  Apache  Bigtop  –  upstream  distribu=on  for  many  commercial   distribu=ons  
  • 72.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Cloudera’s  Distribu=on  (CDH)   •  Cloudera’s  Distribu=on  including  Apache  Hadoop  (CDH)   •  The  most  widely  used  distribu=on  of  Hadoop   •  Stable,  proven  and  supported  environment   •  Combines  Hadoop  with  many  important  ecosystem  tools   •  Including  all  those  I’ve  men=oned   •  How  much  does  it  cost?   •  It’s  completely  free   •  Apache  licensed  –  it’s  100%  open  source  too  
  • 73.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Past,  Present,  and  Future   Hadoop  Clusters  
  • 74.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Hadoop  Cluster  Overview   •  A  cluster  is  made  up  of  nodes   •  A  node  is  simply  a  (typically  rackmount)  server   •  There  may  be  a  few  –  or  a  few  thousand  –  nodes   •  Most  are  slave  nodes,  but  a  few  are  master  nodes   •  Every  node  is  responsible  for  both  storage  and  processing   •  Nodes  are  connected  together  by  network  switches   •  Slave  nodes  do  not  use  RAID   •  Block  spli•ng  and  replica=on  is  built  into  HDFS   •  Nearly  all  produc=on  clusters  run  Linux  
  • 75.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Anatomy  of  a  Small  Hadoop  Cluster   •  Typically  consists  of     industry-­‐standard   rackmounted  servers   •  JobTracker  and  NameNode   might  run  on  same  server   •  TaskTracker  and  DataNode   are  always  co-­‐located  on   each  slave  node  for  data   locality   Slave Nodes Master Node JobTracker NameNode TaskTracker DataNode Network switch
  • 76.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Anatomy  of  a  Large  Cluster   "core" network switch connected to each top-of-rack switch 421 3 1. Master (active NameNode) 2. Master (standby NameNode) 3. Master (active JobTracker) 4. Master (standby JobTracker)
  • 77.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Build,  Buy,  or  Use  the  Cloud?   •  There  are  several  ways  to  run  Hadoop  at  scale   •  Build  your  own  cluster   •  Buy  a  pre-­‐configured  cluster  from  hardware  vendor   •  Run  Hadoop  in  the  cloud   •  Private  cloud:  virtualized  hardware  in  your  data  center   •  Public  cloud:  on  a  service  like  Amazon  EC2   •  Let’s  cover  the  pros,  cons,  and  concerns…  
  • 78.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Build  versus  Buy   •  Pros  for  building  your  own   •  Select  whichever  components  you  like   •  Can  reuse  components  you  already  have   •  Avoid  reliance  on  a  single  vendor   •  Pros  for  buying  pre-­‐configured  cluster   •  Vendor  tests  all  the  components  together  (cer=fica=on)   •  Avoids  “blame  the  other  vendor”  during  support  calls   •  May  actually  be  less  expensive  due  to  economies  of  scale  
  • 79.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Host  in  the  Cloud?   •  Good  choice  when  need  for  cluster  is  sporadic   •  And  amount  of  data  stored  /  transferred  is  rela=vely  low   Your  Own  Cluster   Hosted  in  the  Cloud   Hardware  Cost   X   Staffing  Cost   X   Power  /  HVAC   X   Storage  Cost   X   Bandwidth  Cost   X   Performance   X  
  • 80.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Hadoop's  Current  Data  Processing  Architecture   •  Hadoop's  facili=es  for  data     processing  are  based  on  the   MapReduce  framework   •  Works  well,  but  there  are  two   important  limita=ons   •  Only  one  ac=ve  JobTracker     per  cluster  (scalability)   •  MapReduce  is  not  an  ideal   fit  for  all  processing  needs     (flexibility)   JobTracker TaskTracker
  • 81.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   YARN:  Next  Genera=on  Processing  Architecture   •  YARN  generalizes  scheduling  and  resource  alloca=on     •  Improves  scalability   •  Names  and  roles  of  daemons  have  changed   •  Many  responsibili=es  previously  associated  with  JobTracker  are   now  delegated  to  slave  nodes   •  Improves  flexibility   •  Allows  processing  frameworks  other  than  MapReduce   •  Example:  Apache  Giraph  (graph  processing  framework)   •  Maintains  backwards  compa=bility   •  MapReduce  is  also  supported  by  YARN   •  Requires  no  change  to  exis=ng  MapReduce  jobs  
  • 82.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Hadoop  File  Formats   •  Hadoop  supports  a  number  of  input/output  formats,  including   •  Free-­‐form  text   •  Delimited  text   •  Several  specialized  formats  for  efficient  storage   •  Sequence  files   •  Avro   •  RCFile   •  Parquet   •  Also  possible  to  add  support  for  custom  formats  
  • 83.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Typical  Dataset  Example   •  Most  of  these  formats  store  data  as  rows  of  fields   •  Each  row  contains  all  fields  for  a  single  record   •  Addi=onal  files  contain  addi=onal  records  in  the  same  format   2014-02-11 22:16:49 Alice Cable 19.23 2014-02-11 22:17:52 Bob DVD 28.78 2014-02-11 22:17:54 Alice Keyboard 36.99 2014-02-12 22:16:57 Alice Adapter 19.23 2014-02-12 22:17:01 Bob Cable 28.78 2014-02-12 22:17:03 Alice Mouse 36.99 2014-02-12 22:17:05 Chuck Antenna 24.99 File #1 File #2 date time buyer item price
  • 84.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   The  Parquet  File  Format   •  Parquet  is  a  new  high-­‐performance  file  format   •  Originally  developed  by  engineers  from  Cloudera  and  Twiqer   •  Open  source,  with  an  ac=ve  developer  community   •  Can  store  each  column  in  its  own  file   •  Allows  for  much  beqer  compression  due  to  similar  values   •  Reduces  I/O  when  only  a  subset  of  columns  are  needed   2014-02-11 2014-02-11 2014-02-11 2014-02-12 2014-02-12 2014-02-12 2014-02-12 22:16:49 22:16:52 22:16:54 22:16:57 22:17:01 22:17:03 22:17:05 Alice Bob Alice Alice Bob Alice Chuck 19.23 28.78 36.99 19.23 28.78 36.99 24.99 File #1 (date) File #2 (time) File #3 (buyer) File #5 (price) Cable DVD Keyboard Adapter Cable Mouse Antenna File #4 (item)
  • 85.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Conclusion  
  • 86.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Key  Points   •  We’re  genera=ng  massive  volumes  of  data   •  This  data  can  be  extremely  valuable   •  Companies  can  now  analyze  what  they  previously  discarded   •  Hadoop  supports  large-­‐scale  data  storage  and  processing   •  Heavily  influenced  by  Google's  architecture   •  Already  in  produc=on  by  thousands  of  organiza=ons   •  HDFS  is  Hadoop's  storage  layer   •  MapReduce  is  Hadoop's  processing  framework   •  Many  ecosystem  projects  complement  Hadoop   •  Some  help  you  to  integrate  Hadoop  with  exis=ng  systems   •  Others  help  you  analyze  the  data  you’ve  stored  
  • 87.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Highly  Recommended  Books   Author:  Tom  White   ISBN:  1-­‐449-­‐31152-­‐0   Author:  Eric  Sammer   ISBN:  1-­‐449-­‐32705-­‐2  
  • 88.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   My  book   @hadooparchbook   hadooparchitecturebook.com  
  • 89.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   •  Helps  companies  profit  from  their  data   •  Founded  by  experts  from  Facebook,  Google,  Oracle,  and  Yahoo   •  We  offer  products  and  services  for  large-­‐scale  data  analysis   •  Sovware  (CDH  distribu=on  and  Cloudera  Manager)   •  Consul=ng  and  support  services   •  Training  and  cer=fica=on   •  Ac=ve  developers  of  open  source  “Big  Data”  sovware   •  Staff  includes  commiqers  to  every  single  project  I’ll  cover  today   About  Cloudera  
  • 90.
    ©  2010  –  2015  Cloudera,  Inc.  All  Rights  Reserved   Ques=ons?   •  Thank  you  for  aqending!   •  I’ll  be  happy  to  answer  any  addi=onal  ques=ons  now…   •  Want  to  learn  even  more?     •  Cloudera  training:  developers,  analysts,  sysadmins,  and  more   •  Offered  in  more  than  50  ci=es  worldwide,  and  online  too!   •  See  hqp://university.cloudera.com/  for  more  info   •  Demo  and  slides  at  github.com/markgrover/hadoop-­‐intro-­‐fast   •  Twiqer:  mark_grover   •  Survey  page:  =ny.cloudera.com/mark