KEMBAR78
Scalding by Adform Research, Alex Gryzlov | PPTX
Quick Guide
What is Scalding ?
• Scala wrapper for Cascading
What is Cascading ?
Tap / Pipe / Sink abstraction over Map / Reduce in Java
What is Scalding ?
• Scala wrapper for Cascading
• Just like working with in-memory collections !
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
• No more scripting and UDFs!
Hands on
• Clone the skeleton repository
• Get IntelliJ Idea and the scala plugin
• Open the project
• Compile, wait for dependencies to download
• Create a run configuration …
• Create a specs2 configuration for tests
run the WordCountJob in local
mode with given input and output
Building and Deploying
• Get sbt
• sbt assembly produces jar file in target/scala_2.10
• sbt s3-upload produces jar and uploads to s3
• Configure teamcity
Running on EMR
• hadoop fs -get s3://dev-adform-temp-results/wordcount-job.jar job.jar
• hadoop jar job.jar 
com.twitter.scalding.Tool  Entry class
com.adform.dspr.WordCountJob  Scalding job class
--hdfs  Run in HDFS mode
--input s3://adform-dsp-metadata/countries/countries.txt  Parameter
--output s3://dev-adform-temp-results/wordcount Parameter
Under the covers
• sbt run-main 
com.twitter.scalding.Tool 
com.adform.dspr.WordCountJob 
--hdfs 
--tool.graph 
--input dummy --output dummy
• dot -Tpng com.adform.dspr.WordCountJob0.dot -o logical_plan.png
• dot -Tpng com.adform.dspr.WordCountJob0_steps.dot -o mr_plan.png
Development
• Different APIs:
• Fields – everything is a string
• Typed – working with classes, e.g. Request/Transaction
Development
• Fields:
• No need to parse columns
• Redundant
• No IDE support like auto-completion
• Typed:
• All benefits of types
• More manual work with parsing
Resources
• https://github.com/twitter/scalding
• https://github.com/twitter/scalding/tree/develop/tutorial
• https://github.com/twitter/scalding/wiki
• http://www.slideshare.net/AntwnisChalkiopoulos/scalding-presentation
• http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-2014
• https://gitz.adform.com/dspr/data-processing/tree/develop/jobs/process-logs-rtb
My Experience
• Running the job locally is a HUGE time saver
• Programming scala is amazing (no more UDFs)
• Type safety, IDE support!
• Debugging !!!!111
• More optimal job plans
My Experience
• A lot of configuring and googling random issues
• Scarce documentation, had to read source code
• IntelliJ is slow
• Boilerplate code for parsing data
Use cases
• Easy jobs  hive
• Non-trivial jobs  scalding
• Optional: scalding is nice for doing matrix calculations, twitter also
provides a lot of monoids (algorithms) for nice approximations, e.g.
HyperLogLog, CountMinSketch, etc. (see algebird).
process-logs-rtb
• Had to hack scalding:
• WritableMultiSinkTap
• Records
• CompressedTsv
• ModelKryoInstantiator
• Uses typed API
• Helpers like FluentJob
Scalding by Adform Research, Alex Gryzlov

Scalding by Adform Research, Alex Gryzlov

  • 1.
  • 2.
    What is Scalding? • Scala wrapper for Cascading
  • 3.
    What is Cascading? Tap / Pipe / Sink abstraction over Map / Reduce in Java
  • 4.
    What is Scalding? • Scala wrapper for Cascading • Just like working with in-memory collections ! TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) • No more scripting and UDFs!
  • 5.
    Hands on • Clonethe skeleton repository • Get IntelliJ Idea and the scala plugin • Open the project • Compile, wait for dependencies to download • Create a run configuration … • Create a specs2 configuration for tests
  • 6.
    run the WordCountJobin local mode with given input and output
  • 7.
    Building and Deploying •Get sbt • sbt assembly produces jar file in target/scala_2.10 • sbt s3-upload produces jar and uploads to s3 • Configure teamcity
  • 8.
    Running on EMR •hadoop fs -get s3://dev-adform-temp-results/wordcount-job.jar job.jar • hadoop jar job.jar com.twitter.scalding.Tool Entry class com.adform.dspr.WordCountJob Scalding job class --hdfs Run in HDFS mode --input s3://adform-dsp-metadata/countries/countries.txt Parameter --output s3://dev-adform-temp-results/wordcount Parameter
  • 9.
    Under the covers •sbt run-main com.twitter.scalding.Tool com.adform.dspr.WordCountJob --hdfs --tool.graph --input dummy --output dummy • dot -Tpng com.adform.dspr.WordCountJob0.dot -o logical_plan.png • dot -Tpng com.adform.dspr.WordCountJob0_steps.dot -o mr_plan.png
  • 11.
    Development • Different APIs: •Fields – everything is a string • Typed – working with classes, e.g. Request/Transaction
  • 12.
    Development • Fields: • Noneed to parse columns • Redundant • No IDE support like auto-completion • Typed: • All benefits of types • More manual work with parsing
  • 13.
    Resources • https://github.com/twitter/scalding • https://github.com/twitter/scalding/tree/develop/tutorial •https://github.com/twitter/scalding/wiki • http://www.slideshare.net/AntwnisChalkiopoulos/scalding-presentation • http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-2014 • https://gitz.adform.com/dspr/data-processing/tree/develop/jobs/process-logs-rtb
  • 14.
    My Experience • Runningthe job locally is a HUGE time saver • Programming scala is amazing (no more UDFs) • Type safety, IDE support! • Debugging !!!!111 • More optimal job plans
  • 15.
    My Experience • Alot of configuring and googling random issues • Scarce documentation, had to read source code • IntelliJ is slow • Boilerplate code for parsing data
  • 16.
    Use cases • Easyjobs  hive • Non-trivial jobs  scalding • Optional: scalding is nice for doing matrix calculations, twitter also provides a lot of monoids (algorithms) for nice approximations, e.g. HyperLogLog, CountMinSketch, etc. (see algebird).
  • 17.
    process-logs-rtb • Had tohack scalding: • WritableMultiSinkTap • Records • CompressedTsv • ModelKryoInstantiator • Uses typed API • Helpers like FluentJob