Scalding by Adform Research, Alex Gryzlov

What is Scalding ?
• Scala wrapper for Cascading

What is Cascading ?
Tap / Pipe / Sink abstraction over Map / Reduce in Java

What is Scalding ?
• Scala wrapper for Cascading
• Just like working with in-memory collections !
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
• No more scripting and UDFs!

Hands on
• Clone the skeleton repository
• Get IntelliJ Idea and the scala plugin
• Open the project
• Compile, wait for dependencies to download
• Create a run configuration …
• Create a specs2 configuration for tests

run the WordCountJob in local
mode with given input and output

Building and Deploying
• Get sbt
• sbt assembly produces jar file in target/scala_2.10
• sbt s3-upload produces jar and uploads to s3
• Configure teamcity

Running on EMR
• hadoop fs -get s3://dev-adform-temp-results/wordcount-job.jar job.jar
• hadoop jar job.jar
com.twitter.scalding.Tool Entry class
com.adform.dspr.WordCountJob Scalding job class
--hdfs Run in HDFS mode
--input s3://adform-dsp-metadata/countries/countries.txt Parameter
--output s3://dev-adform-temp-results/wordcount Parameter

Under the covers
• sbt run-main
com.twitter.scalding.Tool
com.adform.dspr.WordCountJob
--hdfs
--tool.graph
--input dummy --output dummy
• dot -Tpng com.adform.dspr.WordCountJob0.dot -o logical_plan.png
• dot -Tpng com.adform.dspr.WordCountJob0_steps.dot -o mr_plan.png

Development
• Different APIs:
• Fields – everything is a string
• Typed – working with classes, e.g. Request/Transaction

Development
• Fields:
• No need to parse columns
• Redundant
• No IDE support like auto-completion
• Typed:
• All benefits of types
• More manual work with parsing

Resources
• https://github.com/twitter/scalding
• https://github.com/twitter/scalding/tree/develop/tutorial
• https://github.com/twitter/scalding/wiki
• http://www.slideshare.net/AntwnisChalkiopoulos/scalding-presentation
• http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-2014
• https://gitz.adform.com/dspr/data-processing/tree/develop/jobs/process-logs-rtb

My Experience
• Running the job locally is a HUGE time saver
• Programming scala is amazing (no more UDFs)
• Type safety, IDE support!
• Debugging !!!!111
• More optimal job plans

My Experience
• A lot of configuring and googling random issues
• Scarce documentation, had to read source code
• IntelliJ is slow
• Boilerplate code for parsing data

Use cases
• Easy jobs  hive
• Non-trivial jobs  scalding
• Optional: scalding is nice for doing matrix calculations, twitter also
provides a lot of monoids (algorithms) for nice approximations, e.g.
HyperLogLog, CountMinSketch, etc. (see algebird).

process-logs-rtb
• Had to hack scalding:
• WritableMultiSinkTap
• Records
• CompressedTsv
• ModelKryoInstantiator
• Uses typed API
• Helpers like FluentJob

Scalding by Adform Research, Alex Gryzlov

Scalding by Adform Research, Alex Gryzlov

More Related Content

What's hot

Viewers also liked

Similar to Scalding by Adform Research, Alex Gryzlov

More from Vasil Remeniuk

Recently uploaded

Scalding by Adform Research, Alex Gryzlov