KEMBAR78
Scalding | PDF
Scalding

Mario Pastorelli (Mario.Pastorelli@eurecom.fr)

                  EURECOM


            September 27, 2012




                                                 1/21
What is Scalding




   Scalding is a Scala library written on top of Cascading that makes
   it easy to define MapReduce programs




                                                                  2/21
Summary




  Hadoop MapReduce Programming Model



  Cascading



  Scalding




                                       3/21
Summary




  Hadoop MapReduce Programming Model



  Cascading



  Scalding




                                       4/21
Map and Reduce

  At high level, a MapReduce Job is described with two functions
  operating over lists of key/value pairs.




                                                                   5/21
Map and Reduce

  At high level, a MapReduce Job is described with two functions
  operating over lists of key/value pairs.
      Map: a function from an input key/value pair to a list of
      intermediate key/value pairs

           map : (keyinput , valueinput ) → list(keymap , valuemap )




                                                                       5/21
Map and Reduce

  At high level, a MapReduce Job is described with two functions
  operating over lists of key/value pairs.
      Map: a function from an input key/value pair to a list of
      intermediate key/value pairs

            map : (keyinput , valueinput ) → list(keymap , valuemap )
      Reduce: a function from an intermediate key/values pairs to
      a list of output key/value pairs


       reduce : (keymap , list(valuemap )) → list(keyreduce , valuereduce )




                                                                        5/21
Hadoop Programming Model
  The Hadoop MapReduce programming model allows to control all
  the job workflow components. Job components are divided in two
  phases:




                                                             6/21
Hadoop Programming Model
     The Hadoop MapReduce programming model allows to control all
     the job workflow components. Job components are divided in two
     phases:


          The Map Phase:
                          Km1 Vm1
       Input                             Km1 Vm6
               1   1        2   2                               Km1 Vm6           Km3 Vm3
 Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6
                                          m   m
Source       Ki2 Vi2      Km1 Vm5
                                         Km3 Vm3             P2 Km2 Vm2        P2 Km2 Vm2
                          Km3 Vm3

                               combine(Vm1,Vm5)=Vm6




                                                                               6/21
Hadoop Programming Model
     The Hadoop MapReduce programming model allows to control all
     the job workflow components. Job components are divided in two
     phases:


             The Map Phase:
                          Km1 Vm1
       Input                             Km1 Vm6
               1   1        2   2                               Km1 Vm6           Km3 Vm3
 Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6
                                          m   m
Source       Ki2 Vi2      Km1 Vm5
                                         Km3 Vm3             P2 Km2 Vm2        P2 Km2 Vm2
                          Km3 Vm3

                                   combine(Vm1,Vm5)=Vm6



             The Reduce Phase:
       Km Vm3
         3
                             Km3 Vm3                                                  Output
                                                        3   3
                    Sorter                 Grouper G1 Km Vm         Reducer Kr1 Vr1   Writer     Data
Shuffle Km1 Vm6 Vm7            Km4 Vm8                  Km4 Vm8                                    Dest
                                                                            Kr2 Vr2
       Km4 Vm8               Km1 Vm6 Vm7           G2 Km1 Vm6 Vm7

                                                                                          6/21
Example: Word Count 1/2
 1    class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable>{
 2
 3        public void map(Object key, Text value, Context context)
 4              throws IOException, InterruptedException {
 5          StringTokenizer itr = new StringTokenizer(value.toString());
 6          while (itr.hasMoreTokens()) {
 7            word.set(itr.nextToken());
 8            context.write(new Text(word), new IntWritable(1));
 9          }
10        }
11    }
12
13    class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
14
15        public void reduce(Text key, Iterable<IntWritable> values,
16                           Context context
17                           ) throws IOException, InterruptedException {
18          int sum = 0;
19          for (IntWritable val : values)
20            sum += val.get();
21          context.write(key, new IntWritable(sum));
22        }
23    }

                                                                       7/21
Example: Word Count 2/2
 1    public class WordCount {
 2
 3        public static void main(String[] args) throws Exception {
 4          Job job = new Job(conf, "word count");
 5          job.setMapperClass(TokenizerMapper.class);
 6
 7
 8
 9            job.setReducerClass(IntSumReducer.class);
10            job.setOutputKeyClass(Text.class);
11            job.setOutputValueClass(IntWritable.class);
12            FileInputFormat.addInputPath(job, new Path(args[0]));
13            FileOutputFormat.setOutputPath(job, new Path(args[1]));
14            System.exit(job.waitForCompletion(true) ? 0 : 1);
15        }
16    }




                                                                        8/21
Example: Word Count 2/2
 1    public class WordCount {
 2
 3        public static void main(String[] args) throws Exception {
 4          Job job = new Job(conf, "word count");
 5          job.setMapperClass(TokenizerMapper.class);
 6
 7            job.setCombinerClass(IntSumReducer.class);
 8
 9            job.setReducerClass(IntSumReducer.class);
10            job.setOutputKeyClass(Text.class);
11            job.setOutputValueClass(IntWritable.class);
12            FileInputFormat.addInputPath(job, new Path(args[0]));
13            FileOutputFormat.setOutputPath(job, new Path(args[1]));
14            System.exit(job.waitForCompletion(true) ? 0 : 1);
15        }
16    }


                 Sending the integer 1 for each instance of a word is very
                 inefficient (1TB of data yields 1TB+ of data)
                 Hadoop doesn’t know if it can use the reducer as combiner. A
                 manual set is needed
                                                                          8/21
Hadoop weaknesses
      The reducer cannot be always used as combiner, Hadoop
      relies on the combiner specification or on manual partial
      aggregation inside the mapper instance life cycle (in-mapper
      combiner)
      Combiners are limited to associative and commutative
      functions (like sum). Partial aggregation is more general and
      powerful
      Programming model limited to the map/reduce phases
      model, multi-job programs are often difficult and
      counter-intuitive (think about iterative algorithms like
      PageRank)
      Joins can be difficult, many techniques must be
      implemented from scratch
      More in general, MapReduce is indeed simple but many
      optimizations are similar to hacks and not so natural

                                                                 9/21
Summary




  Hadoop MapReduce Programming Model



  Cascading



  Scalding




                                       10/21
Cascading


      Open source project developed @Concurrent
      It is Java application framework on top of Hadoop developed
      to be extendible by providing:
            Processing API: to develop complex data flows
            Integration API: integration test supported by the framework
            to avoid to put in production unstable software
            Scheduling API: used to schedule unit of work from any
            third-party application
      It changes the MapReduce programming model to a more
      generic data flow oriented programming model
      Cascading has a data flow optimizer that converts user data
      flows to optimized data flows



                                                                    11/21
Cascading Programming Model




      A Cascading program is composed by flows
      A flow is composed by a source tap, a sink tap and pipes
      that connect them
      A pipe holds a particular transformation over its input data
      flow
      Pipes can be combined to create more complex programs




                                                                 12/21
Example: Word Count

           MapReduce word count concept:
                                    Map(tokenize text
                                     and emit 1 for             Reduce(count values
                       1        1
         TextLine Ki       Vi         each token)               and emit the result)   Kr1 Vr1 TextLine
 Data                                                                                                     Data
                                                        Shuffle
Source                                                                                                    Dest




           Cascading word count concept:
TextLine


         tokenize each line group by tokens count values in every group


                                                                                                     TextLine




                                                                                                      13/21
Example: Word Count
 1    public class WordCount {
 2      public static void main( String[] args ) {
 3        Tap docTap = new Hfs( new TextDelimited( true, "t" ), args[0] );
 4        Tap wcTap = new Hfs( new TextDelimited( true, "t" ), args[1] );
 5
 6            RegexSplitGenerator s = new RegexSplitGenerator(
 7                                                  new Fields("token"),
 8                                                  "[ [](),.]" );
 9            Pipe docPipe = new Each( "token", new Fields( "text" ), s,
10                                      Fields.RESULTS ); // text -> token
11
12            Pipe wcPipe = new Pipe( "wc", docPipe );
13            wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );
14            wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
15
16            // connect the taps and pipes to create a flow definition
17            FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
18                                               .addSource( docPipe, docTap )
19                                               .addTailSink( wcPipe, wcTap );
20
21            getFlowConnector().connect( flowDef ).complete();
22        }
23    }

                                                                         14/21
Summary




  Hadoop MapReduce Programming Model



  Cascading



  Scalding




                                       15/21
Scalding


      Open source project developed @Twitter




                                               16/21
Scalding


      Open source project developed @Twitter
      Two APIs:
           Field Based
               Primary API: stable
               Uses Cascading Fields: dynamic with errors at runtime
           Type Safe
               Secondary API: experimental
               Uses Scala Types: static with errors at compile time




                                                                       16/21
Scalding


      Open source project developed @Twitter
      Two APIs:
           Field Based
               Primary API: stable
               Uses Cascading Fields: dynamic with errors at runtime
           Type Safe
               Secondary API: experimental
               Uses Scala Types: static with errors at compile time
      The two APIs can work together using pipe.typed and
      TypedPipe.from




                                                                       16/21
Scalding


      Open source project developed @Twitter
      Two APIs:
           Field Based
               Primary API: stable
               Uses Cascading Fields: dynamic with errors at runtime
           Type Safe
               Secondary API: experimental
               Uses Scala Types: static with errors at compile time
      The two APIs can work together using pipe.typed and
      TypedPipe.from
      This presentation is about the TypeSafe API ¨




                                                                       16/21
Why Scalding




               17/21
Why Scalding

      MapReduce high-level idea comes from LISP and works on
      functions (Map/Reduce) and function composition




                                                               17/21
Why Scalding

        MapReduce high-level idea comes from LISP and works on
        functions (Map/Reduce) and function composition
        Cascading works on objects representing functions and uses
        constructors as compositor between pipes:
    1   Pipe wcPipe = new Pipe( "wc", docPipe );
    2   wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );
    3   wcPipe = new Every( wcPipe, Fields.ALL, new Count(),
    4                       Fields.ALL );




                                                                 17/21
Why Scalding

        MapReduce high-level idea comes from LISP and works on
        functions (Map/Reduce) and function composition
        Cascading works on objects representing functions and uses
        constructors as compositor between pipes:
    1   Pipe wcPipe = new Pipe( "wc", docPipe );
    2   wcPipe = new GroupBy( wcPipe, new Fields( "token" ) );
    3   wcPipe = new Every( wcPipe, Fields.ALL, new Count(),
    4                       Fields.ALL );


        Functional programming can naturally describe data flows:
        every pipe can be seen as a function working and pipes can be
        combined using functional compositing. The code above can
        be written as:
    1   docPipe.groupBy( new Fields( "token" ) )
    2          .every(Fields.ALL, new Count(), Fields.ALL)



                                                                 17/21
Example: Word Count
 1    class WordCount(args : Args) extends Job(args) {
 2
 3    /* TextLine reads each line of the given file */
 4    val input = TypedPipe.from( TextLine( args( "input" ) ) )
 5
 6    /* tokenize every line and flat the result into a list of words */
 7    val words = input.flatMap{ tokenize(_) }
 8
 9    /* group by words and add a new field size that is the group size */
10    val wordGroups = words.groupBy{ identity(_) }.size
11
12    /* write each pair (word,count) as line using TextLine */
13    wordGroups.write((0,1), TextLine( args( "output" ) ) )
14
15     /* Split a piece of text into individual words */
16     def tokenize(text : String) : Array[String] = {
17       // Lowercase each word and remove punctuation.
18       text.trim.toLowerCase.replaceAll("[ˆa-zA-Z0-9s]", "")
19                            .split("s+")
20     }
21    }



                                                                    18/21
Scalding TypeSafe API


   Two main concepts:




                        19/21
Scalding TypeSafe API


   Two main concepts:
       TypedPipe[T]: class whose instances are distributed
       objects that wrap a cascading Pipe object, and holds the
       transformation done up until that point. Its interface is similar
       to Scala’s Iterator[T] (map, flatMap, groupBy,
       filter,. . . )




                                                                    19/21
Scalding TypeSafe API


   Two main concepts:
       TypedPipe[T]: class whose instances are distributed
       objects that wrap a cascading Pipe object, and holds the
       transformation done up until that point. Its interface is similar
       to Scala’s Iterator[T] (map, flatMap, groupBy,
       filter,. . . )
       KeyedList[K,V]: trait that represents a sharded lists of
       items. Two implementations:
           Grouped[K,V]: represents a grouping on keys of type K
           CoGrouped2[K,V,W,Result]: represents a cogroup over
           two grouped pipes. Used for joins




                                                                    19/21
Conclusions


      MapReduce API is powerful but limited




                                              20/21
Conclusions


      MapReduce API is powerful but limited
      Cascading API is as simple as the MapReduce API but more
      generic and powerful




                                                           20/21
Conclusions


      MapReduce API is powerful but limited
      Cascading API is as simple as the MapReduce API but more
      generic and powerful
      Scalding combines Cascading and Scala to easily describe
      distributed programs. Major strength points are:
          Functional programming to naturally describe data flows.
          Scalding is similar to Scala library, if you know Scala then
          you already know how to use Scalding
          Statically typed (TypeSafe API), no type errors at runtime
          Scala is standard and works on top of the JVM
          Scala libraries and tools can be used in production: IDEs,
          debug systems, test systems, build systems and everything else.




                                                                     20/21
Thank you for listening




                          21/21

Scalding

  • 1.
  • 2.
    What is Scalding Scalding is a Scala library written on top of Cascading that makes it easy to define MapReduce programs 2/21
  • 3.
    Summary HadoopMapReduce Programming Model Cascading Scalding 3/21
  • 4.
    Summary HadoopMapReduce Programming Model Cascading Scalding 4/21
  • 5.
    Map and Reduce At high level, a MapReduce Job is described with two functions operating over lists of key/value pairs. 5/21
  • 6.
    Map and Reduce At high level, a MapReduce Job is described with two functions operating over lists of key/value pairs. Map: a function from an input key/value pair to a list of intermediate key/value pairs map : (keyinput , valueinput ) → list(keymap , valuemap ) 5/21
  • 7.
    Map and Reduce At high level, a MapReduce Job is described with two functions operating over lists of key/value pairs. Map: a function from an input key/value pair to a list of intermediate key/value pairs map : (keyinput , valueinput ) → list(keymap , valuemap ) Reduce: a function from an intermediate key/values pairs to a list of output key/value pairs reduce : (keymap , list(valuemap )) → list(keyreduce , valuereduce ) 5/21
  • 8.
    Hadoop Programming Model The Hadoop MapReduce programming model allows to control all the job workflow components. Job components are divided in two phases: 6/21
  • 9.
    Hadoop Programming Model The Hadoop MapReduce programming model allows to control all the job workflow components. Job components are divided in two phases: The Map Phase: Km1 Vm1 Input Km1 Vm6 1 1 2 2 Km1 Vm6 Km3 Vm3 Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6 m m Source Ki2 Vi2 Km1 Vm5 Km3 Vm3 P2 Km2 Vm2 P2 Km2 Vm2 Km3 Vm3 combine(Vm1,Vm5)=Vm6 6/21
  • 10.
    Hadoop Programming Model The Hadoop MapReduce programming model allows to control all the job workflow components. Job components are divided in two phases: The Map Phase: Km1 Vm1 Input Km1 Vm6 1 1 2 2 Km1 Vm6 Km3 Vm3 Data reader Ki Vi Mapper Km Vm Combiner K 2 V 2 Partitioner P1 Km3 Vm3 Sorter P1 Km1 Vm6 m m Source Ki2 Vi2 Km1 Vm5 Km3 Vm3 P2 Km2 Vm2 P2 Km2 Vm2 Km3 Vm3 combine(Vm1,Vm5)=Vm6 The Reduce Phase: Km Vm3 3 Km3 Vm3 Output 3 3 Sorter Grouper G1 Km Vm Reducer Kr1 Vr1 Writer Data Shuffle Km1 Vm6 Vm7 Km4 Vm8 Km4 Vm8 Dest Kr2 Vr2 Km4 Vm8 Km1 Vm6 Vm7 G2 Km1 Vm6 Vm7 6/21
  • 11.
    Example: Word Count1/2 1 class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable>{ 2 3 public void map(Object key, Text value, Context context) 4 throws IOException, InterruptedException { 5 StringTokenizer itr = new StringTokenizer(value.toString()); 6 while (itr.hasMoreTokens()) { 7 word.set(itr.nextToken()); 8 context.write(new Text(word), new IntWritable(1)); 9 } 10 } 11 } 12 13 class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{ 14 15 public void reduce(Text key, Iterable<IntWritable> values, 16 Context context 17 ) throws IOException, InterruptedException { 18 int sum = 0; 19 for (IntWritable val : values) 20 sum += val.get(); 21 context.write(key, new IntWritable(sum)); 22 } 23 } 7/21
  • 12.
    Example: Word Count2/2 1 public class WordCount { 2 3 public static void main(String[] args) throws Exception { 4 Job job = new Job(conf, "word count"); 5 job.setMapperClass(TokenizerMapper.class); 6 7 8 9 job.setReducerClass(IntSumReducer.class); 10 job.setOutputKeyClass(Text.class); 11 job.setOutputValueClass(IntWritable.class); 12 FileInputFormat.addInputPath(job, new Path(args[0])); 13 FileOutputFormat.setOutputPath(job, new Path(args[1])); 14 System.exit(job.waitForCompletion(true) ? 0 : 1); 15 } 16 } 8/21
  • 13.
    Example: Word Count2/2 1 public class WordCount { 2 3 public static void main(String[] args) throws Exception { 4 Job job = new Job(conf, "word count"); 5 job.setMapperClass(TokenizerMapper.class); 6 7 job.setCombinerClass(IntSumReducer.class); 8 9 job.setReducerClass(IntSumReducer.class); 10 job.setOutputKeyClass(Text.class); 11 job.setOutputValueClass(IntWritable.class); 12 FileInputFormat.addInputPath(job, new Path(args[0])); 13 FileOutputFormat.setOutputPath(job, new Path(args[1])); 14 System.exit(job.waitForCompletion(true) ? 0 : 1); 15 } 16 } Sending the integer 1 for each instance of a word is very inefficient (1TB of data yields 1TB+ of data) Hadoop doesn’t know if it can use the reducer as combiner. A manual set is needed 8/21
  • 14.
    Hadoop weaknesses The reducer cannot be always used as combiner, Hadoop relies on the combiner specification or on manual partial aggregation inside the mapper instance life cycle (in-mapper combiner) Combiners are limited to associative and commutative functions (like sum). Partial aggregation is more general and powerful Programming model limited to the map/reduce phases model, multi-job programs are often difficult and counter-intuitive (think about iterative algorithms like PageRank) Joins can be difficult, many techniques must be implemented from scratch More in general, MapReduce is indeed simple but many optimizations are similar to hacks and not so natural 9/21
  • 15.
    Summary HadoopMapReduce Programming Model Cascading Scalding 10/21
  • 16.
    Cascading Open source project developed @Concurrent It is Java application framework on top of Hadoop developed to be extendible by providing: Processing API: to develop complex data flows Integration API: integration test supported by the framework to avoid to put in production unstable software Scheduling API: used to schedule unit of work from any third-party application It changes the MapReduce programming model to a more generic data flow oriented programming model Cascading has a data flow optimizer that converts user data flows to optimized data flows 11/21
  • 17.
    Cascading Programming Model A Cascading program is composed by flows A flow is composed by a source tap, a sink tap and pipes that connect them A pipe holds a particular transformation over its input data flow Pipes can be combined to create more complex programs 12/21
  • 18.
    Example: Word Count MapReduce word count concept: Map(tokenize text and emit 1 for Reduce(count values 1 1 TextLine Ki Vi each token) and emit the result) Kr1 Vr1 TextLine Data Data Shuffle Source Dest Cascading word count concept: TextLine tokenize each line group by tokens count values in every group TextLine 13/21
  • 19.
    Example: Word Count 1 public class WordCount { 2 public static void main( String[] args ) { 3 Tap docTap = new Hfs( new TextDelimited( true, "t" ), args[0] ); 4 Tap wcTap = new Hfs( new TextDelimited( true, "t" ), args[1] ); 5 6 RegexSplitGenerator s = new RegexSplitGenerator( 7 new Fields("token"), 8 "[ [](),.]" ); 9 Pipe docPipe = new Each( "token", new Fields( "text" ), s, 10 Fields.RESULTS ); // text -> token 11 12 Pipe wcPipe = new Pipe( "wc", docPipe ); 13 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) ); 14 wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); 15 16 // connect the taps and pipes to create a flow definition 17 FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) 18 .addSource( docPipe, docTap ) 19 .addTailSink( wcPipe, wcTap ); 20 21 getFlowConnector().connect( flowDef ).complete(); 22 } 23 } 14/21
  • 20.
    Summary HadoopMapReduce Programming Model Cascading Scalding 15/21
  • 21.
    Scalding Open source project developed @Twitter 16/21
  • 22.
    Scalding Open source project developed @Twitter Two APIs: Field Based Primary API: stable Uses Cascading Fields: dynamic with errors at runtime Type Safe Secondary API: experimental Uses Scala Types: static with errors at compile time 16/21
  • 23.
    Scalding Open source project developed @Twitter Two APIs: Field Based Primary API: stable Uses Cascading Fields: dynamic with errors at runtime Type Safe Secondary API: experimental Uses Scala Types: static with errors at compile time The two APIs can work together using pipe.typed and TypedPipe.from 16/21
  • 24.
    Scalding Open source project developed @Twitter Two APIs: Field Based Primary API: stable Uses Cascading Fields: dynamic with errors at runtime Type Safe Secondary API: experimental Uses Scala Types: static with errors at compile time The two APIs can work together using pipe.typed and TypedPipe.from This presentation is about the TypeSafe API ¨ 16/21
  • 25.
  • 26.
    Why Scalding MapReduce high-level idea comes from LISP and works on functions (Map/Reduce) and function composition 17/21
  • 27.
    Why Scalding MapReduce high-level idea comes from LISP and works on functions (Map/Reduce) and function composition Cascading works on objects representing functions and uses constructors as compositor between pipes: 1 Pipe wcPipe = new Pipe( "wc", docPipe ); 2 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) ); 3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(), 4 Fields.ALL ); 17/21
  • 28.
    Why Scalding MapReduce high-level idea comes from LISP and works on functions (Map/Reduce) and function composition Cascading works on objects representing functions and uses constructors as compositor between pipes: 1 Pipe wcPipe = new Pipe( "wc", docPipe ); 2 wcPipe = new GroupBy( wcPipe, new Fields( "token" ) ); 3 wcPipe = new Every( wcPipe, Fields.ALL, new Count(), 4 Fields.ALL ); Functional programming can naturally describe data flows: every pipe can be seen as a function working and pipes can be combined using functional compositing. The code above can be written as: 1 docPipe.groupBy( new Fields( "token" ) ) 2 .every(Fields.ALL, new Count(), Fields.ALL) 17/21
  • 29.
    Example: Word Count 1 class WordCount(args : Args) extends Job(args) { 2 3 /* TextLine reads each line of the given file */ 4 val input = TypedPipe.from( TextLine( args( "input" ) ) ) 5 6 /* tokenize every line and flat the result into a list of words */ 7 val words = input.flatMap{ tokenize(_) } 8 9 /* group by words and add a new field size that is the group size */ 10 val wordGroups = words.groupBy{ identity(_) }.size 11 12 /* write each pair (word,count) as line using TextLine */ 13 wordGroups.write((0,1), TextLine( args( "output" ) ) ) 14 15 /* Split a piece of text into individual words */ 16 def tokenize(text : String) : Array[String] = { 17 // Lowercase each word and remove punctuation. 18 text.trim.toLowerCase.replaceAll("[ˆa-zA-Z0-9s]", "") 19 .split("s+") 20 } 21 } 18/21
  • 30.
    Scalding TypeSafe API Two main concepts: 19/21
  • 31.
    Scalding TypeSafe API Two main concepts: TypedPipe[T]: class whose instances are distributed objects that wrap a cascading Pipe object, and holds the transformation done up until that point. Its interface is similar to Scala’s Iterator[T] (map, flatMap, groupBy, filter,. . . ) 19/21
  • 32.
    Scalding TypeSafe API Two main concepts: TypedPipe[T]: class whose instances are distributed objects that wrap a cascading Pipe object, and holds the transformation done up until that point. Its interface is similar to Scala’s Iterator[T] (map, flatMap, groupBy, filter,. . . ) KeyedList[K,V]: trait that represents a sharded lists of items. Two implementations: Grouped[K,V]: represents a grouping on keys of type K CoGrouped2[K,V,W,Result]: represents a cogroup over two grouped pipes. Used for joins 19/21
  • 33.
    Conclusions MapReduce API is powerful but limited 20/21
  • 34.
    Conclusions MapReduce API is powerful but limited Cascading API is as simple as the MapReduce API but more generic and powerful 20/21
  • 35.
    Conclusions MapReduce API is powerful but limited Cascading API is as simple as the MapReduce API but more generic and powerful Scalding combines Cascading and Scala to easily describe distributed programs. Major strength points are: Functional programming to naturally describe data flows. Scalding is similar to Scala library, if you know Scala then you already know how to use Scalding Statically typed (TypeSafe API), no type errors at runtime Scala is standard and works on top of the JVM Scala libraries and tools can be used in production: IDEs, debug systems, test systems, build systems and everything else. 20/21
  • 36.
    Thank you forlistening 21/21