training@datadotz.
com 1
APACHE PIG
Senthil Kumar A
training@datadotz.com 2
Introduction
• Abstraction over Mapreduce.
• It is a data-flow language called Pig Latin.
• Pig was originally created at Yahoo! To serve the similar need to
hive.
• Many developers doesn't have the knowledge of
Java/Mapreduce
• Under the covers, PigLatin scripts are turned as a Mapreduce
jobs and runs on the hadoop cluster
• Latest release is 0.12.0
training@datadotz.com 3
Pig Features
• Joining the dataset
• Sorting and aggregation
• Grouping data
• Referring to elements by position(useful for large datasets)
• Creation of UDF using java
training@datadotz.com 4
Installation
• tar –xvf pig-***.tgz
• Set JAVA_HOME
• Set HADOOP_HOME
training@datadotz.com 5
Accessing Pig
• Interactive mode
• Grunt, the Pig shell
• Batch mode
• Submitting a Pig script directly
• Pig server
• Java class, JDBC like interface
training@datadotz.com 6
Grunt- The Pig Shell (bin/pig)
• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';
• dump F;
training@datadotz.com 7
Alias name to the fields with data types
• A = load '/user/senthil/drugdata' using PigStorage(',') as
(pid:int, pname:chararray, drug:chararray,
gender:chararray, tot_amt:int);
• F = filter A by drug == 'avil';
• dump F;
training@datadotz.com 8
Data Types
• Scalar Types
• int 10
• float 10.0F
• long 10L
• double 10.0
• chararray hello
• bytearray
training@datadotz.com 9
Data formats
• PigStorage
• using field delimited text format
• BinStorage
• Loads/stores relations in HDFS from or to binary files
• TextLoader
• Loads relations in HDFS from a plain text format
• Loads a whole line as single column
• PigDump
• Stores relations in HDFS by writing the toString() representation of
tuples, one per line
training@datadotz.com 10
Store the results
• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';
• Store F into '/pig_result001’ using PigStorage(',') ;
Store -> writes the data in HDFS directory
training@datadotz.com 11
Viewing the Schema
• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';
• Describe F;
• Describe A;
• Illustrate F;
• Illustrate A;
training@datadotz.com 12
Execution Plan
• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';
• Explain F;
training@datadotz.com 13
Grouping and Sorting
• A =load '/user/senthil/drugdata' using PigStorage(',');
• D = GROUP A by $2;
• sm = foreach D generate group,SUM(A.$4) as s;
• smorder = order sm by s desc;
• dump smorder;
training@datadotz.com 14
Eliminating duplicates
• Select distinct drug from patient;
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• D = foreach A generate drug;
• unique = DISTINCT D;
• Dump unique;
training@datadotz.com 15
Limit, Match, Non-Match and Count
• -- LIMIT
• Reduce the number of o/p records
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• F = limit A 2;
• dump F;
• --Similar to Like in SQL
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• F = filter A by pname matches 'Brandon.*';
• dump F;
training@datadotz.com 16
Cont..
• -- Not matches Brandon
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• F = filter A by not pname matches 'Brandon.*';
• dump F;
• -- Count
• A =load '/user/senthil/drugdata' using PigStorage(',');
• F = GROUP A ALL;
• sm = foreach F generate COUNT_STAR(A);
• dump sm;
training@datadotz.com 17
Macros in Pig
• DEFINE my_macro(V, col,value) returns B {
$B = FILTER $V BY $col == '$value';
};
• A = load ‘/datagen_10.txt' using PigStorage(',');
• C = my_macro(A,$2,'metacin');
• dump C;
training@datadotz.com 18
Joining Data Sets
• PigLatin supports inner and outer joins of two or more relations.
Inner join --Join two tables by common key
• A =load ‘/datagen_10.txt' using PigStorage(',');
• B = load '/drug.txt' using PigStorage();
• C = join A by $2, B by $0;
• dump C;
training@datadotz.com 19
Outer joins
• Pig can perform left, right, full outer joins(similar to sql)
• A =load ‘/datagen_10.txt' using PigStorage(',');
• B = load '/drug.txt' using PigStorage();
• C = join A by $2 [left outer|right outer|full outer], B by $0;
• Dump C;
training@datadotz.com 20
GROUP vs COGROUP
• GROUP – collects records of one input based on a key
• COGROUP – collects records of n inputs based on a key
• C = COGROUP A by $2, B by $0;
• Dump C;
training@datadotz.com 21
Pig Scripts
• Use Pig scripts to place Pig Latin statements and Pig commands
in a single file.
• Good practice to identify the file using *.Pig
• Can run scripts that are stored in HDFS
• Pig hdfs://path/script.pig
• Single as well as Comment lines can be added
training@datadotz.com 22
Pig Server
• It is not a daemon server
• It is a single threaded stub to run pig in a java application
• org.apache.pig.Pigserver class
• Allows java programs to invoke pig commands
• Use “local” or “mapreduce” to indicate run method
• PigServer
• ps = new PigSrever(“local”);
• ps.registerQuery(“A = load 'file' ”);
• ps.registerQuery(“B = group A by $0 ”);
• ps.store(“B”, “outfile”);
training@datadotz.com 23
Implementation of UPPER UDF
package com;
public class Upper extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;}
try {String str = (String) input.get(0);
return str.toUpperCase();
} catch (IOException e) {
e.getMessage();}
return null;}}
training@datadotz.com 24
THANK YOU