KEMBAR78
Apache Pig: Senthil Kumar A | PDF | Information Retrieval | Data Management
0% found this document useful (0 votes)
60 views24 pages

Apache Pig: Senthil Kumar A

This document provides an overview of Apache Pig, including: - Pig is a data flow language called Pig Latin that allows abstraction over MapReduce jobs. - Pig was created at Yahoo to allow developers without Java/MapReduce knowledge to analyze large datasets. - Features include joining, sorting, grouping, and user defined functions using Java. - Pig scripts can be run interactively using Grunt or submitted in batch mode. - Common tasks like loading, filtering, grouping, joining, and storing data are demonstrated using Pig Latin statements.

Uploaded by

Babjee Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views24 pages

Apache Pig: Senthil Kumar A

This document provides an overview of Apache Pig, including: - Pig is a data flow language called Pig Latin that allows abstraction over MapReduce jobs. - Pig was created at Yahoo to allow developers without Java/MapReduce knowledge to analyze large datasets. - Features include joining, sorting, grouping, and user defined functions using Java. - Pig scripts can be run interactively using Grunt or submitted in batch mode. - Common tasks like loading, filtering, grouping, joining, and storing data are demonstrated using Pig Latin statements.

Uploaded by

Babjee Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

training@datadotz.

com 1

APACHE PIG
Senthil Kumar A
training@datadotz.com 2

Introduction
• Abstraction over Mapreduce.
• It is a data-flow language called Pig Latin.
• Pig was originally created at Yahoo! To serve the similar need to
hive.
• Many developers doesn't have the knowledge of
Java/Mapreduce
• Under the covers, PigLatin scripts are turned as a Mapreduce
jobs and runs on the hadoop cluster
• Latest release is 0.12.0
training@datadotz.com 3

Pig Features
• Joining the dataset
• Sorting and aggregation
• Grouping data
• Referring to elements by position(useful for large datasets)
• Creation of UDF using java
training@datadotz.com 4

Installation
• tar –xvf pig-***.tgz
• Set JAVA_HOME
• Set HADOOP_HOME
training@datadotz.com 5

Accessing Pig
• Interactive mode
• Grunt, the Pig shell

• Batch mode
• Submitting a Pig script directly

• Pig server
• Java class, JDBC like interface
training@datadotz.com 6

Grunt- The Pig Shell (bin/pig)


• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';
• dump F;
training@datadotz.com 7

Alias name to the fields with data types


• A = load '/user/senthil/drugdata' using PigStorage(',') as
(pid:int, pname:chararray, drug:chararray,
gender:chararray, tot_amt:int);

• F = filter A by drug == 'avil';


• dump F;
training@datadotz.com 8

Data Types
• Scalar Types
• int 10
• float 10.0F
• long 10L
• double 10.0
• chararray hello
• bytearray
training@datadotz.com 9

Data formats
• PigStorage
• using field delimited text format
• BinStorage
• Loads/stores relations in HDFS from or to binary files
• TextLoader
• Loads relations in HDFS from a plain text format
• Loads a whole line as single column
• PigDump
• Stores relations in HDFS by writing the toString() representation of
tuples, one per line
training@datadotz.com 10

Store the results


• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';
• Store F into '/pig_result001’ using PigStorage(',') ;

Store -> writes the data in HDFS directory


training@datadotz.com 11

Viewing the Schema


• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';

• Describe F;
• Describe A;

• Illustrate F;
• Illustrate A;
training@datadotz.com 12

Execution Plan
• A = load '/user/senthil/drugdata' using PigStorage(',') ;
• F = filter A by $2 == 'avil';

• Explain F;
training@datadotz.com 13

Grouping and Sorting


• A =load '/user/senthil/drugdata' using PigStorage(',');
• D = GROUP A by $2;
• sm = foreach D generate group,SUM(A.$4) as s;
• smorder = order sm by s desc;
• dump smorder;
training@datadotz.com 14

Eliminating duplicates
• Select distinct drug from patient;
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• D = foreach A generate drug;

• unique = DISTINCT D;
• Dump unique;
training@datadotz.com 15

Limit, Match, Non-Match and Count


• -- LIMIT
• Reduce the number of o/p records
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• F = limit A 2;
• dump F;

• --Similar to Like in SQL


• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• F = filter A by pname matches 'Brandon.*';
• dump F;
training@datadotz.com 16

Cont..
• -- Not matches Brandon
• A = load '/user/senthil/drugdata' using PigStorage(',') as (pid:int,
pname:chararray, drug:chararray,gender:chararray,tot_amt:int);
• F = filter A by not pname matches 'Brandon.*';
• dump F;

• -- Count
• A =load '/user/senthil/drugdata' using PigStorage(',');
• F = GROUP A ALL;
• sm = foreach F generate COUNT_STAR(A);
• dump sm;
training@datadotz.com 17

Macros in Pig
• DEFINE my_macro(V, col,value) returns B {
$B = FILTER $V BY $col == '$value';
};
• A = load ‘/datagen_10.txt' using PigStorage(',');
• C = my_macro(A,$2,'metacin');
• dump C;
training@datadotz.com 18

Joining Data Sets


• PigLatin supports inner and outer joins of two or more relations.

Inner join --Join two tables by common key


• A =load ‘/datagen_10.txt' using PigStorage(',');
• B = load '/drug.txt' using PigStorage();
• C = join A by $2, B by $0;
• dump C;
training@datadotz.com 19

Outer joins
• Pig can perform left, right, full outer joins(similar to sql)

• A =load ‘/datagen_10.txt' using PigStorage(',');


• B = load '/drug.txt' using PigStorage();
• C = join A by $2 [left outer|right outer|full outer], B by $0;
• Dump C;
training@datadotz.com 20

GROUP vs COGROUP
• GROUP – collects records of one input based on a key
• COGROUP – collects records of n inputs based on a key
• C = COGROUP A by $2, B by $0;
• Dump C;
training@datadotz.com 21

Pig Scripts
• Use Pig scripts to place Pig Latin statements and Pig commands
in a single file.
• Good practice to identify the file using *.Pig
• Can run scripts that are stored in HDFS
• Pig hdfs://path/script.pig
• Single as well as Comment lines can be added
training@datadotz.com 22

Pig Server
• It is not a daemon server
• It is a single threaded stub to run pig in a java application
• org.apache.pig.Pigserver class
• Allows java programs to invoke pig commands
• Use “local” or “mapreduce” to indicate run method
• PigServer
• ps = new PigSrever(“local”);
• ps.registerQuery(“A = load 'file' ”);
• ps.registerQuery(“B = group A by $0 ”);
• ps.store(“B”, “outfile”);
training@datadotz.com 23

Implementation of UPPER UDF


package com;
public class Upper extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0) {
return null;}
try {String str = (String) input.get(0);
return str.toUpperCase();
} catch (IOException e) {
e.getMessage();}
return null;}}
training@datadotz.com 24

THANK YOU

You might also like