Creating a Table using HBase Shell
You can create a table using the create command, here you must
specify the table name and the Column Family name.
The syntax to create a table in HBase shell is shown below.
create ‘<table name>’,’<column family>’
Example
Given below is a sample schema of a table named emp. It has
two column families: “personal data” and “professional data”.
Row key personal data professional data
You can create this table in HBase shell as shown below.
hbase(main):002:0> create 'emp', 'personal data', 'professional data'
Inserting Data using HBase Shell
How to create data in an HBase table? To create data in an HBase
table, the following commands is used:
• put command,
As an example, we are going to create the following table in
HBase.
Using put command, you can insert rows into a table. Its syntax is
as follows:
put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’
Inserting the First Row
Let us insert the first row values into the emp table as shown
below.
hbase(main):007:0> put 'emp','1','professional data:salary','50000'
0 row(s) in 0.0240 seconds
Updating Data using HBase Shell
You can update an existing cell value using the put command. To
do so, just follow the same syntax and mention your new value
as shown below.
put ‘table name’,’row ’,'Column family:column name',’new value’
The newly given value replaces the existing value, updating the
row.
Example
Suppose there is a table in HBase called emp with the following
data.
hbase(main):003:0> scan 'emp'
ROW COLUMN + CELL
row1 column = personal:name, timestamp = 1418051555, value = raju
row1 column = personal:city, timestamp = 1418275907, value = Hyderabad
row1 column = professional:designation, timestamp = 14180555,value = manager
row1 column = professional:salary, timestamp = 1418035791555,value = 50000
1 row(s) in 0.0100 seconds
The following command will update the city value of the employee
named ‘Raju’ to Delhi.
hbase(main):002:0> put 'emp','row1','personal:city','Delhi'
0 row(s) in 0.0400 seconds
Reading Data using HBase Shell
The get command and the get() method of HTable class are used to
read data from a table in HBase. Using get command, you can get
a single row of data at a time. Its syntax is as follows:
get ’<table name>’,’row1’
Example
The following example shows how to use the get command. Let
us scan the first row of the emp table.
hbase(main):012:0> get 'emp', '1'
Deleting a Specific Cell in a Table
Using the delete command, you can delete a specific cell in a table.
The syntax of delete command is as follows:
delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time stamp>’
Example
Here is an example to delete a specific cell. Here we are deleting
the salary.
hbase(main):006:0> delete 'emp', '1', 'personal data:city',
1417521848375
0 row(s) in 0.0060 seconds
Deleting All Cells in a Table
Using the “deleteall” command, you can delete all the cells in a
row. Given below is the syntax of deleteall command.
deleteall ‘<table name>’, ‘<row>’,
Example
Here is an example of “deleteall” command, where we are
deleting all the cells of row1 of emp table.
hbase(main):007:0> deleteall 'emp','1'
0 row(s) in 0.0240 seconds
Verify the table using the scan command. A snapshot of the table
after deleting the table is given below.
hbase(main):022:0> scan 'emp'
count
You can count the number of rows of a table using
the count command. Its syntax is as follows:
count ‘<table name>’
After deleting the first row, emp table will have two rows. Verify it
as shown below.
hbase(main):023:0> count 'emp'
2 row(s) in 0.090 seconds
⇒2
Dropping a Table using HBase Shell
Using the drop command, you can delete a table. Before dropping
a table, you have to disable it.
hbase(main):018:0> disable 'emp'
0 row(s) in 1.4580 seconds
hbase(main):019:0> drop 'emp'
0 row(s) in 0.3060 seconds
Apache Pig - Group Operator
The GROUP operator is used to group the data in one or more
relations. It collects the data having the same key.
Syntax
Given below is the syntax of the group operator.
grunt> Group_data = GROUP Relation_name BY age;
Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Apache Pig with the relation
name student_details as shown below.
grunt> student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray,
age:int, phone:chararray, city:chararray);
Now, let us group the records/tuples in the relation by age as
shown below.
grunt> group_data = GROUP student_details by age;
Verification
Verify the relation group_data using the DUMP operator as shown
below.
grunt> Dump group_data;
Output
Then you will get output displaying the contents of the relation
named group_data as shown below. Here you can observe that the
resulting schema has two columns −
• One is age, by which we have grouped the relation.
• The other is a bag, which contains the group of tuples,
student records with the respective age.
(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hydera bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233
8,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336
,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334,
trivendram)})
You can see the schema of the table after grouping the data using
the describe command as shown below.
grunt> Describe group_data;
group_data: {group: int,student_details: {(id: int,firstname: chararray,
lastname: chararray,age: int,phone: chararray,city: chararray)}
The JOIN operator is used to combine records from two or more
relations. While performing a join operation, we declare one (or a
group of) tuple(s) from each relation, as keys. When these keys
match, the two particular tuples are matched, else the records
are dropped. Joins can be of the following types −
• Self-join
• Inner-join
• Outer-join − left join, right join, and full join
This chapter explains with examples how to use the join operator
in Pig Latin. Assume that we have two files
namely customers.txt and orders.txt in the /pig_data/ directory of
HDFS as shown below.
customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060
And we have loaded these two files into Pig with the
relations customers and orders as shown below.
grunt> customers = LOAD
'hdfs://localhost:9000/pig_data/customers.txt' USING
PigStorage(',')
as (id:int, name:chararray, age:int,
address:chararray, salary:int);
grunt> orders = LOAD
'hdfs://localhost:9000/pig_data/orders.txt' USING
PigStorage(',')
as (oid:int, date:chararray, customer_id:int,
amount:int);
Let us now perform various Join operations on these two
relations.
Self - join
Self-join is used to join a table with itself as if the table were two
relations, temporarily renaming at least one relation.
Generally, in Apache Pig, to perform self-join, we will load the
same data multiple times, under different aliases (names).
Therefore let us load the contents of the file customers.txt as two
tables as shown below.
grunt> customers1 = LOAD
'hdfs://localhost:9000/pig_data/customers.txt' USING
PigStorage(',')
as (id:int, name:chararray, age:int,
address:chararray, salary:int);
grunt> customers2 = LOAD
'hdfs://localhost:9000/pig_data/customers.txt' USING
PigStorage(',')
as (id:int, name:chararray, age:int,
address:chararray, salary:int);
Syntax
Given below is the syntax of performing self-join operation using
the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
Example
Let us perform self-join operation on the relation customers, by
joining the two relations customers1 and customers2 as shown
below.
grunt> customers3 = JOIN customers1 BY id, customers2
BY id;
Verification
Verify the relation customers3 using the DUMP operator as shown
below.
grunt> Dump customers3;
Output
It will produce the following output, displaying the contents of the
relation customers.
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)
Inner Join
Inner Join is used quite frequently; it is also referred to as equijoin.
An inner join returns rows when there is a match in both tables.
It creates a new relation by combining column values of two
relations (say A and B) based upon the join-predicate. The query
compares each row of A with each row of B to find all pairs of
rows which satisfy the join-predicate. When the join-predicate is
satisfied, the column values for each matched pair of rows of A
and B are combined into a result row.
Syntax
Here is the syntax of performing inner join operation using
the JOIN operator.
grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
Example
Let us perform inner join operation on the two
relations customers and orders as shown below.
grunt> coustomer_orders = JOIN customers BY id, orders
BY customer_id;
Verification
Verify the relation coustomer_orders using the DUMP operator as
shown below.
grunt> Dump coustomer_orders;
Output
You will get the following output that will the contents of the
relation named coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Note −
Outer Join: Unlike inner join, outer join returns all the rows from at
least one of the relations. An outer join operation is carried out in
three ways −
• Left outer join
• Right outer join
• Full outer join
Left Outer Join
The left outer Join operation returns all rows from the left table,
even if there are no matches in the right relation.
Syntax
Given below is the syntax of performing left outer join operation
using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY
customer_id;
Example
Let us perform left outer join operation on the two relations
customers and orders as shown below.
grunt> outer_left = JOIN customers BY id LEFT OUTER,
orders BY customer_id;
Verification
Verify the relation outer_left using the DUMP operator as shown
below.
grunt> Dump outer_left;
Output
It will produce the following output, displaying the contents of the
relation outer_left.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
Right Outer Join
The right outer join operation returns all rows from the right table,
even if there are no matches in the left table.
Syntax
Given below is the syntax of performing right outer join operation
using the JOIN operator.
grunt> outer_right = JOIN customers BY id RIGHT, orders BY customer_id;
Example
Let us perform right outer join operation on the two
relations customers and orders as shown below.
grunt> outer_right = JOIN customers BY id RIGHT, orders
BY customer_id;
Verification
Verify the relation outer_right using the DUMP operator as shown
below.
grunt> Dump outer_right
Output
It will produce the following output, displaying the contents of the
relation outer_right.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Full Outer Join
The full outer join operation returns rows when there is a match in
one of the relations.
Syntax
Given below is the syntax of performing full outer join using
the JOIN operator.
grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;
Example
Let us perform full outer join operation on the two
relations customers and orders as shown below.
grunt> outer_full = JOIN customers BY id FULL OUTER,
orders BY customer_id;
Verification
Verify the relation outer_full using the DUMP operator as shown
below.
grun> Dump outer_full;
Output
It will produce the following output, displaying the contents of the
relation outer_full.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
The SPLIT operator is used to split a relation into two or more
relations.
Syntax
Given below is the syntax of the SPLIT operator.
grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name
(condition2),
Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation
name student_details as shown below.
student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray,
age:int, phone:chararray, city:chararray);
Let us now split the relation into two, one listing the employees of
age less than 23, and the other listing the employees having the
age between 22 and 25.
SPLIT student_details into student_details1 if age<23,
student_details2 if (22<age and age>25);
Verification
Verify the relations student_details1 and student_details2 using
the DUMP operator as shown below.
grunt> Dump student_details1;
grunt> Dump student_details2;
Output
It will produce the following output, displaying the contents of the
relations student_details1 and student_details2 respectively.
grunt> Dump student_details1;
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(4,Preethi,Agarwal,21,9848022330,Pune)
grunt> Dump student_details2;
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,23,9848022335,Chennai)
(7,Komal,Nayak,24,9848022334,trivendram)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
The FILTER operator is used to select the required tuples from a
relation based on a condition.
Syntax
Given below is the syntax of the FILTER operator.
grunt> Relation2_name = FILTER Relation1_name BY (condition);
Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation
name student_details as shown below.
grunt> student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray,
age:int, phone:chararray, city:chararray);
Let us now use the Filter operator to get the details of the
students who belong to the city Chennai.
filter_data = FILTER student_details BY city ==
'Chennai';
Verification
Verify the relation filter_data using the DUMP operator as shown
below.
grunt> Dump filter_data;
Output
It will produce the following output, displaying the contents of the
relation filter_data as follows.
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
The DISTINCT operator is used to remove redundant (duplicate)
tuples from a relation.
Syntax
Given below is the syntax of the DISTINCT operator.
grunt> Relation_name2 = DISTINCT Relatin_name1;
Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai
006,Archana,Mishra,9848022335,Chennai
And we have loaded this file into Pig with the relation
name student_details as shown below.
grunt> student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray,
phone:chararray, city:chararray);
Let us now remove the redundant (duplicate) tuples from the
relation named student_details using the DISTINCT operator, and
store it as another relation named distinct_data as shown below.
grunt> distinct_data = DISTINCT student_details;
Verification
Verify the relation distinct_data using the DUMP operator as shown
below.
grunt> Dump distinct_data;
Output
It will produce the following output, displaying the contents of the
relation distinct_data as follows.
(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)
The FOREACH operator is used to generate specified data
transformations based on the column data.
Syntax
Given below is the syntax of FOREACH operator.
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation
name student_details as shown below.
grunt> student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray,
lastname:chararray,age:int, phone:chararray,
city:chararray);
Let us now get the id, age, and city values of each student from
the relation student_details and store it into another relation
named foreach_data using the foreach operator as shown below.
grunt> foreach_data = FOREACH student_details GENERATE
id,age,city;
Verification
Verify the relation foreach_data using the DUMP operator as shown
below.
grunt> Dump foreach_data;
Output
It will produce the following output, displaying the contents of the
relation foreach_data.
(1,21,Hyderabad)
(2,22,Kolkata)
(3,22,Delhi)
(4,21,Pune)
(5,23,Bhuwaneshwar)
(6,23,Chennai)
(7,24,trivendram)
(8,24,Chennai)
The ORDER BY operator is used to display the contents of a
relation in a sorted order based on one or more fields.
Syntax
Given below is the syntax of the ORDER BY operator.
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);
Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation
name student_details as shown below.
grunt> student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray,
lastname:chararray,age:int, phone:chararray,
city:chararray);
Let us now sort the relation in a descending order based on the
age of the student and store it into another relation
named order_by_data using the ORDER BY operator as shown
below.
grunt> order_by_data = ORDER student_details BY age
DESC;
Verification
Verify the relation order_by_data using the DUMP operator as
shown below.
grunt> Dump order_by_data;
Output
It will produce the following output, displaying the contents of the
relation order_by_data.
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
(7,Komal,Nayak,24,9848022334,trivendram)
(6,Archana,Mishra,23,9848022335,Chennai)
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(4,Preethi,Agarwal,21,9848022330,Pune)
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
The LIMIT operator is used to get a limited number of tuples
from a relation.
Syntax
Given below is the syntax of the LIMIT operator.
grunt> Result = LIMIT Relation_name required number of tuples;
Example
Assume that we have a file named student_details.txt in the HDFS
directory /pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation
name student_details as shown below.
grunt> student_details = LOAD
'hdfs://localhost:9000/pig_data/student_details.txt'
USING PigStorage(',')
as (id:int, firstname:chararray,
lastname:chararray,age:int, phone:chararray,
city:chararray);
Now, let’s sort the relation in descending order based on the age
of the student and store it into another relation
named limit_data using the ORDER BY operator as shown below.
grunt> limit_data = LIMIT student_details 4;
Verification
Verify the relation limit_data using the DUMP operator as shown
below.
grunt> Dump limit_data;
Output
It will produce the following output, displaying the contents of the
relation limit_data as follows.
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(4,Preethi,Agarwal,21,9848022330,Pune)
Eval Functions
Given below is the list of eval functions provided by Apache Pig.
S.N. Function & Description
AVG()
1
To compute the average of the numerical values within a bag.
BagToString()
2
To concatenate the elements of a bag into a string. While concatenating, we
can place a delimiter between these values (optional).
CONCAT()
3
To concatenate two or more expressions of same type.
COUNT()
4
To get the number of elements in a bag, while counting the number of tuples
in a bag.
COUNT_STAR()
5
It is similar to the COUNT() function. It is used to get the number of
elements in a bag.
DIFF()
6
To compare two bags (fields) in a tuple.
IsEmpty()
7
To check if a bag or map is empty.
MAX()
8
To calculate the highest value for a column (numeric values or chararrays) in
a single-column bag.
MIN()
9
To get the minimum (lowest) value (numeric or chararray) for a certain
column in a single-column bag.
PluckTuple()
10
Using the Pig Latin PluckTuple() function, we can define a string Prefix and
filter the columns in a relation that begin with the given prefix.
SIZE()
11
To compute the number of elements based on any Pig data type.
SUBTRACT()
12
To subtract two bags. It takes two bags as inputs and returns a bag which
contains the tuples of the first bag that are not in the second bag.
SUM()
13
To get the total of the numeric values of a column in a single-column bag.
TOKENIZE()
14
To split a string (which contains a group of words) in a single tuple and
return a bag which contains the output of the split operation.