[Approved by AICTE, Govt. of India & Affiliated to Dr.
APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Program-6
Aim: Installation of Hive along with practice examples.
Theory: Apache Hive is a distributed, fault-tolerant data warehouse system that enables
analytics at a massive scale. A data warehouse provides a central store of
information that can easily be analyzed to make informed, data driven decisions.
Hive allows users to read, write, and manage petabytes of data using SQL.
Prerequisites:
JDK(Java) installed in the system
Hadoop installed
7 Zip
Steps:
Step 1: Check whether the Java is installed or not using the following command.
$ java -version
Step 2: Check whether the Hadoop is installed or not using the following command.
$ hadoop version
Step 3: Download the Apache Hive file apache-hive-2.3.9-bin.tar.gz using Hive download
directory.
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Step 4: Unzip and Install Hive:
The following command is used to verify the download and extract the hive archive:
$ tar zxvf apache-hive-2.3.9-bin.tar.gz
$ ls
On successful download, you get to see the following response:
apache-hive-2.3.9-bin apache-hive-2.3.9-bin.tar.gz
Step 5: Copying files to /user/local/hive directory:
We need to copy the files from the super user “su -”. The following commands are
used to copy the files from the extracted directory to the /user/local/hive, directory.
$ su -
password:
# cd /home/cloudera/user/Download
# mv apache-hive-2.3.9-bin /usr/local/hive
# exit
Step 6: Setting up environment for Hive:
You can set up the Hive environment by appending the following lines to ~/.bashrc file:
export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
The following command is used to execute ~/.bashrc file:
$ source ~/.bashrc
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Step 7: Configuring Hive:
To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed
in the $HIVE_HOME/conf directory. The following commands redirect to Hive config folder
and copy the template file:
$ cd $HIVE_HOME/conf
$ cp hive-env.sh.template hive-env.sh
Edit the hive-env.sh file by appending the following line:
export HADOOP_HOME=/user/local/27adoop
Hive installation is completed successfully.
Step 8: Verify Hive Installation by following command:
$ bin/hive
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Program-7
Aim: Installation of HBase, Installing thrift along with Practice examples.
Theory: HBase provides low latency random read and write access to petabytes of data by
distributing requests from applications across a cluster of hosts. Each host has
access to data in HDFS and S3, and serves read and write requests in milliseconds.
Steps:
Step 1: Download the HBase file hbase-3.0.0-beta-1-bin.tar.gz from Apache website.
Step 2: Unzip and Move:
$ cd /home/cloudera/user/Downloads
$ sudo tar -zxvf hbase-3.0.0-beta-1-bin.tar.gz
$ sudo mv hbase-3.0.0-beta-1 /usr/local/hbase
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Step 3: Edit hbase-env.sh and hbase-site.xml:
$ cd /usr/local/hbase/conf
In the hbase-env.sh file, you need to export JAVA_HOME path. Here, you need to check the
JAVA_HOME path on your system. You can display your java home using the command
given below:
{} $ echo $JAVA_HOME
On this system the above command gives the output /usr/lib/jvm/java-7-openjdk-amd64
To update the hbase-env.sh, we need to run the command given below:
$ sudo nano hbase-env.sh
Copy and paste the below line in the hbase-env.sh file:
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HBASE_REGIONSERVERS=/usr/local/hbase/conf/regionservers
export HBASE_MANAGES_ZK=true
Use Ctrl+X and Y to save.
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Your hbase-env.sh file will look like the image given below:
Now update the .bashrc file to export hbase variables:
$ sudo nano ~/.bashrc
Copy and paste the below lines at the end of .bashrc:
#HBASE VARIABLES START
export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin
#HBASE VARIABLES END
Image below explains how to append the HBASE variables:
Use Ctrl+X and Y to save.
To make the above changes permanent in .bashrc, we need to run the command given
below:
$ source ~/.bashrc
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Now to update the hbase-site.xml file, use the command given below:
$ cd /usr/local/hbase/conf
$ sudo nano hbase-site.xml
Add the following lines between configuration tags:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:54310/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/usr/local/hadoop/zookeeper</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
</configuration>
Use Ctrl+X and Y to save
Your hbase-site.xml file will look like the image given below:
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Step 4: Starting HBASE:
$ cd /usr/local/hbase/bin
$ sudo chown -R hduser:hadoop /usr/local/hbase/
$ ./start-hbase.sh
You will get the following message which looks as shown in the image given below at the
time of starting:
Once you need to run jps command to check the hbase daemons: {} $ jps
You will get the output as shown in the image given below:
To check the HBase Directory in HDFS:
$ hadoop fs -ls hdfs://localhost:54310/hbase/
You will get the output as shown in the image given below:
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Program-8
Aim: Write PIG Commands: Write Pig Latin scripts sort, group, join, project and filter your
data.
Pig Latin Script for Sorting:
The ORDER BY operator is used to display the contents of a relation in a sorted order based
on one or more fields.
Syntax:
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC);
Pig Latin Script for Group:
The GROUP operator is used to group the data in one or more relations. It collects the data
having the same key.
Syntax:
grunt> Group_data = GROUP Relation_name BY age;
Pig Latin Script for Join:
The JOIN operator is used to combine records from two or more relations. While performing
a join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When
these keys match, the two particular tuples are matched, else the records are dropped.
Joins can be of the following types −
Self-join
Inner-join
Outer-join − le join, right join, and full join
Syntax:
Given below is the syntax of performing self-join operation using the JOIN operator.
grunt> Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Here is the syntax of performing inner join operation using the JOIN operator.
grunt> result = JOIN relation1 BY columnname, relation2 BY columnname;
Pig Latin Script for Filter:
The FILTER operator is used to select the required tuples from a relation based on a
condition.
Syntax:
grunt> Relation2_name = FILTER Relation1_name BY (condition);
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Program-9
Aim: Run the Pig Latin Scripts to find Word Count.
Program:
Assume we have data in the file like below:
This is a hadoop post
hadoop is a bigdata technology
we want to generate output for count of each word like below:
(a,2)
(is,2)
(This,1)
(class,1)
(hadoop,2)
(bigdata,1)
(technology,1)
Steps how to generate the same using Pig Latin:
Step 1: Load the data from HDFS
Use Load statement to load the data into a relation.
As keyword used to declare column names, as we don’t have any columns, we declared only
one column named line.
input = LOAD '/path/to/file/' AS(line:Chararray);
Step 2: Convert the Sentence into words
The data we have is in sentences. So we have to convert that data into words using
TOKENIZE Function.
(TOKENIZE(line));
(or)
If we have any delimeter like space we can specify as
(TOKENIZE(line,' '));
Output will be like this:
({(This),(is),(a),(hadoop),(class)})
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
({(hadoop),(is),(a),(bigdata),(technology)})
but we have to convert it into multiple rows like below:
(This)
(is)
(a)
(hadoop)
(class)
(hadoop)
(is)
(a)
(bigdata)
(technology)
Convert Column into Rows
I mean we have to convert every line of data into multiple rows ,for this we have function
called
FLATTEN in pig.
Using FLATTEN function the bag is converted into tuple, means the array of strings
converted into multiple rows.
Words = FOREACH input GENERATE FLATTEN(TOKENIZE(line,' ')) AS word;
Then the ouput is like below
(This)
(is)
(a)
(hadoop)
(class)
(hadoop)
(is)
(a)
(bigdata)
(technology)
Step 3: Apply GROUP BY
We have to count each word occurance, for that we have to group all the words.
Grouped = GROUP words BY word;
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Step 4: Generate word count
wordcount = FOREACH Grouped GENERATE group, COUNT(words);
We can print the word count on console using Dump.
DUMP wordcount;
Output will be like below:
(a,2)
(is,2)
(This,1)
(class,1)
(hadoop,2)
(bigdata,1)
(technology,1)
Below is the complete program for the same:
input = LOAD '/path/to/file/' AS(line:Chararray);
Words = FOREACH input GENERATE FLATTEN(TOKENIZE(line,' ')) AS word;
Grouped = GROUP words BY word;
wordcount = FOREACH Grouped GENERATE group, COUNT(words);
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
Program-10
Aim: Run the Pig Latin Scripts to find a max temp for each and every year.
Word Count using Pig Latin Steps:
Step1:
1. Create a text file having few lines of text and save it as bd.txt.
2. Create on directory in hdfs named wc
3. Store bd..txt file from local file system to hdfs file system directory wc
Step2:
inputline = load '/user/cloudera/wc/bd.txt' using PigStorage('\t') as (data:chararray);
words = FOREACH inputline GENERATE FLATTEN(TOKENIZE(data)) AS word;
filtered_words = FILTER words BY word MATCHES '\\w+';
word_groups = GROUP filtered_words BY word;
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS
count, group AS word;
ordered_word_count = ORDER word_count BY count DESC;
DUMP ordered_word_count;
You can use the below command to save the result in HDFS:
grunt> store ordered_word_count; into '/user/cloudera/wc/output/';
Find the Maximum Year in a given dataset using PIG
Option1:
A = LOAD 'input' USING PigStorage() AS (Year:int,Temp:int);
B = GROUP A ALL;
C = FOREACH B GENERATE MAX(A.Temp);
DUMP C;
[Approved by AICTE, Govt. of India & Affiliated to Dr. APJ
GL BAJAJ
Institute of Technologies & Management
Abdul Kalam Technical University, Lucknow, U.P., India]
Department of Applied Computational Science & Engineering
Greater Noida
or
Option2: Using (ORDER and LIMIT)
A = LOAD 'input' USING PigStorage() AS (Year:int,Temp:int);
B = ORDER A BY Temp DESC;
C = LIMIT B 1;
D = FOREACH C GENERATE Temp;
DUMP D;