MEERUT INSTITUTE OF ENGINEERING AND
TECHNOLOGY MEERUT
                LAB MANUAL
           (Big Data and Analytics Lab)
    DR. A.P.J. ABDUL KALAM TECHNICAL UNIVERSITY
                       LUCKNOW
                B.TECH. THIRD YEAR
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING - DS
2                             BIG DATA ANALYTICS LABORATORY (KDS651)
                                                 Vision of Institute
To be an outstanding institution in the country imparting technical education, providing need-based, value-based
and career-based programs and producing self-reliant, self-sufficient technocrats capable of meeting new
challenges.
                                                Mission of Institute
The mission of the institute is to educate young aspirants in various technical fields to fulfill global requirement of
human resources by providing sustainable quality education, training and invigorating environment besides
molding them into skilled competent and socially responsible citizens who will lead the building of a powerful
nation.
      3                                                              BIG DATA ANALYTICS LABORATORY (KDS651)
Course Objectives:
          This course is designed to:
              1. Get familiar with Hadoop distributions, configuring Hadoop and performing
                 File management tasks
              2. Experiment MapReduce in Hadoop frameworks
              3. Implement MapReduce programs in variety applications
              4. Explore MapReduce support for debugging
              5. Understand different approaches for building Hadoop MapReduce programs for real-
                 time applications
     List of Experiments:
     1 Downloading and Installation Hadoop, Understanding different Hadoop modes, startup scripts,
     configuration files.
     2. Implementation of file management in Hadoop
          ➢ Adding files and Directories
          ➢ Retrieving file
          ➢ Deleting files
     3. Matrix Multiplication with Hadoop MapReduce
     4. To run a Word Count MapReduce program to understand the MapReduce paradigm
     5. Implementation of K-mean clustering using Map reduce
     6. Installation of HIVE alone practical Example
     7. Installing of HBase, Installation thrift alone with practice example
     8. Write PIG Commands. Write pig Latin script sort, group and filter in your data
     9. Run the PIG Latin Script to find world count
     10. Run the PIG Latin Script to find a max temp for each and every year.
      4                                                             BIG DATA ANALYTICS LABORATORY (KDS651)
Text Books:
1. Tom White, “Hadoop: The Definitive Guide” Fourth Edition, O’reilly Media, 2015.
Reference Books:
1. Glenn J. Myatt, Making Sense of Data , John Wiley & Sons, 2007 Pete Warden, Big Data
Glossary, O’Reilly, 2011.
2. Michael Berthold, David J.Hand, Intelligent Data Analysis, Spingers, 2007.
3. Chris Eaton, Dirk DeRoos, Tom Deutsch, George Lapis, Paul Zikopoulos, Uderstanding Big
Data : Analytics for Enterprise Class Hadoop and Streaming Data, McGrawHill Publishing,
2012.
4. AnandRajaraman and Jeffrey David UIIman, Mining of Massive Datasets Cambridge
University Press, 2012.
Course Outcomes:
Upon completion of the course, the students should be able to:
    1. Configure Hadoop and perform File Management Tasks (L2)
    2. Apply MapReduce programs to real time issues like word count, weather dataset and
       sales of a company (L3)
    3. Critically analyze huge data set using Hadoop distributed file systems and MapReduce
       (L5)
    4. Apply different data processing tools like Pig, Hive and Spark.(L6)
5                                                           BIG DATA ANALYTICS LABORATORY (KDS651)
6
    Lab: 1
                 Downloading and Installation Hadoop, Understanding different Hadoop
                 modes, startup scripts, configuration files.
AIM: To Install Apache Hadoop.
Introduction of bigdata architecture:
 ➢ A big data architecture is designed to handle the ingestion, processing, and analysis of data that
is too large or complex for traditional database systems.
Component of Big Data Architecture
➢ Data sources. All big data solutions start with one or more data sources.
➢ Data storage. Data for batch processing operations is typically stored in a distributed file store
that can hold high volumes of large files in various formats
➢ Batch processing. Because the data sets are so large, often a big data solution must process data
files using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis.
Usually these jobs involve reading source files, processing them, and writing the output to new files.
➢ Real-time message ingestion. If the solution includes real-time sources, the architecture must
include a way to capture and store real-time messages for stream processing.
➢ Stream processing. After capturing real-time messages, the solution must process them by
filtering, aggregating, and otherwise preparing the data for analysis
➢ Analytical data store. Many big data solutions prepare data for analysis and then serve the
processed data in a structured format that can be queried using analytical tools
➢ Analysis and reporting. The goal of most big data solutions is to provide insights into the data
through analysis and reporting.
➢ Orchestration. Most big data solutions consist of repeated data processing operations,
encapsulated in work flows that transform source data, move data between multiple sources and
sinks, load the processed data into an analytical data store, or push the results straight to a report
or dashboard.
❖ Introduction of Hadoop Architecture:
➢ Apache Hadoop offers a scalable, flexible and reliable distributed computing big data framework
for a cluster of systems with storage capacity and local computing power by leveraging commodity
hardware.
➢ Hadoop follows a Master Slave architecture for the transformation and analysis of large datasets
using Hadoop MapReduce paradigm. The 3 important hadoop components that play a vital role in
the Hadoop architecture.
➢ Hadoop Common – the libraries and utilities used by other Hadoop modules
➢ Hadoop Distributed File System (HDFS) – the Java-based scalable system that stores data
across multiple machines without prior organization.
➢ YARN – (Yet Another Resource Negotiator) provides resource management for the processes
running on Hadoop.
➢ MapReduce – a parallel processing software framework. It is comprised of two steps. Map step
is a master node that takes inputs and partitions them into smaller sub problems and then
6                                                             BIG DATA ANALYTICS LABORATORY (KDS651)
7
distributes them to worker nodes. After the map step has taken place, the master node takes the
answers to all of the sub problems and combines them to produce output.
EXCERCISE:
1) What do you know about the term “Big Data” and what is application of Big Data ?
2) What is Internet of things?
3) What are the challenges of using Hadoop?
4) Tell us how big data and Hadoop are related to each other.
5) How would you transform unstructured data into structured data?
Hadoop can run in 3 different modes.
1. Standalone (Local) Mode
By default, Hadoop is configured to run in a no distributed mode. It runs as a single Java process.
Instead of HDFS, this mode utilizes the local file system. This mode useful for debugging and there
isn't any need to configure core-site.xml, hdfs-site.xml, mapred-site.xml, masters & slaves. Stand-
alone mode is usually the fastest mode in Hadoop.
In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode, Secondary Name node,
Job Tracker, and Task Tracker. We use job-tracker and task-tracker for processing purposes in
Hadoop1. For Hadoop2 we use Resource Manager and Node Manager. Standalone Mode also
means that we are installing Hadoop only in a single system. By default, Hadoop is made to run in
this Standalone Mode or we can also call it as the Local mode. We mainly use Hadoop in this Mode
for the Purpose of Learning, testing, and debugging.
Hadoop works very much Fastest in this mode among all of these 3 modes. As we all know HDFS
(Hadoop distributed file system) is one of the major components for Hadoop whichutilized for
storage Permission is not utilized in this mode. You can think of HDFS as similar to the file system’s
available for windows i.e. NTFS (New Technology File System) and FAT32(File Allocation Table
which stores the data in the blocks of 32 bits ). when your Hadoop works in this mode there is no
need to configure the files – hdfs-site.xml, mapred-site.xml, core- site.xml for Hadoop environment.
In this Mode, all of your Processes will run on a single JVM(Java Virtual Machine) and
this mode can only be used for small development purposes.
2. Pseudo-Distributed Mode (Single node)
Hadoop can also run on a single node in a Pseudo Distributed mode. In this mode, each daemon
runs on separate java process. In this mode custom configuration is required( core-site.xml, hdfs-
site.xml, mapred-site.xml ). Here HDFS is utilized for input and output. This mode of deployment
is useful for testing and debugging purposes.
In Pseudo-distributed Mode we also use only a single node, but the main thing is that the cluster
is simulated, which means that all the processes inside the cluster will run independently to each
other. All the daemons that are Namenode, Datanode, Secondary Name node, Resource Manager,
Node Manager, etc. will be running as a separate process on separate JVM(Java Virtual Machine)
or we can say run on different java processes that is why it is called a Pseudo- distributed.
One thing we should remember that as we are using only the single node set up so all the Master
and Slave processes are handled by the single system. Namenode and Resource Manager are used
as Master and Datanode and Node Manager is used as a slave. A secondary name node isalso
used as a Master. The purpose of the Secondary Name node is to just keep the hourly based backup
of the Name node. In this Mode,
Hadoop is used for development and for debugging purposes both.
Our HDFS(Hadoop Distributed File System ) is utilized for managing the Input and Output
processes.
7                                                             BIG DATA ANALYTICS LABORATORY (KDS651)
8
We need to change the configuration files mapred-site.xml, core-site.xml, hdfs-site.xml for
setting up the environment.
3. Fully Distributed Mode
This is the production mode of Hadoop. In this mode typically one machine in the cluster is
designated as NameNode and another as Resource Manager exclusively. These are masters. All
other nodes act as Data Node and Node Manager. These are the slaves. Configuration parameters
and environment need to specified for Hadoop Daemons.
This mode offers fully distributed computing capability, reliability, fault tolerance and scalability.
This is the most important one in which multiple nodes are used few of them run the Master
Daemon’s that are Namenode and Resource Manager and the rest of them run the Slave
Daemon’s that are DataNode and Node Manager. Here Hadoop will run on the clusters of
Machine or nodes. Here the data that is used is distributed across different nodes. This is
actually the Production Mode of Hadoop let’s clarify or understand this Mode in a better way in
Physical Terminology.
Once you download the Hadoop in a tar file format or zip file format then you install it in your
system and you run all the processes in a single system but here in the fully distributed mode we
are extracting this tar or zip file to each of the nodes in the Hadoop cluster and then we are
using a particular node for a particular process. Once you distribute the process among the
nodes then you’ll define which nodes are
working as a master or which one of them is working as a slave.
8                                                             BIG DATA ANALYTICS LABORATORY (KDS651)
9
Startup scripts
The $HADOOP_INSTALL/hadoop/bin directory contains some scripts used to launch Hadoop
DFS and Hadoop Map/Reduce daemons. These are:
start-dfs.sh - Starts the Hadoop DFS daemons, the namenode and datanodes. Use this
before start-mapred.sh
stop-dfs.sh - Stops the Hadoop DFS daemons.
start-mapred.sh - Starts the Hadoop Map/Reduce daemons, the jobtracker and tasktrackers.
stop-mapred.sh - Stops the Hadoop Map/Reduce daemons.
start-all.sh - Starts all Hadoop daemons, the namenode, datanodes, the jobtracker and
tasktrackers. Deprecated; use start-dfs.sh then start-mapred.sh
stop-all.sh - Stops all Hadoop daemons. Deprecated; use stop-mapred.sh then stop-dfs.sh
Configuration files
The $HADOOP_INSTALL/hadoop/conf directory contains some configuration files for Hadoop.
These are:
hadoop-env.sh - This file contains some environment variable settings used by Hadoop. You can
use these to affect some aspects of Hadoop daemon behavior, such as where log files are stored, the
maximum amount of heap used etc. The only variable you should need to change in this fileis
JAVA_HOME, which specifies the path to the Java 1.5.x installation used by Hadoop.
slaves - This file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and
tasktrackers) will run. By default this contains the single entry localhost
hadoop-default.xml - This file contains generic default settings for Hadoop daemons and
Map/Reduce jobs. Do not modify this file.
mapred-default.xml - This file contains site specific settings for the Hadoop Map/Reduce daemons
and jobs. The file is empty by default. Putting configuration properties in this file will override
Map/Reduce settings in the hadoop-default.xml file. Use this file to tailor the behavior of
Map/Reduce on your site.
hadoop-site.xml - This file contains site specific settings for all Hadoop daemons and
9                                                           BIG DATA ANALYTICS LABORATORY (KDS651)
10
Map/Reduce jobs. This file is empty by default. Settings in this file override those in hadoop-
default.xml and mapred-default.xml. This file should contain settings that must be respected by all
servers and clients in a Hadoop installation, for instance, the location of the namenode and the
jobtracker.
Hadoop software can be installed in three modes of
Hadoop is a Java-based programming framework that supports the processing and storage of
extremely large datasets on a cluster of inexpensive machines. It was the first major open source
project in the big data playing field and is sponsored by the Apache Software Foundation.
Hadoop-2.7.3 is comprised of four main layers:
   ➢ Hadoop Common is the collection of utilities and libraries that support other Hadoop
       modules.
   ➢ HDFS, which stands for Hadoop Distributed File System, is responsible for persisting
       data to disk.
   ➢ YARN, short for Yet Another Resource Negotiator, is the "operating system" for HDFS.
   ➢ MapReduce is the original processing model for Hadoop clusters. It distributes work
       within the cluster or map, then organizes and reduces the results from the nodes into a
       response to a query. Many other processing models are available for the 2.x version of
       Hadoop.
Hadoop clusters are relatively complex to set up, so the project includes a stand-alone mode
which is suitable for learning about Hadoop, performing simple operations, and debugging.
Procedure:
we'll install Hadoop in stand-alone mode and run one of the example example MapReduce
programs it includes to verify the installation.
Prerequisites:
Step1: Installing Java 8 version.
Openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
This output verifies that OpenJDK has been successfully installed.
Note: To set the path for environment variables. i.e. JAVA_HOME
Step2: Installing Hadoop
With Java in place, we'll visit the Apache Hadoop Releases page to find the most
recent stable release. Follow the binary for the current release:
10                                                          BIG DATA ANALYTICS LABORATORY (KDS651)
11
Download Hadoop from www.hadoop.apache.org
11                                           BIG DATA ANALYTICS LABORATORY (KDS651)
12
Procedure to Run Hadoop
1. Install Apache Hadoop 2.2.0 in Microsoft Windows OS
If Apache Hadoop 2.2.0 is not already installed then follow the post Build, Install,
Configure and Run Apache Hadoop 2.2.0 in Microsoft Windows OS.
2. Start HDFS (Namenode and Datanode) and YARN (Resource Manager and Node
Manager)
Run following commands.
Command Prompt
C:\Users\abhijitg>cd c:\hadoop
c:\hadoop>sbin\start-dfs
c:\hadoop>sbin\start-yarn
starting yarn daemons
Namenode, Datanode, Resource Manager and Node Manager will be started in few
minutes and ready to execute Hadoop MapReduce job in the Single Node (pseudo-
distributed mode) cluster.
Resource Manager & Node Manager:
12                                                     BIG DATA ANALYTICS LABORATORY (KDS651)
13
Run wordcount MapReduce job
Now we'll run wordcount MapReduce job available
in %HADOOP_HOME%\share\hadoop\mapreduce\hadoop-mapreduce-examples-
2.2.0.jar
Create a text file with some content. We'll pass this file as input to
the wordcount MapReduce job for counting words.
C:\file1.txt
Install Hadoop
Run Hadoop Wordcount Mapreduce Example
Create a directory (say 'input') in HDFS to keep all the text files (say 'file1.txt') to be used for
counting words.
C:\Users\abhijitg>cd c:\hadoop
C:\hadoop>bin\hdfs dfs -mkdir input
Copy the text file(say 'file1.txt') from local disk to the newly created 'input' directory in HDFS.
C:\hadoop>bin\hdfs dfs -copyFromLocal c:/file1.txt input
13                                                             BIG DATA ANALYTICS LABORATORY (KDS651)
14
Check content of the copied file.
C:\hadoop>hdfs dfs -ls input
Found 1 items
-rw-r--r-- 1 ABHIJITG supergroup           55 2014-02-03 13:19 input/file1.txt
C:\hadoop>bin\hdfs dfs -cat input/file1.txt
Install Hadoop
Run Hadoop Wordcount Mapreduce Example
Run the wordcount MapReduce job provided
in   %HADOOP_HOME%\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.2.0.jar
C:\hadoop>bin\yarnjarshare/hadoop/mapreduce/hadoop-mapreduce-examples- 2.2.0.jar
wordcount input output
14/02/03 13:22:02 INFO client.RMProxy: Connecting to ResourceManager at
/0.0.0.0:8032
14/02/03 13:22:03 INFO input.FileInputFormat: Total input paths to process : 1 14/02/03
13:22:03 INFO mapreduce.JobSubmitter: number of splits:1
:
:
14/02/03 13:22:04 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1391412385921_0002
14/02/03 13:22:04 INFO impl.YarnClientImpl: Submitted application
application_1391412385921_0002 to ResourceManager at /0.0.0.0:8032 14/02/03
13:22:04 INFO mapreduce.Job: The url to track the job:
http://ABHIJITG:8088/proxy/application_1391412385921_0002/
14/02/03 13:22:04 INFO mapreduce.Job: Running job: job_1391412385921_0002 14/02/03
13:22:14 INFO mapreduce.Job: Job job_1391412385921_0002 running in uber mode : false
14/02/03 13:22:14 INFO mapreduce.Job: map 0% reduce 0%
14/02/03 13:22:22 INFO mapreduce.Job: map 100% reduce 0%
14/02/03 13:22:30 INFO mapreduce.Job: map 100% reduce 100%
14/02/03 13:22:30 INFO mapreduce.Job: Job job_1391412385921_0002 completed
successfully
14/02/03 13:22:31 INFO mapreduce.Job: Counters: 43 File
      System Counters
           FILE: Number of bytes read=89
           FILE: Number of bytes written=160142
           FILE: Number of read operations=0 FILE:
           Number of large read operations=0 FILE:
           Number of write operations=0
                                                                                        5
14                                                     BIG DATA ANALYTICS LABORATORY (KDS651)
15
           HDFS: Number of bytes read=171
           HDFS: Number of bytes written=59
           HDFS: Number of read operations=6
           HDFS: Number of large read operations=0 HDFS:
           Number of write operations=2
     Job Counters
           Launched map tasks=1
           Launched reduce tasks=1
           Data-local map tasks=1
           Total time spent by all maps in occupied slots (ms)=5657 Total
           time spent by all reduces in occupied slots (ms)=6128
     Map-Reduce Framework Map
           input records=2 Map
           output records=7 Map
           output bytes=82
           Map output materialized bytes=89 Input
           split bytes=116
           Combine input records=7
           Combine output records=6
           Reduce input groups=6
           Reduce shuffle bytes=89
           Reduce input records=6
           Reduce output records=6
           Spilled Records=12 Shuffled
           Maps =1
           Failed Shuffles=0
           Merged Map outputs=1
           GC time elapsed (ms)=145
           CPU time spent (ms)=1418
           Physical memory (bytes) snapshot=368246784
           Virtual memory (bytes) snapshot=513716224 Total
           committed heap usage (bytes)=307757056
     Shuffle Errors
           BAD_ID=0
           CONNECTION=0
           IO_ERROR=0
           WRONG_LENGTH=0
           WRONG_MAP=0
           WRONG_REDUCE=0
     File Input Format Counters
           Bytes Read=55
     File Output Format Counters
                                                                                              6
15                                                       BIG DATA ANALYTICS LABORATORY (KDS651)
16
            Bytes Written=59
http://abhijitg:8088/cluster
Result: We've installed Hadoop in stand-alone mode and verified it by running an
example program it provided.
                               Other method to install Hadoop
 Installation of HADOOP
1-Download virtual box
    ➢ Virtual box-oracle (https://www.virtualbox.org/wiki/Downloads)
    ➢ VirtualBox 7.0.6 platform packages
    ➢ Windows hosts
2. Google: sandbox HDP clouder
    ➢ Go to Hortonswork HDP -> download now
    ➢ Choose Installation Type: virtual Box
    ➢ Fill form (chose for udemy training)
    ➢ After all click: download older version: 2.5.0 (11 GB)
3- Connect virtual box to HDP
     ➢   Open virtualbox
     ➢   Click import
     ➢   Chose path of set of of HDP 2.5.0
     ➢   It will take time to Import (Approx 2 min)
16                                                        BIG DATA ANALYTICS LABORATORY (KDS651)
17
Page of Ambari:
   ➢ Open setting
   ➢ Click network : reduce base memory (up to less than half) : it is form remove GURU
       MEDITATION ERROR
   ✓ After start machine (main page of virtual box)
   ✓ After some time aapprox 3-4 min: it will generate http:// link
   ✓ Local host connect: http://127.0.0.1:8888/ 0r
   ✓ http://127.0.0.1:8080/#/main/dashboard/metrics
   ✓ Login on ambari
   ✓ Username: maria_dev
   ✓ Password: maria_dev
5- Hadoop Main page
17                                                      BIG DATA ANALYTICS LABORATORY (KDS651)
   18
 Lab NO: 2                   Implementation of file management in Hadoop
                               ➢ Adding files and Directories
                               ➢ Retrieving file
                               ➢ Deleting files
AIM: Implementation of file management in Hadoop
 With growing data velocity the data size easily outgrows the storage limit of a machine.
A solution would be to store the data across a network of machines. Such filesystems are
called distributed filesystems. Since data is stored across a network all the complications
of a network come in.
This is where Hadoop comes in. It provides one of the most reliable filesystems. HDFS
(Hadoop Distributed File System) is a unique design that provides storage for extremely
large files with streaming data access pattern and it runs on commodity hardware. Let’s
elaborate the terms:
•      Extremely large files: Here we are talking about the data in range of petabytes(1000
    TB).
•      Streaming Data Access Pattern: HDFS is designed on principle of write-once and
    read-many-times. Once data is written large portions of dataset can be processed any
    number times.
•      Commodity hardware: Hardware that is inexpensive and easily available in the
    market. This is one of feature which specially distinguishes HDFS from other file
    system.
Nodes: Master-slave nodes typically forms the HDFS cluster.
1.     NameNode(MasterNode):
    •          Manages all the slave nodes and assign work to them.
    •          It executes filesystem namespace operations like opening, closing,
       renaming
    •          files and directories.
    •          It should be deployed on reliable hardware which has the high config. not
       on commodity hardware.
2.     DataNode(SlaveNode):
    •          Actual worker nodes, who do the actual work like reading, writing,
       processing etc.
    •          They also perform creation, deletion, and replication upon instruction from
       the master.
    •          They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in background.
•      Namenodes:
           • Run on the master node.
           • Store metadata (data about data) like file path, the number of blocks, block
               Ids. etc.
           • Require high amount of RAM.
           • Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a
               persistent copy of it is kept on disk.
•      DataNodes:
           • Run on slave nodes.
   18                                                   BIG DATA ANALYTICS LABORATORY (KDS651)
   19
          •    Require high memory as data is actually stored here.
Data storage in HDFS: Now let’s see how the data is stored in a distributed manner.
Lets assume that 100TB file is inserted, then masternode(namenode) will first divide the
file into blocks of 10TB (default size is 128 MB in Hadoop 2.x and above). Then these
blocks         are        stored         across        different       datanodes(slavenode).
Datanodes(slavenode)replicate the blocks among themselves and the information of
what blocks they contain is sent to the master. Default replication factor is 3 means for
each block 3 replicas are created (including itself). In hdfs.site.xml we can increase or
decrease the replication factor i.e we can edit its configuration here.
Note: MasterNode has the record of everything, it knows the location and info of each
and every single data nodes and the blocks they contain, i.e. nothing is done without the
permission of masternode.
Why divide the file into blocks?
Answer: Let’s assume that we don’t divide, now it’s very difficult to store a 100 TB file
on a single machine. Even if we store, then each read and write operation on that whole file
is going to take very high seek time. But if we have multiple blocks of size 128MB then its
become easy to perform various read and write operations on it compared todoing it
on a whole file at once. So we divide the file to have faster data access i.e. reduce seek
time.
Why replicate the blocks in data nodes while storing?
Answer: Let’s assume we don’t replicate and only one yellow block is present on datanode
D1. Now if the data node D1 crashes we will lose the block and which willmake the
overall data inconsistent and faulty. So we replicate the blocks to achieve fault-
tolerance.
Terms related to HDFS:
•       HeartBeat : It is the signal that datanode continuously sends to namenode. If
    namenode doesn’t receive heartbeat from a datanode then it will consider it dead.
•       Balancing : If a datanode is crashed the blocks present on it will be gone too and the
    blocks will be under-replicated compared to the remaining blocks. Here master
    node(namenode) will give a signal to datanodes containing replicas of those lostblocks
    to replicate so that overall distribution of blocks is balanced.
•       Replication:: It is done by datanode.
Note: No two replicas of the same block are present on the same datanode.
Features:
•       Distributed data storage.
•       Blocks reduce seek time.
•       The data is highly available as the same block is present at multiple datanodes.
•       Even if multiple datanodes are down we can still do our work, thus making it
    highly reliable.
•       High fault tolerance.
Limitations: Though HDFS provide many features there are some areas where it
doesn’t work well.
•       Low latency data access: Applications that require low-latency access to data i.e in
    the range of milliseconds will not work well with HDFS, because HDFS is designed
    keeping in mind that we need high-throughput of data even at the cost of latency.
•       Small file problem: Having lots of small files will result in lots of seeks and lots
    of movement from one datanode to another datanode to retrieve each small file, this
    whole process is a very inefficient data access pattern.
   19                                                     BIG DATA ANALYTICS LABORATORY (KDS651)
    20
 File Management tasks in Hadoop
 1. Create a directory in HDFS at given path(s).
Usage:
hadoop fs -mkdir <paths>
Example:
hadoop fs -mkdir /user/saurzcode/dir1 /user/saurzcode/dir2
2. List the contents of a directory.
Usage :
hadoop fs -ls <args>
Example:
hadoop fs -ls /user/saurzcode
 3. Upload and download a file in HDFS.
Upload: hadoop fs -put:
Copy single src file, or multiple src files from local file system to the Hadoop data file system
Usage:
hadoop fs -put <localsrc> ... <HDFS_dest_Path>
Example:
hadoop fs -put /home/saurzcode/Samplefile.txt /user/ saurzcode/dir3/
Download: hadoop fs -get:
Copies/Downloads files to the local file system
Usage:
hadoop fs -get <hdfs_src> <localdst>
Example:
hadoop fs -get /user/saurzcode/dir3/Samplefile.txt /home/
4. See contents of a file
Same as unix cat command:
Usage:
hadoop fs -cat <path[filename]>
Example:
hadoop fs -cat /user/saurzcode/dir1/abc.txt
    20                                                                BIG DATA ANALYTICS LABORATORY (KDS651)
    21
5. Copy a file from source to destination
This command allows multiple sources as well in which case the destination must be a directory.
Usage:
hadoop fs -cp <source> <dest>
Example:
hadoop fs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/ dir2
6. Copy a file from/To Local file system to HDFS
copyFromLocal
Usage:
hadoop fs -copyFromLocal <localsrc> URI
Example:
hadoop fs -copyFromLocal /home/saurzcode/abc.txt                     /user/ saurzcode/abc.txt
Similar to put command, except that the source is restricted to a local file reference.
copyToLocal
Usage:
hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Similar to get command, except that the destination is restricted to a local file reference.
7. Move file from source to destination.
Note:- Moving files across filesystem is not permitted.
Usage :
hadoop fs -mv <src> <dest>
Example:
hadoop fs -mv /user/saurzcode/dir1/abc.txt /user/saurzcode/ dir2
8. Remove a file or directory in HDFS.
Remove files specified as argument. Deletes directory only when it is empty
Usage :
hadoop fs -rm <arg> Example:
hadoop fs -rm /user/saurzcode/dir1/abc.txt
 9. Recursive version of delete.
Usage :
hadoop fs -rmr <arg>
Example:
hadoop fs -rmr /user/saurzcode/
10. Display last few lines of a file.
Similar to tail command in Unix.
Usage :
hadoop fs -tail <path[filename]>
Example:
hadoop fs -tail /user/saurzcode/dir1/abc.txt
    21                                                                BIG DATA ANALYTICS LABORATORY (KDS651)
   22
11. Display the aggregate length of a file.
Usage :
hadoop fs -du <path>
Example:
hadoop fs -du /user/saurzcode/dir1/abc.txt
   22                                         BIG DATA ANALYTICS LABORATORY (KDS651)
    23
 Lab NO: 3
                                   Matrix Multiplication with Hadoop MapReduce
AIM: To understand matrix multiplication in Hadoop MapReduce
MapReduce is a technique in which a huge program is subdivided into small tasks and run parallelly
to make computation faster, save time, and mostly used in distributed systems. It has 2 important
parts:
Mapper: It takes raw data input and organizes into key, value pairs. For example, In a dictionary,
you search for the word “Data” and its associated meaning is “facts and statistics collected together
for reference or analysis”. Here the Key is Data and the Value associated with is facts and
statistics collected together for reference or analysis.
Reducer: It is responsible for processing data in parallel and produce final output.
Let us consider the matrix multiplication example to visualize MapReduce. Consider the following
matrix:
                                       2×2 matrices A and B
Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of
columns(j)=2. Matrix B is also a 2×2 matrix where number of rows(j)=2 and number of
columns(k)=2. Each cell of the matrix is labelled as Aij and Bij. Ex. element 3 in matrix A is called
A21 i.e. 2nd-row 1st column. Now One step matrix multiplication has 1 mapper and 1 reducer. The
Formula is:
Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i
Therefore computing the mapper for Matrix A:
# k, i, j computes the number of times it occurs.
# Here all are 2, therefore when k=1, i can have
# 2 values 1 & 2, each case can have 2 further
# values of j=1 and j=2. Substituting all values
# in formula
k=1 i=1 j=1 ((1, 1), (A, 1, 1))
         j=2 ((1, 1), (A, 2, 2))
   i=2 j=1 ((2, 1), (A, 1, 3))
         j=2 ((2, 1), (A, 2, 4))
    23                                                        BIG DATA ANALYTICS LABORATORY (KDS651)
    24
k=2 i=1 j=1 ((1, 2), (A, 1, 1))
         j=2 ((1, 2), (A, 2, 2))
   i=2 j=1 ((2, 2), (A, 1, 3))
         j=2 ((2, 2), (A, 2, 4))
Computing the mapper for Matrix B
i=1 j=1 k=1 ((1, 1), (B, 1, 5))
         k=2 ((1, 2), (B, 1, 6))
   j=2 k=1 ((1, 1), (B, 2, 7))
         k=2 ((1, 2), (B, 2, 8))
i=2 j=1 k=1 ((2, 1), (B, 1, 5))
         k=2 ((2, 2), (B, 1, 6))
   j=2 k=1 ((2, 1), (B, 2, 7))
         k=2 ((2, 2), (B, 2, 8))
The formula for Reducer is:
Reducer(k, v)=(i, k)=>Make sorted Alist and Blist
(i, k) => Summation (Aij * Bjk)) for j
Output =>((i, k), sum)
Therefore computing the reducer:
# We can observe from Mapper computation
# that 4 pairs are common (1, 1), (1, 2),
# (2, 1) and (2, 2)
# Make a list separate for Matrix A &
# B with adjoining values taken from
# Mapper step above:
(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}
     Blist ={(B, 1, 5), (B, 2, 7)}
     Now Aij x Bjk: [(1*5) + (2*7)] =19 ------- (i)
(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}
     Blist ={(B, 1, 6), (B, 2, 8)}
     Now Aij x Bjk: [(1*6) + (2*8)] =22 ------- (ii)
(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}
     Blist ={(B, 1, 5), (B, 2, 7)}
     Now Aij x Bjk: [(3*5) + (4*7)] =43 ------- (iii)
    24                                                  BIG DATA ANALYTICS LABORATORY (KDS651)
    25
(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}
     Blist ={(B, 1, 6), (B, 2, 8)}
     Now Aij x Bjk: [(3*6) + (4*8)] =50 ------- (iv)
From (i), (ii), (iii) and (iv) we conclude that
((1, 1), 19)
((1, 2), 22)
((2, 1), 43)
((2, 2), 50)
Therefore the Final Matrix is:
Final output of Matrix multiplication.
Code for mapper and Reducer
 import java.io.IOException; import java.util.*;
 import java.util.AbstractMap.SimpleEntry; import
 java.util.Map.Entry;
 import org.apache.hadoop.fs.Path; import
 org.apache.hadoop.conf.*; import
 org.apache.hadoop.io.*;
 import org.apache.hadoop.mapreduce.*;
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
 org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import
 org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
 org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
 public class TwoStepMatrixMultiplication {
       public static class Map extends Mapper<LongWritable, Text, Text, Text> {
             public void map(LongWritable key, Text value, Context context) throws
 IOException, InterruptedException {
                    String line = value.toString();
                    String[] indicesAndValue = line.split(","); Text outputKey = new
                    Text();
                    Text outputValue = new Text();
                    if (indicesAndValue[0].equals("A")) { outputKey.set(indicesAndValue[2]);
                           outputValue.set("A," + indicesAndValue[1] + "," +
 indicesAndValue[3]);
                           context.write(outputKey, outputValue);
    25                                                     BIG DATA ANALYTICS LABORATORY (KDS651)
  26
                } else {
                      outputKey.set(indicesAndValue[1]); outputValue.set("B," +
                      indicesAndValue[2] + "," +
indicesAndValue[3]);
                      context.write(outputKey, outputValue);
                }
          }
      }
         public static class Reduce extends Reducer<Text, Text, Text, Text> {
         public void reduce(Text key, Iterable<Text> values, Context context) throws
         IOException, InterruptedException {
         String[] value;
         ArrayList<Entry<Integer, Float>> listA = new ArrayList<Entry<Integer, Float>>();
         ArrayList<Entry<Integer, Float>> listB = new ArrayList<Entry<Integer, Float>>();
         for (Text val : values) {
         value = val.toString().split(",");
         if (value[0].equals("A")) { listA.add(new SimpleEntry<Integer,
         Float>(Integer.parseInt(value[1]), Float.parseFloat(value[2])));
         } else {
         listB.add(new SimpleEntry<Integer, Float>(Integer.parseInt(value[1]),
         Float.parseFloat(value[2])));
         }
         }
         String i; float a_ij; String k; float b_jk;
         Text outputValue = new Text();
         for (Entry<Integer, Float> a : listA) { i = Integer.toString(a.getKey()); a_ij =
         a.getValue();
         for (Entry<Integer, Float> b : listB) { k = Integer.toString(b.getKey()); b_jk =
         b.getValue(); outputValue.set(i + "," + k + "," +
         Float.toString(a_ij*b_jk));
         context.write(null, outputValue);
         }
         }
         }
         }
         public static void main(String[] args) throws Exception { Configuration conf = new
         Configuration();
         Job job = new Job(conf, "MatrixMatrixMultiplicationTwoSteps");
         job.setJarByClass(TwoStepMatrixMultiplication.class);
         job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class);
         job.setMapperClass(Map.class); job.setReducerClass(Reduce.class);
         job.setInputFormatClass(TextInputFormat.class);
         job.setOutputFormatClass(TextOutputFormat.class);
         FileInputFormat.addInputPath(job, new Path("hdfs:// 127.0.0.1:9000/matrixin"));
         FileOutputFormat.setOutputPath(job, new Path("hdfs:// 127.0.0.1:9000/matrixout"));
         job.waitForCompletion(true);
         }
  26                                                      BIG DATA ANALYTICS LABORATORY (KDS651)
 27
  Lab NO: 4          To run a Word Count MapReduce program to understand
                     the MapReduce paradigm
AIM: To Develop a MapReduce program by using Word count
 FOR THIS WE NEED 3 SOFTWARE
 1-HADOOP INSTALLATION
 2- PUTTY CONNECT WITH HADOOP
 3- ECLIPSE FOR DOWNLOAD JAR FILE (JAVA)
 Step1: create text file on which map reduce woks
 Ex: hi Tarun Hi
 Tarun ram ram
 Hi Tarun ram
 Hi Hi Hi
 Create directory (In case of Hadoop, create in user/maria_dev/folder name(dataflow))
 Ex: Hadoop fs –ls /user/maria_dev/dataflow
 Step2: upload file (wordcount) in this folder by direct in Hadoop or upload file by using command
 put.
 Step3: run command Hadoop fs-ls
 It will show file
 Step4: To show contain of file by using command cat
 Ex: Hadoop dfs –cat/user/maria_dev/word_count
 27                                                         BIG DATA ANALYTICS LABORATORY (KDS651)
28
Part 2: create jar file of program of mapreduce wordcount by using ecliips software
Step1: new ->java project->next ->fnish (mapreduce)
Step2: in java project import all packages of yarn , HDFS etc
Step3: write program of word count mapreduce
28                                                 BIG DATA ANALYTICS LABORATORY (KDS651)
 29
Map Function – It takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (Key-Value pair).
Example – (Map function in Word Count)
Input
Set of data
Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS, TRAIN
Output
Convert into another set of data
(Key,Value)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1)
Reduce Function – Takes the output from Map as an input and combines those data tuples
into a smaller set of tuples.
Example – (Reduce function in Word Count)
Input         Set of Tuples
(output of Map function)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1),
(buS,1),(caR,1),(CAR,1), (car,1), (BUS,1), (TRAIN,1)
Output            Converts into smaller set of tuples
                   (BUS,7), (CAR,7), (TRAIN,4)
Work Flow of Program
 29                                                                  BIG DATA ANALYTICS LABORATORY (KDS651)
30
Workflow of MapReduce consists of 5 steps
1. Splitting – The splitting parameter can be anything, e.g. splitting by space,
   comma, semicolon, or even by a new line (‘\n’).
2. Mapping – as explained above
3. Intermediate splitting – the entire process in parallel on different clusters. In order to
   group them in “Reduce Phase” the similar KEY data should be on same cluster.
4. Reduce – it is nothing but mostly group by phase
5. Combining – The last phase where all the data (individual result set from each
   cluster) is combine together to form a Result
Now Let’s See the Word Count Program in Java
Make sure that Hadoop is installed on your system with java idk Steps to
follow
Step 1. Open Eclipse> File > New > Java Project > (Name it – MRProgramsDemo) > Finish
Step 2. Right Click > New > Package ( Name it - PackageDemo) > Finish Step 3.
Right Click on Package > New > Class (Name it - WordCount) Step 4. Add
Following Reference Libraries –
                                                                                                  9
30                                                           BIG DATA ANALYTICS LABORATORY (KDS651)
31
Right Click on Project > Build Path> Add External Archivals
   • /usr/lib/hadoop-0.20/hadoop-core.jar
   • Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar
Program: Step 5. Type following Program :
package PackageDemo;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static void main(String [] args) throws Exception
{
Configuration c=new Configuration();
String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
Path input=new Path(files[0]);
Path output=new Path(files[1]);
Job j=new Job(c,"wordcount");
j.setJarByClass(WordCount.class);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
}
public static class MapForWordCount extends Mapper<LongWritable, Text, Text,
IntWritable>{
public void map(LongWritable key, Text value, Context con) throws IOException,
InterruptedException
{
String line = value.toString();
                                                                                       10
31                                                  BIG DATA ANALYTICS LABORATORY (KDS651)
32
String[] words=line.split(",");
for(String word: words )
{
   Text outputKey = new Text(word.toUpperCase().trim());
   IntWritable outputValue = new IntWritable(1);
   con.write(outputKey, outputValue);
}
}
}
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text,
IntWritable>
{
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws
IOException,
InterruptedException
{
int sum = 0;
for(IntWritable value : values)
{
sum += value.get();
}
con.write(word, new IntWritable(sum));
}
}
}
Make Jar File
Right Click on Project> Export> Select export destination as Jar File > next> Finish
                                                                                        11
32                                                   BIG DATA ANALYTICS LABORATORY (KDS651)
33
                                        12
33   BIG DATA ANALYTICS LABORATORY (KDS651)
34
To Move this into Hadoop directly, open the terminal and enter the following
commands:
[training@localhost ~]$ hadoop fs -put wordcountFile wordCountFile
Run Jar file
(Hadoop jar jarfilename.jar packageName.ClassName PathToInputTextFile
PathToOutputDirectry)
[training@localhost ~]$ Hadoop jar MRProgramsDemo.jar
PackageDemo.WordCount wordCountFile MRDir1
Result: Open Result
[training@localhost ~]$ hadoop fs -ls MRDir1
Found 3 items
-rw-r--r-- 1 training supergroup
0 2016-02-23 03:36 /user/training/MRDir1/_SUCCESS
drwxr-xr-x - training supergroup
0 2016-02-23 03:36 /user/training/MRDir1/_logs
-rw-r--r-- 1 training supergroup
20 2016-02-23 03:36 /user/training/MRDir1/part-r-00000
[training@localhost ~]$ hadoop fs -cat MRDir1/part-r-00000
BUS 7
CAR 4
TRAIN 6
                                                                                        13
34                                                   BIG DATA ANALYTICS LABORATORY (KDS651)
35
 Lab NO: 5
                         Implementation of K-mean clustering using Map reduce
AIM: Implementation of K-mean clustering using Map reduce.
 K-Means is a clustering algorithm that partition a set of data point into k clusters. The k-means
clustering algorithm is commonly used on large data sets, and because of the characteristics of the
algorithm is a good candidate for parallelization. The aim of this project is to implement a
framework in java for performing k-means clustering using Hadoop MapReduce.
Pseudocode
The classical k-means algorithm works as an iterative process in which at each iteration itcomputes
the distance between the data points and the centroids, that are randomly initialized at the beginning
of the algorithm.
We decided to design such algorithm as a MapReduce workflow. A single stage of MapReduce
roughly corresponds to a single iteration of the classical algorithm. As in the classical algorithm at
the first stage the centroids are randomly sampled from the set of data points. The map function
takes as input a data point and the list of centroids, computes the distance between the point and
each centroid and emits the point and the closest centroid. The reduce function collects all the points
beloging to a cluster and computes the new centroid and emits it. At the end of each stage,it finds
a new approximation of the centroids, that are used for the next iteration. The workflow continues
until the distance from each centroid of a previuos stage and the corresponding centroids of the
current stage drops below a given threshold.
centroids = k random sampled points from the dataset.
do:
     Map:
      - Given a point and the set of centroids.
      - Calculate the distance between the point and each centroid.
      - Emit the point and the closest centroid.
     Reduce:
       - Given the centroid and the points belonging to its cluster.
       - Calculate the new centroid as the aritmetic mean position of the points.
       - Emit the new centroid.
     prev_centroids = centroids.
     centroids = new_centroids.
while prev_centroids - centroids > threshold.
 Mapper
The mapper calculates the distance between the data point and each centroid. Then emits the index
of the closest centroid and the data point.
class MAPPER
   method MAP(file_offset, point)
     min_distance = POSITIVE_INFINITY
35                                                             BIG DATA ANALYTICS LABORATORY (KDS651)
36
     closest_centroid = -1
     for all centroid in list_of_centroids
        distance = distance(centroid, point)
        if (distance < min_distance)
           closest_centroid = index_of(centroid)
           min_distance = distance
     EMIT(closest_centroid, point)
 Combiner
At each stage we need to sum the data points belonging to a cluster to calculate the centroid
(arithmetic mean of points). Since the sum is an associative and commutative function, our
algorithm can benefit from the use of a combiner to reduce the amount of data to be transmitted to
the reducers.
class COMBINER
   method COMBINER(centroid_index, list_of_points)
      point_sum.number_of_points = 0
      point_sum = 0
      for all point in list_of_points:
         point_sum += point
         point_sum.number_of_points += 1
      EMIT(centroid_index, point_sum)
We implemented the combiner only in the Hadoop algorithm.
 Reducer
The reducer calculates the new approximation of the centroid and emits it. The result of the
MapReduce stage will be the same even if the combiner is not called by the Hadoop framework.
class REDUCER
   method REDUCER(centroid_index, list_of_point_sums)
     number_of_points = partial_sum.number_of_points
     point_sum = 0
     for all partial_sum in list_of_partial_sums:
        point_sum += partial_sum
        point_sum.number_of_points += partial_sum.number_of_points
     centroid_value = point_sum / point_sum.number_of_points
     EMIT(centroid_index, centroid_value)
36                                                          BIG DATA ANALYTICS LABORATORY (KDS651)
37
 Lab NO: 6
                              Installation of HIVE alone practical Example
AIM: Use Hive query to find out movie rating from data set
           1. Install Java: Apache Hive requires Java to run. Check that Java is installed on your
system by running the command "java -version" in the command prompt. If it's not installed,
download and install Java on your system.
2. Download Hive: Download the latest version of Apache Hive from the official website
(https://hive.apache.org/downloads.html).
3. Extract the downloaded file: Extract the downloaded file to a directory of your choice.
4. Configure Hive: Go to the conf directory in the Hive installation folder and edit the hive- site.xml
file. This file contains configuration settings for Hive. Configure the file according toyour needs,
such as setting the location of Hadoop and the metastore database.
5. Start Hadoop: Hive requires Hadoop to run. Start the Hadoop daemons by running the command
"start-all.sh" in the bin directory of your Hadoop installation.
6. Start Hive: To start Hive, run the command "hive" in the bin directory of your Hive installation.
7. Test Hive: Once you have started Hive, run some sample queries to test that everything is
working correctly.
Hive Commands:
Data Definition Language (DDL )
DDL statements are used to build and modify the tables and other objects in the database.
          DDL
                                   Function
          Command
          CREATE                   It is used to create a table or Database
          SHOW                     It is used to show Database, Table, Properties, etc
          ALTER                    It is used to make changes to the existing table
          DESCRIBE                 It describes the table columns
          TRUNCATE                 Used to permanently truncate and delete the rows of table
          DELETE                   Deletes the table data, but, can be restored
37                                                            BIG DATA ANALYTICS LABORATORY (KDS651)
Using Hive & Uploading Data
  36
                              Click Here
  38                           BIG DATA ANALYTICS LABORATORY (KDS651)
39   BIG DATA ANALYTICS LABORATORY (KDS651)
  40
1-Click: Hive View
2-Click: Upload Table
  40                    BIG DATA ANALYTICS LABORATORY (KDS651)
  41
Click: Setting Button Ahead of File Type
  38                                       BIG DATA ANALYTICS LABORATORY (KDS651)
1-Select Number 9: TAB(horizontal tab)
 39
2-Then Close
 39                                      BIG DATA ANALYTICS LABORATORY (KDS651)
Choose File: u.data
40
40                    BIG DATA ANALYTICS LABORATORY (KDS651)
 41
It will look like this:-
 41                        BIG DATA ANALYTICS LABORATORY (KDS651)
1- Change Table Name To – ratings
   42
2- Change Column1 To – movie_id
3- Change Column2 To – ratingCount
4- Click Upload Table – After few minutes, you may need to refresh it.
   42                                                                    BIG DATA ANALYTICS LABORATORY (KDS651)
1- Go to Hive
2-Paste43 this Query & Execute:-
SELECT movie_id, count(movie_id) as ratingCount
FROM ratings
GROUP BY movie_id
ORDER BY ratingCount
DESC;
      43                                          BIG DATA ANALYTICS LABORATORY (KDS651)
 44
Result will look like this:
 44                           BIG DATA ANALYTICS LABORATORY (KDS651)
  45
Click on the graph button:
  45                         BIG DATA ANALYTICS LABORATORY (KDS651)
 46
1- Drag movie_id to x
2- Drag ratingcount to y
 46                        BIG DATA ANALYTICS LABORATORY (KDS651)
 47
Result:
 47       BIG DATA ANALYTICS LABORATORY (KDS651)
48
 Lab No: 7        Installing of HBase, Installation thrift alone with practice
                  example
AIM: Installing of HBase, Installation thrift alone with practice example
1. Download the HBase distribution package from the official Apache HBase website.
2. Extract the downloaded package to a directory of your choice.
3. Set the `HBASE_HOME` environment variable to the path where you extracted the package.
4. Update the `PATH` environment variable to include the `$HBASE_HOME/bin` directory.
5. Configure the `hbase-site.xml` file located in the `$HBASE_HOME/conf` directory to suit your
needs.
6. If you plan to use HBase with Hadoop, make sure that the Hadoop configuration files are also
in the `$HBASE_HOME/conf` directory.
7. Start the HBase server by running the `start-hbase.sh` script located in the
`$HBASE_HOME/bin` directory.
8. Verify that HBase is running by accessing the HBase shell using the `hbase shell` command.
Note that these are general steps and the specifics of installation may vary depending on your
operating system and the version of HBase you are installing. It is always a good idea to consult
the official documentation and any relevant installation guides specific to your setup.
7. Test Hive: Once you have started Hive, run some sample queries to test that everything is
working correctly.
General commands
In Hbase, general commands are categorized into following commands
     •   Status
     •   Version
     •   Table_help ( scan, drop, get, put, disable, etc.)
     •   Whoami
To get enter into HBase shell command, first of all, we have to execute the code as mentioned
below
48                                                           BIG DATA ANALYTICS LABORATORY (KDS651)
49
hbase Shell
Once we get to enter into HBase shell, we can execute all shell commands mentioned below. With
the help of these commands, we can perform all type of table operations in the HBase shell mode.
Let us look into all of these commands and their usage one by one with an example.
Status
Syntax:status
This command will give details about the system status like a number of servers present in the
cluster, active server count, and average load value. You can also pass any particular parameters
depending on how detailed status you want to know about the system. The parameters can be
‘summary’, ‘simple’, or ‘detailed’, the default parameter provided is “summary”.
hbase(main):001:0>status
hbase(main):002:0>status 'simple'
hbase(main):003:0>status 'summary'
hbase(main):004:0> status 'detailed'
49                                                          BIG DATA ANALYTICS LABORATORY (KDS651)
50
When we execute this command status, it will give information about number of server’s present,
dead servers and average load of server, here in screenshot it shows the information like- 1 live
server, 1 dead servers, and 7.0000 average load.
Version
Syntax: version
     •   This command will display the currently used HBase version in command mode
     •   If you run version command, it will give output as shown above
Table help
Syntax:table_help
This command guides
     •   What and how to use table-referenced commands
     •   It will provide different HBase shell command usages and its syntaxes
     •   Here in the screen shot above, its shows the syntax to “create” and “get_table” command
         with its usage. We can manipulate the table via these commands once the table gets created
         in HBase.
     •   It will give table manipulations commands like put, get and all other commands
         information.
50                                                           BIG DATA ANALYTICS LABORATORY (KDS651)
51
Syntax: Whoami
This command “whoami” is used to return the current HBase user information from the HBase
cluster.
It will provide information like
     •   Groups present in HBase
     •   The user information, for example in this case “hduser” represent the user name as shown
         in screen shot
TTL(Time To Live) – Attribute
In HBase, Column families can be set to time values in seconds using TTL. HBase will
automatically delete rows once the expiration time is reached. This attribute applies to all versions
of a row – even the current version too.
The TTL time encoded in the HBase for the row is specified in UTC. This attribute used with table
management commands.
Important differences between TTL handling and Column family TTLs are below
     •   Cell TTLs are expressed in units of milliseconds instead of seconds.
     •   A cell TTLs cannot extend the effective lifetime of a cell beyond a Column Family level
         TTL setting.
Tables Managements commands
These commands will allow programmers to create tables and table schemas with rows and column
families.
The following are Table Management commands
     •   Create
     •   List
     •   Describe
     •   Disable
     •   Disable_all
     •   Enable
     •   Enable_all
     •   Drop
51                                                            BIG DATA ANALYTICS LABORATORY (KDS651)
52
     •   Drop_all
     •   Show_filters
     •   Alter
     •   Alter_status
52                      BIG DATA ANALYTICS LABORATORY (KDS651)
53
 Lab No: 8
                   Write PIG Commands. Write pig Latin script sort, group and
                   filter in your data
AIM: Write PIG Commands. Write pig Latin script sort, group and filter in
your data
Commonly used Pig Latin commands:
     1. LOAD: Loads data from a specified data source into Pig. Example: input_data = LOAD
          'data.txt' USING PigStorage(',') AS (id: int, name: chararray, age:
          int);
     2.   FILTER: Filters the data based on a specified condition. Example: filtered_data =
          FILTER input_data BY age > 25;
     3.   GROUP: Groups the data based on one or more fields. Example: grouped_data =
          GROUP input_data BY name;
     4. FOREACH: Applies transformations or computations on each record of the data.
        Example: transformed_data = FOREACH input_data GENERATE id, name, age *
          2 AS doubled_age;
     5. ORDER: Sorts the data based on one or more fields. Example: sorted_data = ORDER
          input_data BY age DESC;
     6. DISTINCT: Removes duplicate records from the data. Example: distinct_data =
          DISTINCT input_data;
     7. JOIN: Performs a join operation between two or more relations based on a common field.
        Example: joined_data = JOIN input_data BY id, other_data BY id;
     8. STORE: Stores the data into a specified location or data source. Example: STORE
          input_data INTO 'output.txt' USING PigStorage(',');
     9. DUMP: Displays the data on the screen or console. Example: DUMP input_data;
     10. DESCRIBE: Provides the schema information of a relation. Example: DESCRIBE
          input_data;
These are just a few examples of Pig Latin commands. Pig provides a wide range of operators
and functions for data manipulation and analysis.
53                                                          BIG DATA ANALYTICS LABORATORY (KDS651)
54
Write pig Latin script sort, group and filter in your data
In this script, we assume that the input data is in a CSV file called "input_data.csv" with
columns: id, name, age, and salary. The script loads the data, filters it to include only records
where age is greater than 25, groups the filtered data by name, calculates the average salary for
each group, sorts the groups based on the average salary in descending order, stores the sorted
data in a file called "output_data", and finally displays the sorted data using the DUMP
command.
-- Load the input data
input_data = LOAD 'input_data.csv' USING PigStorage(',') AS (id: int, name: chararray, age: int,
salary: float);
-- Filter the data to include only records where age is greater than 25
filtered_data = FILTER input_data BY age > 25;
-- Group the filtered data by name
grouped_data = GROUP filtered_data BY name;
-- Calculate the average salary for each group
avg_salary = FOREACH grouped_data GENERATE group AS name, AVG(filtered_data.salary)
AS average_salary;
-- Sort the grouped data by average salary in descending order
sorted_data = ORDER avg_salary BY average_salary DESC;
-- Store the sorted data
STORE sorted_data INTO 'output_data' USING PigStorage(',');
-- Display the sorted data
DUMP sorted_data;
54                                                            BIG DATA ANALYTICS LABORATORY (KDS651)
55
 Lab No: 9
                  Run the PIG Latin Script to find world count
AIM: Run the PIG Latin Script to find world count
In this script, we assume that the input data is in a text file called "input.txt". The script loads the
data, tokenizes it by splitting on whitespace, groups the tokenized data by word, counts the
occurrences of each word, stores the word count results in a file called "output.txt", and finally
displays the word count results using the DUMP command.
To run this script, you would typically save it in a file with a .pig extension (e.g., wordcount.pig),
and then execute it using the Pig Latin interpreter or tool provided by Apache Pig. The specific
steps to run the script may vary depending on your Pig installation and environment.
-- Load the input data
input_data = LOAD 'input.txt' USING TextLoader();
-- Tokenize the input data by splitting on whitespace
tokenized_data = FOREACH input_data GENERATE FLATTEN(TOKENIZE($0)) AS word;
-- Group the tokenized data by word
grouped_data = GROUP tokenized_data BY word;
-- Count the occurrences of each word
word_count = FOREACH grouped_data GENERATE group AS word, COUNT(tokenized_data)
AS count;
-- Store the word count results
STORE word_count INTO 'output.txt' USING PigStorage(',');
-- Display the word count results
DUMP word_count;
55                                                              BIG DATA ANALYTICS LABORATORY (KDS651)
56
 Lab No: 10
                  Run the PIG Latin Script to find a max temp for each and every
                  year.
AIM: Run the PIG Latin Script to find a max temp for each and every year.
To find the maximum temperature for each year using Pig Latin, you can follow this example
script assuming your input data is in the format (year: int, temperature: float):
In this script, we assume that the input data is in a CSV file called "input.txt" with columns: year
and temperature. The script loads the data, groups it by year, calculates the maximum temperature
for each year using the MAX function, stores the results in a file called "output.txt", and finally
displays the results using the DUMP command.
To run this script, save it in a file with a .pig extension (e.g., maxtemp.pig), and then execute it
using the Pig Latin interpreter or tool provided by Apache Pig. Make sure to replace 'input.txt' and
'output.txt' with the actual file paths in your system.
-- Load the input data
input_data = LOAD 'input.txt' USING PigStorage(',') AS (year: int, temperature:
float);
-- Group the data by year
grouped_data = GROUP input_data BY year;
-- Calculate the maximum temperature for each year
max_temp_by_year = FOREACH grouped_data GENERATE group AS year,
MAX(input_data.temperature) AS max_temperature;
-- Store the results
STORE max_temp_by_year INTO 'output.txt' USING PigStorage(',');
-- Display the results
DUMP max_temp_by_year;
56                                                           BIG DATA ANALYTICS LABORATORY (KDS651)