KEMBAR78
CSET 371 Course File | PDF | Apache Hadoop | Map Reduce
0% found this document useful (0 votes)
16 views81 pages

CSET 371 Course File

The document outlines the course file for Business Intelligence and Data Analytics (CSET 371) at Bennett University, detailing the syllabus, evaluation policy, teaching resources, and lab assignments. It covers topics such as Big Data Analytics, Hadoop architecture, and data processing using various tools like Spark and Hive. The course includes both theoretical lectures and practical lab sessions, with a focus on hands-on experience in big data technologies.

Uploaded by

Ravi Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views81 pages

CSET 371 Course File

The document outlines the course file for Business Intelligence and Data Analytics (CSET 371) at Bennett University, detailing the syllabus, evaluation policy, teaching resources, and lab assignments. It covers topics such as Big Data Analytics, Hadoop architecture, and data processing using various tools like Spark and Hive. The course includes both theoretical lectures and practical lab sessions, with a focus on hands-on experience in big data technologies.

Uploaded by

Ravi Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

COURSE FILE

For

Business Intelligence and Data Analytics (CSET 371)

Faculty Name : Dr. Aditya Bhardwaj

Course Type : Specialization Elective

Program : BTech

Semester and Year : V Semester & III Year

L-T-P : 2– 0 – 2

Credits :3

School : SCSET

Level : UG

School of Computer Science Engineering & Technology

Bennett University, Greater Noida, Uttar Pradesh

https://www.csebu.com/view_course_desc.php?ct_id=1958 1/6
Course File Format

The course file format is as indicated below.

Detailed Syllabus
Module 1 (8 hours)
Big Data Analytics: Data and Relations, Business Intelligence, Business intelligence vs business
analytics, Why what and how BI? OLTP VS OLAP, Ethics in Business Intelligence, Big Data
Technology Component, Structured/unstructured, streaming data, Streaming, Stream Data Types and
computing, Real Time Analysis of Big Data, Big Data Architecture, Big Data Warehouse.

Module 2 (6 hours)

Introduction to Hadoop, Hadoop high level architecture, Processing data with Hadoop, HDFS, Design of
HDFS, NameNodes and DataNodes, MapReduce, Mapper and Reducer function, Analysis of Real
Time Data using MapReduce.
Module 3 (6 hours)
Hadoop Ecosystem: Pig Overview, Pig Grunt Shell, Use cases for Pig-ETL Processing, Pig Relational
Operators, Hive, Hive file format, HBase, Architecture of Hive and HBase.
Module 4 (8 hours)
HQL, Associations and Joins, Aggregate function, Polymorphic queries, Clauses, Subqueries, Spark,
Core, Spark SQL, Spark RDD, Deployment and Integration, Spark GraphX and Graph Analytics,
Functional vs Procedural programming models, NoSQL, Use of Tableau, data source and worksheet,
Big Data Predictive Analysis, Research Topics in Big Data Aalytics.

TEXTBOOKS/LEARNING RESOURCES:

TEXTBOOKS/LEARNING RESOURCES:
1. Peter Ghavami, Big Data Analytics Methods (2nd ed.), De Gruyter, 2020. ISBN 9781547417951.
2. Acharya and Seema, Data Analytics using R (1st ed.), New York: McGraw-Hill
Education, 2018. ISBN 9352605241.

https://www.csebu.com/view_course_desc.php?ct_id=1958 2/6
REFERENCE BOOKS/LEARNING RESOURCES:
1. Ana Azevedo and Manuel Filipe Santos, Integration Challenges for Analytics, Business
Intelligence,and Data Mining (1st ed.), Engineering Science Reference, 2020. ISBN
9781799857832.

EVALUATION POLICY

Components of Course Evaluation Percentage Distribution

End Term Examination 35%

Quiz 5%

Continuous Lab Evaluation 20%

Certification/Assignment 20%

End Term Lab Exam/Project Work 20%

Total 100%

1. Lecture Wise Plan


2.

No. Content Planned

1 Course structure/handout and Assessment mechanism (15)


Big Data Analytics: Data and Relations (30)

2 Business Intelligence (15)


Business intelligence vs business analytics (15)
Ethics in Business Intelligence (10)

3 Why what and how BI? (25)


OLTP VS OLAP (20)

4 Big Data Technology Component (25)

Structured/unstructured (10)

Streaming data, Stream Data Types (10)


5 Real time Analysis of Big Data (25)

Big Data Architecture (10)

Big Data Warehouse (10)


Introduction to Hadoop (10)
6
Hadoop architecture (35)

7 Processing Data with Hadoop (30)

Real-time examples of Hadoop (15)

8 HDFS (25)

Design of HDFS (15)


Assessment/Buffer Lecture*
9

NameNodes and DataNodes (25)


10
MapReduce (10)

Mapper and Reducer function (10)

Analysis of Real time data using Map


11
Reduce (45)

Hadoop Ecosystem (10)


12
Pig Overview (10)
Pig Grunt Shell (25)

13 Use cases for Pig-ETL Processing (25)


Pig Relational Operators (20)

14 Hive (15)

Hive file format (30)

15 Data Analytics using Hive (45)

HBase (15)
16
Architecture of Hive and HBase (30)

17 HQL (10)
Associations and Joins (20)
Aggregate function (15)
Polymorphic queries (15)
18
Clauses (15)

Subqueries (15)

Assessment/Buffer Lecture*
19

20 Expert Lecture from Industry (50)*

21 Spark (10)
Core (15)
Spark SQL (20)

22 Spark RDD (30)


Deployment and Integration (20)

23 Spark GraphX (20)


Graph Analytics (25)

24 Functional vs Procedural programming models (10)


NoSQL (35)

25 Use of Tableau, data source and worksheet (50)

26 Case Study (50)

27 Research Topics in Big Data Analytics (50)

28 Buffer Lecture (45)

3. Lab Wise Plan


4.

No. Content Planned

1 Linux HDFS Commands Perquisites

2 Installation and Deployment of Hadoop Cluster


3 MapReduce Implementation and Use Cases

4 Analyzing Data Set with Pig Grunt Shell

5 Data Analytics using HQL Queries

6 Advance Data Management with Hive

7 Programming in Spark using RDD

8 Data Processing in Spark

9 Social Media Analytics and Text Mining

10 Big Data Cluster Analysis

11 Big Data Predictive analysis

12 Big Data Forecasting and Data Analytics using Tableau

13 Buffer Lab

14 End-Term Lab Exam

5. Teaching Resources (if any)


• Lectures/Lab
Available in the folder

6. Assessment Materials

Surprise Quiz 1:

Q1. In MapReduce output of Mapper is stored on


a. HDFS
b. In-memory
c. Local disk

Q2. Total V's of Big Data is


a. 5
b. 3
c. 6
d. 4
Q3. Select one:
a. MapReduce jobs that are causing excessive memory swaps.
b. There is an infinite loop in Map or reduce tasks.
c. HDFS is full
d. The MasterNode is down.

Q4. Identify the correct statement.


Select one:
a. A NameNode node acts as the Slave and is responsible for executing a Task assigned to it by the
JobTracker.
b. MapReduce is based on “Data Locality” feature.
c. All of the mentioned
d. Reduce Task in MapReduce is performed using the Map() function

Q5. Which of the following best describes the workings of TextInputFormat?

a. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of
both splits containing the brokenline.
b. Input file splits may cross line breaks. A line thatcrosses tile splits is ignored.
c. The input file is split exactly at the line breaks, so each Record Reader will read a series of complete
lines.
d. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the
split that contains the end of the brokenline.

Q6. Which MapReduce phase is theoretically able to utilize features of the underlying file system
in order to optimize parallel execution?

a. Split
b. Map
c. Combine

Q7. You are running a Hadoop cluster with all monitoring facilities properly configured. Which
scenario will go undetected.?
a. MapReduce jobs that are causing excessive memory swaps.
b. There is an infinite loop in Map or reduce tasks.
c. HDFS is full
d. The MasterNode is down.

Surprise Quiz 2:
Q1. Which below instruction can be substituted in place for SORT BY and DISTRIBUTE BY?
a. None
b. GROUP BY
c. ORDER BY
d. CLUSTER BY
Q2. Hive supports trigger
Select one:
True
False

Q3. Which function is used to store the output in Pig?

a. PigBag
b. Pig Storage
c. Pig Store

Q4. Where is the data behind hive tables stored?

a. HDFS
b. In a traditional database like MYSQL or Oracle

Q5. Where are hive table definitions stored?


a. HDFS
b. On all nodes
c. In a traditional database like MySQL or Oracle

Q6. Assume you have a Pig relation with 10 columns. What happens when you load a dataset with
12 columns in each row?

a. Pig will load the dataset with no issues


b. Pig load instructions will error out

Q7. Which property is used to enable partition?


a. hive.exec.dynamic.partition
b. hive.exec.dyn.part
c. hive.dyn.partition
d. hive.enable.exec.dynamic.partition

Q8. Which of the below statement is correct?


a. Hive enforces schema on read
b. Hive enforces schema on write

Q9. Which of the below is not a reason to Partition a Hive table?


a. To improve performance
b. To target queries on specific portion of the dataset
c. To reduce the size of the data stored in HDFS

Q10. What is the downside of using ORDER BY?


a. Order BY involved only one reducer
b. There are no downsides
c. Order By sometimes provide incorrect result
Lab Assignment
Lab Assignment 1
Title- “Installation and Configuration of Cloudera VM Big
Data Setup”

Learning Goals
In this activity, students will learn:

• Install and Configuration of VirtualBox for Big Data Lab.


• Install and Configuration of Cloudera Virtual Machine (VM) Image.
• Launch the Cloudera VM.
• Practice of Linux Commands required for the lab

Instructions
Please use the following instructions to download and install the Cloudera Quickstart VM with VirutalBox
before proceeding to the Getting Started with the Cloudera VM Environment video. The screenshots are
from a Mac but the instructions should be the same for Windows. Please see the discussion boards if you
have any issues.

1. Install VirtualBox. Go to https://www.virtualbox.org/wiki/Downloads to download and install VirtualBox for your


computer.

2. Download the Cloudera VM. Download the Cloudera VM


fromhttps://downloads.cloudera.com/demo_vm/virtualbox/cloudera-quickstart-vm-5.4.2-0-virtualbox.zip. The VM is
over 4GB, so will take some time to download.

3. Unzip the Cloudera VM:

Right-click cloudera-quickstart-vm-5.4.2-0-virtualbox.zip and select “Extract All…”

4. Start VirtualBox.

5. Begin importing. Import the VM by going to File -> Import Appliance


6. Click the Folder icon.

7. Select the cloudera-quickstart-vm-5.4.2-0-virtualbox.ovf from the Folder where you unzipped the VirtualBox
VM and click Open.
8. Click Next to proceed.

9. Click Import.
10. The virtual machine image will be imported. This can take several minutes.

11. Launch Cloudera VM. When the importing is finished, the quickstart-vm-5.4.2-0 VM will appear on the left in the
VirtualBox window. Select it and click the Start button to launch the VM.
12. Cloudera VM booting. It will take several minutes for the Virtual Machine to start. The booting process takes a
long time since many Hadoop tools are started.

13. The Cloudera VM desktop. Once the booting process is complete, the desktop will appear with a browser.
14. Shutting down the Cloudera VM. Before we can change the settings for the Cloudera VM, the VM needs to be
powered off. If the VM is running, click on System in the top toolbar, and then click on Shutdown:

Next, click on Shut down:


Steps to install the Hadoop
Introduction
Every major industry is implementing Apache Hadoop as the standard framework for processing and storing
big data. Hadoop is designed to be deployed across a network of hundreds or even thousands of dedicated
servers. All these machines work together to deal with the massive volume and variety of incoming datasets.
Deploying Hadoop services on a single node is a great way to get yourself acquainted with basic Hadoop
commands and concepts.
This easy-to-follow guide helps you install Hadoop on Ubuntu 18.04 or Ubuntu 20.04.

Introduction
Every major industry is implementing Apache Hadoop as the standard framework for processing and storing
big data. Hadoop is designed to be deployed across a network of hundreds or even thousands of dedicated
servers. All these machines work together to deal with the massive volume and variety of incoming datasets.
Deploying Hadoop services on a single node is a great way to get yourself acquainted with basic Hadoop
commands and concepts.
This easy-to-follow guide helps you install Hadoop on Ubuntu 18.04 or Ubuntu 20.04.
Prerequisites
• Access to a terminal window/command line

• Sudo or root privileges on local /remote machines


Install OpenJDK on Ubuntu
The Hadoop framework is written in Java, and its services require a compatible Java Runtime Environment
(JRE) and Java Development Kit (JDK). Use the following command to update your system before initiating
a new installation:
sudo apt update

At the moment, Apache Hadoop 3.x fully supports Java 8. The OpenJDK 8 package in Ubuntu contains
both the runtime environment and development kit.
Type the following command in your terminal to install OpenJDK 8:
sudo apt install openjdk-8-jdk -y
The OpenJDK or Oracle Java version can affect how elements of a Hadoop ecosystem interact. To install a
specific Java version, check out our detailed guide on how to install Java on Ubuntu.
Once the installation process is complete, verify the current Java version:
java -version; javac -version
The output informs you which Java edition is in use.

Set Up a Non-Root User for Hadoop Environment


It is advisable to create a non-root user, specifically for the Hadoop environment. A distinct user improves
security and helps you manage your cluster more efficiently. To ensure the smooth functioning of Hadoop
services, the user should have the ability to establish a passwordless SSH connection with the localhost.
Install OpenSSH on Ubuntu
Install the OpenSSH server and client using the following command:
sudo apt install openssh-server openssh-client -y
In the example below, the output confirms that the latest version is already installed.
If you have installed OpenSSH for the first time, use this opportunity to implement these vital SSH security
recommendations.
Create Hadoop User
Utilize the adduser command to create a new Hadoop user:
sudo adduser hdoop

The username, in this example, is hdoop. You are free the use any username and password you see fit.
Switch to the newly created user and enter the corresponding password:
su - hdoop
The user now needs to be able to SSH to the localhost without being prompted for a password
Enable Passwordless SSH for Hadoop User
Generate an SSH key pair and define the location is is to be stored in:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
The system proceeds to generate and save the SSH key pair.

Use the cat command to store the public key as authorized_keys in the ssh directory:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Set the permissions for your user with the chmod command:
chmod 0600 ~/.ssh/authorized_keys
The new user is now able to SSH without needing to enter a password every time. Verify everything is set
up correctly by using the hdoop user to SSH to localhost:
ssh localhost
After an initial prompt, the Hadoop user is now able to establish an SSH connection to the localhost
seamlessly.
Download and Install Hadoop on Ubuntu
Visit the official Apache Hadoop project page, and select the version of Hadoop you want to implement.
The steps outlined in this tutorial use the Binary download for Hadoop Version 3.2.1.
Select your preferred option, and you are presented with a mirror link that allows you to download
the Hadoop tar package.

Use the provided mirror link and download the Hadoop package with the wget command:
wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
Once the download is complete, extract the files to initiate the Hadoop installation:
tar xzf hadoop-3.2.1.tar.gz
The Hadoop binary files are now located within the hadoop-3.2.1 directory.

Single Node Hadoop Deployment (Pseudo-Distributed Mode)


Hadoop excels when deployed in a fully distributed mode on a large cluster of networked
servers. However, if you are new to Hadoop and want to explore basic commands or test applications, you
can configure Hadoop on a single node.
This setup, also called pseudo-distributed mode, allows each Hadoop daemon to run as a single Java
process. A Hadoop environment is configured by editing a set of configuration files:
• bashrc

• hadoop-env.sh
• core-site.xml
• hdfs-site.xml
• mapred-site-xml
• yarn-site.xml
Configure Hadoop Environment Variables (bashrc)
Edit the .bashrc shell configuration file using a text editor of your choice (we will be using nano):
sudo nano .bashrc
Define the Hadoop environment variables by adding the following content to the end of the file:
#Hadoop Related Options
export HADOOP_HOME=/home/hdoop/hadoop-3.2.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS"-Djava.library.path=$HADOOP_HOME/lib/nativ"

Once you add the variables, save and exit the .bashrc file.
It is vital to apply the changes to the current running environment by using the following command:
source ~/.bashrc
Edit hadoop-env.sh File
The hadoop-env.sh file serves as a master file to configure YARN, HDFS, MapReduce, and Hadoop-related
project settings.
When setting up a single node Hadoop cluster, you need to define which Java implementation is to be
utilized. Use the previously created $HADOOP_HOME variable to access the hadoop-env.sh file:

sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh


Uncomment the $JAVA_HOME variable (i.e., remove the # sign) and add the full path to the OpenJDK
installation on your system. If you have installed the same version as presented in the first part of this
tutorial, add the following line:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
The path needs to match the location of the Java installation on your system.
If you need help to locate the correct Java path, run the following command in your terminal window:
which javac
The resulting output provides the path to the Java binary directory.

The section of the path just before the /bin/javac directory needs to be assigned to
the $JAVA_HOME variable.

Edit core-site.xml File


The core-site.xml file defines HDFS and Hadoop core properties.
To set up Hadoop in a pseudo-distributed mode, you need to specify the URL for your NameNode, and the
temporary directory Hadoop uses for the map and reduce process.
Open the core-site.xml file in a text editor:
sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

Add the following configuration to override the default values for the temporary directory and add your
HDFS URL to replace the default local file system setting:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdoop/tmpdata</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://127.0.0.1:9000</value>
</property>
</configuration>
This example uses values specific to the local system. You should use values that match your systems
requirements. The data needs to be consistent throughout the configuration process.

Do not forget to create a Linux directory in the location you specified for your temporary data.
Edit hdfs-site.xml File
The properties in the hdfs-site.xml file govern the location for storing node metadata, fsimage file, and edit
log file. Configure the file by defining the NameNode and DataNode storage directories.
Additionally, the default dfs.replication value of 3 needs to be changed to 1 to match the single node setup.
Use the following command to open the hdfs-site.xml file for editing:
sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following configuration to the file and, if needed, adjust the NameNode and DataNode directories
to your custom locations:
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
If necessary, create the specific directories you defined for the dfs.data.dir value.
Edit mapred-site.xml File
Use the following command to access the mapred-site.xml file and define MapReduce values:
sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add the following configuration to change the default MapReduce framework name value to yarn:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Edit yarn-site.xml File


The yarn-site.xml file is used to define settings relevant to YARN. It contains configurations for the Node
Manager, Resource Manager, Containers, and Application Master.
Open the yarn-site.xml file in a text editor:
sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Append the following configuration to the file:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>

<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DI
R,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</
value>
</property>
</configuration>

Format HDFS NameNode


It is important to format the NameNode before starting Hadoop services for the first time:
hdfs namenode -format
The shutdown notification signifies the end of the NameNode format process.
Start Hadoop Cluster
Navigate to the hadoop-3.2.1/sbin directory and execute the following commands to start the NameNode
and DataNode:
./start-dfs.sh
The system takes a few moments to initiate the necessary nodes.

Once the namenode, datanodes, and secondary namenode are up and running, start the YARN resource and
nodemanagers by typing:
./start-yarn.sh
As with the previous command, the output informs you that the processes are starting.

Type this simple command to check if all the daemons are active and running as Java processes:
jps
If everything is working as intended, the resulting list of running Java processes contains all the HDFS and
YARN daemons.

Access Hadoop UI from Browser


Use your preferred browser and navigate to your localhost URL or IP. The default port number 9870 gives
you access to the Hadoop NameNode UI:

http://localhost:9870
The NameNode user interface provides a comprehensive overview of the entire cluster.
The default port 9864 is used to access individual DataNodes directly from your browser:
http://localhost:9864

The YARN Resource Manager is accessible on port 8088:


http://localhost:8088
The Resource Manager is an invaluable tool that allows you to monitor all running processes in your Hadoop
cluster.
Assignment 1
HDFS Linux Practice Commands

File Commands

1. ls Directory listing
2. ls -al Formatted listing with hidden files
3. ls -lt Sorting the Formatted listing by time modification
4. cd dir Change directory to dir
5. cd Change to home directory
6. pwd Show current working directory
7. mkdir dir Creating a directory dir
8. head file Output the first 10 lines of the file
9. tail file Output the last 10 lines of the filethe
last 10 lines
10. touch file Create or update file
11. rm file Deleting the file
12. rm -r dir Deleting the directory
13. rm -f file Force to remove the file
14. rm -rf dir Force to remove the directory dir
15. cp file1 file2 Copy the contents of file1 to file2
16. cp -r dir1 dir2 Copy dir1 to dir2;create dir2 if not present
17. mv file1 file2 Rename or move file1 to file2,if file2 is an existing
directory

Searching
1. grep pattern file Search for pattern in file
2. grep -r pattern dir Search recursively for pattern in dir
3. command | grep Search pattern in the output of a command
pattern
4. locate file Find all instances of file
5. find . -name filename Searches in the current directory (represented by a
period) and below it, for files and directories with
names starting with filename
6. pgrep pattern Searches for all the named processes , that matches
with the pattern and, by default, returns
their ID

System Info
1. date Show the current date and time
2. cal Show this month's calender
3. uptime Show current uptime
4. whoami Who you are logged in as
5. finger user Display information about user
6. cat /proc/cpuinfo Cpu information
7. cat proc/meminfo Memory information
8. man command C
Show the manual for command

1. tar cf file.tar file Create tar named file.tar containing file


2. tar xf file.tar Extract the files from file.tar
3. tar czf file.tar.gz files Create a tar with Gzip compression
4. tar xzf file.tar.gz Extract a tar using Gzip
5. tar cjf file.tar.bz2 Create tar with Bzip2 compression
6. tar xjf file.tar.bz2 Extract a tar using Bzip2
7. gzip file Compresses file and renames it to file.gz
8. gzip -d file.gz Decompresses file.gz back to file
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
HDFS User/Admin Command

1 hdfs dfs –ls / List files in /:

2 hdfs dfs –mkdir /user/test Make a directory


3 dd if=/dev/urandom This would create 1GB
of=sample.txt bs=64M count=1 file called sample.txt on
the local filesystem

4 hdfs dfs –put sample.txt Copy file from local


/user/test filesystem into HDFS

5 hdfs fsck /user/test/sample.txt check the health of the


files
6 hdfs dfsadmin -repor Summary report
7 more /etc/hadoop/conf/hdfs- Check the configuration
site.xm hdfs-site.xml file
8 more etc/hadoop/conf/filename Check the configuration
file of required Hadoop
configuration file

Lab Assignment 2- Running Wordcount with


Hadoop MapReduce Streaming

CO-Mapping
Lab. No. Title CO1 CO2 CO3
2 -
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Outcome

After successful completion of the lab students are able:


❖ To learn how to create applications over Hadoop big data platform.
❖ To examine and control the number of configuration files.
❖ To understand the working architecture of Hadoop Map-reduce framework

The Objective

Word Count case study in Hadoop. Word count is the standard example to
understand the Hadoop MapReduce paradigm in which we count the number of
instances of each word in an input file and gives the list of words and the number of
instances of the particular word as an output.

Instructions:

1. Open a Terminal (Right-click on Desktop or click Terminal icon in the top toolbar)
2. Review the following to create the python code
Section 1: wordcount_mapper.py

//Write Your Python Program Code Logic Here


BENNETT UNIVERSITY
School of Computer Science Engineering & Technology

Section 2: wordcount_reducer.py

//Write Your Python Program Code Logic Here

You can cut and paste the above into a text file as follows from the terminal prompt in
Cloudera VM.

Type in the following to open a text editor, and then cut and paste the above lines for
wordcount_mapper.py into the text editor, save, and exit. Repeat for wordcount_reducer.py

> gedit wordcount_mapper.py

> gedit wordcount_reducer.py

Enter the following to see that the indentations line up as above

> more wordcount_mapper.py

> more wordcount_reducer.py

Enter the following to make it executable

> chmod +x wordcount_mapper.py

> chmod +x wordcount_reducer.py

Enter the following to see what directory you are in

> pwd
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
3. Create some data:

> echo " data that contains greater variety, arriving in increasing volumes and with more
velocity" > /home/cloudera/testfile1

> echo "Apache Hadoop is an open source framework that is used to efficiently store and
process large datasets " > /home/cloudera/testfile2

4. Create a directory on the HDFS file system (if already exists that’s OK):
hdfs dfs -mkdir /user/cloudera/input

5. Copy the files from local filesystem to the HDFS filesystem:


hdfs dfs -put /home/cloudera/testfile1 /user/cloudera/input

hdfs dfs -put /home/cloudera/testfile2 /user/cloudera/input

7. Run the Hadoop WordCount example with the input and output specified.
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/input \
-output /user/cloudera/output_new \
-mapper /home/cloudera/wordcount_mapper.py \
-reducer /home/cloudera/wordcount_reducer.py

Hadoop prints out a whole lot of logging or error information. If it runs you will see
something like the following on the screen scroll by:

....

8. Check the output file to see the results:


hdfs dfs -cat /user/cloudera/output_new/part-00000
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
9. View the output directory:
hdfs dfs -ls /user/cloudera/output_new

Look at the files there and check out the contents, e.g.:

hdfs dfs -cat /user/cloudera/output_new/part-00000

Wordcount_mapper.py
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Wordcount_reducer.py

Task → Cloudera terminal for successful execution of command map 100% and reducer
100%.

Task → Output at the Cloudera terminal


BENNETT UNIVERSITY
School of Computer Science Engineering & Technology

Task → Deployed MapReduce application accessed through the URL


BENNETT UNIVERSITY
School of Computer Science Engineering & Technology

Lab Assignment 3:

“Performing Data Analysis for Web Server Log Files


using Hadoop MapReduce Framework”

CO-Mapping
Lab. No. CO1 CO2 CO3
3 -
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Outcome

After successful completion of the lab students are able:


❖ To learn how to create applications using Hadoop Big Data platform.
❖ To learn how to facilitate the security permissions for the execution configuration
files at Big Data platform.
❖ To perform data processing required for the Hadoop platform.
❖ To analyze the web server log files using Hadoop MapReduce framework.

The Objective

Webserver Log files include information about system performance that can be used
to determine when additional capacity is needed to optimize the user experience. Log
files can help analysts identify slow queries, errors that are causing transactions to
take too long or bugs that impact website or application performance. So, Web server
log analysis help the industries to boost their search engine rankings. So, in this lab
we will learn how to analyze the web server log files using Hadoop MapReduce
Framework.

Lab Problem Statement: Suppose you are working as Big Data Analytics in CISCO. Your
manager has asked to analyze webserver log files using Big Data Hadoop MapReduce
framework. Consider a sample web server log file of Bennett University. Develop a Hadoop
MapReduce framework and write your Mapper, Reducer code that could analyze the log file
data set to complete following tasks:

Task 1: Launch and deploy Hadoop Big Data platform at the Cloudera Virtual Machine

Task 2: Process web server data set at the HDFS framework.

Task 3: Write python source code for the required ‘Mapper’ and ‘Reducer function’.

Task 4: At deployed Hadoop MapReduce framework, analyze ‘Bennett webserver log file’
data set and answer the following queries:

I. How many requests has been generated to access the Bennett university
webpage?
II. Calculate number of hits generated by user IP address 192.12.68.34.
III. Average processing time taken in execution of Mapper, Shuffle and Reducer
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
phase.

Lab Assignment 4:

“How to Process and Analyze Datasets using Apache Pig


Hadoop MapReduce Framework”

CO-Mapping
Lab. No. CO1 CO2 CO3
4

Outcome

After successful completion of the lab students are able:


❖ How to deploy the Cloudera Hadoop platform for the Apache Pig
❖ To learn how to store data using Apache Pig Storage.
❖ To explore Pig Grunt Shell.
❖ Write and implement Pig Latin scripts to process, analyze and manipulate data
sets.

The Objective

Pig is a high-level scripting language that is used with Apache Hadoop. To analyze
the data using Apache Pig, developers need to write scripts using Pig Latin
language. Pig excels at describing data analysis problems as data flows. Pig is highly
utilized in industries because it supports rich set of features to manipulate the
required data. In this lab, students will learn how to deploy and write Pig Script so
that they can analyze the real-time data sets using Apache Hadoop Pig MapReduce
Framework.

Lab Problem Statement:

With increasing number of ecommerce business, there is a need to analyze real time
datasets using Apache Pig. Consider a ‘Sales2022.csv’ analyze this dataset using
Apache Pig to answer the following queries at Grunt Shell:
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Task 1: Launch and deploy Hadoop Big Data platform at the Cloudera Virtual
Machine.

Task 2: Load and create a Schema for ‘Sales2022.csv’ dataset using Pig Storage
delimiter format.

Task 3: Write Pig Script to group and dump the data by ‘Country’.

Task 4: Write Pig Script to popular and filter the data where city is ‘Chicago’

Task 5: Compose Pig Script to fetch data where payment has been made using
‘Mastercard’

Task 6: Write Pig Script to analyze the number of product sales on transactions date
‘1/21/2022 14:06’

Task 7: Write Pig Script to analyze the sales of Product 1 category.

Task 8: Write Pig Script to identify data where Price is >1200.

Lab Assignment 5:

“Movie Data Set Analysis Using Apache Pig Hadoop


MapReduce Framework”

CO-Mapping
Lab. No. CO1 CO2 CO3
5
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Outcome

After successful completion of the lab students are able:


❖ Deploy and configure the Apache Pig Hadoop MapReduce Framework.
❖ To learn how to use Pig Storage Delimiter while storing different real-time datasets
❖ Write and implement Pig Latin scripts to process, analyze and manipulate
multiple data sets.

The Objective

Pig is a high-level scripting language that is used with Apache Hadoop. To analyze
the data using Apache Pig, developers need to write scripts using Pig Latin
language. Pig excels at describing data analysis problems as data flows. Pig is highly
utilized in industries because it supports rich set of features to manipulate the
required data. In this lab, students will learn how to deploy, analyze and write Pig
Script for multiple datasets over the Apache Hadoop Pig MapReduce Framework.

Lab Problem Statement:

MovieLens dataset is mostly used for building recommender systems that analyze
user movie ratings. In this lab assignment, you have to explore the two MovieLens
datasets namely ‘u.data’ and ‘u.item’ to find trends in movie preferences. The data is
provided as an input to Apache Pig which is analyzed and partitioned based on ratings.
Consider these given datasets and write Apache Pig Script to answer the following
queries at Grunt Shell:

Data set 1: ‘u.data’

Task 1: Launch and deploy Apache Hadoop Big Data platform at the Cloudera
Virtual Machine.

Task 2: Write Pig Script to create relation schema for the movie data set 1 ‘u.data,’.

Task 3: Write Pig Script to group the ratings by ‘movieId’.

Task 4: Write Pig Script to find the movie with avg rating >4.0.

Data set 2: ‘u.item’


BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Task 5: Write Pig Script to create relation schema for the movie data set 1 ‘u.item,’.

Task 6: Write Pig Script to analyze and display the ‘u.item’ data set for movie title
‘Shanghai Triad’ only.

Task 7: Write Pig Script to perform ‘Pig Join Operation’ on ‘u.data,’ and ‘u.item’
datasets

Task 8: Find the oldest 3-star movies from ‘u.data,’ and ‘u.item’ datasets
Task 9: Write Pig Script to find the oldest 5-star movies from ‘u.data,’ and
‘u.item’ datasets.

Lab Assignment 6:

“Apache Hive Query Language (HQL) Use Case- Sports


Data Set Analysis”

CO-Mapping
Lab. No. CO1 CO2 CO3
6

Outcome

After successful completion of the lab students are able:


❖ Deploy and configure the Apache Hive Hadoop MapReduce Framework.
❖ To learn how to implement Hive Storage Delimiter while storing different real-time
datasets
❖ Write and implement HQL scripts to process, analyze and manipulate multiple
data sets.
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
The Objective

Apache Hive is a data warehouse system mounted on top of Hadoop and used to
examine structured, semi-structured data. Hive abstracts the complexity of Hadoop
MapReduce. It gives a mechanism to project structure onto the data and make
queries written in HQL (Hive Query Language). In this lab, students will learn how
to deploy, analyze, and write Hive Query Language (HQL) Script for analysis of
‘Sports.csv’ dataset over the deployed Apache Hadoop MapReduce Framework.

Lab Problem Statement for ‘Sports.csv’ dataset

Sports data analytics are used not only in cricket but many other sports such as
‘Swimming,’ ‘Tennis,’ ‘Shooting’, ‘Rowing’ etc. This can be used for improving the
overall team performance and maximizing winning chances. Consider a ‘Sports.csv’
dataset of International Games which is provided as an input to Apache Hive,
analyzed and partitioned based on different given categories. Write Apache Hive
Query Language (HQL) Script to answer the following queries at Hadoop CLI shell
mode:

Data set: ‘Sports.csv’

Task 1: Write HDFS shell command to load the data set at ‘root directory’ from
Apache Hadoop Big Data local file system platform.

Task 2: Write HQL Script to design ‘schema’ and create the required ‘table’ for
‘sports.csv’.

Task 3: Write HQL Script to analyze and list the total number of medals won by each
country for ‘Swimming’ sport category.

Task 4: Write HQL Script to calculate and display the total number of ‘Gold
Medals’ won by India. (Assume all sports category).

Task 5: Write HQL Script to list the number of medals won by ‘China’ in
‘Shooting’.

Task 6: Write HQL Script to calculate and count of the total number of medals each
country won.

Task 7: Write HQL Script to display year and countries name that won medals in
‘Shooting’.
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Task 8: Write HQL Script to list the country that won gold and silver medals in
‘Football’.

Lab Assignment 7:

“Apache Hive Query Language (HQL) Use Case-


University Students Registration Data Set Analysis”

CO-Mapping
Lab. No. CO1 CO2 CO3
7

Outcome

After successful completion of the lab students are able:


❖ Deploy and configure the Apache Hive Hadoop MapReduce Framework.
❖ To learn how to implement Hive Storage Delimiter while storing different real-time
datasets
❖ Write and implement HQL scripts to process, analyze and manipulate multiple
data sets.

The Objective

Apache Hive is a data warehouse system mounted on top of Hadoop and used to
examine structured, semi-structured data. Hive abstracts the complexity of Hadoop
MapReduce. It gives a mechanism to project structure onto the data and make
queries written in HQL (Hive Query Language). In this lab, students will learn how
to deploy, analyze, and write Hive Query Language (HQL) Script for analysis of
‘Sports.csv’ dataset over the deployed Apache Hadoop MapReduce Framework.

Lab Problem Statement

Consider in Bennett University,


School of Computer Science Engineering and Technology (SCSET) have
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
completed the registration process of B.Tech 2022-2026 batch. The students have
enrolled for different specialization courses like ‘cloud computing’, ‘AI’,
‘FullStack’, ‘DS’ etc. Lets assume ‘Batch_Distribution’, ‘Demo_Missed’,
‘Demo_Schedule’, ‘Enquiry’ and ‘Enroll’ datasets for this registration process.
Deploy the Hadoop framework over cloudera virtual machine and write Apache Hive
Query Language (HQL) Script to answer the following queries for these datasets.
Use Hadoop Hive CLI shell mode to complete your task.

Task 1: Write HQL Script to design ‘schema’ and create the required ‘table’ for
‘Batch_Distribution’, ‘Demo_Missed’, ‘Demo_Schedule’, ‘Enquiry’ and ‘Enroll’
datasets.

Task 2: Write HQL Script to list the students who have been enrolled for ‘Weekend
Batch Courses’.

Task 3: Write HQL Script to display the students who have done payment using
‘Googlepay’.

Task 4: Write HQL Script to display the students who have paid the fees in ‘Cash’
mode.

Task 5: Write HQL Script to list the students who have paid complete fees.

Task 6: Write HQL Script to list the students whose fees is pending.

Task 7: Write HQL Script to list the details of students whose 1 installment payment
is pending.

Task 8: Write HQL Script to display the students who have enrolled for ‘Cloud’
batch.

Task 9: Write HQL Script to display the students who have enrolled for ‘FullStack’
batch.

Task 10: Write HQL Script to list the information of students who have missed their
course demo.

Lab Assignment 8:

“Analyzing E-commerce Smartphones Diwali 2022 Sales


using Apache Hive”
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
CO-Mapping
Lab. No. CO1 CO2 CO3
8

Outcome

After successful completion of the lab students are able:


❖ Deploy and configure the Apache Hive Hadoop MapReduce Framework.
❖ To learn how to implement Hive Storage Delimiter while storing different real-time
datasets
❖ Write and implement HQL scripts to process, analyze and manipulate multiple
data sets.

The Objective

Data has the potential to transform business and drive the creation of business
value. The real power of data lies in the use of analytical tools that allow the user to
extract useful knowledge and quantify the factors that impact events. Some
examples include customer sentiment analysis, ecommerce data set analysis, geo-
spatial analysis of key operation centers, etc. In this lab, students will learn how to
deploy, analyze, and write Hive Query Language (HQL) Script for analysis of
‘Smartphone.csv’ dataset over the deployed Apache Hadoop MapReduce
Framework.
Lab Problem Statement

Consider telephonic companies that wants to analyze their Diwali smartphones 2022
sales datasets ‘2022_OctSales.csv’, ‘2022_NovSales.csv’ using Apache Hive. As Big
Data Analyst you have been given the challenge to extract the customer clickstream
and find hidden information from these real-time datasets. This would help the
telephonic companies to analyze the Diwali smartphone sales and develop a customer
product recommendations system. For the given case study, deploy the Hadoop
framework over cloudera virtual machine and write Apache Hive Query Language
(HQL) Script to answer the following queries.

Task 1: Write HQL Script to design schema and create the required ‘External Table’
for ‘2022_OctSales.csv’ and ‘2022_NovSales.csv’ datasets.
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Task 2: Write HQL Script to identify the month in which maximum sales profit has
been generated.

Task 3: Write HQL Script to display the difference of revenue gained during
smartphone sales from Oct to Nov month.

Task 4: Write HQL Script to identify and display the name of company that received
maximum sales in the combined category of Oct and Nov month.

Task 5: Assume from the given data sets, a telephonic company wants to reward
the top 5 users of its website with BONUS credit points of Rs 10,000. Write a query
to generate a list of top 5 users who spend the most time on the website. [Hint: Can
use Dynamic Partitioning]

Task 6: Write HQL Script to identify and display the company whose sales is
increased from Oct to Nov.

Lab Assignment 9:

“Customer Order Data Set Analysis using Apache


PySpark RDD Interface”

CO-Mapping
Lab. No. CO1 CO2 CO3
9

Outcome

After successful completion of the lab students are able:


❖ Deploy and configure the Apache Spark Framework.
❖ To learn how to implement RDD while working on Spark cluster.
❖ Write and implement PySpark scripts to process, analyze and manipulate
multiple data sets.
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
The Objective

Apache Spark has emerged as the next big thing in the Big Data domain – quickly
rising from an ascending technology to an established superstar in just a matter of
years. Spark allows you to quickly extract actionable insights from large amounts of
data, on a real-time basis, making it an essential tool in many modern businesses.
In this lab, students will learn how to deploy, analyze, and write PySpark Script for
analysis of ‘customerorder.csv’ dataset over the deployed Apache Spark
Framework.

Lab Problem Statement

Consider Walmart wants to analyze their customer sales datasets ‘customer-


order.csv’, using Apache Spark Framework. This data set file contains comma-
separated fields of a customer, an item ID and the amount spent on that item. As Big
Data Analyst you have to write a python Spark script that consolidates and analyze
this data set to answer the following queries.

Task 1: Configure and deploy the Apache Spark Framework on your Cloudera VM.

Task 2: Write PySpark script to split the given data set each comma-delimited line
into fields.

Task 3: Write PySpark script to map each line to key/value pairs of customer ID and
the amount.

Task 4: Write PySpark script to reduceByKey to add up amount spent by customer


ID.

Task 5. Write PySpark script to finally collect the result and display them on the spark
shell.

Lab Assignment 10:

“Building a Recommendation System using Apache


Spark”

CO-Mapping
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Lab. No. CO1 CO2 CO3
10

Outcome

After successful completion of the lab students are able:


❖ Deploy and configure the Apache Spark Framework.
❖ To learn how to implement recommendation system while working on Spark
cluster.
❖ Implement PySpark scripts to process, analyze and manipulate multiple data
sets.
❖ How to implement Machine Learning libraries and Elasticsearch queries for
the recommendation system model.

The Objective

Apache Spark has emerged as the next big thing in the Big Data domain – quickly
rising from an ascending technology to an established superstar in just a matter of
years. Spark allows you to quickly extract actionable insights from large amounts of
data, on a real-time basis, making it an essential tool in many modern businesses.
In this lab, students will learn how to implement recommendation system over the
deployed Apache Spark Framework.

Lab Problem Statement

Companies like Amazon and Netflix providing recommendations based on the user’s
interest. So, by using Apache spark machine learning capabilities along with
elasticsearch, you are supposed to build a recommendation system. As Big Data
Analyst you have to write a python Spark script that consolidates and analyze this
data set to answer the following queries.

Task 1. Load the product dataset into Spark.

Task 2. Use Spark DataFrame operations to clean up the dataset.

Task 3. Load the cleaned data into Elasticsearch.


BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Task 4. Using Spark MLlib, train a collaborative filtering recommendation model
from the rating data.

Task 5. Save the resulting model data into Elasticsearch.

Task 6. Using Elasticsearch queries, generate recommendations.

Figure 1. Sample Workflow of the Steps for Building Recommendation System

Lab Assignment 11:

“Building a Healthcare Prediction Model using Apache


Spark”

CO-Mapping
Lab. No. CO1 CO2 CO3
11
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Outcome

After successful completion of the lab students are able:


❖ Deploy and configure the Apache Spark Framework.
❖ To learn how to implement prediction system while working on Spark cluster.
❖ Implement PySpark scripts to process, analyze and manipulate multiple data
sets.
❖ How to implement python and machine learning libraries for the prediction
system model.

The Objective

Apache Spark has emerged as the next big thing in the Big Data domain – quickly
rising from an ascending technology to an established superstar in just a matter of
years. Spark allows you to quickly extract actionable insights from large amounts of
data, on a real-time basis, making it an essential tool in many modern businesses.
In this lab, students will learn how to implement prediction system over the deployed
Apache Spark Framework.

Lab Problem Statement

Diabetes can lead to serious complications which can affect many different parts of
human body. Consider the diabetes dataset of a healthcare industry that consist of
information about patients’ diabetes report values. As Big Data Analyst you have to
write python Spark scripts that consolidates and analyze this data set to answer the
following queries.

Task 1. Load the diabetes dataset into Spark.

Task 2. Perform data cleaning and preparation using Spark Data Frame operations.

Task 3. Perform correlation analysis and feature selection.

Task 4. Using PYSPARK build the required prediction model.

Task 5. Visualize the resulting model data into spark environment.


BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
.

Lab Assignment 12:

“Implementing Twitter Streaming Application using


PYSPARK Framework”

CO-Mapping
Lab. No. CO1 CO2 CO3
12

Outcome

After successful completion of the lab students are able:


❖ Deploy and configure the Apache Spark Framework.
❖ To learn how to implement and fetch live tweets from twitter application.
❖ Implement PySpark scripts to process, analyze and manipulate multiple data
sets.
❖ How to implement real-time streaming for Twitter deployed application.

The Objective

Apache Spark has emerged as the next big thing in the Big Data domain – quickly
rising from an ascending technology to an established superstar in just a matter of
years. Spark allows you to quickly extract actionable insights from large amounts of
data, on a real-time basis, making it an essential tool in many modern businesses.
In this lab, students will learn how to implement spark streaming system.

Lab Problem Statement

Apache Spark is an efficient framework for analyzing large amounts of data in real-time and
performing various analyses on the data. Many resources discuss Spark and its popularity in
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
the big data space, but it is worthwhile to highlight that its core features include real-time big
data processing using Resilient Distributed Datasets (RDDs)/Data Frame, streaming, and
machine learning. In this lab, you will learn how Spark Streaming components can be used in
conjunction with PySpark to resolve a business problem.

Task 1. Retrieve Tweets from the Twitter API.

Task 2. Import the necessary packages and create receive-tweets.py, read_tweets.py.

Task 3. Create a Stream Listener instance.

Task 4. Perform and start streaming session using SPARK Data frame.

Task 5. Tweets preprocessing and finding trending #tags.

END TERM EXAMINATION ODD SEMESTER 2022-23


COURSE CODE-CSET
MAX. DURATION 2 HRS
371
BIG DATA ANALYTICS AND BUSINESS
COURSE TITLE-
INTELLIGENCE
COURSE CREDIT -3 TOTAL MARKS-35

GENERAL INSTRUCTIONS: -
1. Do not write anything on the question paper except name, enrolment number and
department/school.
2. Carrying mobile phone, smart watch and any other non-permissible materials in the
examination hall is an act of UFM.

COURSE INSTRUCTIONS:
a) All questions are compulsory
b) Write complete syntax for the asked queries.
SECTION A
Max Marks: 15 Marks
Q1.) When a data will be known as Big Data? Illustrate the five key properties of Big Data.
[2 Marks]
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Q2.) Justify why Spark is faster than Map Reduce. List the 3 functionalities of Spark Engine
[3 Marks] Q3.) Draw architecture of PIG and explain how Pig Latin is converted to
MapReduce? [3 Marks]
Q4.) List the property that you can use to enable bucketing in Apache hive. [1
Roll no Name CGPA Marks]
001 John 3.0 Q5) Write the property used to increase the buffer size
002 Jack 4.0
for sorting during the shuffle phase? [2 Marks]
003 Smith 4.5
004 David 4.2
Q6.) Draw functional architecture of Hive and justify
005 Robert 3.4
purpose of ‘driver’ module. [1+3=4 Marks]

SECTION B Max Marks:


10 Marks
Q7). Bring out two key differences between internal and external tables in HQL. [2
Marks]
Q8) Justify, how Streaming Context object play key role in Apache Spark? [2
Marks]
Q9.) As a Big Data Admin, write syntax free commands for the following queries. [1*6 =
6 Marks]
i) To copy file from local file system into HDFS; ii) PIG Latin that can remove
duplicate tuples from a table; iii) Hive Query Language (HQL) to replace Col2
with Col3 in table;
iv) Pig Latin to create an external table; v) Spark data frame to read CSV, JSON
and Parquet file; vi) PYSPARK script to create a complete SparkSession.
SECTION C
Max Marks: 10
Q10) Consider given datasets, write Pig Latin Script to answer the following queries [1*5
= 5 Marks]
Table 1. Student.csv
Table 2. Dept.csv Roll no Dept. Name Universit
y
001 CSE Harvard
002 EC Stanford
003 IT Harvard
004 Civil Yale
005 CSE MIT
a) To join two relations namely ‘Student’ and ‘Dept’ based on the ‘Roll no’ column.
b) To display the details of students who belong to MIT and Stanford University.
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
c) To find the tuples of those students where CGPA is greater than 3.5.
d) To display the details of all students from Harvard University.
e) To partition a relation based on CGPA acquired by the students. CGPA >= 4.0 place
it into relation X, CGPA<4.0, place it into relation Y.

Q11) Consider following ‘Jan.csv’ and ‘Feb.csv’ two-month sales datasets of your
organization. Write Apache Hive Query Language (HQL) script to answer the following
queries. [1*5= 5 Marks]
Table 1. Sales in Jan 2022
Event_tim Event_typ Product_i Category_i Compan Price Useri User_sessio
e e d d y d n

2022-01- Cart 12 Samsung 1441 43 26dd6e6e


01
5773203 4
2022-01- Cart 45 Realme 4141 64 49e8d843
02
4
5773353
2022-01- Cart 56 Nokia 5656 34 49e8d843
03
5881589
Table 2. Sales in Feb 2022
Event_time Event_type Product_id Category_id Company Price Userid User_session

2022-02-05 view 76 Apple 9414 23 28dd6e6e


5873203
2022-02-01 Cart 89 Samsung 8414 56 92e8d843

5873353
2022-02-04 Purchase 31 Vivo 6756 47 12e8d843

5981589

a) To find the total revenue generated due to purchases made in January.


b) To find the change in revenue generated due to purchases from January to Feb.
c) To calculate distinct categories of products. Categories with null code can be
ignored.
d) To calculate total number of products available in each category.
e) Your company wants to reward the top 10 users of its website with a Golden
Customer Plan. Write a HQL query to generate a list of top 10 users who spend the
most.

ALL THE BEST


BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
SOLUTION TO END TERM QUESTION PAPER

SECTION A
Max Marks: 15 Marks
Q1.) When a data will be known as Big Data? Illustrate the five key properties of Big Data.
[2 Marks]
Answer:
Big data is a collection of data from many different sources and is often describe by five characteristics:
volume, value, variety, velocity, and veracity.

• Volume: the size and amounts of big data that companies manage and analyze
• Value: the most important “V” from the perspective of the business, the value of big data
usually comes from insight discovery and pattern recognition that lead to more effective
operations, stronger customer relationships and other clear and quantifiable business benefits
• Variety: the diversity and range of different data types, including unstructured data, semi-
structured data and raw data
• Velocity: the speed at which companies receive, store and manage data – e.g., the specific
number of social media posts or search queries received within a day, hour or other unit of
time
• Veracity: the “truth” or accuracy of data and information assets, which often determines
executive-level confidence

The additional characteristic of variability can also be considered:

• Variability: the changing nature of the data companies seek to capture, manage and analyze
– e.g., in sentiment or text analytics, changes in the meaning of key words or phrases

Q2.) Justify why Spark is faster than Map Reduce. List the 3 functionalities of Spark
Engine [3 Marks]
Answer: In-memory computation property makes Spark faster than Map Reduce
• Interacting with cluster and storage manager.
• Dividing task into sub tasks for parallel execution.
• Scheduling tasks on the cluster.

Q3.) Draw architecture of PIG and explain how Pig Latin is converted to MapReduce?
[3 Marks]
Answer:
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
i. Parser: Parser basically checks the syntax of the script, does type checking. Parser’s output
will be a DAG (directed acyclic graph) that represents the Pig Latin statements as well as
logical operators.
ii. Optimizer: DAG is passed to the logical optimizer. It carries out the logical optimizations.
iii. Compiler: Then compiler compiles the optimized logical plan into a series of MapReduce
jobs.
iv. Execution engine: Eventually, all the MapReduce jobs are submitted to Hadoop in a
sorted order. Ultimately, it produces the desired results while these MapReduce jobs are
executed on Hadoop.

Q4.) List the property that you can use to enable bucketing in Apache hive.
[1 Marks]
Answer: Hive.enforce.bucketing
Q5) Write the property used to increase the buffer size for sorting during the shuffle
phase? [2 Marks]
Answer: Mapreduce.task.io.sort.mb

Q6.) Draw functional architecture of Hive and justify purpose of ‘driver’ module.
[1+3=4 Marks]

• Driver contacts the compiler to validate the hive query.


• Driver convert hive queries into map reduce program with the help of compiler.
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Step 1. Execute Query- The Hive interface such as Command Line or Web UI sends
query to Driver to execute.
Step 2. Get Plan- The driver takes the help of query complier that parses the query to
check the syntax and query plan or the requirement of query. Driver convert hive queries
into map reduce program with the help of compiler
Step 3. Get Metadata- The compiler sends metadata request to Metastore
Step 4. Send Metadata- Metastore sends metadata as a response to the compiler.
Step 5. Send Plan- The compiler checks the requirement and resends the plan to the
driver. Up to here, the parsing and compiling of a query is complete.
Step 6. Execute Plan- the driver sends the execute plan to the execution engine.
Step 7. Execute Job- Internally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name node and it assigns this
job to TaskTracker, which is in Data node. Here, the query executes MapReduce job.
Step 7.1 . Metadata Ops- Meanwhile in execution, the execution engine can execute
metadata operations with Metastore. By default HIVE uses derby database metastore
but derby is limited to single instance of HIVE CLI.
Step 8. Fetch Result- The execution engine receives the results from Data nodes.
Step 9. Send Results- The execution engine sends those resultant values to the driver.
Step 10. Send Results- The driver sends the results to Hive Interfaces.
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology

SECTION B Max Marks:


10 Marks
Q7). Bring out two key differences between internal and external tables in HQL. [2
Marks]
Answer: So when you delete an internal table, it deletes the schema as well as the data under
the warehouse folder, but for an external table it's only the schema that you will loose.

In case of Internal Tables, both the table and the data contained in the tables are managed
by HIVE. That is, we can add/delete/modify any data using HIVE. When we DROP the
table, along with the table, the data will also get deleted.
Eg: CREATE TABLE tweets (text STRING, words INT, length INT) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

In case of External Tables, only the table is managed by HIVE. The data present in these
tables can be from any storage locations like HDFS. We cant add/delete/modify the data in
these tables. We can only use the data in these tables using SELECT statements. When
we DROP the table, only the table gets deleted and not the data contained in it. This is why
its said that only meta-data gets deleted. When we create EXTERNAL tables, we need to
mention the location of the data.

Eg: CREATE EXTERNAL TABLE tweets (text STRING, words INT, length INT) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/tweets';

When to use External and Internal Tables


Use managed tables when Hive should manage the lifecycle of the table, or when generating
temporary tables.
Use external tables when files are already present or in remote locations, and the files should
remain even if the table is dropped.

Q8) Justify, how Streaming Context object play key role in Apache Spark? [2
Marks]

Main entry point for Spark Streaming functionality. It provides methods used to
create DStreams from various input sources. It can be either created by providing a Spark
master URL and an appName, or from a org.apache.spark.SparkConf configuration (see core
Spark documentation), or from an existing org.apache.spark.SparkContext.
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology

Answer:
i) hdfs dfs -copyFromLocal /path 1 /path 2 .... /path n /destination
ii) The Apache Pig DISTINCT operator is used to remove duplicate tuples in a relation.
Initially, Pig sorts the given data and then eliminates duplicates.

E.g. grunt> Result = DISTINCT A;

iii) hive> ALTER TABLE <tablename> REPLACE COLUMNS (<old column name> INT,
<new column name> STRING);

iv) create external table if not exists [external-table-name] ( [column1-name] [column1-


type], [column2-name] [column2-type], …) comment '[comment]' row format [format-type]
fields terminated by '[termination-character]' stored as [storage-type] location '[location]';

v) flightTimeCsvDF = spark.read \
.format("csv") \
.option("header", "true") \
.schema(flightSchemaStruct) \
.option("mode", "FAILFAST") \
.option("dateFormat", "M/d/y") \
.load("data/flight*.csv")
JSON:
flightTimeJsonDF = spark.read \
.format("json") \
.schema(flightSchemaDDL) \
.option("dateFormat", "M/d/y") \
.load("data/flight*.json")

Parquet:
flightTimeParquetDF = spark.read \
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
.format("parquet") \
.load("data/flight*.parquet")

vi)
import sys
from pyspark.sql import *
from lib.utils import *

Answer 10.
a). A = load ‘student.csv’ as (Roll no: int, Name: chararray, CGPA:float);
B = load ‘Dept.csv’ as (Roll no: int, Dept. Name: chararray, University: chararray);
C= JOIN A BY Rollno, B BY Rollno
Dump C;
b)
NewDelhiStudents = FILTER cseData1 BY University== ‘MIT’ & Standford;
c).
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
marksGT70 = FILTER cseData1 BY CGPA >= 3.5;

d). output = FILTER cseData1 BY University== ‘Harvard’;


e). A = load ‘/student.csv’ as (Roll no: int, Name: chararray, CGPA:float);
SPLIT A into X IF CGPA >=4.0, Y IF CGPA<4.0;
Dump X;

Answer:
a) SELECT SUM(price) AS Total_Revenue_Jan
FROM CosmeticStore
WHERE date_format(event_time, 'MM')=10
AND
event_type= 'purchase';

b) WITH Total_Monthly_Revenue AS (
SELECT
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
SUM (CASE WHEN date_format(event_time, 'MM')=10 THEN price ELSE 0
END) AS October_Revenue,
SUM (CASE WHEN date_format(event_time, 'MM')=11 THEN price ELSE 0
END) AS November_Revenue
FROM CosmeticStore
WHERE event_type= 'purchase'
AND date_format(event_time, 'MM') in ('10','11')
)
SELECT November_Revenue, October_Revenue, (November_Revenue-October_Revenue)
AS Difference_Of_Revenue FROM Total_Monthly_Revenue;

c) SELECT DISTINCT SPLIT(category_code,'\\.')[0] AS Categories


FROM CosmeticStore
WHERE SPLIT (category_code,'\\.')[0]<>'';
d) SELECT SPLIT(category_code,'\\.')[0] AS Categories, COUNT (product_id) AS
Count_Of_Products
FROM CosmeticStore
WHERE SPLIT(category_code,'\\.')[0]<>''
GROUP BY SPLIT(category_code,'\\.')[0]
ORDER BY Count_Of_Products DESC;

e).
--To create Partitioning and bucketing
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.enforce.bucketing=true;

--Creating table with partition on event_type and clustering on price


CREATE TABLE IF NOT EXISTS Dynamic_Part_Cluster_CosmeticStore(
event_time timestamp, product_id string, category_id string, category_code string,
brand string, price float, user_id bigint, user_session string
)
PARTITIONED BY (event_type string)
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
CLUSTERED BY (price) INTO 7 BUCKETS
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE;

-- Adding data from CosmeticStore table into partitioned and clustered table.
INSERT INTO TABLE Dynamic_Part_Cluster_CosmeticStore
PARTITION (event_type)
SELECT event_time, product_id, category_id, category_code, brand, price, user_id,
user_session, event_type
FROM CosmeticStore;

--Checking whether the table has been succesfully created and loaded with the data
EXIT;
hadoop fs-ls /user/hive/warehouse/Dynamic_Part_Cluster_CosmeticStore

--Checking if the partitions (event_type=purchase) have been created successfully


hadoop fs-ls
/user/hive/warehouse/Dynamic_Part_Cluster_CosmeticStore/event_type=purchase

--Checking if the partitions (event_type=cart) have been created successfully


hadoop fs-ls
/user/hive/warehouse/Dynamic_Part_Cluster_CosmeticStore/event_type=cart

--Checking if the partitions (event_type=remove_from_cart) have been created


successfully
hadoop fs-ls
/user/hive/warehouse/Dynamic_Part_Cluster_CosmeticStore/event_type=remove_fr
om_cart

--Checking if the partitions (event_type=view) have been created successfully


hadoop fs-ls
/user/hive/warehouse/Dynamic_Part_Cluster_CosmeticStore/event_type=view
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
--Re-entering Hive
hive

--Question 8 Query
SELECT user_id, SUM(price) AS Total_Spend
FROM Dynamic_Part_Cluster_CosmeticStore
WHERE event_type='purchase'
GROUP BY user_id
ORDER BY Total_Spend DESC
LIMIT 10;

Hackathon Problem Statement 1

“Industries are making a carbon footprint free environment well we


always know it’s difficult because how would one track it. Hence all
en-compassing and includes direct and indirect emissions knows as
scopes. The analysis would determine the exclusive global amount of
carbon dioxide and other greenhouse gases which will help the
existing industries to calculate and even predict their emissions for
next few years.”

• Introduction

In a trend of corporate and investors in any manufacturing industry have shown increased

interest in carbon footprint analysis, which shows a keen align of how a given

manufacture may result in carbon intensity of the environment. Emissions are directly

related to energy costs, such that this tracking can have tangible relevance to the

profilablity of the company. A gasp of carbon emissions can worn efficiencies of a


BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
company hence it is one of the most ignored stigma as seen in Indian manufacturing

industries. At a minimum we have created a tool which

performs for a individual organization to better understand the company’s emission and

have a value of CO2 index with a greater impact with competitive advantage among

peers.

• Objectives

1. From investors point of view – CO2 emissions helps in sustainability

development which is one of the very hot topics in manufacturing industry.

2. The Global scientific community has reached a consensus goal of stabilizing the

earth’s climate to stay withing two digits. Our project can play a member role in

the mission.

3. The analysis does not take the full business cycle. So one have to possibly add

manually. The advantage to work in the same would be better understand of

emission chronology department wise.

4. Given that using a carbon footprint will not provide a comprehensive approach,

hence we have trained a dataset to see and analyze next 2 years of training data

by using data science possibilities.

5. Concluding with reduction targets it would be easier for us to achieve a

company’s reduction target within set deadline.

• Tables

Hardware Personal Machine Ie; Laptops


BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Software Spyder, Anaconda navigator,

Tableau, CLI, Terminal and Python

IDE.

Technique Regression model with multi-

regressor pipeline approach.

• Implementation Methodology along with Flowchart

Scope 1: Direct Greenhouse gas emissions from sources that are owned or controlled by a

company.

Scope 2: Indirect Greenhouse gas emissions resulting from the generation of electricity, heat

or steam purchased by a company

Scope 3: Indirect Greenhouse gas emissions from sources not owned or directly controlled

by a company but related to the company’s activities.

At a minimum, a carbon footprint analysis should include all scope 1 (direct) and scope 2

(indirect) carbon and methane emissions. Due to the difficulties in obtaining comparable and

quantifiable data, Scope 3 emissions (from other emissions-generating activities such as

supply chain) are not typically included. Investors should be able to conduct more robust

analyses of all major GHG emissions as companies

improve their disclosure. At this time, a carbon footprint analysis cannot provide a complete

picture of a given company's or investment portfolio's overall environmental impacts. In our

opinion, it is critical to look further, toward the broader benefits associated with those

businesses, in order to gain a better understanding of their broader portfolio and economic
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
impact.

• Running Source Code Repository Link

Here is the running source code link :

https://github.com/bennettswallaby/Carbonated-CO2-Emissions-

Tracker/tree/main

• Results with Screenshot

• Step 1 : Installing dependencies

Figure 1: Importing Dependencies

• Step 2 : Making Scope


BENNETT UNIVERSITY
School of Computer Science Engineering & Technology

Figure 2: Scaling Emissions

• Step 3 : Making RDD Dataframe for Scopes.

• Step 4 : Creating Spark Session


BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Step 5 : Importing RandomForest for regression model.

Step 6 : Using Tensorflow for scope regression

Step 7 : Creating Azure Bucket


BENNETT UNIVERSITY
School of Computer Science Engineering & Technology

Step 8 : Calculating Final Emissions !

• Data Visualization with Power BI and React.


BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology

• Conclusion

Taking scope 3 or downstream impacts into account for both suppliers and

consumers, in our opinion, aids in understanding total environmental impact.

It can also help businesses assess where there may be opportunities to identify

and engage suppliers who are leaders or laggards in this field. Companies

concerned with scope 3 emissions can improve product recycling and

disposal.

Overall, addressing scope 3 emissions can assist businesses in identifying and

promoting energy efficiency opportunities outside of their own operations,

including their products/services after they have been sold. Similarly, focusing
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
on downstream effects can assist investors in identifying investment

opportunities. We provide a few examples below to demonstrate how

businesses and investors can have a broader, positive impact.

6. Mapping of Questions to Course Outcomes (COs)

7. Registration list of the students

Sno Enroll No Student Name


1 E20CSE344 AARYAMAN YADAV

2 E20CSE077 ABHINAV CHAUDHARY

3 E20CSE055 ALTAMASH ALAM

4 E20CSE429 APARAJITA MEHROTRA

5 E20CSE186 BATCHU SRINIVAS VISHNU VAMSI

6 E20CSE168 D HARSHA VARDHAN REDDY

7 E20CSE287 DAKSH GARG

8 E20CSE296 DANISH ARA YAKTA

10 E20CSE044 DIVYANSH PALIA

11 E20CSE316 Gadiraju Venkata Siva Srirama Rushit Varma

12 E20CSE244 GUNIKA DHINGRA

13 E20CSE470 HARSHITA JAIN

14 E20CSE460 ISHAN VARSHNEY


BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
15 E20CSE216 MANPREETH SAI MUTHYALA

16 E20CSE148 MOIZ ANWAR

17 E20CSE067 PARITOSH TRIPATHI

18 E20CSE332 PARVATHANENI VIJAYA LAKSHMI

19 E20CSE040 PRATEEK KUMAR

20 E20CSE479 PRIYAM SRIVASTAVA

21 E20CSE338 PUSHPENDER SHARMA

22 E20CSE203 RISHABH .

23 E20CSE400 SAMARTH SHARMA

24 E20CSE122 SASHANK DURBHA

25 E20CSE300 SHAGUN KADAM

26 E20CSE201 SHREYA CHATTERJEE

27 E20CSE249 SIDDANSH CHAWLA

28 E20CSE199 SIKHAKOLLI VENU

29 E20CSE517 SUMRAN TALREJA

30 E20CSE382 TANISHQ YADAV

31 E20CSE402 VAMSI BITRA


BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
8. Result Statistics and result distribution
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology

9. Course assessment and review by Faculty

• During the entire course and its end, it was found that the course was very much

useful for the students because it helps them to understand, key concept of Big Data

in terms of Hadoop, Pig, Hive and their implementation.

• The course covered almost all the aspects of Big Data and Business Intelligence.

• The students found the Hackathon assignments useful and interesting to improve

them knowledge and presentation skills.

• Both candidate completed the Infosys certification on Big Data and they have

enjoyed this MOOC course.


BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
10. Student feedback / End of Course Survey

Category BCA

CLO 5 out of 5

Mid Sem Feedback 4.5 out of 5

11. Special efforts and measures taken to improve the class’s learning

• During lectures, real-world practical demonstrations helped to maintain the

student's interaction consistently.

• Used video to take the short break during the session.

• The focus was on creating a natural exchange of thoughts with an emphasis on

opportunities as a big data professional.

• Used Mentimeter, open-source tools and industry oriented MOOC courses.

• Yes, Hackathon sponsored by DataCouch Private Ltd.

• Top 3 hackathon winner teams received prize of Rs 10,000/-, 6000/-, 4000/.

• 100% free coupons of industry ready-course worth Rs 10,000/- ‘JUST ENOUGH

PYTHON FOR ML’ to all participants.

• Industry tie-up with DataCouch Private Ltd for Hackathon sponsorship and Internship

opportunities.

You might also like