CSET 371 Course File
CSET 371 Course File
For
Program : BTech
L-T-P : 2– 0 – 2
Credits :3
School : SCSET
Level : UG
https://www.csebu.com/view_course_desc.php?ct_id=1958 1/6
Course File Format
Detailed Syllabus
Module 1 (8 hours)
Big Data Analytics: Data and Relations, Business Intelligence, Business intelligence vs business
analytics, Why what and how BI? OLTP VS OLAP, Ethics in Business Intelligence, Big Data
Technology Component, Structured/unstructured, streaming data, Streaming, Stream Data Types and
computing, Real Time Analysis of Big Data, Big Data Architecture, Big Data Warehouse.
Module 2 (6 hours)
Introduction to Hadoop, Hadoop high level architecture, Processing data with Hadoop, HDFS, Design of
HDFS, NameNodes and DataNodes, MapReduce, Mapper and Reducer function, Analysis of Real
Time Data using MapReduce.
Module 3 (6 hours)
Hadoop Ecosystem: Pig Overview, Pig Grunt Shell, Use cases for Pig-ETL Processing, Pig Relational
Operators, Hive, Hive file format, HBase, Architecture of Hive and HBase.
Module 4 (8 hours)
HQL, Associations and Joins, Aggregate function, Polymorphic queries, Clauses, Subqueries, Spark,
Core, Spark SQL, Spark RDD, Deployment and Integration, Spark GraphX and Graph Analytics,
Functional vs Procedural programming models, NoSQL, Use of Tableau, data source and worksheet,
Big Data Predictive Analysis, Research Topics in Big Data Aalytics.
TEXTBOOKS/LEARNING RESOURCES:
TEXTBOOKS/LEARNING RESOURCES:
1. Peter Ghavami, Big Data Analytics Methods (2nd ed.), De Gruyter, 2020. ISBN 9781547417951.
2. Acharya and Seema, Data Analytics using R (1st ed.), New York: McGraw-Hill
Education, 2018. ISBN 9352605241.
https://www.csebu.com/view_course_desc.php?ct_id=1958 2/6
REFERENCE BOOKS/LEARNING RESOURCES:
1. Ana Azevedo and Manuel Filipe Santos, Integration Challenges for Analytics, Business
Intelligence,and Data Mining (1st ed.), Engineering Science Reference, 2020. ISBN
9781799857832.
EVALUATION POLICY
Quiz 5%
Certification/Assignment 20%
Total 100%
Structured/unstructured (10)
8 HDFS (25)
14 Hive (15)
HBase (15)
16
Architecture of Hive and HBase (30)
17 HQL (10)
Associations and Joins (20)
Aggregate function (15)
Polymorphic queries (15)
18
Clauses (15)
Subqueries (15)
Assessment/Buffer Lecture*
19
21 Spark (10)
Core (15)
Spark SQL (20)
13 Buffer Lab
6. Assessment Materials
Surprise Quiz 1:
a. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of
both splits containing the brokenline.
b. Input file splits may cross line breaks. A line thatcrosses tile splits is ignored.
c. The input file is split exactly at the line breaks, so each Record Reader will read a series of complete
lines.
d. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the
split that contains the end of the brokenline.
Q6. Which MapReduce phase is theoretically able to utilize features of the underlying file system
in order to optimize parallel execution?
a. Split
b. Map
c. Combine
Q7. You are running a Hadoop cluster with all monitoring facilities properly configured. Which
scenario will go undetected.?
a. MapReduce jobs that are causing excessive memory swaps.
b. There is an infinite loop in Map or reduce tasks.
c. HDFS is full
d. The MasterNode is down.
Surprise Quiz 2:
Q1. Which below instruction can be substituted in place for SORT BY and DISTRIBUTE BY?
a. None
b. GROUP BY
c. ORDER BY
d. CLUSTER BY
Q2. Hive supports trigger
Select one:
True
False
a. PigBag
b. Pig Storage
c. Pig Store
a. HDFS
b. In a traditional database like MYSQL or Oracle
Q6. Assume you have a Pig relation with 10 columns. What happens when you load a dataset with
12 columns in each row?
Learning Goals
In this activity, students will learn:
Instructions
Please use the following instructions to download and install the Cloudera Quickstart VM with VirutalBox
before proceeding to the Getting Started with the Cloudera VM Environment video. The screenshots are
from a Mac but the instructions should be the same for Windows. Please see the discussion boards if you
have any issues.
4. Start VirtualBox.
7. Select the cloudera-quickstart-vm-5.4.2-0-virtualbox.ovf from the Folder where you unzipped the VirtualBox
VM and click Open.
8. Click Next to proceed.
9. Click Import.
10. The virtual machine image will be imported. This can take several minutes.
11. Launch Cloudera VM. When the importing is finished, the quickstart-vm-5.4.2-0 VM will appear on the left in the
VirtualBox window. Select it and click the Start button to launch the VM.
12. Cloudera VM booting. It will take several minutes for the Virtual Machine to start. The booting process takes a
long time since many Hadoop tools are started.
13. The Cloudera VM desktop. Once the booting process is complete, the desktop will appear with a browser.
14. Shutting down the Cloudera VM. Before we can change the settings for the Cloudera VM, the VM needs to be
powered off. If the VM is running, click on System in the top toolbar, and then click on Shutdown:
Introduction
Every major industry is implementing Apache Hadoop as the standard framework for processing and storing
big data. Hadoop is designed to be deployed across a network of hundreds or even thousands of dedicated
servers. All these machines work together to deal with the massive volume and variety of incoming datasets.
Deploying Hadoop services on a single node is a great way to get yourself acquainted with basic Hadoop
commands and concepts.
This easy-to-follow guide helps you install Hadoop on Ubuntu 18.04 or Ubuntu 20.04.
Prerequisites
• Access to a terminal window/command line
At the moment, Apache Hadoop 3.x fully supports Java 8. The OpenJDK 8 package in Ubuntu contains
both the runtime environment and development kit.
Type the following command in your terminal to install OpenJDK 8:
sudo apt install openjdk-8-jdk -y
The OpenJDK or Oracle Java version can affect how elements of a Hadoop ecosystem interact. To install a
specific Java version, check out our detailed guide on how to install Java on Ubuntu.
Once the installation process is complete, verify the current Java version:
java -version; javac -version
The output informs you which Java edition is in use.
The username, in this example, is hdoop. You are free the use any username and password you see fit.
Switch to the newly created user and enter the corresponding password:
su - hdoop
The user now needs to be able to SSH to the localhost without being prompted for a password
Enable Passwordless SSH for Hadoop User
Generate an SSH key pair and define the location is is to be stored in:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
The system proceeds to generate and save the SSH key pair.
Use the cat command to store the public key as authorized_keys in the ssh directory:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Set the permissions for your user with the chmod command:
chmod 0600 ~/.ssh/authorized_keys
The new user is now able to SSH without needing to enter a password every time. Verify everything is set
up correctly by using the hdoop user to SSH to localhost:
ssh localhost
After an initial prompt, the Hadoop user is now able to establish an SSH connection to the localhost
seamlessly.
Download and Install Hadoop on Ubuntu
Visit the official Apache Hadoop project page, and select the version of Hadoop you want to implement.
The steps outlined in this tutorial use the Binary download for Hadoop Version 3.2.1.
Select your preferred option, and you are presented with a mirror link that allows you to download
the Hadoop tar package.
Use the provided mirror link and download the Hadoop package with the wget command:
wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
Once the download is complete, extract the files to initiate the Hadoop installation:
tar xzf hadoop-3.2.1.tar.gz
The Hadoop binary files are now located within the hadoop-3.2.1 directory.
• hadoop-env.sh
• core-site.xml
• hdfs-site.xml
• mapred-site-xml
• yarn-site.xml
Configure Hadoop Environment Variables (bashrc)
Edit the .bashrc shell configuration file using a text editor of your choice (we will be using nano):
sudo nano .bashrc
Define the Hadoop environment variables by adding the following content to the end of the file:
#Hadoop Related Options
export HADOOP_HOME=/home/hdoop/hadoop-3.2.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS"-Djava.library.path=$HADOOP_HOME/lib/nativ"
Once you add the variables, save and exit the .bashrc file.
It is vital to apply the changes to the current running environment by using the following command:
source ~/.bashrc
Edit hadoop-env.sh File
The hadoop-env.sh file serves as a master file to configure YARN, HDFS, MapReduce, and Hadoop-related
project settings.
When setting up a single node Hadoop cluster, you need to define which Java implementation is to be
utilized. Use the previously created $HADOOP_HOME variable to access the hadoop-env.sh file:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
The path needs to match the location of the Java installation on your system.
If you need help to locate the correct Java path, run the following command in your terminal window:
which javac
The resulting output provides the path to the Java binary directory.
The section of the path just before the /bin/javac directory needs to be assigned to
the $JAVA_HOME variable.
Add the following configuration to override the default values for the temporary directory and add your
HDFS URL to replace the default local file system setting:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdoop/tmpdata</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://127.0.0.1:9000</value>
</property>
</configuration>
This example uses values specific to the local system. You should use values that match your systems
requirements. The data needs to be consistent throughout the configuration process.
Do not forget to create a Linux directory in the location you specified for your temporary data.
Edit hdfs-site.xml File
The properties in the hdfs-site.xml file govern the location for storing node metadata, fsimage file, and edit
log file. Configure the file by defining the NameNode and DataNode storage directories.
Additionally, the default dfs.replication value of 3 needs to be changed to 1 to match the single node setup.
Use the following command to open the hdfs-site.xml file for editing:
sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add the following configuration to the file and, if needed, adjust the NameNode and DataNode directories
to your custom locations:
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
If necessary, create the specific directories you defined for the dfs.data.dir value.
Edit mapred-site.xml File
Use the following command to access the mapred-site.xml file and define MapReduce values:
sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add the following configuration to change the default MapReduce framework name value to yarn:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DI
R,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</
value>
</property>
</configuration>
Once the namenode, datanodes, and secondary namenode are up and running, start the YARN resource and
nodemanagers by typing:
./start-yarn.sh
As with the previous command, the output informs you that the processes are starting.
Type this simple command to check if all the daemons are active and running as Java processes:
jps
If everything is working as intended, the resulting list of running Java processes contains all the HDFS and
YARN daemons.
http://localhost:9870
The NameNode user interface provides a comprehensive overview of the entire cluster.
The default port 9864 is used to access individual DataNodes directly from your browser:
http://localhost:9864
File Commands
1. ls Directory listing
2. ls -al Formatted listing with hidden files
3. ls -lt Sorting the Formatted listing by time modification
4. cd dir Change directory to dir
5. cd Change to home directory
6. pwd Show current working directory
7. mkdir dir Creating a directory dir
8. head file Output the first 10 lines of the file
9. tail file Output the last 10 lines of the filethe
last 10 lines
10. touch file Create or update file
11. rm file Deleting the file
12. rm -r dir Deleting the directory
13. rm -f file Force to remove the file
14. rm -rf dir Force to remove the directory dir
15. cp file1 file2 Copy the contents of file1 to file2
16. cp -r dir1 dir2 Copy dir1 to dir2;create dir2 if not present
17. mv file1 file2 Rename or move file1 to file2,if file2 is an existing
directory
Searching
1. grep pattern file Search for pattern in file
2. grep -r pattern dir Search recursively for pattern in dir
3. command | grep Search pattern in the output of a command
pattern
4. locate file Find all instances of file
5. find . -name filename Searches in the current directory (represented by a
period) and below it, for files and directories with
names starting with filename
6. pgrep pattern Searches for all the named processes , that matches
with the pattern and, by default, returns
their ID
System Info
1. date Show the current date and time
2. cal Show this month's calender
3. uptime Show current uptime
4. whoami Who you are logged in as
5. finger user Display information about user
6. cat /proc/cpuinfo Cpu information
7. cat proc/meminfo Memory information
8. man command C
Show the manual for command
CO-Mapping
Lab. No. Title CO1 CO2 CO3
2 -
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Outcome
The Objective
Word Count case study in Hadoop. Word count is the standard example to
understand the Hadoop MapReduce paradigm in which we count the number of
instances of each word in an input file and gives the list of words and the number of
instances of the particular word as an output.
Instructions:
1. Open a Terminal (Right-click on Desktop or click Terminal icon in the top toolbar)
2. Review the following to create the python code
Section 1: wordcount_mapper.py
Section 2: wordcount_reducer.py
You can cut and paste the above into a text file as follows from the terminal prompt in
Cloudera VM.
Type in the following to open a text editor, and then cut and paste the above lines for
wordcount_mapper.py into the text editor, save, and exit. Repeat for wordcount_reducer.py
> pwd
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
3. Create some data:
> echo " data that contains greater variety, arriving in increasing volumes and with more
velocity" > /home/cloudera/testfile1
> echo "Apache Hadoop is an open source framework that is used to efficiently store and
process large datasets " > /home/cloudera/testfile2
4. Create a directory on the HDFS file system (if already exists that’s OK):
hdfs dfs -mkdir /user/cloudera/input
7. Run the Hadoop WordCount example with the input and output specified.
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/cloudera/input \
-output /user/cloudera/output_new \
-mapper /home/cloudera/wordcount_mapper.py \
-reducer /home/cloudera/wordcount_reducer.py
Hadoop prints out a whole lot of logging or error information. If it runs you will see
something like the following on the screen scroll by:
....
Look at the files there and check out the contents, e.g.:
Wordcount_mapper.py
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Wordcount_reducer.py
Task → Cloudera terminal for successful execution of command map 100% and reducer
100%.
Lab Assignment 3:
CO-Mapping
Lab. No. CO1 CO2 CO3
3 -
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Outcome
The Objective
Webserver Log files include information about system performance that can be used
to determine when additional capacity is needed to optimize the user experience. Log
files can help analysts identify slow queries, errors that are causing transactions to
take too long or bugs that impact website or application performance. So, Web server
log analysis help the industries to boost their search engine rankings. So, in this lab
we will learn how to analyze the web server log files using Hadoop MapReduce
Framework.
Lab Problem Statement: Suppose you are working as Big Data Analytics in CISCO. Your
manager has asked to analyze webserver log files using Big Data Hadoop MapReduce
framework. Consider a sample web server log file of Bennett University. Develop a Hadoop
MapReduce framework and write your Mapper, Reducer code that could analyze the log file
data set to complete following tasks:
Task 1: Launch and deploy Hadoop Big Data platform at the Cloudera Virtual Machine
Task 3: Write python source code for the required ‘Mapper’ and ‘Reducer function’.
Task 4: At deployed Hadoop MapReduce framework, analyze ‘Bennett webserver log file’
data set and answer the following queries:
I. How many requests has been generated to access the Bennett university
webpage?
II. Calculate number of hits generated by user IP address 192.12.68.34.
III. Average processing time taken in execution of Mapper, Shuffle and Reducer
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
phase.
Lab Assignment 4:
CO-Mapping
Lab. No. CO1 CO2 CO3
4
Outcome
The Objective
Pig is a high-level scripting language that is used with Apache Hadoop. To analyze
the data using Apache Pig, developers need to write scripts using Pig Latin
language. Pig excels at describing data analysis problems as data flows. Pig is highly
utilized in industries because it supports rich set of features to manipulate the
required data. In this lab, students will learn how to deploy and write Pig Script so
that they can analyze the real-time data sets using Apache Hadoop Pig MapReduce
Framework.
With increasing number of ecommerce business, there is a need to analyze real time
datasets using Apache Pig. Consider a ‘Sales2022.csv’ analyze this dataset using
Apache Pig to answer the following queries at Grunt Shell:
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Task 1: Launch and deploy Hadoop Big Data platform at the Cloudera Virtual
Machine.
Task 2: Load and create a Schema for ‘Sales2022.csv’ dataset using Pig Storage
delimiter format.
Task 3: Write Pig Script to group and dump the data by ‘Country’.
Task 4: Write Pig Script to popular and filter the data where city is ‘Chicago’
Task 5: Compose Pig Script to fetch data where payment has been made using
‘Mastercard’
Task 6: Write Pig Script to analyze the number of product sales on transactions date
‘1/21/2022 14:06’
Lab Assignment 5:
CO-Mapping
Lab. No. CO1 CO2 CO3
5
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Outcome
The Objective
Pig is a high-level scripting language that is used with Apache Hadoop. To analyze
the data using Apache Pig, developers need to write scripts using Pig Latin
language. Pig excels at describing data analysis problems as data flows. Pig is highly
utilized in industries because it supports rich set of features to manipulate the
required data. In this lab, students will learn how to deploy, analyze and write Pig
Script for multiple datasets over the Apache Hadoop Pig MapReduce Framework.
MovieLens dataset is mostly used for building recommender systems that analyze
user movie ratings. In this lab assignment, you have to explore the two MovieLens
datasets namely ‘u.data’ and ‘u.item’ to find trends in movie preferences. The data is
provided as an input to Apache Pig which is analyzed and partitioned based on ratings.
Consider these given datasets and write Apache Pig Script to answer the following
queries at Grunt Shell:
Task 1: Launch and deploy Apache Hadoop Big Data platform at the Cloudera
Virtual Machine.
Task 2: Write Pig Script to create relation schema for the movie data set 1 ‘u.data,’.
Task 4: Write Pig Script to find the movie with avg rating >4.0.
Task 6: Write Pig Script to analyze and display the ‘u.item’ data set for movie title
‘Shanghai Triad’ only.
Task 7: Write Pig Script to perform ‘Pig Join Operation’ on ‘u.data,’ and ‘u.item’
datasets
Task 8: Find the oldest 3-star movies from ‘u.data,’ and ‘u.item’ datasets
Task 9: Write Pig Script to find the oldest 5-star movies from ‘u.data,’ and
‘u.item’ datasets.
Lab Assignment 6:
CO-Mapping
Lab. No. CO1 CO2 CO3
6
Outcome
Apache Hive is a data warehouse system mounted on top of Hadoop and used to
examine structured, semi-structured data. Hive abstracts the complexity of Hadoop
MapReduce. It gives a mechanism to project structure onto the data and make
queries written in HQL (Hive Query Language). In this lab, students will learn how
to deploy, analyze, and write Hive Query Language (HQL) Script for analysis of
‘Sports.csv’ dataset over the deployed Apache Hadoop MapReduce Framework.
Sports data analytics are used not only in cricket but many other sports such as
‘Swimming,’ ‘Tennis,’ ‘Shooting’, ‘Rowing’ etc. This can be used for improving the
overall team performance and maximizing winning chances. Consider a ‘Sports.csv’
dataset of International Games which is provided as an input to Apache Hive,
analyzed and partitioned based on different given categories. Write Apache Hive
Query Language (HQL) Script to answer the following queries at Hadoop CLI shell
mode:
Task 1: Write HDFS shell command to load the data set at ‘root directory’ from
Apache Hadoop Big Data local file system platform.
Task 2: Write HQL Script to design ‘schema’ and create the required ‘table’ for
‘sports.csv’.
Task 3: Write HQL Script to analyze and list the total number of medals won by each
country for ‘Swimming’ sport category.
Task 4: Write HQL Script to calculate and display the total number of ‘Gold
Medals’ won by India. (Assume all sports category).
Task 5: Write HQL Script to list the number of medals won by ‘China’ in
‘Shooting’.
Task 6: Write HQL Script to calculate and count of the total number of medals each
country won.
Task 7: Write HQL Script to display year and countries name that won medals in
‘Shooting’.
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Task 8: Write HQL Script to list the country that won gold and silver medals in
‘Football’.
Lab Assignment 7:
CO-Mapping
Lab. No. CO1 CO2 CO3
7
Outcome
The Objective
Apache Hive is a data warehouse system mounted on top of Hadoop and used to
examine structured, semi-structured data. Hive abstracts the complexity of Hadoop
MapReduce. It gives a mechanism to project structure onto the data and make
queries written in HQL (Hive Query Language). In this lab, students will learn how
to deploy, analyze, and write Hive Query Language (HQL) Script for analysis of
‘Sports.csv’ dataset over the deployed Apache Hadoop MapReduce Framework.
Task 1: Write HQL Script to design ‘schema’ and create the required ‘table’ for
‘Batch_Distribution’, ‘Demo_Missed’, ‘Demo_Schedule’, ‘Enquiry’ and ‘Enroll’
datasets.
Task 2: Write HQL Script to list the students who have been enrolled for ‘Weekend
Batch Courses’.
Task 3: Write HQL Script to display the students who have done payment using
‘Googlepay’.
Task 4: Write HQL Script to display the students who have paid the fees in ‘Cash’
mode.
Task 5: Write HQL Script to list the students who have paid complete fees.
Task 6: Write HQL Script to list the students whose fees is pending.
Task 7: Write HQL Script to list the details of students whose 1 installment payment
is pending.
Task 8: Write HQL Script to display the students who have enrolled for ‘Cloud’
batch.
Task 9: Write HQL Script to display the students who have enrolled for ‘FullStack’
batch.
Task 10: Write HQL Script to list the information of students who have missed their
course demo.
Lab Assignment 8:
Outcome
The Objective
Data has the potential to transform business and drive the creation of business
value. The real power of data lies in the use of analytical tools that allow the user to
extract useful knowledge and quantify the factors that impact events. Some
examples include customer sentiment analysis, ecommerce data set analysis, geo-
spatial analysis of key operation centers, etc. In this lab, students will learn how to
deploy, analyze, and write Hive Query Language (HQL) Script for analysis of
‘Smartphone.csv’ dataset over the deployed Apache Hadoop MapReduce
Framework.
Lab Problem Statement
Consider telephonic companies that wants to analyze their Diwali smartphones 2022
sales datasets ‘2022_OctSales.csv’, ‘2022_NovSales.csv’ using Apache Hive. As Big
Data Analyst you have been given the challenge to extract the customer clickstream
and find hidden information from these real-time datasets. This would help the
telephonic companies to analyze the Diwali smartphone sales and develop a customer
product recommendations system. For the given case study, deploy the Hadoop
framework over cloudera virtual machine and write Apache Hive Query Language
(HQL) Script to answer the following queries.
Task 1: Write HQL Script to design schema and create the required ‘External Table’
for ‘2022_OctSales.csv’ and ‘2022_NovSales.csv’ datasets.
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Task 2: Write HQL Script to identify the month in which maximum sales profit has
been generated.
Task 3: Write HQL Script to display the difference of revenue gained during
smartphone sales from Oct to Nov month.
Task 4: Write HQL Script to identify and display the name of company that received
maximum sales in the combined category of Oct and Nov month.
Task 5: Assume from the given data sets, a telephonic company wants to reward
the top 5 users of its website with BONUS credit points of Rs 10,000. Write a query
to generate a list of top 5 users who spend the most time on the website. [Hint: Can
use Dynamic Partitioning]
Task 6: Write HQL Script to identify and display the company whose sales is
increased from Oct to Nov.
Lab Assignment 9:
CO-Mapping
Lab. No. CO1 CO2 CO3
9
Outcome
Apache Spark has emerged as the next big thing in the Big Data domain – quickly
rising from an ascending technology to an established superstar in just a matter of
years. Spark allows you to quickly extract actionable insights from large amounts of
data, on a real-time basis, making it an essential tool in many modern businesses.
In this lab, students will learn how to deploy, analyze, and write PySpark Script for
analysis of ‘customerorder.csv’ dataset over the deployed Apache Spark
Framework.
Task 1: Configure and deploy the Apache Spark Framework on your Cloudera VM.
Task 2: Write PySpark script to split the given data set each comma-delimited line
into fields.
Task 3: Write PySpark script to map each line to key/value pairs of customer ID and
the amount.
Task 5. Write PySpark script to finally collect the result and display them on the spark
shell.
CO-Mapping
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Lab. No. CO1 CO2 CO3
10
Outcome
The Objective
Apache Spark has emerged as the next big thing in the Big Data domain – quickly
rising from an ascending technology to an established superstar in just a matter of
years. Spark allows you to quickly extract actionable insights from large amounts of
data, on a real-time basis, making it an essential tool in many modern businesses.
In this lab, students will learn how to implement recommendation system over the
deployed Apache Spark Framework.
Companies like Amazon and Netflix providing recommendations based on the user’s
interest. So, by using Apache spark machine learning capabilities along with
elasticsearch, you are supposed to build a recommendation system. As Big Data
Analyst you have to write a python Spark script that consolidates and analyze this
data set to answer the following queries.
CO-Mapping
Lab. No. CO1 CO2 CO3
11
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Outcome
The Objective
Apache Spark has emerged as the next big thing in the Big Data domain – quickly
rising from an ascending technology to an established superstar in just a matter of
years. Spark allows you to quickly extract actionable insights from large amounts of
data, on a real-time basis, making it an essential tool in many modern businesses.
In this lab, students will learn how to implement prediction system over the deployed
Apache Spark Framework.
Diabetes can lead to serious complications which can affect many different parts of
human body. Consider the diabetes dataset of a healthcare industry that consist of
information about patients’ diabetes report values. As Big Data Analyst you have to
write python Spark scripts that consolidates and analyze this data set to answer the
following queries.
Task 2. Perform data cleaning and preparation using Spark Data Frame operations.
CO-Mapping
Lab. No. CO1 CO2 CO3
12
Outcome
The Objective
Apache Spark has emerged as the next big thing in the Big Data domain – quickly
rising from an ascending technology to an established superstar in just a matter of
years. Spark allows you to quickly extract actionable insights from large amounts of
data, on a real-time basis, making it an essential tool in many modern businesses.
In this lab, students will learn how to implement spark streaming system.
Apache Spark is an efficient framework for analyzing large amounts of data in real-time and
performing various analyses on the data. Many resources discuss Spark and its popularity in
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
the big data space, but it is worthwhile to highlight that its core features include real-time big
data processing using Resilient Distributed Datasets (RDDs)/Data Frame, streaming, and
machine learning. In this lab, you will learn how Spark Streaming components can be used in
conjunction with PySpark to resolve a business problem.
Task 4. Perform and start streaming session using SPARK Data frame.
GENERAL INSTRUCTIONS: -
1. Do not write anything on the question paper except name, enrolment number and
department/school.
2. Carrying mobile phone, smart watch and any other non-permissible materials in the
examination hall is an act of UFM.
COURSE INSTRUCTIONS:
a) All questions are compulsory
b) Write complete syntax for the asked queries.
SECTION A
Max Marks: 15 Marks
Q1.) When a data will be known as Big Data? Illustrate the five key properties of Big Data.
[2 Marks]
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Q2.) Justify why Spark is faster than Map Reduce. List the 3 functionalities of Spark Engine
[3 Marks] Q3.) Draw architecture of PIG and explain how Pig Latin is converted to
MapReduce? [3 Marks]
Q4.) List the property that you can use to enable bucketing in Apache hive. [1
Roll no Name CGPA Marks]
001 John 3.0 Q5) Write the property used to increase the buffer size
002 Jack 4.0
for sorting during the shuffle phase? [2 Marks]
003 Smith 4.5
004 David 4.2
Q6.) Draw functional architecture of Hive and justify
005 Robert 3.4
purpose of ‘driver’ module. [1+3=4 Marks]
Q11) Consider following ‘Jan.csv’ and ‘Feb.csv’ two-month sales datasets of your
organization. Write Apache Hive Query Language (HQL) script to answer the following
queries. [1*5= 5 Marks]
Table 1. Sales in Jan 2022
Event_tim Event_typ Product_i Category_i Compan Price Useri User_sessio
e e d d y d n
5873353
2022-02-04 Purchase 31 Vivo 6756 47 12e8d843
5981589
SECTION A
Max Marks: 15 Marks
Q1.) When a data will be known as Big Data? Illustrate the five key properties of Big Data.
[2 Marks]
Answer:
Big data is a collection of data from many different sources and is often describe by five characteristics:
volume, value, variety, velocity, and veracity.
• Volume: the size and amounts of big data that companies manage and analyze
• Value: the most important “V” from the perspective of the business, the value of big data
usually comes from insight discovery and pattern recognition that lead to more effective
operations, stronger customer relationships and other clear and quantifiable business benefits
• Variety: the diversity and range of different data types, including unstructured data, semi-
structured data and raw data
• Velocity: the speed at which companies receive, store and manage data – e.g., the specific
number of social media posts or search queries received within a day, hour or other unit of
time
• Veracity: the “truth” or accuracy of data and information assets, which often determines
executive-level confidence
• Variability: the changing nature of the data companies seek to capture, manage and analyze
– e.g., in sentiment or text analytics, changes in the meaning of key words or phrases
Q2.) Justify why Spark is faster than Map Reduce. List the 3 functionalities of Spark
Engine [3 Marks]
Answer: In-memory computation property makes Spark faster than Map Reduce
• Interacting with cluster and storage manager.
• Dividing task into sub tasks for parallel execution.
• Scheduling tasks on the cluster.
Q3.) Draw architecture of PIG and explain how Pig Latin is converted to MapReduce?
[3 Marks]
Answer:
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
i. Parser: Parser basically checks the syntax of the script, does type checking. Parser’s output
will be a DAG (directed acyclic graph) that represents the Pig Latin statements as well as
logical operators.
ii. Optimizer: DAG is passed to the logical optimizer. It carries out the logical optimizations.
iii. Compiler: Then compiler compiles the optimized logical plan into a series of MapReduce
jobs.
iv. Execution engine: Eventually, all the MapReduce jobs are submitted to Hadoop in a
sorted order. Ultimately, it produces the desired results while these MapReduce jobs are
executed on Hadoop.
Q4.) List the property that you can use to enable bucketing in Apache hive.
[1 Marks]
Answer: Hive.enforce.bucketing
Q5) Write the property used to increase the buffer size for sorting during the shuffle
phase? [2 Marks]
Answer: Mapreduce.task.io.sort.mb
Q6.) Draw functional architecture of Hive and justify purpose of ‘driver’ module.
[1+3=4 Marks]
In case of Internal Tables, both the table and the data contained in the tables are managed
by HIVE. That is, we can add/delete/modify any data using HIVE. When we DROP the
table, along with the table, the data will also get deleted.
Eg: CREATE TABLE tweets (text STRING, words INT, length INT) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
In case of External Tables, only the table is managed by HIVE. The data present in these
tables can be from any storage locations like HDFS. We cant add/delete/modify the data in
these tables. We can only use the data in these tables using SELECT statements. When
we DROP the table, only the table gets deleted and not the data contained in it. This is why
its said that only meta-data gets deleted. When we create EXTERNAL tables, we need to
mention the location of the data.
Eg: CREATE EXTERNAL TABLE tweets (text STRING, words INT, length INT) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/tweets';
Q8) Justify, how Streaming Context object play key role in Apache Spark? [2
Marks]
Main entry point for Spark Streaming functionality. It provides methods used to
create DStreams from various input sources. It can be either created by providing a Spark
master URL and an appName, or from a org.apache.spark.SparkConf configuration (see core
Spark documentation), or from an existing org.apache.spark.SparkContext.
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
Answer:
i) hdfs dfs -copyFromLocal /path 1 /path 2 .... /path n /destination
ii) The Apache Pig DISTINCT operator is used to remove duplicate tuples in a relation.
Initially, Pig sorts the given data and then eliminates duplicates.
iii) hive> ALTER TABLE <tablename> REPLACE COLUMNS (<old column name> INT,
<new column name> STRING);
v) flightTimeCsvDF = spark.read \
.format("csv") \
.option("header", "true") \
.schema(flightSchemaStruct) \
.option("mode", "FAILFAST") \
.option("dateFormat", "M/d/y") \
.load("data/flight*.csv")
JSON:
flightTimeJsonDF = spark.read \
.format("json") \
.schema(flightSchemaDDL) \
.option("dateFormat", "M/d/y") \
.load("data/flight*.json")
Parquet:
flightTimeParquetDF = spark.read \
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
.format("parquet") \
.load("data/flight*.parquet")
vi)
import sys
from pyspark.sql import *
from lib.utils import *
Answer 10.
a). A = load ‘student.csv’ as (Roll no: int, Name: chararray, CGPA:float);
B = load ‘Dept.csv’ as (Roll no: int, Dept. Name: chararray, University: chararray);
C= JOIN A BY Rollno, B BY Rollno
Dump C;
b)
NewDelhiStudents = FILTER cseData1 BY University== ‘MIT’ & Standford;
c).
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
marksGT70 = FILTER cseData1 BY CGPA >= 3.5;
Answer:
a) SELECT SUM(price) AS Total_Revenue_Jan
FROM CosmeticStore
WHERE date_format(event_time, 'MM')=10
AND
event_type= 'purchase';
b) WITH Total_Monthly_Revenue AS (
SELECT
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
SUM (CASE WHEN date_format(event_time, 'MM')=10 THEN price ELSE 0
END) AS October_Revenue,
SUM (CASE WHEN date_format(event_time, 'MM')=11 THEN price ELSE 0
END) AS November_Revenue
FROM CosmeticStore
WHERE event_type= 'purchase'
AND date_format(event_time, 'MM') in ('10','11')
)
SELECT November_Revenue, October_Revenue, (November_Revenue-October_Revenue)
AS Difference_Of_Revenue FROM Total_Monthly_Revenue;
e).
--To create Partitioning and bucketing
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.enforce.bucketing=true;
-- Adding data from CosmeticStore table into partitioned and clustered table.
INSERT INTO TABLE Dynamic_Part_Cluster_CosmeticStore
PARTITION (event_type)
SELECT event_time, product_id, category_id, category_code, brand, price, user_id,
user_session, event_type
FROM CosmeticStore;
--Checking whether the table has been succesfully created and loaded with the data
EXIT;
hadoop fs-ls /user/hive/warehouse/Dynamic_Part_Cluster_CosmeticStore
--Question 8 Query
SELECT user_id, SUM(price) AS Total_Spend
FROM Dynamic_Part_Cluster_CosmeticStore
WHERE event_type='purchase'
GROUP BY user_id
ORDER BY Total_Spend DESC
LIMIT 10;
• Introduction
In a trend of corporate and investors in any manufacturing industry have shown increased
interest in carbon footprint analysis, which shows a keen align of how a given
manufacture may result in carbon intensity of the environment. Emissions are directly
related to energy costs, such that this tracking can have tangible relevance to the
performs for a individual organization to better understand the company’s emission and
have a value of CO2 index with a greater impact with competitive advantage among
peers.
• Objectives
2. The Global scientific community has reached a consensus goal of stabilizing the
earth’s climate to stay withing two digits. Our project can play a member role in
the mission.
3. The analysis does not take the full business cycle. So one have to possibly add
4. Given that using a carbon footprint will not provide a comprehensive approach,
hence we have trained a dataset to see and analyze next 2 years of training data
• Tables
IDE.
Scope 1: Direct Greenhouse gas emissions from sources that are owned or controlled by a
company.
Scope 2: Indirect Greenhouse gas emissions resulting from the generation of electricity, heat
Scope 3: Indirect Greenhouse gas emissions from sources not owned or directly controlled
At a minimum, a carbon footprint analysis should include all scope 1 (direct) and scope 2
(indirect) carbon and methane emissions. Due to the difficulties in obtaining comparable and
supply chain) are not typically included. Investors should be able to conduct more robust
improve their disclosure. At this time, a carbon footprint analysis cannot provide a complete
opinion, it is critical to look further, toward the broader benefits associated with those
businesses, in order to gain a better understanding of their broader portfolio and economic
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
impact.
https://github.com/bennettswallaby/Carbonated-CO2-Emissions-
Tracker/tree/main
• Conclusion
Taking scope 3 or downstream impacts into account for both suppliers and
It can also help businesses assess where there may be opportunities to identify
and engage suppliers who are leaders or laggards in this field. Companies
disposal.
including their products/services after they have been sold. Similarly, focusing
BENNETT UNIVERSITY
School of Computer Science Engineering & Technology
on downstream effects can assist investors in identifying investment
22 E20CSE203 RISHABH .
• During the entire course and its end, it was found that the course was very much
useful for the students because it helps them to understand, key concept of Big Data
• The course covered almost all the aspects of Big Data and Business Intelligence.
• The students found the Hackathon assignments useful and interesting to improve
• Both candidate completed the Infosys certification on Big Data and they have
Category BCA
CLO 5 out of 5
11. Special efforts and measures taken to improve the class’s learning
• Industry tie-up with DataCouch Private Ltd for Hackathon sponsorship and Internship
opportunities.