KEMBAR78
Data Analytics Lab Manual | PDF | Map Reduce | Apache Hadoop
0% found this document useful (0 votes)
10K views23 pages

Data Analytics Lab Manual

The document is a lab manual for a data analytics course. It provides instructions for 4 assignments involving analyzing different datasets using Python and R. The first assignment involves analyzing the Iris flower dataset, including summarizing features, computing statistics, and creating histograms and boxplots to visualize the data distributions and identify outliers. Key steps are described like downloading the dataset, exploring feature types and distributions, and using pandas and matplotlib for data visualization.

Uploaded by

Anushka Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10K views23 pages

Data Analytics Lab Manual

The document is a lab manual for a data analytics course. It provides instructions for 4 assignments involving analyzing different datasets using Python and R. The first assignment involves analyzing the Iris flower dataset, including summarizing features, computing statistics, and creating histograms and boxplots to visualize the data distributions and identify outliers. Key steps are described like downloading the dataset, exploring feature types and distributions, and using pandas and matplotlib for data visualization.

Uploaded by

Anushka Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

lOMoARcPSD|8677763

Data Analgesic Lab Manual

computer engineer (Savitribai Phule Pune University)

StuDocu is not sponsored or endorsed by any college or university


Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)
lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

SINHGAD TECHNICAL EDUCATION SOCIETY’S


RMD SINHGAD SCHOOL OF ENGINEERING
Warje, Pune 411033

DEPARTMENT OF COMPUTER ENGINEERING

LAB MANUAL
Academic year 2018-19

B.E. COMPUTER (SEM – I)

DATA ANALYTICS LAB (Laboratory Practice I )

Subject Incharge

TEACHING SCHEME EXAMINATION SCHEME


PRACTICAL: 4 HRS/WEEK UNIVERSITY PRACTICAL : 50 MARKS
UNIVERSITY TERMWORK: 50 MARKS
CREDITS: 02

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

RMD Sinhgad School of Engineering


DEPARTMENT OF COMPUTER ENGINEERING

CERTIFICATE

This is to certify that Mr. /Miss _____________________________________ of class ___________


Div __________ Roll No ._________ Examination Seat No. ______________ has completed all
the practical work in the subject of Data Analytics Lab, satisfactorily, as prescribed by Savitribai
Phule Pune University in the Academic Year 2018 - 2019.

Mrs. Vina Lomte Dr. V. V. Dixit


Staff In-charge Head of Department Principal

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

INDEX
Sr Date of
Date of Marks
N Title completio Signature
performance Obtained
o n

1 Study of Iris flower Data Set


Naive Bayes‟ Algorithm for
2
classification
3 Trip History Analysis
4 Bigmart Sales Analysis

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Assignment No: 1
Title: Study of Iris Flower Data Set
Problem Statement:
Download the Iris flower dataset or any other dataset into a DataFrame. (eg
https://archive.ics.uci.edu/ml/datasets/Iris ) Use Python/R and Perform following –
 How many features are there and what are their types (e.g., numeric, nominal)?
 Compute and display summary statistics for each feature available in the dataset. (eg.
minimum value, maximum value, mean, range, standard deviation, variance and percentiles
 Data Visualization-Create a histogram for each feature in the dataset to illustrate the feature
distributions. Plot each histogram.
 Create a boxplot for each feature in the dataset. All of the boxplots should be combined into a
single plot. Compare distributions and identify outliers.

Theory:

This is perhaps the best known database to be found in the pattern recognition literature. The data set
contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is
linearly separable from the other 2; the latter are NOT linearly separable from each other.

Predicted attribute: class of iris plant.

This is an exceedingly simple domain.

This data differs from the data presented in Fishers article.

The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" where the error is in the fourth feature.

The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are in the second and third features.

Attribute Information:

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

The extra modules needed for coding

• Pandas:

Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data


structures and data analysis tools for the Python programming language.

For Fedora Users:


sudo yum install numpyscipy python-matplotlibipython python-pandas sympy
python-nose atlas-devel

For Ubuntu Users:


sudo aptget install python3pandas

Pandas deals with the following three data structures −

 Series
 DataFrame
 Panel

DataFrame is a two-dimensional array with heterogeneous data. For example,

Name Age Gender Rating


Steve 32 Male 3.45
Lia 28 Female 4.6
Vin 45 Male 3.9
Katie 38 Female 2.78

 The table represents the data of a sales team of an organization with their overall performance
rating. The data is represented in rows and columns. Each column represents an attribute and each
row represents a person

pandas.DataFrame

A pandas DataFrame can be created using the following constructor –

pandas.DataFrame( data, index, columns, dtype, copy)

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

S.No Parameter & Description


data
1 data takes various forms like ndarray, series, map, lists, dict, constants
and also another DataFrame.
index
2 For the row labels, the Index to be used for the resulting frame is
Optional Default np.arrange(n) if no index is passed.
columns
3 For column labels, the optional default syntax is - np.arrange(n). This is
only true if no index is passed.
dtype
4
Data type of each column.
copy
5 This command (or whatever it is) is used for copying of data, if the
default is False.
Example 1

The following example shows how to create a DataFrame by passing a list of dictionaries.

import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df

Its output is as follows −

a b c
0 1 2 NaN
1 5 10 20.0

• matplotlib

- matplotlib.pyplot is a collection of command style functions that make matplotlib work like
MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates
a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels,
etc.

- In matplotlib.pyplot various states are preserved across function calls, so that it keeps track of
things like the current figure and plotting area, and the plotting functions are directed to the
current axes (please note that "axes" here and in most places in the documentation refers to
the axes part of a figure and not the strict mathematical term for more than one axis).

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Example 1

Generating visualizations with pyplot is very quick:


import matplotlib.pyplot as plt
plt.plot([1, 2, 3, 4])
plt.ylabel('some numbers')
plt.show()

• Sklearn

- Scikit-learn provide a range of supervised and unsupervised learning algorithms via a


consistent interface in Python.
- It is on NumPy, SciPy and matplotlib, this library contains a lot of efficient tools for machine
learning and statistical modeling including classification, regression, clustering and
dimensionality reduction.

Loading Your Data Set


- The first step to about anything in data science is loading in your data. This is also the starting
point of this scikit-learn
- To load in the data, you import the module datasets from sklearn. Then, you can use the
load_digits() method from datasets to load in the data:

Example 1:
# import `datasets` from `sklearn`
from sklearn import ________

# Load in the `digits` data


digits = datasets.load_digits()

# Print the `digits` data


print(______)

• How to install?

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

– sudo aptget install python3pandas


– sudo aptget install python3matplotlib
– sudo aptget install python3sklearn

 How to Find the Mean, Median, Mode, Range, and Standard Deviation
Simplify comparisons of sets of number, especially large sets of number, by
calculating the center values using mean, mode and median. Use the ranges and
standard deviations of the sets to examine the variability of data.
Calculating Mean

The mean identifies the average value of the set of numbers. For example,
consider the data set containing the values 20, 24, 25, 36, 25, 22, 23.
Formula
To find the mean, use the formula: Mean equals the sum of the numbers in the data set
divided by the number of values in the data set. In mathematical terms: Mean=(sum of all
terms)÷(how many terms or values in the set).
Adding Data Set

Add the numbers in the example data set: 20+24+25+36+25+22+23=175.


Finding Divisor

Divide by the number of data points in the set. This set has seven values so divide by 7.
Finding Mean

Insert the values into the formula to calculate the mean. The mean equals the sum
of the values (175) divided by the number of data points (7). Since 175÷7=25, the
mean of this data set equals 25. Not all mean values will equal a whole number.
Calculating Range

Range shows the mathematical distance between the lowest and highest values in the
data set. Range measures the variability of the data set. A wide range indicates greater
variability in the data, or perhaps a single outlier far from the rest of the data. Outliers
may skew, or shift, the mean value enough to impact data analysis.
Identifying Low and High Values

In the sample group, the lowest value is 20 and the highest value is 36.
Calculating Range

To calculate range, subtract the lowest value from the highest value. Since 36-
20=16, the range equals 16.

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Calculating Standard Deviation

Standard deviation measures the variability of the data set. Like range, a smaller
standard deviation indicates less variability.
Formula

Finding standard deviation requires summing the squared difference between each
data point and the mean [∑(x-µ)2], adding all the squares, dividing that sum by one
less than the number of values (N-1), and finally calculating the square root of the
dividend.
Mathematically, start with calculating the mean.
Calculating the Mean

Calculate the mean by adding all the data point values, then dividing by the number
of data points. In the sample data set, 20+24+25+36+25+22+23=175. Divide the sum,
175, by the number of data points, 7, or 175÷7=25. The mean equals 25.
Squaring the Difference

Next, subtract the mean from each data point, then square each difference. The
formula looks like this: ∑(x-µ)2, where ∑ means sum, x represents each data set value
and µ represents the mean value.
Continuing with the example set, the values become:
20-25=-5 and -52=25; 24-25=-1 and -12=1; 25-25=0 and 02=0; 36-25=11 and
112=121; 25-25=0 and 02=0; 22-25=-3 and -32=9; and 23-25=-2 and -22=4.
Adding the Squared Differences

Adding the squared differences yields:


25+1+0+121+0+9+4=160. Division by N-1
Divide the sum of the squared differences by one less than the number of data points.
The example data set has 7 values, so N-1 equals 7-1=6. The sum of the squared
differences, 160, divided by 6 equals approximately 26.6667.
Standard Deviation
Calculate the standard deviation by finding the square root of the division by N-1. In
the example, the square root of 26.6667 equals approximately 5.164. Therefore, the
standard deviation equals approximately 5.164.
Evaluating Standard Deviation
Standard deviation helps evaluate data. Numbers in the data set that fall within one
standard deviation of the mean are part of the data set. Numbers that fall outside of
two standard deviations are extreme values or outliers. In the example set, the value
36 lies more than two standard deviations from the mean, so 36 is an outlier. Outliers

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

may represent erroneous data or may suggest unforeseen circumstances and should
be carefully considered when interpreting data.

The Data Set

Import packages

Sample Output

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Conclusion:
Thus we have learnt and implemented various extraction, visualization and box plot for each feature.
Also compared distributions and identify outliers

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Assignment No: 2

Title: Naive Bayes‟ Algorithm for classification


Problem Statement:

Download Pima Indians Diabetes dataset. Use Naive Bayes‟ Algorithm for classification
- Load the data from CSV file and split it into training and test datasets.
- Summarize the properties in the training dataset so that we can calculate probabilities and
make predictions.
- Classify samples from a test dataset and a summarized training dataset.

Theory :

Implement a classification algorithm that is Naïve Bayes. Implement the


following operations:
1. Split the dataset into Training and Test dataset.
2. Calculate conditional probability of each feature in training dataset.
3. Classify sample from a test dataset.
4. Display confusion matrix with predicted and actual values.

Dataset

The dataset includes data from 768 women with 8 characteristics, in particular:

1. Number of times pregnant


2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)

The last column of the dataset indicates if the person has been diagnosed with diabetes (1) or not (0)

The Problem

The type of dataset and problem is a classic supervised binary classification. Given a number of
elements all with certain characteristics (features), we want to build a machine learning model to
identify people affected by type 2 diabetes.

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

To solve the problem we will have to analyze the data, do any required transformation and
normalization, apply a machine learning algorithm, train a model, check the performance of the
trained model and iterate with other algorithms until we find the most performant for our type of
dataset.

What is Naive Bayes algorithm?

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem
with the “naive” assumption of independence between every pair of features
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
Look at the equation below:

Above,

 P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of predictor given class.
 P(x) is the prior probability of predictor.

How Naive Bayes algorithm works?

Let’s understand it using an example. Below I have a training data set of weather and corresponding
target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players
will play or not based on weather condition. Let’s follow the below steps to perform it.

Step 1: Convert the data set into a frequency table

Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The
class with the highest posterior probability is the outcome of prediction.

Problem: Players will play if weather is sunny. Is this statement is correct?

We can solve it using above discussed method of posterior probability.

P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.

The algorithm is categorized into the following steps:

1. Handle Data: Load the data from CSV file and split it into training and test datasets.
2. Summarize Data: summarize the properties in the training dataset so that we can calculate
probabilities and make predictions.
3. Make a Prediction: Use the summaries of the dataset to generate a single prediction.
4. Make Predictions: Generate predictions given a test dataset and a summarized training
dataset.
5. Evaluate Accuracy: Evaluate the accuracy of predictions made for a test dataset as the
percentage correct out of all predictions made.
6. Tie it Together: Use all of the code elements to present a complete and standalone
implementation of the Naive Bayes algorithm

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Applications:
 Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast.
Thus, it could be used for making predictions in real time.

 Multi class Prediction: This algorithm is also well known for multi class prediction
feature. Here we can predict the probability of multiple classes of target variable.

 Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers


mostly used in text classification (due to better result in multi class problems and
independence rule) have higher success rate as compared to other algorithms. As a
result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment
Analysis (in social media analysis, to identify positive and negative customer
sentiments)

 Recommendation System: Naive Bayes Classifier and Collaborative Filtering


together builds a Recommendation System that uses machine learning and data mining
techniques to filter unseen information and predict whether a user would like a given
resource or not

Input:
Structured Dataset : PimaIndiansDiabetes Dataset
File: PimaIndiansDiabetes.csv
Output:
1. Splitted dataset according to Split ratio.
2. Conditional probability of each feature.
3. visualization of the performance of an algorithm with confusion matrix

Conclusion: Hence, we have studied classification algorithm that is Naïve Bayes


classification.
Questions:
1. What is Bayes Theorem?
2. What is confusion matrix?
3. Which function is used to split the dataset in R?
4. What are steps of Naïve Bayes algorithm?
5. What is conditional probability?

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Assignment No: 3

Title: Trip History Analysis

Problem Statement:

Use trip history dataset that is from a bike sharing service in the United States. The data is
provided quarter-wise from 2010 (Q4) onwards. Each file has 7 columns. Predict the class of user.

Problem Definition:

Theory:
Data Set Information
Bike sharing systems are new generation of traditional bike rentals where whole process from
membership, rental and return back has become automatic. Through these systems, user is able to
easily rent a bike from a particular position and return back at another position.

Attribute Information:

Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv

 instant: record index

 dteday : date

 season : season (1:springer, 2:summer, 3:fall, 4:winter)

 yr : year (0: 2011, 1:2012)

 mnth : month ( 1 to 12)

 hr : hour (0 to 23)

 holiday : weather day is holiday or not (extracted from [Web Link])

 weekday : day of the week

 workingday : if day is neither weekend nor holiday is 1, otherwise is 0.


+ weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist


- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered
clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

 temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-
t_min), t_min=-8, t_max=+39 (only in hourly scale)

 atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/
(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)

 hum: Normalized humidity. The values are divided to 100 (max)

 windspeed: Normalized wind speed. The values are divided to 67 (max)

 casual: count of casual users

 registered: count of registered users

 cnt: count of total rental bikes including both casual and registered

Classification Problem: Prediction of the biker’s class Label: Member / Casual Classification:
predicts categorical class labels (discrete or nominal)

Classifies data (constructs a model) based on the training set and the values (class labels) in a
classifying attribute and uses it in classifying new data

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

1. Language Used: Python with Pandas, Scikit learn library

2. Functions Defined: Alphanumeric to Numeric Data conversion

3. Classifier Selection and Training: KNN, Decision Tree

4. Classifier Usage to predict the class Label of Unseen Data

Conclusion:

Thus we have used trip history dataset and learn to predict the class of user.

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

Assignment No: 4

Title: Wordcount using MapReduce

Problem Statement:

Write a Hadoop program that counts the number of occurrences of each word in a text file.

Problem Definition :

Hadoop Installation Guide :


https://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-
cluster/

Theory:

What is MapReduce ?

 MapReduce is a framework using which we can write applications to process huge


amounts of data, in parallel, on large clusters of commodity hardware in a reliable
manner.
 MapReduce is a processing technique and a program model for distributed
computing based on java.
 The MapReduce algorithm contains two important tasks, namely Map and Reduce.

Map and Reduce

 Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
 Secondly, reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name MapReduce
implies, the reduce task is always performed after the map job.
 The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes.
 Under the MapReduce model, the data processing primitives are called mappers and
reducers.

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

 Decomposing a data processing application into mappers and reducers is sometimes


nontrivial. But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of machines in a
cluster is merely a configuration change.
 This simple scalability is what has attracted many programmers to use the MapReduce
model.
The Algorithm

• MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
• Map stage: The map or mapper’s job is to process the input data. Generally the input
data is in the form of file or directory and is stored in the Hadoop file system (HDFS).
The input file is passed to the mapper function line by line. The mapper processes the data
and creates several small chunks of data.
• Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS.

Data input and output

Terminologies

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

 Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
 NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
 DataNode - Node where data is presented in advance before any processing takes place.
 MasterNode - Node where JobTracker runs and which accepts job requests from clients.
 SlaveNode - Node where Map and Reduce program runs.
 JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
 Task Tracker - Tracks the task and reports status to JobTracker.
 Job - A program is an execution of a Mapper and Reducer across a dataset.
 Task - An execution of a Mapper or a Reducer on a slice of data.

The wordcount flow

Hadoop Streaming

Steps :

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)


lOMoARcPSD|8677763

BE (Computer) (Semester I) Data Analytics Laboratory (2018-2019)

1. The following command is to create a directory to store the compiled classes.


$ mkdir words
2. Download hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce
program
3. Compile the program
4. The following command is used to create an input directory in HDFS.
$hadoop fs -mkdir /input
5. The following command is used to copy input dataset file on HDFS.
$hadoop fs -put fruits.txt /input
6. The following command is used to verify the files in the input directory.
$hadoop fs -ls /input
7. The following command is used to run the Wordcount application by taking the input files
from the input directory
$cat fruits.txt | python mapper.py | sort | python reducer.py
Output:
apple 4
banana 2
grapes 3
mango 2
orange 2
pineapple 1
plum 2
pomegranate 1
raspberry 2

Conclusion:

Thus we have learnt Mapper and Reducer concept and implemented a Hadoop program that
counts the number of occurrences of each word in a text file is implemented

RMDSSOE, Warje Department of Computer Engineering

Downloaded by Anushka Joshi (anushkajoshi712@gmail.com)

You might also like