Department of
Computer Science and Engineering
10212CS210 – Big Data Analytics
Course Category : Program Elective
Credits :4
Slot : S1 & S5
Semester : Summer
Academic Year : 2024-2025
Faculty Name : Dr. S. Jagan
School of Computing
Vel Tech Rangarajan Dr. Sagunthala R&D Institute of
Science and Technology
Unit 2 - Big Data Processing
Introduction to Big Data, Big Data Analytics, Evolution of
Big data – Best Practices for Big data Analytics – Big data
characteristics- Understanding Big Data Storage – A
General Overview of High-Performance Architecture –
HDFS – Map Reduce Programming Model-Understanding
the basics of MapReduce, Loading data into HDFS,
Introduction-Apache Spark, Features, Components,
Resilient Distributed Datasets, Data Sharing using Spark
RDD, Spark Programming.
Department of Computer Science and Engineering 2
Definition of Big Data
10/19/2024 Department of Computer Science and Engineering 3
Processing of Big Data?
10/19/2024 Department of Computer Science and Engineering 4
Who is generating big data?
10/19/2024 Department of Computer Science and Engineering 5
10/19/2024 Department of Computer Science and Engineering 6
Data Analytics Visualization
10/19/2024 Department of Computer Science and Engineering 7
Evolution of Big Data
10/19/2024 Department of Computer Science and Engineering 8
Evaluation of Big Data
10/19/2024 Department of Computer Science and Engineering 9
Big Data and Its Importance
• Businesses can utilize outside intelligence while taking decisions.
• Improved customer service.
• Early identification of risk to the product/services, if any
• Better operational efficiency.
10/19/2024 Department of Computer Science and Engineering 10
Why its seems to be important
• Cost Saving.
• Time Reduction.
• Control Online reputations.
• Understand the market conditions.
• To boost customer acquisition and retention.
• As a driver of innovations and product development.
• To solve advertiser's problem and offer marketing insights.
10/19/2024 Department of Computer Science and Engineering 11
Big Data Applications
Education Industry
• Customized and dynamic learning programs.
• Reframing course material.
• Grading Systems.
• Career prediction.
Insurance Industry
• Collecting information.
• Gaining customer insight.
• Fraud Detection.
• Threat Mapping.
Government
• Welfare.
• Cyber Security.
10/19/2024 Department of Computer Science and Engineering 12
Applications – cont…
Banking Sector
Venture credit hazard treatment.
Business clarity.
Customer statistics alteration.
Money laundering.
Risk Mitigation.
Health Care
Patient monitoring.
Diagnose.
Treatments.
10/19/2024 Department of Computer Science and Engineering 13
Hadoop- Features and Advantages of Hadoop
• The Apache™ Hadoop® project develops open-source software
for reliable, scalable, distributed computing.
• The Apache Hadoop software library is a framework that allows
for the distributed processing of large data sets across clusters of
computers using simple programming models.
• It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage.
Department of Computer Science and Engineering 14
Map Reduce Architecture
Department of Computer Science and Engineering 15
Example – Map Reduce
Department of Computer Science and Engineering 16
Algorithm using Hadoop architecture
• MapReduce is a programming model for processing large data
sets with a parallel , distributed algorithm on a cluster.
• Map Reduce when coupled with HDFS can be used to handle big
data.
• The fundamentals of this HDFS-MapReduce system
• Map
• Shuffle
• reduce
Department of Computer Science and Engineering 17
Map reduce
Department of Computer Science and Engineering 18
Map reduce
Department of Computer Science and Engineering 19
Application of MapReduce
• Where we are going to apply ?
• Examples ?
Department of Computer Science and Engineering 20
Overview of Hadoop Ecosystems
A wide variety of companies and organizations use Hadoop for both research and
production. Users are encouraged to add themselves to the Hadoop.
• Hadoop Common – contains libraries and utilities needed by other Hadoop
modules;
• Hadoop Distributed File System (HDFS) – a distributed file-system that
stores data on commodity machines, providing very high aggregate bandwidth
across the cluster;
• Hadoop YARN – introduced in 2012 is a platform responsible for managing
computing resources in clusters and using them for scheduling users'
applications.
• Hadoop MapReduce – an implementation of the MapReduce programming
model for large-scale data processing.
Department of Computer Science and Engineering 21
Eco System cont..
• HDFS
• It employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-
performance access to data across highly scalable Hadoop
clusters.
Department of Computer Science and Engineering 22
Hadoop Distributions
Department of Computer Science and Engineering 23
Hadoop versus SQL
Department of Computer Science and Engineering 24
HDFS Commands
Department of Computer Science and Engineering 25
HDFS Commands
• hdfs dfs version
• hdfs dfs -mkdir /user/dataflair/dir1
• hdfs dfs -ls /user/dataflair/dir1
• hdfs dfs -put /home/dataflair/Desktop/sample
/user/dataflair/dir1
• hdfs dfs -copyFromLocal /home/dataflair/Desktop/sample
/user/dataflair/dir1
• hadoop fs -getfacl /user/dataflair/dir1/sample
Department of Computer Science and Engineering 26
HDFS Commands
• hadoop fs -getfacl -R /user/dataflair/dir1
• hdfs dfs -copyToLocal /user/dataflair/dir1/sample
/home/dataflair/Desktop
• hdfs dfs -cat /user/dataflair/dir1/sample
• hadoop fs -mv /user/dataflair/dir1/purchases.txt /user/dataflair/dir2
• hadoop fs -cp /user/dataflair/dir2/purchases.txt /user/dataflair/dir1
Department of Computer Science and Engineering 27
Introduction to Spark
• Apache Spark is an open-source cluster computing framework.
• Its primary purpose is to handle the real-time generated data.
• Spark was built on the top of the Hadoop MapReduce.
• It was optimized to run in memory whereas alternative approaches
like Hadoop's MapReduce writes data to and from computer hard
drives.
• So, Spark process the data much quicker than other alternatives
Department of Computer Science and Engineering 28
Features of Apache Spark
• Fast - It provides high performance for both batch and streaming
data, using a state-of-the-art DAG scheduler, a query optimizer, and a
physical execution engine.
• Easy to Use - It facilitates to write the application in Java, Scala,
Python, R, and SQL. It also provides more than 80 high-level
operators.
• Generality - It provides a collection of libraries including SQL and
DataFrames, MLlib for machine learning, GraphX, and Spark
Streaming.
Department of Computer Science and Engineering 29
Features of Apache Spark
• Lightweight - It is a light unified analytics engine which is used
for large scale data processing.
• Runs Everywhere - It can easily run on Hadoop, Apache Mesos,
Kubernetes, standalone, or in the cloud.
Department of Computer Science and Engineering 30
Uses of Spark
Data integration:
• The data generated by systems are not consistent enough to
combine for analysis.
• To fetch consistent data from systems we can use processes like
Extract, transform, and load (ETL).
• Spark is used to reduce the cost and time required for this ETL
process.
Stream processing:
• It is always difficult to handle the real-time generated data such
as log files.
• Spark is capable enough to operate streams of data and refuses
potentially fraudulent operations.
Department of Computer Science and Engineering 31
Uses of Spark
Machine learning:
• Machine learning approaches become more feasible and
increasingly accurate due to enhancement in the volume of
data.
• As spark is capable of storing data in memory and can run
repeated queries quickly, it makes it easy to work on machine
learning algorithms.
Interactive analytics:
• Spark is able to generate the respond rapidly. So, instead of
running pre-defined queries, we can handle the data
interactively.
Department of Computer Science and Engineering 32
Spark Installation
Download the Apache Spark tar file.
Unzip the downloaded tar file.
sudo tar -xzvf /home/codegyani/spark-2.4.1-bin-hadoop2.7.tgz
Open the bashrc file.
sudo nano ~/.bashrc
Now, copy the following spark path in the last.
SPARK_HOME=/ home/codegyani /spark-2.4.1-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
Update the environment variable
source ~/.bashrc
Let's test the installation on the command prompt type
spark-shell
Department of Computer Science and Engineering 33
Spark Architecture
The Spark follows the master-slave architecture. Its cluster consists of a
single master and multiple slaves.
The Spark architecture depends upon two abstractions:
• Resilient Distributed Dataset (RDD)
• Directed Acyclic Graph (DAG)
Department of Computer Science and Engineering 34
Resilient Distributed Datasets (RDD)
The Resilient Distributed Datasets are the group of data items that can be stored
in-memory on worker nodes. Here,
Resilient: Restore the data on failure.
Distributed: Data is distributed among different nodes.
Dataset: Group of data.
Department of Computer Science and Engineering 35
Directed Acyclic Graph (DAG)
Directed Acyclic Graph is a finite direct graph
that performs a sequence of computations on
data.
Each node is an RDD partition, and the edge
is a transformation on top of data.
Here, the graph refers the navigation whereas
directed and acyclic refers to how it is done.
Department of Computer Science and Engineering 36
Driver Program
• The Driver Program is a process that runs the main() function of the
application and creates the Spark Context object.
• The purpose of Spark Context is to coordinate the spark applications,
running as independent sets of processes on a cluster.
• To run on a cluster, the Spark Context connects to a different type of
cluster managers and then perform the following tasks: -
• It acquires executors on nodes in the cluster.
• Then, it sends your application code to the executors. Here, the
application code can be defined by JAR or Python files passed to
the SparkContext.
• At last, the SparkContext sends tasks to the executors to run.
Department of Computer Science and Engineering 37
Cluster Manager
• The role of the cluster manager is to allocate resources across applications. The
Spark is capable enough of running on a large number of clusters.
• It consists of various types of cluster managers such as Hadoop YARN, Apache
Mesos and Standalone Scheduler.
• Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates
to install Spark on an empty set of machines.
Department of Computer Science and Engineering 38
Worker Node
• The worker node is a slave node
• Its role is to run the application code in the cluster.
Executor
• An executor is a process launched for an application on a worker
node.
• It runs tasks and keeps data in memory or disk storage across them.
• It read and write data to the external sources.
• Every application contains its executor.
Task
• A unit of work that will be sent to one executor.
Department of Computer Science and Engineering 39
Spark Components
• The Spark project consists of different types of tightly integrated
components.
• At its core, Spark is a computational engine that can schedule,
distribute and monitor multiple applications.
Department of Computer Science and Engineering 40
Spark Core
• The Spark Core is the heart of Spark and performs the core
functionality.
• It holds the components for task scheduling, fault recovery,
interacting with storage systems and memory management.
Department of Computer Science and Engineering 41
Spark SQL
• The Spark SQL is built on the top of Spark Core. It provides
support for structured data.
• It allows to query the data via SQL (Structured Query Language) as
well as the Apache Hive variant of SQL?called the HQL (Hive
Query Language).
• It supports JDBC and ODBC connections that establish a relation
between Java objects and existing databases, data warehouses and
business intelligence tools.
• It also supports various sources of data like Hive tables, Parquet,
and JSON.
Department of Computer Science and Engineering 42
Spark Streaming
• Spark Streaming is a Spark component that supports scalable and
fault-tolerant processing of streaming data.
• It uses Spark Core's fast scheduling capability to perform streaming
analytics.
• It accepts data in mini-batches and performs RDD transformations
on that data.
• Its design ensures that the applications written for streaming data can
be reused to analyze batches of historical data with little
modification.
• The log files generated by web servers can be considered as a real-
time example of a data stream.
Department of Computer Science and Engineering 43
MLlib
• The MLlib is a Machine Learning library that
contains various machine learning algorithms.
• These include correlations and hypothesis testing,
classification and regression, clustering, and principal
component analysis.
• It is nine times faster than the disk-based
implementation used by Apache Mahout.
Department of Computer Science and Engineering 44
GraphX
• The GraphX is a library that is used to manipulate graphs and perform
graph-parallel computations.
• It facilitates to create a directed graph with arbitrary properties
attached to each vertex and edge.
• To manipulate graph, it supports various fundamental operators like
subgraph, join Vertices, and aggregate Messages.
Department of Computer Science and Engineering 45
What is RDD?
• The RDD (Resilient Distributed Dataset) is the Spark's core abstraction.
• It is a collection of elements, partitioned across the nodes of the cluster so
that we can execute various parallel operations on it.
There are two ways to create RDDs:
• Parallelizing an existing data in the driver program
• Referencing a dataset in an external storage system, such as a shared
filesystem, HDFS, HBase, or any data source offering a Hadoop
InputFormat.
Department of Computer Science and Engineering 46
RDD Operations
The RDD provides the two types of operations:
•Transformation
•Action
Transformation
• In Spark, the role of transformation is to create a new dataset from an
existing one.
• The transformations are considered lazy as they only computed when an
action requires a result to be returned to the driver program.
Department of Computer Science and Engineering 47
RDD Persistence
• Spark provides a convenient way to work on the dataset by persisting it in
memory across operations.
• While persisting an RDD, each node stores any partitions of it that it computes
in memory. Now, we can also reuse them in other tasks on that dataset.
• We can use either persist() or cache() method to mark an RDD to be persisted.
Spark?s cache is fault-tolerant.
Department of Computer Science and Engineering 48
RDD Shared Variables
• In Spark, when any function passed to a transformation operation, then it is
executed on a remote cluster node.
• It works on different copies of all the variables used in the function.
• These variables are copied to each machine, and no updates to the variables
on the remote machine are revert to the driver program.
Department of Computer Science and Engineering 49
RDD Shared Variables
• In any case, if the partition of an RDD is lost, it will automatically be
recomputed using the transformations that originally created it.
• There is an availability of different storage levels which are used to store
persisted RDDs. Use these levels by passing a StorageLevel object (Scala,
Java, Python) to persist(). However, the cache() method is used for the
default storage level, which is StorageLevel.MEMORY_ONLY.
Department of Computer Science and Engineering 50
Spark Word Count Example
In Spark word count example, we find out the frequency of each word exists
in a particular file. Here, we use Scala language to perform Spark
operations.
Steps to execute Spark word count example
In this example, we find and display the number of occurrences of each
word.
Create a text file in your local machine and write some text into it.
$ nano sparkdata.txt
Department of Computer Science and Engineering 51
Spark Word Count Example
Check the text written in the sparkdata.txt file.
$ cat sparkdata.txt
Department of Computer Science and Engineering 52
Spark Word Count Example
• Create a directory in HDFS, where to kept text file.
$ hdfs dfs -mkdir /spark
• Upload the sparkdata.txt file on HDFS in the specific directory.
$ hdfs dfs -put /home/codegyani/sparkdata.txt /spark
Department of Computer Science and Engineering 53
Spark Word Count Example
Now, follow the below command to open the spark in Scala mode.
$ spark-shell
Department of Computer Science and Engineering 54
Spark Word Count Example
Let's create an RDD by using the following command.
scala> val data=sc.textFile("sparkdata.txt")
scala> data.collect;
Department of Computer Science and Engineering 55
Spark Word Count Example
•Here, we split the existing data in the form of individual words by using the
following command.
scala> val splitdata = data.flatMap(line => line.split(" "));
Now, we can read the generated result by using the following command.
scala> splitdata.collect;
Department of Computer Science and Engineering 56
Spark Word Count Example
Now, perform the map operation.
• scala> val mapdata = splitdata.map(word => (word,1));
Here, we are assigning a value 1 to each word.
Now, we can read the generated result by using the following command.
• scala> mapdata.collect;
Department of Computer Science and Engineering 57
Spark Word Count Example
Now, perform the reduce operation
• scala> val reducedata = mapdata.reduceByKey(_+_);
Here, we are summarizing the generated data.
Now, we can read the generated result by using the following command.
• scala> reducedata.collect;
Department of Computer Science and Engineering 58
Spark Char Count Example
In Spark char count example, we find out the frequency of each character exists in a
particular file. Here, we use Scala language to perform Spark operations.
Steps to execute Spark char count example
In this example, we find and display the number of occurrences of each character.
Create a text file in your local machine and write some text into it.
$ nano sparkdata.txt
Department of Computer Science and Engineering 59
Spark Char Count Example
Check the text written in the sparkdata.txt file.
$ cat sparkdata.txt
Department of Computer Science and Engineering 60
Spark Char Count Example
• Create a directory in HDFS, where to kept text file.
• $ hdfs dfs -mkdir /spark
• Upload the sparkdata.txt file on HDFS in the specific directory.
• $ hdfs dfs -put /home/codegyani/sparkdata.txt /spark
Department of Computer Science and Engineering 61
Spark Char Count Example
Now, follow the below command to open the spark in Scala mode.
$ spark-shell
Department of Computer Science and Engineering 62
Spark Char Count Example
•Let's create an RDD by using the following command.
• scala> val data=sc.textFile("sparkdata.txt");
scala> data.collect;
Department of Computer Science and Engineering 63
Spark Char Count Example
scala> val splitdata = data.flatMap(line => line.split(""));
Now, we can read the generated result by using the following
command.
scala> splitdata.collect;
Department of Computer Science and Engineering 64
Spark Char Count Example
Now, perform the map operation.
scala> val mapdata = splitdata.map(word => (word,1));
Here, we are assigning a value 1 to each word.
Now, we can read the generated result by using the following command.
scala> mapdata.collect;
Department of Computer Science and Engineering 65
Spark Char Count Example
Now, perform the reduce operation
scala> val reducedata = mapdata.reduceByKey(_+_);
Here, we are summarizing the generated data.
Now, we can read the generated result by using the following command.
scala> reducedata.collect;
Department of Computer Science and Engineering 66