KEMBAR78
BDA Unit - II | PDF | Apache Spark | Apache Hadoop
0% found this document useful (0 votes)
95 views66 pages

BDA Unit - II

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views66 pages

BDA Unit - II

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Department of

Computer Science and Engineering

10212CS210 – Big Data Analytics


Course Category : Program Elective
Credits :4
Slot : S1 & S5
Semester : Summer
Academic Year : 2024-2025
Faculty Name : Dr. S. Jagan

School of Computing
Vel Tech Rangarajan Dr. Sagunthala R&D Institute of
Science and Technology
Unit 2 - Big Data Processing

Introduction to Big Data, Big Data Analytics, Evolution of


Big data – Best Practices for Big data Analytics – Big data
characteristics- Understanding Big Data Storage – A
General Overview of High-Performance Architecture –
HDFS – Map Reduce Programming Model-Understanding
the basics of MapReduce, Loading data into HDFS,
Introduction-Apache Spark, Features, Components,
Resilient Distributed Datasets, Data Sharing using Spark
RDD, Spark Programming.

Department of Computer Science and Engineering 2


Definition of Big Data

10/19/2024 Department of Computer Science and Engineering 3


Processing of Big Data?

10/19/2024 Department of Computer Science and Engineering 4


Who is generating big data?

10/19/2024 Department of Computer Science and Engineering 5


10/19/2024 Department of Computer Science and Engineering 6
Data Analytics  Visualization

10/19/2024 Department of Computer Science and Engineering 7


Evolution of Big Data

10/19/2024 Department of Computer Science and Engineering 8


Evaluation of Big Data

10/19/2024 Department of Computer Science and Engineering 9


Big Data and Its Importance

• Businesses can utilize outside intelligence while taking decisions.

• Improved customer service.

• Early identification of risk to the product/services, if any

• Better operational efficiency.

10/19/2024 Department of Computer Science and Engineering 10


Why its seems to be important

• Cost Saving.
• Time Reduction.
• Control Online reputations.
• Understand the market conditions.
• To boost customer acquisition and retention.
• As a driver of innovations and product development.
• To solve advertiser's problem and offer marketing insights.

10/19/2024 Department of Computer Science and Engineering 11


Big Data Applications

Education Industry
• Customized and dynamic learning programs.
• Reframing course material.
• Grading Systems.
• Career prediction.
Insurance Industry
• Collecting information.
• Gaining customer insight.
• Fraud Detection.
• Threat Mapping.
Government
• Welfare.
• Cyber Security.
10/19/2024 Department of Computer Science and Engineering 12
Applications – cont…

Banking Sector
Venture credit hazard treatment.
Business clarity.
Customer statistics alteration.
Money laundering.
Risk Mitigation.
Health Care
Patient monitoring.
Diagnose.
Treatments.

10/19/2024 Department of Computer Science and Engineering 13


Hadoop- Features and Advantages of Hadoop

• The Apache™ Hadoop® project develops open-source software


for reliable, scalable, distributed computing.

• The Apache Hadoop software library is a framework that allows


for the distributed processing of large data sets across clusters of
computers using simple programming models.

• It is designed to scale up from single servers to thousands of


machines, each offering local computation and storage.

Department of Computer Science and Engineering 14


Map Reduce Architecture

Department of Computer Science and Engineering 15


Example – Map Reduce

Department of Computer Science and Engineering 16


Algorithm using Hadoop architecture

• MapReduce is a programming model for processing large data


sets with a parallel , distributed algorithm on a cluster.
• Map Reduce when coupled with HDFS can be used to handle big
data.
• The fundamentals of this HDFS-MapReduce system
• Map
• Shuffle
• reduce

Department of Computer Science and Engineering 17


Map reduce

Department of Computer Science and Engineering 18


Map reduce

Department of Computer Science and Engineering 19


Application of MapReduce

• Where we are going to apply ?


• Examples ?

Department of Computer Science and Engineering 20


Overview of Hadoop Ecosystems

A wide variety of companies and organizations use Hadoop for both research and
production. Users are encouraged to add themselves to the Hadoop.

• Hadoop Common – contains libraries and utilities needed by other Hadoop


modules;

• Hadoop Distributed File System (HDFS) – a distributed file-system that


stores data on commodity machines, providing very high aggregate bandwidth
across the cluster;

• Hadoop YARN – introduced in 2012 is a platform responsible for managing


computing resources in clusters and using them for scheduling users'
applications.

• Hadoop MapReduce – an implementation of the MapReduce programming


model for large-scale data processing.

Department of Computer Science and Engineering 21


Eco System cont..

• HDFS
• It employs a NameNode and DataNode architecture to
implement a distributed file system that provides high-
performance access to data across highly scalable Hadoop
clusters.

Department of Computer Science and Engineering 22


Hadoop Distributions

Department of Computer Science and Engineering 23


Hadoop versus SQL

Department of Computer Science and Engineering 24


HDFS Commands

Department of Computer Science and Engineering 25


HDFS Commands

• hdfs dfs version

• hdfs dfs -mkdir /user/dataflair/dir1

• hdfs dfs -ls /user/dataflair/dir1

• hdfs dfs -put /home/dataflair/Desktop/sample


/user/dataflair/dir1

• hdfs dfs -copyFromLocal /home/dataflair/Desktop/sample


/user/dataflair/dir1

• hadoop fs -getfacl /user/dataflair/dir1/sample

Department of Computer Science and Engineering 26


HDFS Commands

• hadoop fs -getfacl -R /user/dataflair/dir1


• hdfs dfs -copyToLocal /user/dataflair/dir1/sample
/home/dataflair/Desktop
• hdfs dfs -cat /user/dataflair/dir1/sample
• hadoop fs -mv /user/dataflair/dir1/purchases.txt /user/dataflair/dir2
• hadoop fs -cp /user/dataflair/dir2/purchases.txt /user/dataflair/dir1

Department of Computer Science and Engineering 27


Introduction to Spark

• Apache Spark is an open-source cluster computing framework.


• Its primary purpose is to handle the real-time generated data.
• Spark was built on the top of the Hadoop MapReduce.
• It was optimized to run in memory whereas alternative approaches
like Hadoop's MapReduce writes data to and from computer hard
drives.
• So, Spark process the data much quicker than other alternatives

Department of Computer Science and Engineering 28


Features of Apache Spark

• Fast - It provides high performance for both batch and streaming

data, using a state-of-the-art DAG scheduler, a query optimizer, and a

physical execution engine.

• Easy to Use - It facilitates to write the application in Java, Scala,

Python, R, and SQL. It also provides more than 80 high-level

operators.

• Generality - It provides a collection of libraries including SQL and

DataFrames, MLlib for machine learning, GraphX, and Spark

Streaming.
Department of Computer Science and Engineering 29
Features of Apache Spark

• Lightweight - It is a light unified analytics engine which is used

for large scale data processing.

• Runs Everywhere - It can easily run on Hadoop, Apache Mesos,

Kubernetes, standalone, or in the cloud.

Department of Computer Science and Engineering 30


Uses of Spark
Data integration:
• The data generated by systems are not consistent enough to
combine for analysis.
• To fetch consistent data from systems we can use processes like
Extract, transform, and load (ETL).
• Spark is used to reduce the cost and time required for this ETL
process.
Stream processing:
• It is always difficult to handle the real-time generated data such
as log files.
• Spark is capable enough to operate streams of data and refuses
potentially fraudulent operations.
Department of Computer Science and Engineering 31
Uses of Spark

Machine learning:
• Machine learning approaches become more feasible and
increasingly accurate due to enhancement in the volume of
data.
• As spark is capable of storing data in memory and can run
repeated queries quickly, it makes it easy to work on machine
learning algorithms.
Interactive analytics:
• Spark is able to generate the respond rapidly. So, instead of
running pre-defined queries, we can handle the data
interactively.

Department of Computer Science and Engineering 32


Spark Installation
Download the Apache Spark tar file.
Unzip the downloaded tar file.
sudo tar -xzvf /home/codegyani/spark-2.4.1-bin-hadoop2.7.tgz

Open the bashrc file.


sudo nano ~/.bashrc
Now, copy the following spark path in the last.

SPARK_HOME=/ home/codegyani /spark-2.4.1-bin-hadoop2.7


export PATH=$SPARK_HOME/bin:$PATH

Update the environment variable


source ~/.bashrc

Let's test the installation on the command prompt type


spark-shell

Department of Computer Science and Engineering 33


Spark Architecture

The Spark follows the master-slave architecture. Its cluster consists of a


single master and multiple slaves.

The Spark architecture depends upon two abstractions:

• Resilient Distributed Dataset (RDD)


• Directed Acyclic Graph (DAG)

Department of Computer Science and Engineering 34


Resilient Distributed Datasets (RDD)

The Resilient Distributed Datasets are the group of data items that can be stored
in-memory on worker nodes. Here,

Resilient: Restore the data on failure.


Distributed: Data is distributed among different nodes.
Dataset: Group of data.

Department of Computer Science and Engineering 35


Directed Acyclic Graph (DAG)

Directed Acyclic Graph is a finite direct graph


that performs a sequence of computations on
data.

Each node is an RDD partition, and the edge


is a transformation on top of data.

Here, the graph refers the navigation whereas


directed and acyclic refers to how it is done.
Department of Computer Science and Engineering 36
Driver Program

• The Driver Program is a process that runs the main() function of the
application and creates the Spark Context object.
• The purpose of Spark Context is to coordinate the spark applications,
running as independent sets of processes on a cluster.
• To run on a cluster, the Spark Context connects to a different type of
cluster managers and then perform the following tasks: -
• It acquires executors on nodes in the cluster.
• Then, it sends your application code to the executors. Here, the
application code can be defined by JAR or Python files passed to
the SparkContext.
• At last, the SparkContext sends tasks to the executors to run.

Department of Computer Science and Engineering 37


Cluster Manager

• The role of the cluster manager is to allocate resources across applications. The

Spark is capable enough of running on a large number of clusters.

• It consists of various types of cluster managers such as Hadoop YARN, Apache

Mesos and Standalone Scheduler.

• Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates

to install Spark on an empty set of machines.

Department of Computer Science and Engineering 38


Worker Node

• The worker node is a slave node


• Its role is to run the application code in the cluster.

Executor
• An executor is a process launched for an application on a worker
node.
• It runs tasks and keeps data in memory or disk storage across them.
• It read and write data to the external sources.
• Every application contains its executor.

Task
• A unit of work that will be sent to one executor.

Department of Computer Science and Engineering 39


Spark Components

• The Spark project consists of different types of tightly integrated


components.
• At its core, Spark is a computational engine that can schedule,
distribute and monitor multiple applications.

Department of Computer Science and Engineering 40


Spark Core

• The Spark Core is the heart of Spark and performs the core
functionality.
• It holds the components for task scheduling, fault recovery,
interacting with storage systems and memory management.

Department of Computer Science and Engineering 41


Spark SQL

• The Spark SQL is built on the top of Spark Core. It provides


support for structured data.
• It allows to query the data via SQL (Structured Query Language) as
well as the Apache Hive variant of SQL?called the HQL (Hive
Query Language).
• It supports JDBC and ODBC connections that establish a relation
between Java objects and existing databases, data warehouses and
business intelligence tools.
• It also supports various sources of data like Hive tables, Parquet,
and JSON.

Department of Computer Science and Engineering 42


Spark Streaming

• Spark Streaming is a Spark component that supports scalable and


fault-tolerant processing of streaming data.
• It uses Spark Core's fast scheduling capability to perform streaming
analytics.
• It accepts data in mini-batches and performs RDD transformations
on that data.
• Its design ensures that the applications written for streaming data can
be reused to analyze batches of historical data with little
modification.
• The log files generated by web servers can be considered as a real-
time example of a data stream.
Department of Computer Science and Engineering 43
MLlib

• The MLlib is a Machine Learning library that


contains various machine learning algorithms.
• These include correlations and hypothesis testing,
classification and regression, clustering, and principal
component analysis.
• It is nine times faster than the disk-based
implementation used by Apache Mahout.

Department of Computer Science and Engineering 44


GraphX

• The GraphX is a library that is used to manipulate graphs and perform

graph-parallel computations.

• It facilitates to create a directed graph with arbitrary properties

attached to each vertex and edge.

• To manipulate graph, it supports various fundamental operators like

subgraph, join Vertices, and aggregate Messages.

Department of Computer Science and Engineering 45


What is RDD?

• The RDD (Resilient Distributed Dataset) is the Spark's core abstraction.


• It is a collection of elements, partitioned across the nodes of the cluster so
that we can execute various parallel operations on it.

There are two ways to create RDDs:


• Parallelizing an existing data in the driver program
• Referencing a dataset in an external storage system, such as a shared
filesystem, HDFS, HBase, or any data source offering a Hadoop
InputFormat.

Department of Computer Science and Engineering 46


RDD Operations

The RDD provides the two types of operations:


•Transformation
•Action

Transformation
• In Spark, the role of transformation is to create a new dataset from an
existing one.
• The transformations are considered lazy as they only computed when an
action requires a result to be returned to the driver program.

Department of Computer Science and Engineering 47


RDD Persistence

• Spark provides a convenient way to work on the dataset by persisting it in

memory across operations.

• While persisting an RDD, each node stores any partitions of it that it computes

in memory. Now, we can also reuse them in other tasks on that dataset.

• We can use either persist() or cache() method to mark an RDD to be persisted.

Spark?s cache is fault-tolerant.

Department of Computer Science and Engineering 48


RDD Shared Variables

• In Spark, when any function passed to a transformation operation, then it is


executed on a remote cluster node.
• It works on different copies of all the variables used in the function.
• These variables are copied to each machine, and no updates to the variables
on the remote machine are revert to the driver program.

Department of Computer Science and Engineering 49


RDD Shared Variables

• In any case, if the partition of an RDD is lost, it will automatically be

recomputed using the transformations that originally created it.

• There is an availability of different storage levels which are used to store

persisted RDDs. Use these levels by passing a StorageLevel object (Scala,

Java, Python) to persist(). However, the cache() method is used for the

default storage level, which is StorageLevel.MEMORY_ONLY.

Department of Computer Science and Engineering 50


Spark Word Count Example

In Spark word count example, we find out the frequency of each word exists
in a particular file. Here, we use Scala language to perform Spark
operations.

Steps to execute Spark word count example


In this example, we find and display the number of occurrences of each
word.
Create a text file in your local machine and write some text into it.
$ nano sparkdata.txt

Department of Computer Science and Engineering 51


Spark Word Count Example

Check the text written in the sparkdata.txt file.


$ cat sparkdata.txt

Department of Computer Science and Engineering 52


Spark Word Count Example

• Create a directory in HDFS, where to kept text file.

$ hdfs dfs -mkdir /spark

• Upload the sparkdata.txt file on HDFS in the specific directory.

$ hdfs dfs -put /home/codegyani/sparkdata.txt /spark

Department of Computer Science and Engineering 53


Spark Word Count Example

Now, follow the below command to open the spark in Scala mode.
$ spark-shell

Department of Computer Science and Engineering 54


Spark Word Count Example

Let's create an RDD by using the following command.


scala> val data=sc.textFile("sparkdata.txt")

scala> data.collect;

Department of Computer Science and Engineering 55


Spark Word Count Example

•Here, we split the existing data in the form of individual words by using the
following command.

scala> val splitdata = data.flatMap(line => line.split(" "));

Now, we can read the generated result by using the following command.

scala> splitdata.collect;

Department of Computer Science and Engineering 56


Spark Word Count Example

Now, perform the map operation.


• scala> val mapdata = splitdata.map(word => (word,1));
Here, we are assigning a value 1 to each word.
Now, we can read the generated result by using the following command.
• scala> mapdata.collect;

Department of Computer Science and Engineering 57


Spark Word Count Example

Now, perform the reduce operation


• scala> val reducedata = mapdata.reduceByKey(_+_);
Here, we are summarizing the generated data.
Now, we can read the generated result by using the following command.
• scala> reducedata.collect;

Department of Computer Science and Engineering 58


Spark Char Count Example

In Spark char count example, we find out the frequency of each character exists in a
particular file. Here, we use Scala language to perform Spark operations.
Steps to execute Spark char count example
In this example, we find and display the number of occurrences of each character.
Create a text file in your local machine and write some text into it.
$ nano sparkdata.txt

Department of Computer Science and Engineering 59


Spark Char Count Example
Check the text written in the sparkdata.txt file.
$ cat sparkdata.txt

Department of Computer Science and Engineering 60


Spark Char Count Example

• Create a directory in HDFS, where to kept text file.


• $ hdfs dfs -mkdir /spark
• Upload the sparkdata.txt file on HDFS in the specific directory.
• $ hdfs dfs -put /home/codegyani/sparkdata.txt /spark

Department of Computer Science and Engineering 61


Spark Char Count Example

Now, follow the below command to open the spark in Scala mode.
$ spark-shell

Department of Computer Science and Engineering 62


Spark Char Count Example

•Let's create an RDD by using the following command.


• scala> val data=sc.textFile("sparkdata.txt");

scala> data.collect;

Department of Computer Science and Engineering 63


Spark Char Count Example

scala> val splitdata = data.flatMap(line => line.split(""));


Now, we can read the generated result by using the following
command.
scala> splitdata.collect;

Department of Computer Science and Engineering 64


Spark Char Count Example

Now, perform the map operation.


scala> val mapdata = splitdata.map(word => (word,1));
Here, we are assigning a value 1 to each word.
Now, we can read the generated result by using the following command.
scala> mapdata.collect;

Department of Computer Science and Engineering 65


Spark Char Count Example

Now, perform the reduce operation


scala> val reducedata = mapdata.reduceByKey(_+_);
Here, we are summarizing the generated data.
Now, we can read the generated result by using the following command.
scala> reducedata.collect;

Department of Computer Science and Engineering 66

You might also like