KEMBAR78
Big Data-Unit 4 | PDF | Apache Spark | No Sql
0% found this document useful (0 votes)
17 views41 pages

Big Data-Unit 4

NoSQL databases, designed for flexibility and scalability, store data in various formats such as documents, key-value pairs, and graphs, making them suitable for handling large volumes of unstructured or semi-structured data. MongoDB, a popular NoSQL database, utilizes JSON-like documents and offers features like schema-less design, high performance, and rich query capabilities. Apache Spark, an open-source distributed computing framework, is optimized for big data processing and supports various workloads, including batch and real-time data processing, with key components like RDDs and Spark SQL.

Uploaded by

ishita17tyagi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views41 pages

Big Data-Unit 4

NoSQL databases, designed for flexibility and scalability, store data in various formats such as documents, key-value pairs, and graphs, making them suitable for handling large volumes of unstructured or semi-structured data. MongoDB, a popular NoSQL database, utilizes JSON-like documents and offers features like schema-less design, high performance, and rich query capabilities. Apache Spark, an open-source distributed computing framework, is optimized for big data processing and supports various workloads, including batch and real-time data processing, with key components like RDDs and Spark SQL.

Uploaded by

ishita17tyagi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

UNIT-4

NoSQL Databases: Introduction to NoSQL

NoSQL
NoSQL (Not Only SQL) refers to a new generation of databases that provide a
mechanism for storage and retrieval of data that is modeled in ways other than
the tabular relations used in relational databases (RDBMS).

While traditional databases use tables (rows and columns), NoSQL databases
can store:

●​ Documents​

●​ Key-Value pairs​

●​ Graph structures​

●​ Wide-columns​

Key point: NoSQL databases are designed for flexibility, scalability,


and handling large volumes of unstructured or semi-structured data.

Why NoSQL?
Relational databases are excellent for structured data with clear relationships.​
But today's apps (social media, big data, IoT, AI) need to handle:

✅ Large volumes of data (Big Data)​


✅ Unstructured or semi-structured data​
✅ High velocity and variety of data​
✅ Horizontal scaling (adding more servers easily)
RDBMS sometimes struggles in these scenarios — that's where NoSQL shines.

Types of NoSQL Databases

Type Example Description

Document-bas MongoDB, Stores data as JSON-like documents


ed CouchDB

Key-Value Redis, DynamoDB Data is stored as key-value pairs

Column-family Cassandra, HBase Stores data in columns instead of rows

Graph-based Neo4j, Amazon Stores data as nodes and edges for


Neptune relationships

Features of NoSQL
●​ Schema-less: Flexible data models​

●​ Highly scalable: Can handle huge amounts of data​

●​ Distributed: Built for cloud and distributed data centers​

●​ Fast performance: Optimized for high-speed read/write​

When to use NoSQL?


✅ When dealing with big data​
✅ For real-time web apps (ex: chats, recommendations)​
✅ For flexible schema requirements (ex: IoT, user profiles)​
✅ When scaling horizontally across servers

When NOT to use NoSQL


❌ When data is highly relational and consistent (ex: Banking)​
❌ When strong ACID compliance is mandatory​
❌ For complex joins and transactions

Summary
●​ NoSQL = Not Only SQL → designed for modern apps​

●​ Supports various models: document, key-value, column, graph​

●​ Scales horizontally and handles unstructured/large data​

●​ Use wisely → it's not always the replacement for RDBMS

MongoDB: Introduction
MongoDB is a popular NoSQL, document-oriented database.​
It stores data in JSON-like documents (actually BSON - Binary JSON) which
makes it flexible and easy to use.

✅ Key Features
●​ Schema-less → Flexible document structures​

●​ Scalable → Easily distributed across servers​


●​ High performance → Fast reads and writes​

●​ Rich query language → Supports powerful queries​

📦 Example of a document:
json

"name": "John",

"age": 25,

"email": "john@example.com"

📚 MongoDB Data Types


Data Description Example
Type

String Text data "Hello"

Integer Number 42

Double Decimal 42.5


Boolean True/False true

Array List of values ["red", "green"]

Object Embedded document { "city": "Delhi" }

Date Date/time ISODate("2023-05-01T00:00:


00Z")

Null Null value null

ObjectId Unique ID automatically ObjectId("...")


generated

✍️ Creating Documents
To insert documents into a collection:

db.users.insertOne({

name: "Alice",

age: 30,

email: "alice@example.com"

})
For multiple documents:

db.users.insertMany([

{ name: "Bob", age: 25 },

{ name: "Charlie", age: 28 }

])

🔄 Updating Documents
// Update one document

db.users.updateOne(

{ name: "Alice" },

{ $set: { age: 31 } }

// Update many documents

db.users.updateMany(
{ age: { $lt: 30 } },

{ $set: { status: "young" } }

Operators: $set, $inc, $rename, $unset, etc.

❌ Deleting Documents
// Delete one document

db.users.deleteOne({ name: "Charlie" })

// Delete many documents

db.users.deleteMany({ age: { $lt: 25 } })

🔎 Querying Documents
// Find all documents
db.users.find()

// Find with condition

db.users.find({ age: { $gt: 25 } })

// Find and project only name field

db.users.find({ age: { $gt: 25 } }, { name: 1 })

// Sort and limit

db.users.find().sort({ age: -1 }).limit(2)

MongoDB supports conditions ($gt, $lt, $in, $or, etc.)

Introduction to Indexing
Index → Makes search/query operations faster.

Without index → MongoDB scans all documents (slow for large collections).​
With index → MongoDB finds data faster (just like an index page of a book).

// Create index on "name" field

db.users.createIndex({ name: 1 })
// List all indexes

db.users.getIndexes()

Types of Indexes: Single field, Compound, Text, Geospatial, etc.

📦 Capped Collections
Capped collection → Fixed-size collection (cannot grow beyond limit).

✅ High performance for inserts​


✅ Auto removes oldest documents when size limit is reached​
✅ Useful for logs, cache, real-time data
// Create capped collection

db.createCollection("logs", { capped: true, size: 100000 })

// Insert into capped collection

db.logs.insertOne({ event: "User login", time: new Date() })

Important: No document deletion or resizing → documents automatically


removed in FIFO order when full.
✅ Summary (Quick Table)
Operation MongoDB Command

Insert insertOne(), insertMany()

Update updateOne(), updateMany()

Delete deleteOne(), deleteMany()

Query find() with filters

Indexing createIndex()

Capped createCollection() with {


Collection capped: true }

Database-Level Commands
Command Description

show dbs List all databases


use Switch to or create a new
myDatabase database

db Show current database name

db.dropDataba Delete the current database


se()

📁 Collection-Level Commands
Command Description

show collections List all collections in the current


database

db.createCollection("myColle Create a new collection


ction")

db.myCollection.drop() Delete a collection

📝 CRUD Operations
Create

db.collection.insertOne({ name: "Anshika", age: 25 });


db.collection.insertMany([{ name: "Himank" }, { name:
"Bhavishya" }]);

Read

db.collection.find(); // All documents


db.collection.find({ name: "Anshika" }); // Filtered
db.collection.findOne({ age: 25 });

Update
db.collection.updateOne({ name: "Anshika" }, { $set: { age: 26 }
});
db.collection.updateMany({ age: 25 }, { $inc: { age: 1 } });

Delete

db.collection.deleteOne({ name: "Bhavishya" });


db.collection.deleteMany({ age: { $lt: 20 } });

🔍 Query Operators
Operator Example Meaning

$gt { age: { $gt: 18 } } Greater than

$lt { age: { $lt: 30 } } Less than

$in { name: { $in: ["Anshika", "Himank"] Match any in


} } array

$and { $and: [{ age: 25 }, { name: AND condition


"Anshika" }] }

$or { $or: [{ age: 25 }, { name: OR condition


"Bhavishya" }] }

📊 Aggregation
db.collection.aggregate([
{ $match: { age: { $gte: 25 } } },
{ $group: { _id: "$age", total: { $sum: 1 } } }
]);

🧾 Indexes
db.collection.createIndex({ name: 1 }); // Ascending
db.collection.dropIndex("name_1");

What is Apache Spark?


Apache Spark is an open-source, distributed computing framework designed for
big data processing. It performs fast, in-memory computations and supports batch,
streaming, machine learning, and graph processing workloads.

⚙️ Key Features of Apache Spark


Feature Description

Speed Processes data up to 100x faster than Hadoop MapReduce


using in-memory computing.

Ease of Use Supports high-level APIs in Java, Scala, Python (PySpark), and
R.

Multi-Engine Runs batch jobs, streaming data, machine learning, and graph
Support processing in one platform.

Fault Tolerance Automatically recovers from node failures using RDD lineage.

Works with Can run on Hadoop clusters and access HDFS data.
Hadoop
🧠 Core Concepts of Spark
1. RDD (Resilient Distributed Dataset)

●​ The fundamental data structure of Spark.​

●​ Immutable, distributed collection of objects.​

●​ Can be created from Hadoop files, local data, or transformed from existing
RDDs.​

Key Properties:

●​ Resilient: Recovers from failure.​

●​ Distributed: Stored across multiple nodes.​

●​ Lazy Evaluation: Transforms are not executed until an action is called.​

2. Transformations and Actions

Transformatio Actions
ns

map() collect()

filter() count()

flatMap() reduce()
union() take(n)

groupByKey( saveAsTextFil
) e()

●​ ​
Transformations: Create a new RDD from existing ones (lazy).​

●​ Actions: Trigger computation and return a result.​

3. Spark Components (Libraries)

Component Use

Spark Core Basic functions (task scheduling, memory management, fault


recovery).

Spark SQL SQL interface and DataFrame API.

Spark Real-time data processing.


Streaming

MLlib Machine learning library.

GraphX Graph processing and analysis.


4. Spark Architecture

Driver Program

Cluster Manager (YARN, Mesos, Standalone)

Worker Nodes

Executors (run tasks)

●​ Driver: Main program that defines transformations/actions.​

●​ Cluster Manager: Allocates resources (YARN, Mesos, Standalone).​

●​ Executors: Run tasks and store data.​

5. Anatomy of a Spark Job

When an action (like .count()) is called:

1.​ Spark builds a DAG (Directed Acyclic Graph).​

2.​ DAG is divided into Stages (based on data shuffling).​

3.​ Each stage is divided into Tasks (each runs on a partition).​

4.​ Executors execute tasks and return results to the Driver.​


🖥️ Spark Deployment Modes
Mode Description

Local Runs on a single machine (for


development/testing).

Standalon Uses Spark’s built-in cluster manager.


e

YARN Uses Hadoop YARN for resource


management.

Mesos Another general-purpose cluster manager.

Kubernete Container-based deployment option.


s

⚡ Spark on YARN
●​ Spark can run on Hadoop YARN.​

●​ Two modes:​

○​ Client Mode: Driver runs on the client machine.​

○​ Cluster Mode: Driver runs inside YARN container (recommended for


production).​

●​ Allows Spark to run alongside Hadoop jobs, using the same cluster resources.​
📊 Spark SQL and DataFrames

●​ DataFrame: Similar to a SQL table.​

●​ Supports SQL queries:​

python

df = spark.read.csv("file.csv", header=True)

df.select("name", "age").filter("age > 25").show()

●​ Register DataFrames as temporary tables:​

python

df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE age > 25").show()

🌊 Spark Streaming
●​ Processes real-time data from sources like Kafka, Flume, or TCP sockets.​

●​ Converts streaming data into DStreams (Discretized Streams).​

python

from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 1) # 1 second batch interval

lines = ssc.socketTextStream("localhost", 9999)

lines.pprint()

ssc.start()

ssc.awaitTermination()

🤖 Spark MLlib (Machine Learning)


●​ Offers pre-built algorithms:​

○​ Classification, Regression​

○​ Clustering (KMeans)​
○​ Collaborative Filtering​

○​ Dimensionality Reduction (PCA)​

●​ Supports pipelines, transformers, and evaluators.​

🧠 GraphX
●​ Enables graph computation on RDDs.​

●​ Use cases:​

○​ Social network analysis​

○​ PageRank​

○​ Connected components​

✅ Advantages of Spark
●​ High performance (in-memory)​

●​ Unified platform for all data workloads​

●​ Easy APIs for faster development​

●​ Fault tolerance and scalability​

●​ Works well with HDFS, Hive, Cassandra, HBase, and more​

❗ Limitations
●​ High memory consumption​

●​ Not ideal for low-latency applications (real-time dashboards)​

●​ Requires tuning for very large-scale production use​

🎓 Final Summary Table


Concept Summary

Spark Core Task scheduling, memory & fault


management

RDD Immutable, distributed data structure

Spark SQL Run SQL queries on structured data

Spark Streaming Real-time stream processing

MLlib Machine learning library

GraphX Graph analytics

Deployment Local, Standalone, YARN, Kubernetes


Jobs → Stages → Execution model of a Spark application
Tasks

Background
At its core, Apache Spark is an open-source distributed computing

system designed to quickly process large volumes of data that can hardly

accomplished by operating on a single machine. Spark distributes data

and computations across multiple machines, allowing for parallel

processing.
.

It was first developed at UC Berkeley’s AMPLab in 2009. At that time,

Hadoop MapReduce was the leading parallel programming engine for

processing massive datasets across multiple machines. AMPLab

collaborated with early MapReduce users to identify its strengths and

limitations, driving the creation of more versatile computing platforms.

They also worked closely with Hadoop users at UC Berkeley, who

focused on large-scale machine learning requiring iterative algorithms

and multiple data passes.


These discussions highlighted some insights. Cluster computing had

significant potential. However, MapReduce made building large

applications inefficient, especially for machine learning tasks requiring

multiple data passes. For example, the machine learning algorithm might

need to make many passes over the data. With MapReduce, each pass

must be written as a separate job and launched individually on the

cluster.

To address this, the Spark team created a functional programming-based

API to simplify multistep applications and developed a new engine for

efficient in-memory data sharing across computation steps.

The Spark Application Architecture


A typical Spark application consists of several key components:
●​ Driver: This JVM process manages the Spark application,
handling user input and distributing work to the executors.
●​ Cluster Manager: This component oversees the cluster of
machines running the Spark application. Spark can work with
various cluster managers, including YARN, Apache Mesos, or
its standalone manager.
●​ Executors: These processes execute tasks the driver assigns
and report their status and results. Each Spark application has
its own set of executors. A single worker node can host
multiple executors.
Physical Cluster. .
Job, Stage, and Task
●​ Job: In Spark, a job represents a series of transformations
applied to data. It encompasses the entire workflow from start
to finish.
●​ Stage: A stage is a job segment executed without data shuffling.
Spark splits the job into different stages when a transformation
requires shuffling data across partitions.
●​ Task: A task is the smallest unit of execution within Spark.
Each stage is divided into multiple tasks, running the same
code on a separate data partition executed by individual
executors.

In Spark, a job is divided into stages wherever data shuffling is

necessary. Each stage is further broken down into tasks and executed

parallel across different data partitions. A single Spark application can

have more than one Spark job.

Resilient Distributed Dataset (RDD)


RDD is the primary data abstraction. Whether DataFrames or Datasets

are used, they are compiled into RDDs behind the scenes. It represents
an immutable, partitioned collection of records that can be operated on

in parallel. Data inside RDD is stored in memory for as long and as much

as possible.

Properties
Internally, each RDD in Spark has five key properties:

●​ List of Partitions: The RDD is divided into partitions, which


are the units of parallelism in Spark.
●​ Computation Function: A function determines how to
compute the data for each partition.
●​ Dependencies: The RDD keeps track of its dependencies on
other RDDs, which describes how it was created.
●​ Partitioner (Optional): For key-value RDDs, a partitioner
specifies how the data is partitioned, such as using a hash
partitioner.
●​ Preferred Locations (Optional): This property lists the
preferred locations for computing each partition, such as the
data block locations in the HDFS.

Lazy Evaluation
When you define the RDD, its inside data is not available or transformed

immediately until an action triggers the execution. This approach allows

Spark to determine the most efficient way to execute the

transformations.
●​ Transformations, such as map or filter, are operations that
define how the data should be transformed, but they don't
execute until an action forces the computation. Spark doesn't
modify the original RDD when a transformation is applied to
an RDD. Instead, it creates a new RDD that represents the
result of applying the transformation because RDD is
immutable.
●​ Actions are the commands that Spark runs to produce output
or store data, thereby driving the actual execution of the
transformations.

Partitions
When an RDD is created, Spark divides the data into multiple chunks,

known as partitions. Each partition is a logical data subset and can be


processed independently with different executors. This enables Spark to

perform operations on large datasets in parallel.

Note: I’ll explore the Spark partiions in detail in an upcoming article

Fault Tolerance
Spark RDDs achieve fault tolerance through lineage. Spark forms the

dependency lineage graph by keeping track of each RDD’s dependencies

on other RDDs, which is the series of transformations that created it.

Suppose any partition of an RDD is lost due to a node failure or other

issues. In that case, Spark can reconstruct the lost data by reapplying the

transformations to the original dataset described by the lineage. This

approach eliminates the need to replicate data across nodes. Instead,

Spark only needs to recompute the lost partitions, making the system

efficient and resilient to failures.

Why RDD immutable


You might wonder why Spark RDDs are immutable. Here’s the gist:
●​ Concurrent Processing: Immutability keeps data consistent
across multiple nodes and threads, avoiding complex
synchronization and race conditions.
●​ Lineage and Fault Tolerance: Each transformation creates a
new RDD, preserving the lineage and allowing Spark to
recompute lost data reliably. Mutable RDDs would make this
much harder.
●​ Functional Programming: RDDs follow functional
programming principles that emphasize immutability, making
it easier to handle failures and maintain data integrity.

The journey of the Spark application


Before diving into the flow of a Spark application, it’s essential to

understand the different execution modes Spark offers. We have three

options:

●​ Cluster Mode: In this mode, the driver process is launched on a


worker node within the cluster alongside the executor
processes. The cluster manager handles all the processes
related to the Spark application.
●​ Client Mode: The driver remains on the client machine that
submitted the application. This setup requires the client
machine to maintain the driver process throughout the
application’s execution.
●​ Local mode: This mode runs the entire Spark application on a
single machine, achieving parallelism through multiple
threads. It’s commonly used for learning Spark or testing
applications in a simpler, local environment.
SCALA

SCALA: Introduction
Scala = Scalable Language​
It is a modern, hybrid programming language that combines:

✅ Object-Oriented Programming (OOP) → Like Java (Classes, Objects, Inheritance)​


✅ Functional Programming (FP) → Functions, Immutable data, Closures, etc.
✅ Features
●​ Runs on JVM (Java Virtual Machine) → Can use Java libraries​

●​ Statically typed → Type checking at compile time​

●​ Immutable collections and concurrency-friendly​

●​ Concise syntax compared to Java​

📦 Classes and Objects in Scala


➡️ Defining a Class
class Person(val name: String, val age: Int) {
def greet(): Unit = {
println(s"Hello, my name is $name and I am $age years old.")
}
}

●​ val name → Immutable (read-only)​


●​ def → Defines a method​

●​ Unit → Equivalent to void in Java​

➡️ Creating an Object
val p = new Person("Alice", 25)
p.greet()

➡️ Singleton Object (object)


object Hello {
def main(args: Array[String]): Unit = {
println("Hello, Scala!")
}
}

●​ No static keyword → Use object for static-like behavior.​

🔢 Basic Types and Operators


Type Exampl
e

Int 10

Double 10.5

Boolea true
n

String "Scala"
Char 'A'

➡️ Operators
Operator Example

Arithmeti +, -, *, /, %
c

Relationa ==, !=, >, <, >=,


l <=

Logical &&, `

val a = 10
val b = 5

println(a + b) // 15
println(a > b) // true

🔧 Built-in Control Structures


➡️ If-Else
val num = 5
if (num > 0)
println("Positive")
else
println("Negative")

➡️ Match (like switch)


val grade = "A"
grade match {
case "A" => println("Excellent")
case "B" => println("Good")
case _ => println("Other")
}

➡️ For loop
for (i <- 1 to 5)
println(i)

➡️ While loop
var i = 0
while (i < 5) {
println(i)
i += 1
}

📌 Functions and Closures


➡️ Function
def add(a: Int, b: Int): Int = {
return a + b
}
println(add(3, 4))

OR Shorter form:
def add(a: Int, b: Int) = a + b

➡️ Anonymous Function (Lambda)


val square = (x: Int) => x * x
println(square(5)) // 25

➡️ Closure
A closure is a function which uses variables from its surrounding scope.

val factor = 3
val multiplier = (x: Int) => x * factor
println(multiplier(4)) // 12

factor is captured → this is a closure.

🧬 Inheritance
// Parent class
class Animal {
def sound(): Unit = {
println("Animal makes sound")
}
}

// Child class
class Dog extends Animal {
override def sound(): Unit = {
println("Dog barks")
}
}

val d = new Dog()


d.sound() // Dog barks

➡️ Points
●​ extends → for inheritance​

●​ override → to override parent methods​

●​ By default, methods in Scala are virtual → can be overridden​

Summary Table
Concept Description Example

Class Blueprint of objects class Person

Object Singleton object or create object Hello or new


instance Person

Types Int, Double, Boolean, String, val a: Int = 10


etc.

Operators Arithmetic, Relational, Logical +, ==, &&

Control if-else, for, while, match if, for (i <- ...)


Structures

Functions Named/Anonymous functions def add(a, b) / (x: Int)


=> ...

Closures Functions with captured val factor = 3 ...


variables
Inheritance Reusing parent class features class Dog extends Animal

You might also like