UNIT-4
NoSQL Databases: Introduction to NoSQL
NoSQL
NoSQL (Not Only SQL) refers to a new generation of databases that provide a
mechanism for storage and retrieval of data that is modeled in ways other than
the tabular relations used in relational databases (RDBMS).
While traditional databases use tables (rows and columns), NoSQL databases
can store:
● Documents
● Key-Value pairs
● Graph structures
● Wide-columns
Key point: NoSQL databases are designed for flexibility, scalability,
and handling large volumes of unstructured or semi-structured data.
Why NoSQL?
Relational databases are excellent for structured data with clear relationships.
But today's apps (social media, big data, IoT, AI) need to handle:
✅ Large volumes of data (Big Data)
✅ Unstructured or semi-structured data
✅ High velocity and variety of data
✅ Horizontal scaling (adding more servers easily)
RDBMS sometimes struggles in these scenarios — that's where NoSQL shines.
Types of NoSQL Databases
Type Example Description
Document-bas MongoDB, Stores data as JSON-like documents
ed CouchDB
Key-Value Redis, DynamoDB Data is stored as key-value pairs
Column-family Cassandra, HBase Stores data in columns instead of rows
Graph-based Neo4j, Amazon Stores data as nodes and edges for
Neptune relationships
Features of NoSQL
● Schema-less: Flexible data models
● Highly scalable: Can handle huge amounts of data
● Distributed: Built for cloud and distributed data centers
● Fast performance: Optimized for high-speed read/write
When to use NoSQL?
✅ When dealing with big data
✅ For real-time web apps (ex: chats, recommendations)
✅ For flexible schema requirements (ex: IoT, user profiles)
✅ When scaling horizontally across servers
When NOT to use NoSQL
❌ When data is highly relational and consistent (ex: Banking)
❌ When strong ACID compliance is mandatory
❌ For complex joins and transactions
Summary
● NoSQL = Not Only SQL → designed for modern apps
● Supports various models: document, key-value, column, graph
● Scales horizontally and handles unstructured/large data
● Use wisely → it's not always the replacement for RDBMS
MongoDB: Introduction
MongoDB is a popular NoSQL, document-oriented database.
It stores data in JSON-like documents (actually BSON - Binary JSON) which
makes it flexible and easy to use.
✅ Key Features
● Schema-less → Flexible document structures
● Scalable → Easily distributed across servers
● High performance → Fast reads and writes
● Rich query language → Supports powerful queries
📦 Example of a document:
json
"name": "John",
"age": 25,
"email": "john@example.com"
📚 MongoDB Data Types
Data Description Example
Type
String Text data "Hello"
Integer Number 42
Double Decimal 42.5
Boolean True/False true
Array List of values ["red", "green"]
Object Embedded document { "city": "Delhi" }
Date Date/time ISODate("2023-05-01T00:00:
00Z")
Null Null value null
ObjectId Unique ID automatically ObjectId("...")
generated
✍️ Creating Documents
To insert documents into a collection:
db.users.insertOne({
name: "Alice",
age: 30,
email: "alice@example.com"
})
For multiple documents:
db.users.insertMany([
{ name: "Bob", age: 25 },
{ name: "Charlie", age: 28 }
])
🔄 Updating Documents
// Update one document
db.users.updateOne(
{ name: "Alice" },
{ $set: { age: 31 } }
// Update many documents
db.users.updateMany(
{ age: { $lt: 30 } },
{ $set: { status: "young" } }
Operators: $set, $inc, $rename, $unset, etc.
❌ Deleting Documents
// Delete one document
db.users.deleteOne({ name: "Charlie" })
// Delete many documents
db.users.deleteMany({ age: { $lt: 25 } })
🔎 Querying Documents
// Find all documents
db.users.find()
// Find with condition
db.users.find({ age: { $gt: 25 } })
// Find and project only name field
db.users.find({ age: { $gt: 25 } }, { name: 1 })
// Sort and limit
db.users.find().sort({ age: -1 }).limit(2)
MongoDB supports conditions ($gt, $lt, $in, $or, etc.)
Introduction to Indexing
Index → Makes search/query operations faster.
Without index → MongoDB scans all documents (slow for large collections).
With index → MongoDB finds data faster (just like an index page of a book).
// Create index on "name" field
db.users.createIndex({ name: 1 })
// List all indexes
db.users.getIndexes()
Types of Indexes: Single field, Compound, Text, Geospatial, etc.
📦 Capped Collections
Capped collection → Fixed-size collection (cannot grow beyond limit).
✅ High performance for inserts
✅ Auto removes oldest documents when size limit is reached
✅ Useful for logs, cache, real-time data
// Create capped collection
db.createCollection("logs", { capped: true, size: 100000 })
// Insert into capped collection
db.logs.insertOne({ event: "User login", time: new Date() })
Important: No document deletion or resizing → documents automatically
removed in FIFO order when full.
✅ Summary (Quick Table)
Operation MongoDB Command
Insert insertOne(), insertMany()
Update updateOne(), updateMany()
Delete deleteOne(), deleteMany()
Query find() with filters
Indexing createIndex()
Capped createCollection() with {
Collection capped: true }
Database-Level Commands
Command Description
show dbs List all databases
use Switch to or create a new
myDatabase database
db Show current database name
db.dropDataba Delete the current database
se()
📁 Collection-Level Commands
Command Description
show collections List all collections in the current
database
db.createCollection("myColle Create a new collection
ction")
db.myCollection.drop() Delete a collection
📝 CRUD Operations
Create
db.collection.insertOne({ name: "Anshika", age: 25 });
db.collection.insertMany([{ name: "Himank" }, { name:
"Bhavishya" }]);
Read
db.collection.find(); // All documents
db.collection.find({ name: "Anshika" }); // Filtered
db.collection.findOne({ age: 25 });
Update
db.collection.updateOne({ name: "Anshika" }, { $set: { age: 26 }
});
db.collection.updateMany({ age: 25 }, { $inc: { age: 1 } });
Delete
db.collection.deleteOne({ name: "Bhavishya" });
db.collection.deleteMany({ age: { $lt: 20 } });
🔍 Query Operators
Operator Example Meaning
$gt { age: { $gt: 18 } } Greater than
$lt { age: { $lt: 30 } } Less than
$in { name: { $in: ["Anshika", "Himank"] Match any in
} } array
$and { $and: [{ age: 25 }, { name: AND condition
"Anshika" }] }
$or { $or: [{ age: 25 }, { name: OR condition
"Bhavishya" }] }
📊 Aggregation
db.collection.aggregate([
{ $match: { age: { $gte: 25 } } },
{ $group: { _id: "$age", total: { $sum: 1 } } }
]);
🧾 Indexes
db.collection.createIndex({ name: 1 }); // Ascending
db.collection.dropIndex("name_1");
What is Apache Spark?
Apache Spark is an open-source, distributed computing framework designed for
big data processing. It performs fast, in-memory computations and supports batch,
streaming, machine learning, and graph processing workloads.
⚙️ Key Features of Apache Spark
Feature Description
Speed Processes data up to 100x faster than Hadoop MapReduce
using in-memory computing.
Ease of Use Supports high-level APIs in Java, Scala, Python (PySpark), and
R.
Multi-Engine Runs batch jobs, streaming data, machine learning, and graph
Support processing in one platform.
Fault Tolerance Automatically recovers from node failures using RDD lineage.
Works with Can run on Hadoop clusters and access HDFS data.
Hadoop
🧠 Core Concepts of Spark
1. RDD (Resilient Distributed Dataset)
● The fundamental data structure of Spark.
● Immutable, distributed collection of objects.
● Can be created from Hadoop files, local data, or transformed from existing
RDDs.
Key Properties:
● Resilient: Recovers from failure.
● Distributed: Stored across multiple nodes.
● Lazy Evaluation: Transforms are not executed until an action is called.
2. Transformations and Actions
Transformatio Actions
ns
map() collect()
filter() count()
flatMap() reduce()
union() take(n)
groupByKey( saveAsTextFil
) e()
●
Transformations: Create a new RDD from existing ones (lazy).
● Actions: Trigger computation and return a result.
3. Spark Components (Libraries)
Component Use
Spark Core Basic functions (task scheduling, memory management, fault
recovery).
Spark SQL SQL interface and DataFrame API.
Spark Real-time data processing.
Streaming
MLlib Machine learning library.
GraphX Graph processing and analysis.
4. Spark Architecture
Driver Program
Cluster Manager (YARN, Mesos, Standalone)
Worker Nodes
Executors (run tasks)
● Driver: Main program that defines transformations/actions.
● Cluster Manager: Allocates resources (YARN, Mesos, Standalone).
● Executors: Run tasks and store data.
5. Anatomy of a Spark Job
When an action (like .count()) is called:
1. Spark builds a DAG (Directed Acyclic Graph).
2. DAG is divided into Stages (based on data shuffling).
3. Each stage is divided into Tasks (each runs on a partition).
4. Executors execute tasks and return results to the Driver.
🖥️ Spark Deployment Modes
Mode Description
Local Runs on a single machine (for
development/testing).
Standalon Uses Spark’s built-in cluster manager.
e
YARN Uses Hadoop YARN for resource
management.
Mesos Another general-purpose cluster manager.
Kubernete Container-based deployment option.
s
⚡ Spark on YARN
● Spark can run on Hadoop YARN.
● Two modes:
○ Client Mode: Driver runs on the client machine.
○ Cluster Mode: Driver runs inside YARN container (recommended for
production).
● Allows Spark to run alongside Hadoop jobs, using the same cluster resources.
📊 Spark SQL and DataFrames
● DataFrame: Similar to a SQL table.
● Supports SQL queries:
python
df = spark.read.csv("file.csv", header=True)
df.select("name", "age").filter("age > 25").show()
● Register DataFrames as temporary tables:
python
df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE age > 25").show()
🌊 Spark Streaming
● Processes real-time data from sources like Kafka, Flume, or TCP sockets.
● Converts streaming data into DStreams (Discretized Streams).
python
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 1) # 1 second batch interval
lines = ssc.socketTextStream("localhost", 9999)
lines.pprint()
ssc.start()
ssc.awaitTermination()
🤖 Spark MLlib (Machine Learning)
● Offers pre-built algorithms:
○ Classification, Regression
○ Clustering (KMeans)
○ Collaborative Filtering
○ Dimensionality Reduction (PCA)
● Supports pipelines, transformers, and evaluators.
🧠 GraphX
● Enables graph computation on RDDs.
● Use cases:
○ Social network analysis
○ PageRank
○ Connected components
✅ Advantages of Spark
● High performance (in-memory)
● Unified platform for all data workloads
● Easy APIs for faster development
● Fault tolerance and scalability
● Works well with HDFS, Hive, Cassandra, HBase, and more
❗ Limitations
● High memory consumption
● Not ideal for low-latency applications (real-time dashboards)
● Requires tuning for very large-scale production use
🎓 Final Summary Table
Concept Summary
Spark Core Task scheduling, memory & fault
management
RDD Immutable, distributed data structure
Spark SQL Run SQL queries on structured data
Spark Streaming Real-time stream processing
MLlib Machine learning library
GraphX Graph analytics
Deployment Local, Standalone, YARN, Kubernetes
Jobs → Stages → Execution model of a Spark application
Tasks
Background
At its core, Apache Spark is an open-source distributed computing
system designed to quickly process large volumes of data that can hardly
accomplished by operating on a single machine. Spark distributes data
and computations across multiple machines, allowing for parallel
processing.
.
It was first developed at UC Berkeley’s AMPLab in 2009. At that time,
Hadoop MapReduce was the leading parallel programming engine for
processing massive datasets across multiple machines. AMPLab
collaborated with early MapReduce users to identify its strengths and
limitations, driving the creation of more versatile computing platforms.
They also worked closely with Hadoop users at UC Berkeley, who
focused on large-scale machine learning requiring iterative algorithms
and multiple data passes.
These discussions highlighted some insights. Cluster computing had
significant potential. However, MapReduce made building large
applications inefficient, especially for machine learning tasks requiring
multiple data passes. For example, the machine learning algorithm might
need to make many passes over the data. With MapReduce, each pass
must be written as a separate job and launched individually on the
cluster.
To address this, the Spark team created a functional programming-based
API to simplify multistep applications and developed a new engine for
efficient in-memory data sharing across computation steps.
The Spark Application Architecture
A typical Spark application consists of several key components:
● Driver: This JVM process manages the Spark application,
handling user input and distributing work to the executors.
● Cluster Manager: This component oversees the cluster of
machines running the Spark application. Spark can work with
various cluster managers, including YARN, Apache Mesos, or
its standalone manager.
● Executors: These processes execute tasks the driver assigns
and report their status and results. Each Spark application has
its own set of executors. A single worker node can host
multiple executors.
Physical Cluster. .
Job, Stage, and Task
● Job: In Spark, a job represents a series of transformations
applied to data. It encompasses the entire workflow from start
to finish.
● Stage: A stage is a job segment executed without data shuffling.
Spark splits the job into different stages when a transformation
requires shuffling data across partitions.
● Task: A task is the smallest unit of execution within Spark.
Each stage is divided into multiple tasks, running the same
code on a separate data partition executed by individual
executors.
In Spark, a job is divided into stages wherever data shuffling is
necessary. Each stage is further broken down into tasks and executed
parallel across different data partitions. A single Spark application can
have more than one Spark job.
Resilient Distributed Dataset (RDD)
RDD is the primary data abstraction. Whether DataFrames or Datasets
are used, they are compiled into RDDs behind the scenes. It represents
an immutable, partitioned collection of records that can be operated on
in parallel. Data inside RDD is stored in memory for as long and as much
as possible.
Properties
Internally, each RDD in Spark has five key properties:
● List of Partitions: The RDD is divided into partitions, which
are the units of parallelism in Spark.
● Computation Function: A function determines how to
compute the data for each partition.
● Dependencies: The RDD keeps track of its dependencies on
other RDDs, which describes how it was created.
● Partitioner (Optional): For key-value RDDs, a partitioner
specifies how the data is partitioned, such as using a hash
partitioner.
● Preferred Locations (Optional): This property lists the
preferred locations for computing each partition, such as the
data block locations in the HDFS.
Lazy Evaluation
When you define the RDD, its inside data is not available or transformed
immediately until an action triggers the execution. This approach allows
Spark to determine the most efficient way to execute the
transformations.
● Transformations, such as map or filter, are operations that
define how the data should be transformed, but they don't
execute until an action forces the computation. Spark doesn't
modify the original RDD when a transformation is applied to
an RDD. Instead, it creates a new RDD that represents the
result of applying the transformation because RDD is
immutable.
● Actions are the commands that Spark runs to produce output
or store data, thereby driving the actual execution of the
transformations.
Partitions
When an RDD is created, Spark divides the data into multiple chunks,
known as partitions. Each partition is a logical data subset and can be
processed independently with different executors. This enables Spark to
perform operations on large datasets in parallel.
Note: I’ll explore the Spark partiions in detail in an upcoming article
Fault Tolerance
Spark RDDs achieve fault tolerance through lineage. Spark forms the
dependency lineage graph by keeping track of each RDD’s dependencies
on other RDDs, which is the series of transformations that created it.
Suppose any partition of an RDD is lost due to a node failure or other
issues. In that case, Spark can reconstruct the lost data by reapplying the
transformations to the original dataset described by the lineage. This
approach eliminates the need to replicate data across nodes. Instead,
Spark only needs to recompute the lost partitions, making the system
efficient and resilient to failures.
Why RDD immutable
You might wonder why Spark RDDs are immutable. Here’s the gist:
● Concurrent Processing: Immutability keeps data consistent
across multiple nodes and threads, avoiding complex
synchronization and race conditions.
● Lineage and Fault Tolerance: Each transformation creates a
new RDD, preserving the lineage and allowing Spark to
recompute lost data reliably. Mutable RDDs would make this
much harder.
● Functional Programming: RDDs follow functional
programming principles that emphasize immutability, making
it easier to handle failures and maintain data integrity.
The journey of the Spark application
Before diving into the flow of a Spark application, it’s essential to
understand the different execution modes Spark offers. We have three
options:
● Cluster Mode: In this mode, the driver process is launched on a
worker node within the cluster alongside the executor
processes. The cluster manager handles all the processes
related to the Spark application.
● Client Mode: The driver remains on the client machine that
submitted the application. This setup requires the client
machine to maintain the driver process throughout the
application’s execution.
● Local mode: This mode runs the entire Spark application on a
single machine, achieving parallelism through multiple
threads. It’s commonly used for learning Spark or testing
applications in a simpler, local environment.
SCALA
SCALA: Introduction
Scala = Scalable Language
It is a modern, hybrid programming language that combines:
✅ Object-Oriented Programming (OOP) → Like Java (Classes, Objects, Inheritance)
✅ Functional Programming (FP) → Functions, Immutable data, Closures, etc.
✅ Features
● Runs on JVM (Java Virtual Machine) → Can use Java libraries
● Statically typed → Type checking at compile time
● Immutable collections and concurrency-friendly
● Concise syntax compared to Java
📦 Classes and Objects in Scala
➡️ Defining a Class
class Person(val name: String, val age: Int) {
def greet(): Unit = {
println(s"Hello, my name is $name and I am $age years old.")
}
}
● val name → Immutable (read-only)
● def → Defines a method
● Unit → Equivalent to void in Java
➡️ Creating an Object
val p = new Person("Alice", 25)
p.greet()
➡️ Singleton Object (object)
object Hello {
def main(args: Array[String]): Unit = {
println("Hello, Scala!")
}
}
● No static keyword → Use object for static-like behavior.
🔢 Basic Types and Operators
Type Exampl
e
Int 10
Double 10.5
Boolea true
n
String "Scala"
Char 'A'
➡️ Operators
Operator Example
Arithmeti +, -, *, /, %
c
Relationa ==, !=, >, <, >=,
l <=
Logical &&, `
val a = 10
val b = 5
println(a + b) // 15
println(a > b) // true
🔧 Built-in Control Structures
➡️ If-Else
val num = 5
if (num > 0)
println("Positive")
else
println("Negative")
➡️ Match (like switch)
val grade = "A"
grade match {
case "A" => println("Excellent")
case "B" => println("Good")
case _ => println("Other")
}
➡️ For loop
for (i <- 1 to 5)
println(i)
➡️ While loop
var i = 0
while (i < 5) {
println(i)
i += 1
}
📌 Functions and Closures
➡️ Function
def add(a: Int, b: Int): Int = {
return a + b
}
println(add(3, 4))
OR Shorter form:
def add(a: Int, b: Int) = a + b
➡️ Anonymous Function (Lambda)
val square = (x: Int) => x * x
println(square(5)) // 25
➡️ Closure
A closure is a function which uses variables from its surrounding scope.
val factor = 3
val multiplier = (x: Int) => x * factor
println(multiplier(4)) // 12
factor is captured → this is a closure.
🧬 Inheritance
// Parent class
class Animal {
def sound(): Unit = {
println("Animal makes sound")
}
}
// Child class
class Dog extends Animal {
override def sound(): Unit = {
println("Dog barks")
}
}
val d = new Dog()
d.sound() // Dog barks
➡️ Points
● extends → for inheritance
● override → to override parent methods
● By default, methods in Scala are virtual → can be overridden
Summary Table
Concept Description Example
Class Blueprint of objects class Person
Object Singleton object or create object Hello or new
instance Person
Types Int, Double, Boolean, String, val a: Int = 10
etc.
Operators Arithmetic, Relational, Logical +, ==, &&
Control if-else, for, while, match if, for (i <- ...)
Structures
Functions Named/Anonymous functions def add(a, b) / (x: Int)
=> ...
Closures Functions with captured val factor = 3 ...
variables
Inheritance Reusing parent class features class Dog extends Animal