0% found this document useful (0 votes)

17 views41 pages

Big Data-Unit 4

NoSQL databases, designed for flexibility and scalability, store data in various formats such as documents, key-value pairs, and graphs, making them suitable for handling large volumes of unstructured or semi-structured data. MongoDB, a popular NoSQL database, utilizes JSON-like documents and offers features like schema-less design, high performance, and rich query capabilities. Apache Spark, an open-source distributed computing framework, is optimized for big data processing and supports various workloads, including batch and real-time data processing, with key components like RDDs and Spark SQL.

Uploaded by

ishita17tyagi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views41 pages

Big Data-Unit 4

Uploaded by

ishita17tyagi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

UNIT-4

NoSQL Databases: Introduction to NoSQL

NoSQL
NoSQL (Not Only SQL) refers to a new generation of databases that provide a
mechanism for storage and retrieval of data that is modeled in ways other than
the tabular relations used in relational databases (RDBMS).

While traditional databases use tables (rows and columns), NoSQL databases
can store:

● Documents

● Key-Value pairs

● Graph structures

● Wide-columns

Key point: NoSQL databases are designed for flexibility, scalability,

and handling large volumes of unstructured or semi-structured data.

Why NoSQL?
Relational databases are excellent for structured data with clear relationships.
But today's apps (social media, big data, IoT, AI) need to handle:

✅ Large volumes of data (Big Data)

✅ Unstructured or semi-structured data
✅ High velocity and variety of data
✅ Horizontal scaling (adding more servers easily)
RDBMS sometimes struggles in these scenarios — that's where NoSQL shines.

Types of NoSQL Databases

Type Example Description

Document-bas MongoDB, Stores data as JSON-like documents

ed CouchDB

Key-Value Redis, DynamoDB Data is stored as key-value pairs

Column-family Cassandra, HBase Stores data in columns instead of rows

Graph-based Neo4j, Amazon Stores data as nodes and edges for

Neptune relationships

Features of NoSQL
● Schema-less: Flexible data models

● Highly scalable: Can handle huge amounts of data

● Distributed: Built for cloud and distributed data centers

● Fast performance: Optimized for high-speed read/write

When to use NoSQL?

✅ When dealing with big data
✅ For real-time web apps (ex: chats, recommendations)
✅ For flexible schema requirements (ex: IoT, user profiles)
✅ When scaling horizontally across servers

When NOT to use NoSQL

❌ When data is highly relational and consistent (ex: Banking)
❌ When strong ACID compliance is mandatory
❌ For complex joins and transactions

Summary
● NoSQL = Not Only SQL → designed for modern apps

● Supports various models: document, key-value, column, graph

● Scales horizontally and handles unstructured/large data

● Use wisely → it's not always the replacement for RDBMS

MongoDB: Introduction
MongoDB is a popular NoSQL, document-oriented database.
It stores data in JSON-like documents (actually BSON - Binary JSON) which
makes it flexible and easy to use.

✅ Key Features
● Schema-less → Flexible document structures

● Scalable → Easily distributed across servers

● High performance → Fast reads and writes

● Rich query language → Supports powerful queries

📦 Example of a document:
json

"name": "John",

"age": 25,

"email": "john@example.com"

📚 MongoDB Data Types

Data Description Example
Type

String Text data "Hello"

Integer Number 42

Double Decimal 42.5

Boolean True/False true

Array List of values ["red", "green"]

Object Embedded document { "city": "Delhi" }

Date Date/time ISODate("2023-05-01T00:00:

00Z")

Null Null value null

ObjectId Unique ID automatically ObjectId("...")

generated

✍️ Creating Documents
To insert documents into a collection:

db.users.insertOne({

name: "Alice",

age: 30,

email: "alice@example.com"

})
For multiple documents:

db.users.insertMany([

{ name: "Bob", age: 25 },

{ name: "Charlie", age: 28 }

])

🔄 Updating Documents
// Update one document

db.users.updateOne(

{ name: "Alice" },

{ $set: { age: 31 } }

// Update many documents

db.users.updateMany(
{ age: { $lt: 30 } },

{ $set: { status: "young" } }

Operators: $set, $inc, $rename, $unset, etc.

❌ Deleting Documents
// Delete one document

db.users.deleteOne({ name: "Charlie" })

// Delete many documents

db.users.deleteMany({ age: { $lt: 25 } })

🔎 Querying Documents
// Find all documents
db.users.find()

// Find with condition

db.users.find({ age: { $gt: 25 } })

// Find and project only name field

db.users.find({ age: { $gt: 25 } }, { name: 1 })

// Sort and limit

db.users.find().sort({ age: -1 }).limit(2)

MongoDB supports conditions ($gt, $lt, $in, $or, etc.)

Introduction to Indexing
Index → Makes search/query operations faster.

Without index → MongoDB scans all documents (slow for large collections).
With index → MongoDB finds data faster (just like an index page of a book).

// Create index on "name" field

db.users.createIndex({ name: 1 })
// List all indexes

db.users.getIndexes()

Types of Indexes: Single field, Compound, Text, Geospatial, etc.

📦 Capped Collections
Capped collection → Fixed-size collection (cannot grow beyond limit).

✅ High performance for inserts

✅ Auto removes oldest documents when size limit is reached
✅ Useful for logs, cache, real-time data
// Create capped collection

db.createCollection("logs", { capped: true, size: 100000 })

// Insert into capped collection

db.logs.insertOne({ event: "User login", time: new Date() })

Important: No document deletion or resizing → documents automatically

removed in FIFO order when full.
✅ Summary (Quick Table)
Operation MongoDB Command

Insert insertOne(), insertMany()

Update updateOne(), updateMany()

Delete deleteOne(), deleteMany()

Query find() with filters

Indexing createIndex()

Capped createCollection() with {

Collection capped: true }

Database-Level Commands
Command Description

show dbs List all databases

use Switch to or create a new
myDatabase database

db Show current database name

db.dropDataba Delete the current database

se()

📁 Collection-Level Commands
Command Description

show collections List all collections in the current

database

db.createCollection("myColle Create a new collection

ction")

db.myCollection.drop() Delete a collection

📝 CRUD Operations
Create

db.collection.insertOne({ name: "Anshika", age: 25 });

db.collection.insertMany([{ name: "Himank" }, { name:
"Bhavishya" }]);

Read

db.collection.find(); // All documents

db.collection.find({ name: "Anshika" }); // Filtered
db.collection.findOne({ age: 25 });

Update
db.collection.updateOne({ name: "Anshika" }, { $set: { age: 26 }
});
db.collection.updateMany({ age: 25 }, { $inc: { age: 1 } });

Delete

db.collection.deleteOne({ name: "Bhavishya" });

db.collection.deleteMany({ age: { $lt: 20 } });

🔍 Query Operators
Operator Example Meaning

$gt { age: { $gt: 18 } } Greater than

$lt { age: { $lt: 30 } } Less than

$in { name: { $in: ["Anshika", "Himank"] Match any in

} } array

$and { $and: [{ age: 25 }, { name: AND condition

"Anshika" }] }

$or { $or: [{ age: 25 }, { name: OR condition

"Bhavishya" }] }

📊 Aggregation
db.collection.aggregate([
{ $match: { age: { $gte: 25 } } },
{ $group: { _id: "$age", total: { $sum: 1 } } }
]);

🧾 Indexes
db.collection.createIndex({ name: 1 }); // Ascending
db.collection.dropIndex("name_1");

What is Apache Spark?

Apache Spark is an open-source, distributed computing framework designed for
big data processing. It performs fast, in-memory computations and supports batch,
streaming, machine learning, and graph processing workloads.

⚙️ Key Features of Apache Spark

Feature Description

Speed Processes data up to 100x faster than Hadoop MapReduce

using in-memory computing.

Ease of Use Supports high-level APIs in Java, Scala, Python (PySpark), and
R.

Multi-Engine Runs batch jobs, streaming data, machine learning, and graph
Support processing in one platform.

Fault Tolerance Automatically recovers from node failures using RDD lineage.

Works with Can run on Hadoop clusters and access HDFS data.
Hadoop
🧠 Core Concepts of Spark
1. RDD (Resilient Distributed Dataset)

● The fundamental data structure of Spark.

● Immutable, distributed collection of objects.

● Can be created from Hadoop files, local data, or transformed from existing
RDDs.

Key Properties:

● Resilient: Recovers from failure.

● Distributed: Stored across multiple nodes.

● Lazy Evaluation: Transforms are not executed until an action is called.

2. Transformations and Actions

Transformatio Actions
ns

map() collect()

filter() count()

flatMap() reduce()
union() take(n)

groupByKey( saveAsTextFil
) e()

●
Transformations: Create a new RDD from existing ones (lazy).

● Actions: Trigger computation and return a result.

3. Spark Components (Libraries)

Component Use

Spark Core Basic functions (task scheduling, memory management, fault

recovery).

Spark SQL SQL interface and DataFrame API.

Spark Real-time data processing.

Streaming

MLlib Machine learning library.

GraphX Graph processing and analysis.

4. Spark Architecture

Driver Program

Cluster Manager (YARN, Mesos, Standalone)

Worker Nodes

Executors (run tasks)

● Driver: Main program that defines transformations/actions.

● Cluster Manager: Allocates resources (YARN, Mesos, Standalone).

● Executors: Run tasks and store data.

5. Anatomy of a Spark Job

When an action (like .count()) is called:

1. Spark builds a DAG (Directed Acyclic Graph).

2. DAG is divided into Stages (based on data shuffling).

3. Each stage is divided into Tasks (each runs on a partition).

4. Executors execute tasks and return results to the Driver.

🖥️ Spark Deployment Modes
Mode Description

Local Runs on a single machine (for

development/testing).

Standalon Uses Spark’s built-in cluster manager.

YARN Uses Hadoop YARN for resource

management.

Mesos Another general-purpose cluster manager.

Kubernete Container-based deployment option.

⚡ Spark on YARN
● Spark can run on Hadoop YARN.

● Two modes:

○ Client Mode: Driver runs on the client machine.

○ Cluster Mode: Driver runs inside YARN container (recommended for

production).

● Allows Spark to run alongside Hadoop jobs, using the same cluster resources.
📊 Spark SQL and DataFrames

● DataFrame: Similar to a SQL table.

● Supports SQL queries:

python

df = spark.read.csv("file.csv", header=True)

df.select("name", "age").filter("age > 25").show()

● Register DataFrames as temporary tables:

python

df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE age > 25").show()

🌊 Spark Streaming
● Processes real-time data from sources like Kafka, Flume, or TCP sockets.

● Converts streaming data into DStreams (Discretized Streams).

python

from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 1) # 1 second batch interval

lines = ssc.socketTextStream("localhost", 9999)

lines.pprint()

ssc.start()

ssc.awaitTermination()

🤖 Spark MLlib (Machine Learning)

● Offers pre-built algorithms:

○ Classification, Regression

○ Clustering (KMeans)
○ Collaborative Filtering

○ Dimensionality Reduction (PCA)

● Supports pipelines, transformers, and evaluators.

🧠 GraphX
● Enables graph computation on RDDs.

● Use cases:

○ Social network analysis

○ PageRank

○ Connected components

✅ Advantages of Spark
● High performance (in-memory)

● Unified platform for all data workloads

● Easy APIs for faster development

● Fault tolerance and scalability

● Works well with HDFS, Hive, Cassandra, HBase, and more

❗ Limitations
● High memory consumption

● Not ideal for low-latency applications (real-time dashboards)

● Requires tuning for very large-scale production use

🎓 Final Summary Table

Concept Summary

Spark Core Task scheduling, memory & fault

management

RDD Immutable, distributed data structure

Spark SQL Run SQL queries on structured data

Spark Streaming Real-time stream processing

MLlib Machine learning library

GraphX Graph analytics

Deployment Local, Standalone, YARN, Kubernetes

Jobs → Stages → Execution model of a Spark application
Tasks

Background
At its core, Apache Spark is an open-source distributed computing

system designed to quickly process large volumes of data that can hardly

accomplished by operating on a single machine. Spark distributes data

and computations across multiple machines, allowing for parallel

processing.
.

It was first developed at UC Berkeley’s AMPLab in 2009. At that time,

Hadoop MapReduce was the leading parallel programming engine for

processing massive datasets across multiple machines. AMPLab

collaborated with early MapReduce users to identify its strengths and

limitations, driving the creation of more versatile computing platforms.

They also worked closely with Hadoop users at UC Berkeley, who

focused on large-scale machine learning requiring iterative algorithms

and multiple data passes.

These discussions highlighted some insights. Cluster computing had

significant potential. However, MapReduce made building large

applications inefficient, especially for machine learning tasks requiring

multiple data passes. For example, the machine learning algorithm might

need to make many passes over the data. With MapReduce, each pass

must be written as a separate job and launched individually on the

cluster.

To address this, the Spark team created a functional programming-based

API to simplify multistep applications and developed a new engine for

efficient in-memory data sharing across computation steps.

The Spark Application Architecture

A typical Spark application consists of several key components:
● Driver: This JVM process manages the Spark application,
handling user input and distributing work to the executors.
● Cluster Manager: This component oversees the cluster of
machines running the Spark application. Spark can work with
various cluster managers, including YARN, Apache Mesos, or
its standalone manager.
● Executors: These processes execute tasks the driver assigns
and report their status and results. Each Spark application has
its own set of executors. A single worker node can host
multiple executors.
Physical Cluster. .
Job, Stage, and Task
● Job: In Spark, a job represents a series of transformations
applied to data. It encompasses the entire workflow from start
to finish.
● Stage: A stage is a job segment executed without data shuffling.
Spark splits the job into different stages when a transformation
requires shuffling data across partitions.
● Task: A task is the smallest unit of execution within Spark.
Each stage is divided into multiple tasks, running the same
code on a separate data partition executed by individual
executors.

In Spark, a job is divided into stages wherever data shuffling is

necessary. Each stage is further broken down into tasks and executed

parallel across different data partitions. A single Spark application can

have more than one Spark job.

Resilient Distributed Dataset (RDD)

RDD is the primary data abstraction. Whether DataFrames or Datasets

are used, they are compiled into RDDs behind the scenes. It represents
an immutable, partitioned collection of records that can be operated on

in parallel. Data inside RDD is stored in memory for as long and as much

as possible.

Properties
Internally, each RDD in Spark has five key properties:

● List of Partitions: The RDD is divided into partitions, which

are the units of parallelism in Spark.
● Computation Function: A function determines how to
compute the data for each partition.
● Dependencies: The RDD keeps track of its dependencies on
other RDDs, which describes how it was created.
● Partitioner (Optional): For key-value RDDs, a partitioner
specifies how the data is partitioned, such as using a hash
partitioner.
● Preferred Locations (Optional): This property lists the
preferred locations for computing each partition, such as the
data block locations in the HDFS.

Lazy Evaluation
When you define the RDD, its inside data is not available or transformed

immediately until an action triggers the execution. This approach allows

Spark to determine the most efficient way to execute the

transformations.
● Transformations, such as map or filter, are operations that
define how the data should be transformed, but they don't
execute until an action forces the computation. Spark doesn't
modify the original RDD when a transformation is applied to
an RDD. Instead, it creates a new RDD that represents the
result of applying the transformation because RDD is
immutable.
● Actions are the commands that Spark runs to produce output
or store data, thereby driving the actual execution of the
transformations.

Partitions
When an RDD is created, Spark divides the data into multiple chunks,

known as partitions. Each partition is a logical data subset and can be

processed independently with different executors. This enables Spark to

perform operations on large datasets in parallel.

Note: I’ll explore the Spark partiions in detail in an upcoming article

Fault Tolerance
Spark RDDs achieve fault tolerance through lineage. Spark forms the

dependency lineage graph by keeping track of each RDD’s dependencies

on other RDDs, which is the series of transformations that created it.

Suppose any partition of an RDD is lost due to a node failure or other

issues. In that case, Spark can reconstruct the lost data by reapplying the

transformations to the original dataset described by the lineage. This

approach eliminates the need to replicate data across nodes. Instead,

Spark only needs to recompute the lost partitions, making the system

efficient and resilient to failures.

Why RDD immutable

You might wonder why Spark RDDs are immutable. Here’s the gist:
● Concurrent Processing: Immutability keeps data consistent
across multiple nodes and threads, avoiding complex
synchronization and race conditions.
● Lineage and Fault Tolerance: Each transformation creates a
new RDD, preserving the lineage and allowing Spark to
recompute lost data reliably. Mutable RDDs would make this
much harder.
● Functional Programming: RDDs follow functional
programming principles that emphasize immutability, making
it easier to handle failures and maintain data integrity.

The journey of the Spark application

Before diving into the flow of a Spark application, it’s essential to

understand the different execution modes Spark offers. We have three

options:

● Cluster Mode: In this mode, the driver process is launched on a

worker node within the cluster alongside the executor
processes. The cluster manager handles all the processes
related to the Spark application.
● Client Mode: The driver remains on the client machine that
submitted the application. This setup requires the client
machine to maintain the driver process throughout the
application’s execution.
● Local mode: This mode runs the entire Spark application on a
single machine, achieving parallelism through multiple
threads. It’s commonly used for learning Spark or testing
applications in a simpler, local environment.
SCALA

SCALA: Introduction
Scala = Scalable Language
It is a modern, hybrid programming language that combines:

✅ Object-Oriented Programming (OOP) → Like Java (Classes, Objects, Inheritance)

✅ Functional Programming (FP) → Functions, Immutable data, Closures, etc.
✅ Features
● Runs on JVM (Java Virtual Machine) → Can use Java libraries

● Statically typed → Type checking at compile time

● Immutable collections and concurrency-friendly

● Concise syntax compared to Java

📦 Classes and Objects in Scala

➡️ Defining a Class
class Person(val name: String, val age: Int) {
def greet(): Unit = {
println(s"Hello, my name is $name and I am $age years old.")
}
}

● val name → Immutable (read-only)

● def → Defines a method

● Unit → Equivalent to void in Java

➡️ Creating an Object
val p = new Person("Alice", 25)
p.greet()

➡️ Singleton Object (object)

object Hello {
def main(args: Array[String]): Unit = {
println("Hello, Scala!")
}
}

● No static keyword → Use object for static-like behavior.

🔢 Basic Types and Operators

Type Exampl
e

Int 10

Double 10.5

Boolea true
n

String "Scala"
Char 'A'

➡️ Operators
Operator Example

Arithmeti +, -, *, /, %
c

Relationa ==, !=, >, <, >=,

l <=

Logical &&, `

val a = 10
val b = 5

println(a + b) // 15
println(a > b) // true

🔧 Built-in Control Structures

➡️ If-Else
val num = 5
if (num > 0)
println("Positive")
else
println("Negative")

➡️ Match (like switch)

val grade = "A"
grade match {
case "A" => println("Excellent")
case "B" => println("Good")
case _ => println("Other")
}

➡️ For loop
for (i <- 1 to 5)
println(i)

➡️ While loop
var i = 0
while (i < 5) {
println(i)
i += 1
}

📌 Functions and Closures

➡️ Function
def add(a: Int, b: Int): Int = {
return a + b
}
println(add(3, 4))

OR Shorter form:
def add(a: Int, b: Int) = a + b

➡️ Anonymous Function (Lambda)

val square = (x: Int) => x * x
println(square(5)) // 25

➡️ Closure
A closure is a function which uses variables from its surrounding scope.

val factor = 3
val multiplier = (x: Int) => x * factor
println(multiplier(4)) // 12

factor is captured → this is a closure.

🧬 Inheritance
// Parent class
class Animal {
def sound(): Unit = {
println("Animal makes sound")
}
}

// Child class
class Dog extends Animal {
override def sound(): Unit = {
println("Dog barks")
}
}

val d = new Dog()

d.sound() // Dog barks

➡️ Points
● extends → for inheritance

● override → to override parent methods

● By default, methods in Scala are virtual → can be overridden

Summary Table
Concept Description Example

Class Blueprint of objects class Person

Object Singleton object or create object Hello or new

instance Person

Types Int, Double, Boolean, String, val a: Int = 10

etc.

Operators Arithmetic, Relational, Logical +, ==, &&

Control if-else, for, while, match if, for (i <- ...)

Structures

Functions Named/Anonymous functions def add(a, b) / (x: Int)

=> ...

Closures Functions with captured val factor = 3 ...

variables
Inheritance Reusing parent class features class Dog extends Animal

Big Data Unit 4 by Multi Atoms2
No ratings yet
Big Data Unit 4 by Multi Atoms2
11 pages
2383 - 1019 - DOC - NoSQL Databases
No ratings yet
2383 - 1019 - DOC - NoSQL Databases
6 pages
L48 - MongoDB
No ratings yet
L48 - MongoDB
31 pages
Complete Unit 3 Notes
No ratings yet
Complete Unit 3 Notes
30 pages
Mongo DB
No ratings yet
Mongo DB
26 pages
Mongodb
No ratings yet
Mongodb
9 pages
Mongo DB
No ratings yet
Mongo DB
26 pages
Module 3 Mongodb
No ratings yet
Module 3 Mongodb
10 pages
Mongodb Commands List and Analysis
No ratings yet
Mongodb Commands List and Analysis
6 pages
Big Data Practical 3
No ratings yet
Big Data Practical 3
4 pages
NOSQL Databases
No ratings yet
NOSQL Databases
8 pages
Unit 2 - Bda Notes
No ratings yet
Unit 2 - Bda Notes
37 pages
Big Training Data Module 2 - Mongo DB 2
No ratings yet
Big Training Data Module 2 - Mongo DB 2
67 pages
Mongo DB
No ratings yet
Mongo DB
30 pages
Lecture 40 1
No ratings yet
Lecture 40 1
22 pages
Mongo DB
No ratings yet
Mongo DB
16 pages
NoSQL Data Analytics Guide
0% (1)
NoSQL Data Analytics Guide
50 pages
MongoDB NoSQL Database Guide
No ratings yet
MongoDB NoSQL Database Guide
19 pages
MongoDB Database Model
No ratings yet
MongoDB Database Model
7 pages
MongoDB Guide for Developers
No ratings yet
MongoDB Guide for Developers
26 pages
1664473609-Unit 5 - Database Management - MongoDB
No ratings yet
1664473609-Unit 5 - Database Management - MongoDB
23 pages
MongoDB Case Study 1
No ratings yet
MongoDB Case Study 1
6 pages
NoSQL MongoDB Tutorial Final
No ratings yet
NoSQL MongoDB Tutorial Final
30 pages
Big Data
No ratings yet
Big Data
11 pages
Big Data (Unit 3)
No ratings yet
Big Data (Unit 3)
46 pages
281511lecture Notes 2 - MongoDB Data Modeling-1718181255820
No ratings yet
281511lecture Notes 2 - MongoDB Data Modeling-1718181255820
13 pages
Unit 2 (MongoDB)
No ratings yet
Unit 2 (MongoDB)
17 pages
FSD Unit III
No ratings yet
FSD Unit III
22 pages
NoSQL+Databases+and+MongoDB+-+I+ +Lecture+Notes
No ratings yet
NoSQL+Databases+and+MongoDB+-+I+ +Lecture+Notes
7 pages
A. Im, G. Cai, H. Tunc, J. Stevens, Y. Barve, S. Hei Vanderbilt University
No ratings yet
A. Im, G. Cai, H. Tunc, J. Stevens, Y. Barve, S. Hei Vanderbilt University
81 pages
Mongodb Schema Validation
No ratings yet
Mongodb Schema Validation
8 pages
mongoDB 1
No ratings yet
mongoDB 1
23 pages
05 NoSQL
No ratings yet
05 NoSQL
21 pages
Big Data
No ratings yet
Big Data
26 pages
Introduction to NoSQL Databases
No ratings yet
Introduction to NoSQL Databases
14 pages
MongoDB Updated
No ratings yet
MongoDB Updated
39 pages
Mongodb Documentation
No ratings yet
Mongodb Documentation
4 pages
Mongo
No ratings yet
Mongo
7 pages
Mongodb Notes HD Excl
No ratings yet
Mongodb Notes HD Excl
22 pages
Mongo DB
No ratings yet
Mongo DB
36 pages
MongoDB Guide for Developers
No ratings yet
MongoDB Guide for Developers
24 pages
Mongo DB Notes
No ratings yet
Mongo DB Notes
5 pages
CHAP1 No SQL Database - 085309
No ratings yet
CHAP1 No SQL Database - 085309
72 pages
Mongo Notes
No ratings yet
Mongo Notes
37 pages
Module 3
No ratings yet
Module 3
15 pages
Unit2 MongoDB Practical
No ratings yet
Unit2 MongoDB Practical
148 pages
Mongo DB
No ratings yet
Mongo DB
77 pages
MongoDB for Developers
No ratings yet
MongoDB for Developers
13 pages
FSD 3 Unit
No ratings yet
FSD 3 Unit
5 pages
Bda Unit 4
No ratings yet
Bda Unit 4
13 pages
MongoDB Data Modeling - Sample Chapter
No ratings yet
MongoDB Data Modeling - Sample Chapter
40 pages
Dbms Unit5 Notes
No ratings yet
Dbms Unit5 Notes
81 pages
Unit 4
No ratings yet
Unit 4
27 pages
Big Data Notes
No ratings yet
Big Data Notes
13 pages
Updated Mongodb Lab Manual IV Sem
No ratings yet
Updated Mongodb Lab Manual IV Sem
48 pages
Chapitre 4 MongoDB
No ratings yet
Chapitre 4 MongoDB
27 pages
Class 5 ICT Notes
100% (1)
Class 5 ICT Notes
8 pages
PEGA Course Content
No ratings yet
PEGA Course Content
8 pages
Database 100 Questions and Answers
No ratings yet
Database 100 Questions and Answers
12 pages
CC6012NIDataandWebDevelopment 93486
No ratings yet
CC6012NIDataandWebDevelopment 93486
18 pages
KnowYourMolecule ChEMBL KNIME
No ratings yet
KnowYourMolecule ChEMBL KNIME
11 pages
SAP B1 SQL Queries for Business Data
100% (1)
SAP B1 SQL Queries for Business Data
10 pages
Data Sheet
No ratings yet
Data Sheet
23 pages
Syllabus
No ratings yet
Syllabus
1 page
DA0Z8CMB8D0 REV D Schematic Diagram 2
No ratings yet
DA0Z8CMB8D0 REV D Schematic Diagram 2
48 pages
Calculating The Hamming Code
25% (4)
Calculating The Hamming Code
26 pages
LAB101 Assignment: Title
No ratings yet
LAB101 Assignment: Title
4 pages
Sap Abap Reports
88% (24)
Sap Abap Reports
107 pages
TECHNOLOGY IN BANKING OPERATIONS Explain The Concept of Network Topology and Its Various Types
No ratings yet
TECHNOLOGY IN BANKING OPERATIONS Explain The Concept of Network Topology and Its Various Types
6 pages
Profiling Apache HIVE Query From Run Time Logs: Givanna Putri Haryono Ying Zhou
No ratings yet
Profiling Apache HIVE Query From Run Time Logs: Givanna Putri Haryono Ying Zhou
8 pages
Lab 2
No ratings yet
Lab 2
9 pages
DSD Memory
No ratings yet
DSD Memory
37 pages
Multiloop Timing Protocol Guide
No ratings yet
Multiloop Timing Protocol Guide
12 pages
Visualizing 802.11 Wireshark Data
No ratings yet
Visualizing 802.11 Wireshark Data
30 pages
Dbms Lab Manual Final Document 2021 Regualtion 4 93
No ratings yet
Dbms Lab Manual Final Document 2021 Regualtion 4 93
90 pages
Telemetry
No ratings yet
Telemetry
22 pages
SERVER SIDE SCRIPTING BASIC-php
No ratings yet
SERVER SIDE SCRIPTING BASIC-php
42 pages
Neo Router Performance Tuning
No ratings yet
Neo Router Performance Tuning
4 pages
DBT Cloud Advanced Architecture Guide
0% (1)
DBT Cloud Advanced Architecture Guide
4 pages
FB2505 Fabia
No ratings yet
FB2505 Fabia
3 pages
Programming in C Gujarati Book
No ratings yet
Programming in C Gujarati Book
4 pages
ABAP Dictionary Guide: SE11 & DDIC
No ratings yet
ABAP Dictionary Guide: SE11 & DDIC
24 pages
Epson Technical Placement Paper
No ratings yet
Epson Technical Placement Paper
6 pages
FANOVI启动手册与问题解决
No ratings yet
FANOVI启动手册与问题解决
43 pages
DNS Record Types
No ratings yet
DNS Record Types
14 pages
Figure P1.1 The File Structure For Problems A-F: Database Systems Tutorial 1
No ratings yet
Figure P1.1 The File Structure For Problems A-F: Database Systems Tutorial 1
7 pages

Big Data-Unit 4

Uploaded by

Big Data-Unit 4

Uploaded by

UNIT-4

NoSQL Databases: Introduction to NoSQL

Key point: NoSQL databases are designed for flexibility, scalability,

✅ Large volumes of data (Big Data)​

Types of NoSQL Databases

Type Example Description

Document-bas MongoDB, Stores data as JSON-like documents

Key-Value Redis, DynamoDB Data is stored as key-value pairs

Column-family Cassandra, HBase Stores data in columns instead of rows

Graph-based Neo4j, Amazon Stores data as nodes and edges for

●​ Highly scalable: Can handle huge amounts of data​

●​ Distributed: Built for cloud and distributed data centers​

●​ Fast performance: Optimized for high-speed read/write​

When to use NoSQL?

When NOT to use NoSQL

●​ Supports various models: document, key-value, column, graph​

●​ Scales horizontally and handles unstructured/large data​

●​ Use wisely → it's not always the replacement for RDBMS

●​ Scalable → Easily distributed across servers​

●​ Rich query language → Supports powerful queries​

📚 MongoDB Data Types

String Text data "Hello"

Double Decimal 42.5

Array List of values ["red", "green"]

Object Embedded document { "city": "Delhi" }

Date Date/time ISODate("2023-05-01T00:00:

Null Null value null

ObjectId Unique ID automatically ObjectId("...")

{ name: "Bob", age: 25 },

{ name: "Charlie", age: 28 }

// Update many documents

{ $set: { status: "young" } }

Operators: $set, $inc, $rename, $unset, etc.

db.users.deleteOne({ name: "Charlie" })

// Delete many documents

db.users.deleteMany({ age: { $lt: 25 } })

// Find with condition

db.users.find({ age: { $gt: 25 } })

// Find and project only name field

db.users.find({ age: { $gt: 25 } }, { name: 1 })

// Sort and limit

db.users.find().sort({ age: -1 }).limit(2)

MongoDB supports conditions ($gt, $lt, $in, $or, etc.)

// Create index on "name" field

Types of Indexes: Single field, Compound, Text, Geospatial, etc.

✅ High performance for inserts​

db.createCollection("logs", { capped: true, size: 100000 })

// Insert into capped collection

db.logs.insertOne({ event: "User login", time: new Date() })

Important: No document deletion or resizing → documents automatically

Insert insertOne(), insertMany()

Update updateOne(), updateMany()

Delete deleteOne(), deleteMany()

Query find() with filters

Capped createCollection() with {

show dbs List all databases

db Show current database name

db.dropDataba Delete the current database

show collections List all collections in the current

db.createCollection("myColle Create a new collection

db.myCollection.drop() Delete a collection

db.collection.insertOne({ name: "Anshika", age: 25 });

db.collection.find(); // All documents

db.collection.deleteOne({ name: "Bhavishya" });

$gt { age: { $gt: 18 } } Greater than

$lt { age: { $lt: 30 } } Less than

$in { name: { $in: ["Anshika", "Himank"] Match any in

$and { $and: [{ age: 25 }, { name: AND condition

$or { $or: [{ age: 25 }, { name: OR condition

What is Apache Spark?

⚙️ Key Features of Apache Spark

Speed Processes data up to 100x faster than Hadoop MapReduce

●​ The fundamental data structure of Spark.​

●​ Immutable, distributed collection of objects.​

●​ Resilient: Recovers from failure.​

●​ Distributed: Stored across multiple nodes.​

✅ Large volumes of data (Big Data)

● Highly scalable: Can handle huge amounts of data

● Distributed: Built for cloud and distributed data centers

● Fast performance: Optimized for high-speed read/write

● Supports various models: document, key-value, column, graph

● Scales horizontally and handles unstructured/large data

● Use wisely → it's not always the replacement for RDBMS

● Scalable → Easily distributed across servers

● Rich query language → Supports powerful queries

✅ High performance for inserts

● The fundamental data structure of Spark.

● Immutable, distributed collection of objects.

● Resilient: Recovers from failure.

● Distributed: Stored across multiple nodes.

● Lazy Evaluation: Transforms are not executed until an action is called.

● Actions: Trigger computation and return a result.

● Driver: Main program that defines transformations/actions.

● Cluster Manager: Allocates resources (YARN, Mesos, Standalone).

● Executors: Run tasks and store data.

1. Spark builds a DAG (Directed Acyclic Graph).

2. DAG is divided into Stages (based on data shuffling).

3. Each stage is divided into Tasks (each runs on a partition).

4. Executors execute tasks and return results to the Driver.

○ Client Mode: Driver runs on the client machine.

○ Cluster Mode: Driver runs inside YARN container (recommended for

● DataFrame: Similar to a SQL table.

● Supports SQL queries:

● Register DataFrames as temporary tables:

● Converts streaming data into DStreams (Discretized Streams).

○ Dimensionality Reduction (PCA)

● Supports pipelines, transformers, and evaluators.

○ Social network analysis

● Unified platform for all data workloads

● Easy APIs for faster development

● Fault tolerance and scalability

● Works well with HDFS, Hive, Cassandra, HBase, and more

● Not ideal for low-latency applications (real-time dashboards)

● Requires tuning for very large-scale production use