A No SQL database provides a mechanism for storage and retrieval of data that employs less
constrained consistency models than traditional relational database
No SQL systems are also referred to as "NotonlySQL“ to emphasize that they do in fact allow
SQL-like query languages to be used.
In relational Databases:
You can’t add a record which does not fit the schema
You need to add NULLs to unused items in a row
We should consider the datatypes. i.e : you can’t add a stirng to an interger field
You can’t add multiple items in a field (You should create another table: primary-key, foreign
key, joins, normalization, ...)
In NoSQL Databases:
There is no schema to consider
There is no unused cell
There is no datatype (implicit)
Most of the considerations are done in
NoSQL avoids:
The overhead of ACID transactions
(Atomicity (A), Consistency (C), Isolation (I)
“concurrency control,” Durability (D)
“Permanent Changes”
Complexity of SQL query
Burden of up-front schema design
DBA presence
Transactions (It should be handled at the
application layer)
Provides:
Easy and frequent changes to DB
Fast development
Large data volumes(e.g. Google)
Schema less
The CAP theorem is a fundamental concept in distributed computing that states that a distributed
We need a distributed database system data store can only guarantee two out of the following three properties:
having such features: 1.Consistency (C):
1. Fault tolerance Every read receives the most recent write or an error. All nodes return the same data, ensuring no
2. High availability discrepancies. It’s like ensuring all replicas reflect the latest update.
3. Consistency 2.Availability (A):
4. Scalability Every request (read or write) receives a response without guaranteeing the latest data. The system
Which is impossible!!! remains operational even if some nodes fail.
According to CAP theorem 3.Partition Tolerance (P):
The system continues to operate even when network partitions (communication breakdowns
between nodes) occur. Partition tolerance is essential in distributed systems.
CAP Trade-off
•In a distributed system, network partitions can happen due to network failures. When this occurs,
the system must prioritize Consistency or Availability.
•CP (Consistency + Partition Tolerance):
The system sacrifices availability to ensure data remains consistent across nodes. Example: HBase.
•AP (Availability + Partition Tolerance):
The system sacrifices consistency to ensure availability, allowing operations even during partitions.
Example: Cassandra.
•CA (Consistency + Availability):
This scenario is typically only possible in systems without partitions, such as a single-node
database. An example is a relational Database like PostgreSQL.
Apache
CassandraDB
• Apache Cassandra is a highly scalable,
high-
performance distributed database
designed to
handle large amounts of data across many
• commodity servers,
It is a type of NoSQL providing high
availability
database.
with no single point of failure.
NoSQL
• A NoSQL database (sometimes called as Not Only
SQL)
is a database that provides a mechanism to store
and
• retrieve data other
These databases than
are the tabularsupport
schema-free, relations used
easy
in
replication, have simple API, eventually
relational
consistent,databases.
and
can handle
• The primaryhuge amounts
objective of a of data.database is to
NoSQL
have
– simplicity
– horizontal of design,
scaling,
and
– finer control over
availability.
NoSQL vs.
RDBMS
Popular NoSQL
Databases
• Apache
HBase:
– HBase is an open source, non-relational, distributed
database modeled after Google’s BigTable and is
written
in Java. It is developed as a part of Apache Hadoop
project and runs on top of HDFS, providing BigTable-
like
• MongoDB:
capabilities for Hadoop.
– MongoDB is a cross-platform document-oriented
database system that avoids using the traditional
table-
based relational database structure in favor of JSON-
like
documents with dynamic schemas making the
integration
of data in certain types of applications easier and
Features of
Cassandra
• Elastic scalability: Cassandra is highly scalable; it allows to
add
more hardware to accommodate more customers and
• more
Alwaysdata
on architecture: Cassandra has no single point of
as per requirement.
failure
and it is continuously available for business-critical
• applications
Fast linear-scale performance: Cassandra is linearly
that cannot
scalable, afford a failure.
i.e., it
increases your throughput as you increase the number of
• nodes indata storage: Cassandra accommodates all
Flexible
the cluster.
possible data Therefore it maintains a quick response time.
formats including: structured, semi-structured, and
unstructured.
It can dynamically accommodate changes to your data
structures
according to your need.
Features of
Cassandra
• Flexible data storage: Cassandra accommodates all
possible
data formats including: structured, semi-structured, and
unstructured. It can dynamically accommodate changes
• to your
Easy data distribution: Cassandra provides the
data structures
flexibility to according to your need.
distribute data where you need by replicating data
• across
Transaction support: Cassandra supports
multiple
properties datacenters.
like
Atomicity, Consistency, Isolation, and Durability
• (ACID).
Fast writes: Cassandra was designed to run on cheap
commodity hardware. It performs blazingly fast writes
and can
store hundreds of terabytes of data, without sacrificing
the
read efficiency.
History of
Cassandra
• Cassandra was developed at Facebook for
inbox
search
. was open-sourced by Facebook in July
• It
2008.
• Cassandra was accepted into Apache
Incubator
in March
2009.
• It was made an Apache top-level project
since
February
2010.
Data replication in
Cassandra
• In Cassandra, one or more of the nodes
in cluster
a act as replicas for a given piece of
data.
• If it is detected that some of the nodes
responded with an out-of-date value,
Cassandra
will return
• After the most
returning recent
the most value to the
recent
client.
value,
Cassandra performs a read repair in
the
background to update the stale
values.
Data replication in
Cassandra
Cassandra
QL
• Users can access Cassandra through its
nodes
using Cassandra Query Language (CQL).
CQL
• treats the database
Programmers (Keyspace)
use cqlsh: as to
a prompt a work
container of
with
CQL or separate application language
tables.
drivers.approach any of the nodes for their
• Clients
read-write operations. That node
(coordinator)
plays a proxy between the client and the
nodes
holding the data.
Data
Model
• The data model of Cassandra is
significantly
different from what we normally see in an
RDBMS.
• Cassandra database is distributed over
several
machines that operate
together.
• The outermost container is known as the
Cluster.
For failure handling, every node contains a
replica,
• Cassandra arranges the nodes in a cluster, in
and
a ring in case
format, and of a failure,
assigns datathe
to replica takes
charge.
them.
Data
Model
• Keyspace is the outermost container for data in Cassandra.
The basic of a Keyspace in Cassandra
attributes
• are:
Replication
factor:
– It is the number of machines in the cluster that will receive copies
ofsame
the data.
• Replica placement
strategy:
– It is nothing but the strategy to place replicas in the ring. We have
strategies such as simple strategy (rack-aware strategy), old
network
topology strategy (rack-aware strategy), and network topology
strategy
• Column
(datacenter-shared strategy).
families:
– Keyspace is a container for a list of one or more column families. A
column
family, in turn, is a container of a collection of rows. Each row
contains
ordered columns. Column families represent the structure of your
data.
Each keyspace has at least one and often many column families.
Column
family
• A column family is a container for an ordered
collection of rows. Each row, in turn, is an
ordered
• collection
A Cassandraof columns.
column family has the
following
attributes:
– keys_cached It represents the number of
locations
– to keep cached
rows_cached per SSTable.
It represents the number of
rows
whose entire contents will be cached in
memory.
– preload_row_cache: It specifies whether you
want
RDBMS vs.
Cassandra
Summary
SQL Databases
1. Predefined Schema
2. Standard definition and interface language
3. Tight consistency
4. Well-defined semantics
NoSQL Database
5. No predefined Schema
6. Per-product definition and interface language
7. Getting an answer quickly is more important than getting a correct answer