Relational databases 5
• Benefits of Relational databases:
🡺 Designed for OLTP
🡺 ACID properties
🡺 Strong consistancy, concurrency, recovery
🡺 Mathematical background
🡺 Standard Query language (SQL)
🡺 Lots of tools to use with i.e: Reporting services, entity
frameworks, ...
NoSQL why, what and when? 8
But...
❑ Relational databases were not built
for distributed applications.
Because...
❑ Joins are expensive
❑ Hard to scale horizontally (adding
more machines)
❑ Impedance (object-relational)
mismatch occurs
❑ Expensive (product cost, hardware,
Maintenance)
NoSQL why, what and when? 9
And....
It’s weak in:
❑ Speed (performance)
❑ High availability
❑ Partition tolerance
Why NOSQL now?? Driving Trends 11
13
What is NoSQL?
❑ A No SQL database provides a mechanism
for storage and retrieval of data that
employs less constrained models than
traditional relational database
❑ No SQL systems are also referred to as
"NotonlySQL“ to emphasize that they do in
fact allow SQL-like query languages to be
used.
Motivations of NoSQL databases 14
o simplicity of design
o simpler "horizontal" scaling to
clusters of machines (which is
a problem for relational
databases)
o finer control over availability
Servers can be added or removed without
application downtime
o limiting the object-relational
impedance mismatch
Characteristics of NoSQL databases 14
NoSQL avoids:
▶ Overhead of ACID transactions
▶ Complexity of SQL query
▶ Burden of up-front schema design
▶ DBA presence
▶ Transactions (It should be handled
at
application layer)
Provides:
▶ Easy and frequent changes to DB
▶ Fast development
▶ Large data volumes(eg.Google)
▶ Schema less
What we need ? 26
• We need a distributed database system having such
features:
– Fault tolerance
– High availability
– Consistency
– Scalability
Which is impossible!!!
According to CAP theorem
CAP Theorem
■ Three properties of a system
❑ Consistency (all copies have same value)
❑ Availability (system can run even if parts have failed)
❑ Via replication
❑ Partitions (network can break into two or more parts,
each with active systems that can’t talk to other
parts)
■ Brewer’s CAP “Theorem”: You can have at most
two of these three properties for any system
■ Very large systems will partition at some point
❑ 🡺Choose one of consistency or availablity
❑ Traditional database choose consistency
❑ Most Web applications choose availability
■ Except for specific parts such as order processing
Availability
■ Traditionally, thought of as the
server/process available five 9’s (99.999
%).
■ However, for large node system, at
almost any point in time there’s a good
chance that a node is either down or
there is a network disruption among the
nodes.
❑ Want a system that is resilient in the face
of network disruption
Eventual Consistency
■ When no updates occur for a long period of time,
eventually all updates will propagate through the
system and all the nodes will be consistent
■ For a given accepted update and a given node,
eventually either the update reaches the node or the
node is removed from service
■ Known as BASE (Basically Available, Soft state,
Eventual consistency), as opposed to ACID
❑ Soft state: copies of a data item may be inconsistent
❑ Eventually Consistent – copies becomes consistent at
some later time if there are no more updates to that
data item
CAP theorem 27
We can not achieve all the three items
In distributed database systems
(center)
NoSQL when? 10
o To handle a huge volume of structured, semi-structured and
unstructured data.
o Where there is a need to follow modern software development
practices like Agile Scrum and if you need to deliver
prototypes or fast applications.
o If you prefer object-oriented programming.
o If your relational database is not capable enough to scale up
to your traffic at an acceptable cost.
o If you want to have an efficient, scale-out architecture in place
of an expensive and monolithic architecture.
o If you have local data transactions that need not be very
durable.
o If you are going with schema-less data and want to include
new fields without any ceremony.
o When your priority is easy scalability and availability.
NoSQL when not? 10
o If you are required to perform complex and dynamic querying
and reporting, then you should avoid using NoSQL as it has a
limited query functionality. For such requirements, you should
prefer SQL only.
o NoSQL also lacks in the ability to perform dynamic operations.
It can’t guarantee ACID properties. In such cases like financial
transactions, etc., you may go with SQL databases.
o You should also avoid NoSQL if your application needs
run-time flexibility.
o If consistency is a must and if there aren’t going to be any
large-scale changes in terms of the data volume, then going
with the SQL database is a better option.
NoSQL is getting more & more popular 15
What is a schema-less datamodel? 16
In relational Databases:
▶ You can’t add a record which does
not fit the schema
▶ You need to add NULLs to unused
items in a row
▶ We should consider the
datatypes.
i.e : you can’t add a stirng to an
interger field
▶ You can’t add multiple items in a
field (You should create another
table: primary-key, foreign key,
joins, normalization, ... !!!)
What is a schema-less datamodel? 17
In NoSQL Databases:
▶ There is no schema to consider
▶ There is no unused cell
▶ There is no datatype (implicit)
▶ Most of considerations are done in
application layer
▶ We gather all items in an aggregate
(document)
Aggregate Data Models 18
NoSQL databases are classified in four major
datamodels:
• Key-value
• Document
• Column family (or wide
column)
• Graph
Each DB has its own query language
Aggregate Data Models 18
Column Family: Azure Cosmos DB, Accumulo, Cassandra, Scylla,
HBase.
Document: Azure Cosmos DB, Apache CouchDB, ArangoDB,
BaseX, Clusterpoint, Couchbase, eXist-db, IBM Domino,
MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB
Key–value: Azure Cosmos DB, Aerospike, Apache Ignite,
ArangoDB, Berkeley DB, Couchbase, Dynamo, FoundationDB,
InfinityDB, MemcacheDB, MUMPS, Oracle NoSQL Database,
OrientDB, Redis, Riak, SciDB, SDBM/Flat File dbm
Graph: Azure Cosmos DB, AllegroGraph, ArangoDB,
InfiniteGraph, Apache Giraph, MarkLogic, Neo4J, AgensGraph,
OrientDB, Virtuoso
Key-value data model 19
🡺 Simplest NOSQL databases
🡺 The main idea is the use of a
hash table
🡺 Access data (values) by strings
called keys
🡺 Data has no required format data
may have any format
🡺 Data model: (key, value) pairs
🡺 Basic
Operations:
Insert(key,value),
Fetch(key),
Update(key),
Delete(key)
Column family data model 20
🡺 Based on Google's Bigtable
🡺 The column is lowest/smallest instance of data.
🡺 the names and format of the columns can vary from row to row in the
same table
🡺 each column family typically contains multiple columns that are used
together
🡺 Within a given column family, all data is stored in a row-by-row
fashion, such that the columns for a given row are stored together,
rather than each column being stored separately.
Column family data model 20
🡺 A wide-column store can be
interpreted as a
two-dimensional key–value
store
🡺 It is a tuple that contains a
name, a value and a timestamp
Column family data model 21
Some statistics about Facebook Search (using Cassandra)
❖ MySQL > 50 GB Data
🡺 Writes Average : ~300 ms
🡺 Reads Average : ~350 ms
❖ Rewritten with Cassandra > 50 GB Data
🡺 Writes Average : 0.12 ms
🡺 Reads Average : 15 ms
Graph data model 22
🡺 Similar to network data model at high
level of abstraction
🡺 Based on Graph Theory.
🡺 You can use graph algorithms easily
🡺 Graph Query language (GQL): Gremlin,
cypher, SPARQL
🡺 underlying storage mechanism of graph
databases can vary: relational, key–value
store or document-oriented database
Document based data model 23
• The central concept of a document-oriented database is the notion
of a document
• document store are roughly equivalent to the programming concept
of an object
• While each document-oriented database implementation differs on
the details of this definition, in general, they all assume documents
encapsulate and encode data (or information) in some standard
format or encoding
• Encodings in use include XML, YAML, JSON, as well as binary forms
like BSON.
• allow different types of documents in a single store
• Documents are addressed in the database via a unique key that
represents that document. This key is a simple identifier (or ID),
typically a string, a URI, or a path
Document based data model 23
• Pair each key with complex data
structure known as data structure.
• Documents can contain many different
key-value pairs, or key-array pairs, or
even nested documents.
SQL vs NOSQL 25
Common Advantages of NoSQL
Systems
■ Cheap, easy to implement (open source)
■ Data are replicated to multiple nodes (therefore identical and
fault-tolerant) and can be partitioned
❑ When data is written, the latest version is on at least one node and then
replicated to other nodes
❑ No single point of failure
■ Easy to distribute
■ Don't require a schema
What does NoSQL Not Provide?
■ Joins
■ Group by
❑ But PNUTS (a massively parallel and
geographically distributed database
system for Yahoo!'s web applications)
provides materialized view approach to
joins/aggregation.
■ ACID transactions
■ SQL
■ Integration with applications that are
based on SQL
What: HBase is...
Open-source non-relational distributed
column family database modeled after
Google’s BigTable.
Think of it as a sparse, consistent,
distributed, multidimensional, sorted map:
labeled tables of rows
row consist of key-value cells:
(row key, column family, column, timestamp) -> value
HBase
random, real time read/write access to the
Big Data
goal is the hosting of very large tables --
billions of rows X millions of columns --
atop clusters of commodity hardware.
HDFS vs HBase
HBase
Tables in HBase can serve as the input and
output for MapReduce jobs run in Hadoop
may be accessed through the Java API but
also through REST, Avro or Thrift gateway
APIs
HBase runs on top of HDFS and is
well-suited for faster read and write
operations on large datasets with high
throughput and low input/output latency.
Phoenix
HBase is not a direct replacement for a
classic SQL database, however Apache Phoenix
project provides a SQL layer for Hbase
Apache Phoenix is an open source, massively
parallel, relational database engine
supporting OLTP for Hadoop using Apache HBase
as its backing store
Phoenix provides a JDBC driver that hides the
intricacies of the noSQL store enabling users
to create, delete, and alter SQL tables,
views, indexes, and sequences; insert and
delete rows singly and in bulk; and query data
through SQL.
Phoenix compiles queries and other statements
into native noSQL store APIs
Usage
HBase is now serving several data-driven
websites
Facebook elected to implement its new messaging
platform using HBase in November 2010, but
migrated away from HBase in 2018 (MyRocks)
Twitter runs HBase across its entire Hadoop
cluster.
HP IceWall SSO is a web-based single sign-on
solution and uses HBase to store user data to
authenticate users.
Adobe: currently have about 30 nodes running
HDFS, Hadoop and HBase in clusters ranging from
5 to 14 nodes on both production and development
Powered By Apache Hbase at
http://hbase.apache.org/poweredbyhbase.html
Enterprises that use HBase
What: Part of Hadoop
ecosystem
Provides realtime random read/write
access to data stored in HDFS
read HBase write
Data Data
read write
Consumer Producer
HDFS write
Hive vs. HBase
o Unlike Hive, HBase operations run in real-time on
its database rather than MapReduce jobs
o Apache Hive is a data warehouse system that's
built on top of Hadoop. Apache HBase is a NoSQL
key/value store on top of HDFS
o Apache Hive provides SQL features to Spark/Hadoop
data. HBase can store or process Hadoop data with
near real-time read/write needs.
o Hive should be used for analytical querying of
data collected over a period of time. HBase is
primarily used to store and process unstructured
Hadoop data
o HBase is perfect for real-time querying of Big
Data. Hive should not be used for real-time
querying
What: Features-1
Linear scalability, capable of
storing hundreds of terabytes of data
Automatic and configurable sharding
of tables
Automatic failover support
Strictly consistent reads and writes
What: Features-2
Integrates nicely with Hadoop MapReduce (both
as source and destination)
Easy Java API for client access
Thrift gateway and REST APIs
Bulk import of large amount of data
Replication across clusters & backup options
Block cache and Bloom filters for real-time
queries
How to use HBase?
Hbase Table
How: the Data
Row keys uninterpreted byte arrays
Columns grouped in columnfamilies (CFs)
CFs defined statically upon table creation
Rows are Cell is uninterpreted byte array and a timestamp
ordered and
accessed by Different data All values stores
row key separated into CFs as byte arrays
Row Key Data Rows can
have
geo:{‘country’:‘Belarus’,‘regio
Minsk differen
n’:‘Minsk’}
t
demography:{‘population’:‘1,937,00
0’@ts=2011} columns
Cell can have
New_York_City multiple
geo:{‘country’:‘ USA’,‘state’:’ NY’} versions
geo:{‘country’:‘Fiji’} Data can be
Suva demography:{‘population’:‘8,175,133’@ts=2010,
very “sparse”
‘population’:‘8,244,910’@ts=2011}
How: Writing the Data
Row updates are atomic
Updates across multiple rows are NOT
atomic, no transaction support out of
the box
HBase stores N versions of a cell
(default 3)
Tables are usually “sparse”, not all
columns populated in a row
How: Reading the Data
Reader will always read the last written (and committed)
values
Reading single row: Get
Reading multiple rows: Scan (very fast)
Scan usually defines start key and stop key
Rows are ordered, easy to do partial key scan
How: MapReduce Integration
How: Sharding the Data
Automatic and configurable sharding of
tables:
Tables partitioned into Regions
Region defined by start & end row keys
Regions are the “atoms” of
distribution
Regions are assigned to RegionServers
(HBase cluster slaves)
How: Setup: Components
HBase components
ZooKeeper
client
How: Setup: Hadoop Cluster
Typical Hadoop+HBase setup
Master Node HDFS
NameNode JobTracker MapRed
uce
HBase
HMaster
TaskTracker
TaskTracker
RegionServer RegionServer Slave
Nodes
DataNode DataNode
Slave Node Slave Node
How: Setup: Automatic Failover
When to Use HBase?
When: What HBase is good at
Serving large amount of data: built
to scale from the get-go
fast random access to the data
Write-heavy applications*
Append-style writing (inserting/
overwriting new data) rather than
heavy read-modify-write operations
When: HBase vs ...
General COMMANDS
• status: Provides the status of HBase,
for example, the number of servers.
• version: Provides the version of HBase
being used.
• table_help: Provides help for
table-reference commands.
• whoami: Provides information about the
user.
Hbase DDL commands
• create: Creates a table.
• list: Lists all the tables in HBase.
• disable: Disables a table.
• is_disabled: Verifies whether a table
is disabled.
• enable: Enables a table.
• is_enabled: Verifies whether a table
is enabled.
• describe: Provides the description of
a table.
• alter: Alters a table.
• exists: Verifies whether a table
exists.
• drop: Drops a table from HBase.
Hbase Data Manipulation commands
• put: Puts a cell value at a specified
column in a
specified row in a particular table.
• get: Fetches the contents of row or a
cell.
• delete: Deletes a cell value in a
table.
• deleteall: Deletes all the cells in a
given row.
• scan: Scans and returns the table
data.
• count: Counts and returns the number
of rows in a
table.
• truncate: Disables, drops, and