NoSQL Systems
RDBMS Databases
● good for handling transactional workloads involving small amounts of
data with random read/write properties.
● are ACID-compliant, atomicity, consistency, isolation, and durability.
○ they are generally restricted to a single node.
○ do not provide out-of-the-box redundancy and fault tolerance.
● To handle large volumes of data RDBMSs employ vertical scaling which
is a more costly
○ RDBMSs less than ideal for long-term storage of data that accumulates over time
RDBMS Databases
● Relational databases need to be
manually sharded, mostly using
application logic.
○ This means that the
application logic needs
to know which shard to
query in order to get the
required data.
○ This further complicates
data processing when
data from multiple
shards is required.
RDBMS Databases
● the use of the application logic
to join data retrieved from
multiple shards
RDBMS Databases
● Relational databases generally require data to adhere to a schema.
○ semi-structured and unstructured data not directly supported.
● traditional RDBMS is generally not useful as the primary storage device
in a Big Data solution environment.
Types of NoSQL Systems
1 Key-value Database
2. Document-oriented Database
3. Column-oriented Database
4. Graph Database
Key-value Database
● One of the simplest NoSQL databases.
● Data is represented as a collection of <key,value> pairs.
● It works by storing buckets of <key,value> pairs in a logical way in which all
relevant data relating to an item are stored within that item.
● A key can have a dynamic set of attributes attached to it. fast response time
● ability to store an enormous number of records with extremely low-latency
● provides all the maintenance and failover services
● Some examples of this type of databases are Redis, Riak, Amazon
DynamoDB, and Voldemort .
DOCUMENT—ORIENTED DATABASE
● A document-oriented database extends the concept of a key-value
database by employing flexible data structures
● Store records as “documents”
● support nested and complex structure documents to define subcategories
of information.
● he data values in a key-value database are opaque to the store, whereas
the data values in a document-oriented database are transparent to the
store
DOCUMENT—ORIENTED DATABASE
DOCUMENT—ORIENTED DATABASE
● Strengths
○ Cost of scaling out compared to a SQL database.
○ Can index the fields of documents which allows the user to query not only by the primary
key but also by a document’s contents.
○ Schemaless, completely free to define the contents of a document.
● Limitations
○ Generally not suitable for business transaction application.
○ does not offer any referential integrity support.
○ does not offer joins across collections.
MongoDB
● Data Representation:
○ MongoDB is a document-style database.
○ Document is analogous to the concept of row in RDBMS.
○ In MongoDB, a Collection is a group of documents. This is analogous to a table in RDBMS
○ Documents in MongoDB are stored in JavaScript Object Notation (JSON) format
● Indexing and Sharding
○ Documents are indexed according to keywords for faster access and retrieval.
○ sharding (or index sharding) is the process of splitting a database across multiple
machines.
○ MongoDB incorporates auto-sharding, through which a MongoDB cluster can split data
and re-balance automatically.
MongoDB
● Automatic sharding benefits:
○ Automatic balancing of data.
○ Scaling out with minimal down time, i.e., new hosts can be added.
○ Replication to avoid single point of failure.
●
MongoDB
● A shard consists of one or more servers that contains the subset of data
that it is responsible for.
● If there are more than one servers in a shard then a shard may also contain
replicated data.
○ If there are more than one servers in a shard then a shard may also contain replicated
data.
●
Example
Mongo DB + Python
! python -m pip install pymongo==3.7.2
###########
import pymongo
from pymongo import MongoClient
client = MongoClient()
#######
Mongo DB + Python
#create db1
mydb =client[ "db1"]
#create collection
mydb.create_collection( 'addressbook ')
# Set the collection to work with
collection = mydb. addressbook
# Insert one item to create the collection
collection.insert_one({ 'name' : 'Ali'})
# Show the existing collections
list (collection.find())
Mongo DB + Python
#insert
data = { 'name' : "Ali" , # String
'age' : 25, # Integer
'gender' : "M", # String
'address': {
'street' : "ahmad tarawnwh" , # String
'number' : 77, # Integer
'city' : "AMMAN", # String
'floor' : None, # Null
'postalcode' : "11910", # String containing a
number
},
'favouriteFruits' : ['banana','pineapple' ,'orange'] # Array
}
collection.insert_one( data)
Mongo DB + Python
list ( collection.find() )
list ( collection.find( {'name' : "Ali" } ))
#Projection : selecting only some fields
list ( collection.find( {},{'name' : 1,'age':1 } ))
#Projection : avoiding some fields
list ( collection.find( {},{'name' : 0,'age':0 } ))
#Projection : selecting only some fields and avoid the id
list ( collection.find( {},{'name' : 1,'address.city':1,'_id':0 } ))
#Projection : selecting only some fields
list ( collection.find( {},{'name' : 1,'address.city':1,'_id':0 } ))
Comparison Query Operators
Source
Comparison Query Operators
#Example comparison operators
list ( collection.find( {'age' : {'$lt':30}} ))
list ( collection.find( {'age' : {'$lt':30}}, {'name' : 1,'age':1,'_id':0 } ))
list ( collection.find( {'age' : {'$gte':25}}, {'name' : 1,'age':1,'_id':0 } ))
#$in operator
list ( collection.find( {'age' : {'$in':[20,30]}}, {'name' : 1,'age':1,'_id':0 } ))
#$nin operator
list ( collection.find( {'age' : {'$nin':[20,30]}}, {'name' : 1,'age':1,'_id':0 } ))
Logical Query Operators
Source
Logical Query Operators
list ( collection.find( {
'$and':[ { 'name':"Ali"}, {'age' : {'$lt':30} } ]},
{'name' : 1,'age':1,'_id':0 } ))
list ( collection.find( {
'$and':[ { 'age':{'$gt':15} }, {'age' : {'$lt':30} } ]},
{'name' : 1,'age':1,'_id':0 } ))
list ( collection.find( {
'age':{'$gt':15,'$lt':30} } ,
{'name' : 1,'age':1,'_id':0 } ))
Source
Sorting
list ( collection.find( {} ,
{'name' : 1,'age':1,'_id':0 }
).sort('age',-1) )
list ( collection.find( {} ,
{'name' : 1,'age':1,'_id':0 }
).sort( [('name',pymongo.ASCENDING),('age',pymongo.DESCENDING) ] ) )
.sort([('name', 1), ('age', -1)])
Aggregation Operations
● You can use aggregation operations to:
○ Group values from multiple documents together.
○ Perform operations on the grouped data to return a single result.
○ Analyze data changes over time.
● We can use
○ Aggregation pipelines
○ Single purpose aggregation methods
Aggregation Pipeline
A pipeline consists of one or more stages that process documents
Sample operation on each stage
● $project – select fields for the output documents.
● $match – select documents to be processed.
● $sort – sort documents.
● $group – group documents by a specified key.
….
Example
mydb.create_collection( 'stdinfo')
std_collection=mydb.stdinfo
data =[
{'name':'ali','gpa':90,'prog':"CS"},
{'name':'zaid','gpa':88, 'prog':"DS"},
{'name':'ahmed','gpa':70,'prog':"SE"},
{'name':'maryam','gpa':68,'prog':"SE"},
{'name':'fatema','gpa':87,'prog':"DS"},
{'name':'kareem','gpa':77,'prog':"CS"}
]
std_collection.insert_many(data)
list(std_collection.find())
list (std_collection.aggregate([
{
'$group': {
'_id': '$prog',
'agvGPA': {'$avg': "$gpa"}
}
}
])
)
list (std_collection.aggregate([
{
'$group': {
'_id': '$prog',
'agvGPA': {'$avg': "$gpa"}
}
},
{
'$sort': {'agvGPA': -1 }
}
])
)
list (std_collection.aggregate([
{
'$group': {
'_id': '$prog',
'agvGPA': {'$avg': "$gpa"}
}
},
{
'$match': { 'agvGPA': {'$gt': 70} }
},
{
'$sort': { 'agvGPA': -1 }
}
])
)
list (std_collection.aggregate([
{
'$match': { 'prog':{'$in':['CS','SE']}
}
},
{
'$group': {
'_id': '$prog',
'agvGPA': {'$avg': "$gpa"}
}
},
{'$sort': {'agvGPA': -1 } }
])
)
COLUMN-ORIENTED DATABASE
A column-oriented database stores its content by column as opposed to by row
and
serializes all of the values of a column together. A columnar database aims to
efficiently
retrieve or write data from hard disk storage in order to speed up the time it
takes to return
a query.
Strengths
● High data. compression and help storage capacity to be used more
efficiently
● Can achieve high query performance on aggregation queries such as AVG.
SUM. MAX. MIN. and COUNT
● more efficient for inserting a single column values at once as this can be
written efficienttly without affecting any other columns for the rows.
● The quick searching, scanning and aggregation abilities of column
oriented database storage are higlily efficient for analytics
GRAPH DATABASE