Document
Databases
MODULE 4,CLASS 1
• Documents are the main concept in document databases.
• The database stores and retrieves documents, which can be XML, JSON,
BSON, and so on.
• These documents are self-describing, hierarchical tree data structures
which can consist of maps, collections, and scalar values.
• The documents stored are similar to each other but do not have to be
exactly the same.
• Document databases store documents in the value part of the key-value
store; think about document databases as key-value stores where the value
is examinable.
Oracle and
MongoDB
• The _id is a special field that
is found on all documents in
Mongo, just like ROWID in
Oracle.
• In MongoDB, _id can be
assigned by the user, as long
as it is unique.
What Is a Document Database?
DOCUMENT-2
DOCUMENT- 1
• Looking at the documents, we can see that they are similar, but have
differences in attribute names.
• This is allowed in document databases.
• The schema of the data can differ across documents, but these documents
can still belong to the same collection—unlike an RDBMS where every row
in a table has to follow the same schema
• We represent a list of cities visited as an array, or a list of addresses as list of
documents embedded inside the main document.
• Embedding child documents as sub objects inside documents provides for easy
access and better performance.
• If you look at the documents, you will see that some of the attributes are similar,
such as firstname or city.
• At the same time, there are attributes in the second document which do not exist
in the first document, such as addresses, while likes is in the first document but
not the second.
• MongoDB [MongoDB],
• CouchDB [CouchDB],
• Terrastore [Terrastore],
popular • OrientDB [OrientDB],
document • RavenDB [RavenDB], and
• Lotus Notes [Notes Storage
databases Facility] that uses document
storage
Consistency
Transactions
Features of
document
Availability
database(MongoDB)
Query Features
Scaling
how MongoDB works
• Consistency in MongoDB database is
configured by using the replica sets and
1. Consistency choosing to wait for the writes to be
replicated to all the slaves or a given
number of slaves.
• Every write can specify the number of
servers the write has to be propagated to
before it returns as successful.
• A command like db.runCommand({ getlasterror : 1 , w : "majority" }) tells
the database how strong is the consistency you want.
• For example, if you have one server and specify the w as majority, the write
will return immediately since there is only one node.
• If you have three nodes in the replica set and specify w as majority, the write
will have to complete at a minimum of two nodes before it is reported as a
success.
• You can increase the w value for stronger consistency but you will suffer on
write performance, since now the writes have to complete at more nodes.
• Replica sets also allow you to increase the read performance by allowing
reading from slaves by setting slaveOk;
• this parameter can be set on the connection, or database, or collection, or
individually for each operation:
• Mongo mongo = new Mongo("localhost:27017");
• mongo.slaveOk();
• Here we are setting slaveOk per operation, so that we can decide which
operations can work with data from the slave node.
1. DBCollection collection = getOrderCollection();
2. BasicDBObject query = new BasicDBObject();
3. query.put("name", "Martin");
4. DBCursor cursor = collection.find(query).slaveOk();
• Similar to various options available for read, you can change the settings to
achieve strong write consistency, if desired.
WriteConcern
• By default, a write is reported successful once the database
receives it; you can change this so as to wait for the writes to be
synced to disk or to propagate to two or more slaves.
• This is known as WriteConcern:
• You make sure that certain writes are written to the master and
some slaves by setting WriteConcern to REPLICAS_SAFE.
Code: setting the WriteConcern for all writes to a
collection
• DBCollection shopping = database.getCollection("shopping");
• shopping.setWriteConcern(REPLICAS_SAFE);
• WriteConcern can also be set per operation by specifying it on the
savecommand:
• WriteResult result = shopping.insert(order, REPLICAS_SAFE);
• There is a tradeoff that you need to
carefully think about, based on your
application needs and business
requirements, to decide what settings
make sense for slaveOk during read or
what safety level you desire during write
with WriteConcern
END OF M4,C1
Availability,
Transactions, Query
Features
M4,C2
• Recall : CAP theorm
• Document databases try to improve on availability
by replicating data using the master-slave setup.
2. Availability
• The same data is available on multiple nodes and
the clients can get to the data even when the
primary node is down.
• Usually, the application code does not have to
determine if the primary node is available or not.
• MongoDB implements replication, providing high
availability using replica sets.
REPLICA SET
• In a replica set, there are two or more
nodes participating in an asynchronous
master-slave replication.
• The replica-set nodes elect the master,
or primary, among themselves.
• Assuming all the nodes have equal voting
rights, some nodes can be favoured for
being closer to the other servers, for
having more RAM, and so on;
• users can affect this by assigning a
priority—a number between 0 and 1000—
to a node
• All requests go to the master node, and the data is replicated to the slave
nodes.
• If the master node goes down, the remaining nodes in the replica set vote
among themselves to elect a new master;
• all future requests are routed to the new master, and the slave nodes start
getting data from the new master.
• When the node that failed comes back online, it joins in as a slave and catches
up with the rest of the nodes by pulling all the data it needs to get current.
Example:
Replica set configuration
with higher priority
assigned to nodes in the
same datacenter
• We have two nodes, mongo A and mongo B, running the MongoDB
database in the primary datacenter, and mongo C in the secondary
datacenter.
• If we want nodes in the primary datacenter to be elected as primary
nodes, we can assign them a higher priority than the other nodes.
• More nodes can be added to the replica sets without having to take them
offline.
• The application writes or reads from the primary (master) node.
• When connection is established, the application only needs to connect to one node
(primary or not, does not matter) in the replica set, and the rest of the nodes are
discovered automatically.
• When the primary node goes down, the driver talks to the new primary elected by the
replica set.
• The application does not have to manage any of the communication failures or node
selection criteria.
• Using replica sets gives you the
ability to have a highly available
document data store.
• Replica sets are generally used for
data redundancy, automated
failover, read scaling, server
maintenance without downtime, and
disaster recovery.
Transactions, in the traditional RDBMS sense,
mean that you can start modifying the database
with insert, update, or delete commands over
different tables and then decide if you want to
keep the changes or not by using commit or
3. rollback.
Transactions
These constructs are generally not available in
NoSQL solutions—a write either succeeds or
fails.
Transactions involving more
than one operation are not
Transactions at the single-
possible, although there are
document level are known as
products such as RavenDB that
atomic transactions.
do support transactions across
multiple operations.
• By default, all writes are reported as
successful.
• A finer control over the write can be
achieved by using WriteConcern parameter.
• We ensure that order is written to more than
one node before it’s reported successful by
using WriteConcern.REPLICAS_SAFE.
• Different levels of WriteConcern let you
choose the safety level during writes;
• for example, when writing log entries, you can use lowest level of
safety, WriteConcern.NONE.
• final Mongo mongo = new Mongo(mongoURI);
• mongo.setWriteConcern(REPLICAS_SAFE);
• DBCollection shopping =
mongo.getDB(orderDatabase).getCollection(shoppingCollection);
• try {
• WriteResult result = shopping.insert(order, REPLICAS_SAFE);
• //Writes made it to primary and at least one secondary
• } catch (MongoException writeException) {
• //Writes did not make it to minimum of two nodes including primary
• dealWithWriteFailure(order, writeException);
• }
4. Query Features
• CouchDB allows you to query via views—complex queries on documents
which can be either materialized or dynamic (think of them as RDBMS views
which are either materialized or not).
• With CouchDB, if you need to aggregate the number of reviews for a product as
well as the average rating, you could add a view implemented via map-reduce
to return the count of reviews and the average of their ratings.
• When there are many requests, you don’t want to compute the count
and average for every request;
• instead you can add a materialized view that precomputes the values
and stores the results in the database.
• These materialized views are updated when queried, if any data was
changed since the last update.
• One of the good features of document
databases, as compared to key-value
stores, is that we can query the data
inside the document without having to
retrieve the whole document by its key
and then introspect the document.
• This feature brings these databases closer
to the RDBMS query model.
MongoDB has a query language which is expressed
via JSON and has constructs such as
$query for the where clause,
$orderby for sorting the data, or
$explain to show the execution plan
of the query.
There are many more constructs like these
that can be combined to create a MongoDB
query.
END OF M4,C2
Query Features (cont) ,
Scaling , Suitable Use
Cases
M4,C3
SQL and equivalent MongoDB
queries
1. want to return all the documents in an order collection (all
rows in the order table).
• The SQL for this would be:
• SELECT * FROM order
• The equivalent query in Mongo shell would be:
• db.order.find()
2. Selecting the orders for a single customerId of 883c2c5b4e5b
• SQL Query would be:
• SELECT * FROM order WHERE customerId =
"883c2c5b4e5b"
• The equivalent query in Mongo to get all orders for a
single customerId of 883c2c5b4e5b:
• db.order.find({"customerId":"883c2c5b4e5b"})
3. selecting orderId and orderDate for one customer in
• SQL would be:
• SELECT orderId,orderDate FROM order WHERE
customerId = "883c2c5b4e5b"
• the equivalent in Mongo would be:
• db.order.find({customerId:"883c2c5b4e5b"},{order
Id:1,orderDate:1})
• Similarly, queries to count, sum, and so on are all
available.
• Since the documents are aggregated objects, it is really
easy to query for documents that have to be matched using
the fields with child objects.
4. QUERY FOR ALL THE ORDERS WHERE ONE OF THE ITEMS
ORDERED HAS A NAME LIKE REFACTORING.
SQL EQUIVALENT MONGO QUERY
• SELECT * FROM customerOrder, • db.orders.find({"items.product.name":/Ref
orderItem, product WHERE actoring/})
customerOrder.orderId =
orderItem.customerOrderId AND
orderItem.productId =
product.productId AND product.name
LIKE '%Refactoring%'
The query for MongoDB is simpler because the objects
are embedded inside a single document and you can
query based on the embedded child documents.
5. SCALING
• Scaling for heavy-read loads can be achieved by adding more read slaves, so
that all the reads can be directed to the slaves.
• Given a heavy-read application, with our 3-node replica-set cluster, we can
add more read capacity to the cluster as the read load increases just by
adding more slave nodes to the replica set to execute reads with the slaveOk
flag
• This is horizontal scaling for reads.
Adding a new node, mongo D, to an existing replica-set cluster
Once the new node, mongo D, is started, it needs
to be added to the replica set.
rs.add("mongod:27017");
When a new node is added, it will sync up with
the existing nodes, join the replica set as
secondary node, and start serving read requests.
An advantage of this setup is that we do not
have to restart any other nodes, and there is no
downtime for the application either.
SCALING FOR WRITE
When we want to scale for write, we can start
sharding the data.
Sharding is similar to partitions in RDBMS where
we split data by value in a certain column, such as
state or year.
With RDBMS, partitions are usually on the same node, so the
client application does not have to query a specific partition
but can keep querying the base table;
the RDBMS takes care of finding the right partition
for the query and returns the data
• In sharding, the data is also split by certain field, but then moved to
different Mongo nodes.
• The data is dynamically moved between nodes to ensure that shards are
always balanced.
• We can add more nodes to the cluster and increase the number of writable
nodes, enabling horizontal scaling for writes.
• db.runCommand( { shardcollection : "ecommerce.customer",
• key : {firstname : 1} } )
• Splitting the data on the first name of the customer ensures that the
data is balanced across the shards for optimal write performance;
• furthermore, each shard can be a replica set ensuring better read
performance within the shard
MongoDB sharded setup where each shard is a replica set
• When we add a new shard to this existing sharded cluster, the
data will now be balanced across four shards instead of three.
• As all this data movement and infrastructure refactoring is
happening, the application will not experience any downtime,
although the cluster may not perform optimally when large
amounts of data are being moved to rebalance the shards
• The shard key plays an important role.
• You may want to place your MongoDB database shards closer to
their users, so sharding based on user location may be a good idea.
• When sharding by customer location, all user data for the East
Coast of the USA is in the shards that are served from the East
Coast, and all user data for the West Coast is in the shards that are
on the West Coast.
EVENT LOGGING CONTENT MANAGEMENT
SYSTEMS, BLOGGING
Suitable Use PLATFORMS
Cases
WEB ANALYTICS OR E-COMMERCE
REAL-TIME ANALYTICS APPLICATIONS
Event Logging
• Applications have different event logging needs;
• within the enterprise, there are many different
applications that want to log events.
• Document databases can store all these different types
of events and can act as a central data store for event
storage.
• This is especially true when the type of data being
captured by the events keeps changing.
• Events can be sharded by the name of the application
where the event originated or by the type of event such
as order_processed or customer_logged.
Content Management
Systems, Blogging Platforms
• Since document databases have no predefined
schemas and usually understand JSON
documents, they work well in content
management systems or applications for
publishing websites, managing user comments,
user registrations, profiles, web-facing
documents.
Web Analytics or Real-Time
Analytics
• Document databases can store data for real-
time analytics;
• since parts of the document can be updated,
it’s very easy to store page views or unique
visitors, and new metrics can be easily
added without schema changes.
• E-commerce applications often
need to have flexible schema for
E-Commerce products and orders, as well as the
Applications ability to evolve their data models
without expensive database
refactoring or data migration
When Not to Use
Complex
Queries against
Transactions
Varying Aggregate
Spanning Different
Structure
Operations
Complex Transactions Spanning
Different Operations
If you need to have However, there are some
atomic cross-document document databases that
operations, then do support these kinds of
document databases operations, such as
may not be for you. RavenDB
Queries against Varying Aggregate Structure
• Flexible schema means that the database does not enforce any restrictions on
the schema.
• Data is saved in the form of application entities. If you need to query these
entities ad hoc, your queries will be changing (in RDBMS terms, this would
mean that as you join criteria between tables, the tables to join keep changing).
• Since the data is saved as an aggregate, if the design of the aggregate is
constantly changing, you need to save the aggregates at the lowest level of
granularity—basically, you need to normalize the data.
• In this scenario, document databases may not work.
END OF MODULE 4