0% found this document useful (0 votes)

7 views54 pages

BCS403 DBMS Mod5

The document discusses concurrency control in databases, focusing on two-phase locking techniques, types of locks, and their operations. It explains the importance of locks for synchronizing access to data items and outlines various locking mechanisms, including binary, shared/exclusive, and their rules for operation. Additionally, it covers the two-phase locking protocol and its variations, emphasizing the balance between concurrency and serializability in database transactions.

Uploaded by

hegdeashwath674

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views54 pages

BCS403 DBMS Mod5

Uploaded by

hegdeashwath674

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Database management Systems (BCS403)

MODULE – 5

Concurrency Control in Databases: Two-phase locking techniques for Concurrency

control, Concurrency control based on Timestamp ordering, Multiversion Concurrency
control techniques, Validation Concurrency control techniques, Granularity of Data items and
Multiple Granularity Locking.
NOSQL databases and Big data Storage systems: Introduction to NOSQL, CAP theorem,
Document based NOSQL and system and MongoDB, NOSQL key value stores, Column
basedor wide column NOSQL systems, NOSQL graph databases and Neo4j,

CONCURRENCY CONTROL IN DATABASES

TWO-PHASE LOCKING TECHNIQUES FOR CONCURRENCY CONTROL

A lock is a variable associated with a data item that describes the status of the item with respect to
possible operations that can be applied to it. Generally, there is one lock for each data item in the
database. Locks are used as a means of synchronizing the access by concurrent transactions to the
database items.

1. Types of Locks and System Lock Tables:

Types of locks used in concurrency control are binary locks, shared/exclusive locks—also known as
read/write locks and a certify lock that improves performance of locking protocols.

Binary Locks:
▪ A binary lock can have two states or values: locked and unlocked (or 1 and 0).
▪ A distinct lock is associated with each database item X. If the value of the lock on X is 1, item X
cannot be accessed by a database operation that requests the item. If the value of the lock on X is 0,
theitem can be accessed when requested, and the lock value is changed to 1.
▪ The current value (or state) of the lock associated with item X is referred to as lock(X).
▪ Two operations, lock_item and unlock_item, are used with binary locking.
▪ A transaction requests access to an item X by first issuing a lock_item(X) operation. If LOCK(X)
= 1, the transaction is forced to wait. If LOCK(X) = 0, the transaction locks the item) and the
transaction is allowed to access item X.
▪ When the transaction is through using the item, it issues an unlock_item(X) operation, which
sets LOCK(X) back to 0 (unlocks the item) so that X may be accessed by other transactions.
Hence, a binary lock enforces mutual exclusion on the data item.
▪ A description of the lock_item(X) and unlock_item(X) operations is shown in Figure 1.

Dept of CS&E,KVGCE,Sullia 1
Database management Systems (BCS403)

Figure 1: Lock and unlock operations for binary locks.

▪ The lock_item and unlock_item operations must be implemented as indivisible units that is, no
interleaving should be allowed once a lock or unlock operation is started until the operation terminates
or the transaction waits.
▪ In Figure 1, the wait command within the lock_item(X) operation is usually implemented by
putting the transaction in a waiting queue for item X until X is unlocked and the transaction can be
granted access to it. Other transactions that also want to access X are placed in the same queue. Hence,
the wait command is considered to be outside the lock_item operation.
▪ A binary-valued variable, LOCK is associated with each data item X in the database. Each lock can be
a record with three fields: <Data_item_name, LOCK, Locking_transaction> plus a
queue for transactions that are waiting to access the item. The system needs to maintain only these
records for the items that are currently locked in a lock table, which could be organized as a hash file
on the item name.
▪ Items not in the lock table are considered to be unlocked. The DBMS has a lock manager subsystem to
keep track of and control access to locks.
▪ Every transaction must obey the following rules:
1. A transaction T must issue the operation lock_item(X) before any read_item(X) or
write_item(X) operations are performed in T.
2. A transaction T must issue the operation unlock_item(X) after all read_item(X)and
write_item(X) operations are completed in T.
3. A transaction T will not issue a lock_item(X) operation if it already holds the lock on item X.
4. A transaction T will not issue an unlock_item(X) operation unless it already holds the lock
on item X.

Dept of CS&E,KVGCE,Sullia 2
Database management Systems (BCS403)

The lock manager module of the DBMS can enforce these rules. Between the lock_item(X) and
unlock_item(X) operations in transaction T, T is said to hold the lock on item X. At most one
transaction can hold the lock on a particular item. Thus no two transactions can access the same item
concurrently.

Shared/Exclusive (or Read/Write) Locks:

▪ Several transactions should be allowed to access the same item X if they all access X for reading
purposes only. Read operations on the same item by different transactions are not conflicting.
▪ If a transaction is to write an item X, it must have exclusive access to X. A multiple-mode lock is used
for this purpose. In this scheme—called shared/exclusive or read/write locks—there are three locking
operations: read_lock(X), write_lock(X), and unlock(X).
▪ A lock associated with an item X, LOCK(X) has three possible states: read-locked, write-locked, or
unlocked.
▪ A read-locked item is also called share-locked because other transactions are allowed to read the item,
whereas a write-locked item is called exclusive-locked because a single transaction exclusively holds
the lock on the item.
▪ Keep track of the number of transactions that hold a shared (read) lock on an item in the lock table, as
well as a list of transaction ids that hold a shared lock.
▪ Each record in the lock table will have four fields:
<Data_item_name, LOCK, No_of_reads, Locking_transaction(s)>.
▪ The system needs to maintain lock records only for locked items in the lock table. The value (state) of
LOCK is either read-locked or write-locked, suitably coded (if we assume no records are kept in the
lock table for unlocked items).
▪ If LOCK(X) = write-locked, the value of locking_transaction(s) is a single transaction that holds the
exclusive (write) lock on X. If LOCK(X)=read-locked, the value of locking transaction(s) is a list of
one or more transactions that hold the shared (read) lock on X.
▪ The three operations read_lock(X), write_lock(X) and unlock(X) are described in Figure
2. Each of the three locking operations should be considered indivisible; no interleaving should be
allowed once one of the operations is started until either the operation terminates by granting the lock
or the transaction is placed in a waiting queue for the item.

Read_lock(X):

Dept of CS&E,KVGCE,Sullia 3
Database management Systems (BCS403)

B: if LOCK(X) = “unlocked”
then begin LOCK(X) ← “read-locked”;
no_of_reads(X) ← 1
end
else if LOCK(X) = “read-locked”
then no_of_reads(X) ← no_of_reads(X) + 1
else begin
wait (until LOCK(X) = “unlocked” and the lock manager
wakes up the transaction);
go to B
end;

write_lock(X):
B: if LOCK(X) = “unlocked”
then LOCK(X) ← “write-locked”
else begin
wait (until LOCK(X) = “unlocked” and the lock manager
wakes up the transaction);
go to B
end;

unlock(X):
if LOCK(X) = “write-locked”
then begin LOCK(X) ← “unlocked”;
wakeup one of the waiting transactions, if any
end
else if LOCK(X) = “read-locked”
then begin
no_of_reads(X) ← no_of_reads(X) −1;
if no_of_reads(X) = 0
then begin LOCK(X) = “unlocked”;
wakeup one of the waiting transactions, if any
end
end;
Figure 2: Locking and unlocking operations for two mode (read/write, or)shared/exclusive locks.

Dept of CS&E,KVGCE,Sullia 4
Database management Systems (BCS403)

▪ When the shared/exclusive locking scheme is used, the system must enforce the following rules:
1. A transaction T must issue the operation read_lock(X) or write_lock(X) before any
read_item(X) operation is performed in T.
2. transaction T must issue the operation write_lock(X) before any
write_item(X)operation is performed in T.
3. A transaction T must issue the operation unlock(X) after all read_item(X) and
write_item(X) operations are completed in T.
4. A transaction T will not issue a read_lock(X)operation if it already holds a read (shared) lock or
a write (exclusive) lock on item X. This rule may be relaxed for downgrading of locks.
5. A transaction T will not issue a write_lock(X)operation if it already holds a read (shared) lock
or write (exclusive) lock on item X. This rule may also be relaxed for upgrading of locks, as we
discuss shortly.
6. A transaction T will not issue an unlock(X) operation unless it already holds a read (shared) lock
or a write (exclusive) lock on item X.

Conversion (Upgrading, Downgrading) of Locks:

▪ A transaction that already holds a lock on item X is allowed under certain conditions to convert the
lock from one locked state to another. Example: A transaction T can issue a read_lock(X) and
then later to upgrade the lock by issuing a write_lock(X) operation. If T is the only transaction
holding a read lock on X at the time it issues the write_lock(X)operation, the lock can be
upgraded; otherwise, the transaction must wait. It is also possible for a transaction T to issue a
write_lock(X)and then later to downgrade the lock by issuing a read_lock(X) operation.
▪ When upgrading and downgrading of locks is used, the lock table must include transaction identifiers
in the record structure for each lock (in the locking_transaction(s) field) to store the information on
which transactions hold locks on the item.
▪ Using binary locks or read/write locks in transactions, does not guarantee serializability of schedules
on its own. Figure 3 shows an example where the preceding locking rules are followed but a
nonserializable schedule may result. This is because in Figure 3(a) the items Y in T1 and X in T2 were
unlocked too early. This allows a schedule such as the one shown in Figure 3(c) to occur, which is not
a serializable schedule and hence gives incorrect results.
▪ An additional protocol concerning the positioning of locking and unlocking operations in every

Dept of CS&E,KVGCE,Sullia 5
Database management Systems (BCS403)

transaction must be followed to guarantee serializability.

Figure 3: Transactions that do not obey two phase locking.

2. Guaranteeing Serializability by Two-Phase Locking:
❖ A transaction is said to follow the two-phase locking protocol if all locking operations (read_lock,
write_lock) precede the first unlock operation in the transaction.
❖ Such a transaction can be divided into two phases:
- an expanding or growing(first) phase, during which new locks on items can be acquired but
none can be released;
- a shrinking (second) phase, during which existing locks can be released but no new locks can
be acquired.
If lock conversion is allowed, then upgrading of locks (from read-locked to write-locked) must be
done during the expanding phase, and downgrading of locks (from write-locked to read-locked) must
be done in the shrinking phase.
❖ Transactions T1 and T2 in Figure 3(a) do not follow the two-phase locking protocol because the
write_lock(X) operation follows the unlock(Y) operation in T1, and similarly the
write_lock(Y) operation follows the unlock(X) operation in T2.
❖ If two-phase locking is enforced, the transactions can be rewritten as T1′ and T2′, as shown in Figure 4.
The schedule shown in Figure 3(c) is not permitted for T1′ and T2′ (with their modified order of
locking and unlocking operations) because T1′ will issue its write_lock(X)before it unlocks item

Dept of CS&E,KVGCE,Sullia 6
Database management Systems (BCS403)

Y; consequently, when T2′ issues its read_lock(X), it is forced to wait until T1′ releases the lock
by issuing unlock(X) in the schedule. This can lead to deadlock.

Figure 4: Transactions T1′ and T2′, which are the same as T1 and T2 in Figure 3 but follow the two-
phase locking protocol. They can produce a deadlock.

❖ If every transaction in a schedule follows the two-phase locking protocol, the schedule is guaranteed to
be serializable. The locking protocol, by enforcing two-phase locking rules, also enforces
serializability.
❖ Two-phase locking may limit the amount of concurrency that can occur in a schedule because a
transaction T may not be able to release an item X after it is through using it if T must lock an
additional item Y later; or, conversely, T must lock the additional item Y before it needs it so that it
can release X. Hence, X must remain locked by T until all items that the transaction needs to read or
write have been locked; only then can X be released by T. Meanwhile, another transaction seeking to
access X may be forced to wait, even though T is done with X; conversely, if Y is locked earlier than it
is needed, another transaction seeking to access Y is forced to wait even though T is not using Y yet.
This is the price for guaranteeing serializability of all schedules without having to check the schedules
themselves.
❖ The two-phase locking protocol guarantees serializability (that is, every schedule that is permitted is
serializable) but it does not permit all possible serializable schedules (that is, some serializable
schedules will be prohibited by the protocol).

Basic, Conservative, Strict, and Rigorous Two-Phase Locking:

❖ Conservative 2PL (or static 2PL) requires a transaction to lock all the items it accesses before the
transaction begins execution, by predeclaring it’s read-set and write-set.
❖ The read-set of a transaction is the set of all items that the transaction reads, and the write-set is the set
of all items that it writes. If any of the predeclared items needed cannot be locked, the transaction does

Dept of CS&E,KVGCE,Sullia 7
Database management Systems (BCS403)

not lock any item; instead, it waits until all the items are available for locking.
❖ Conservative 2PL is a deadlock-free protocol. But it is difficult to use in practice because of the need
to predeclare the read-set and write-set, which is not possible in some situations.

❖ Strict 2PL guarantees strict schedules where a transaction T does not release any of its exclusive
(write) locks until after it commits or aborts. Hence, no other transaction can read or write an item that
is written by T unless T has committed, leading to a strict schedule for recoverability.
❖ Strict 2PL is not deadlock-free.
❖ Rigorous 2PL also guarantees strict schedules. In this variation, a transaction T does not release any of
its locks (exclusive or shared) until after it commits or aborts, and so it is easier to implement than
strict 2PL.
❖ Difference between strict and rigorous 2PL: the former holds write-locks until it commits, whereas the
latter holds all locks (read and write).
❖ Difference between conservative and rigorous 2PL: the former must lock all its items before it starts,
so once the transaction starts it is in its shrinking phase; the latter does not unlock any of its items until
after it terminates (by committing or aborting), so the transaction is in its expanding phase until it
ends.
❖ Usually the concurrency control subsystem itself is responsible for generating the read_lock and
write_lock requests.

Example: Suppose the system is to enforce the strict 2PL protocol. Then, whenever transaction T
issues a read_item(X), the system calls the read_lock(X) operation on behalf of T. If the state
of LOCK(X) is write_locked by some other transaction T′, the system places T in the waiting queue
for item X; otherwise, it grants the read_lock(X) request and permits the read_item(X)
operation of T to execute. On the other hand, if transaction T issues a write_item(X), the system
calls the write_lock(X) operation on behalf of T. If the state of LOCK(X) is write_locked or
read_locked by some other transaction T′, the system places T in the waiting queue for item X; if the
state of LOCK(X) is read_locked and T itself is the only transaction holding the read lock on X, the
system upgrades the lock to write_locked and permits the write_item(X) operation by T. Finally,
if the state of LOCK(X) is unlocked, the system grants the write_lock(X) request and permits the
write_item(X) operation to execute. After each action, the system must update its lock table
appropriately.

Dept of CS&E,KVGCE,Sullia 8
Database management Systems (BCS403)

❖ Disadvantages of locking: Locking is generally considered to have a high overhead, because every
read or write operation is preceded by a system locking request. The use of locks can also cause two
additional problems: deadlock and starvation.

3. Dealing with Deadlock and Starvation:

❖ Deadlock occurs when each transaction T in a set of two or more transactions is waiting for some item
that is locked by some other transaction T′ in the set. Hence, each transaction in the set is in a waiting
queue, waiting for one of the other transactions in the set to release the lock on an item. But because
the other transaction is also waiting, it will never release the lock.
❖ In Figure 5(a) the two transactions T1′ and T2′ are deadlocked in a partial schedule; T1′ is in the waiting
queue for X, which is locked by T2′, whereas T2′ is in the waiting queue for Y, which is locked by T1′.
Meanwhile, neither T1′ nor T2′ nor any other transaction can access items X and Y.

Figure 5: Illustrating the deadlock problem. (a) A partial schedule of T1′ and T2′ that is
in a state of deadlock. (b) A wait-for graph for the partial schedule in (a).

Deadlock Prevention Protocols:

❖ Conservative two-phase locking requires that every transaction locks all the items it needs in advance.
If any of the items cannot be obtained, none of the items are locked. Rather, the transaction waits and
then tries again to lock all the items it needs. This solution further limits concurrency.
❖ A second protocol, which also limits concurrency, involves ordering all the items in the database and
making sure that a transaction that needs several items will lock them according to that order. This
requires that the programmer (or the system) is aware of the chosen order of the items, which is also
not practical in the database context.
❖ Some deadlock prevention techniques use transaction timestamp TS(T′), which is a unique identifier
assigned to each transaction. The timestamps are typically based on the order in which transactions are
started. If transaction T1 starts before transaction T2, then TS(T1) < TS(T2). The older transaction

Dept of CS&E,KVGCE,Sullia 9
Database management Systems (BCS403)

(which starts first) has the smaller timestamp value.

❖ Deadlock prevention schemes: Suppose that transaction Ti tries to lock an item X but is not able to
because X is locked by some other transaction Tj with a conflicting lock. The rules followed by these
schemes are:
▪ Wait-die: If TS(Ti) < TS(Tj), then (Ti older than Tj) Ti is allowed to wait; otherwise (Ti
younger than Tj) abort Ti (Ti dies) and restart it later with the same timestamp.
▪ Wound-wait: If TS(Ti) < TS(Tj), then (Ti older than Tj) abort Tj (Ti wounds Tj) and restart
it later with the same timestamp; otherwise (Ti younger than Tj) Ti is allowed to wait.

❖ In wait-die, an older transaction is allowed to wait for a younger transaction, whereas a younger
transaction requesting an item held by an older transaction is aborted and restarted. In the wound-wait
approach, a younger transaction is allowed to wait for an older one, whereas an older transaction
requesting an item held by a younger transaction preempts the younger transaction by aborting it.
❖ Both schemes end up aborting the younger of the two transactions (the transaction that started later)
that may be involved in a deadlock, assuming that this will waste less processing.
❖ These two techniques are deadlock-free, since in wait-die, transactions only wait for younger
transactions so no cycle is created. Similarly, in wound-wait, transactions only wait for older
transactions so no cycle is created. But both techniques may cause some transactions to be aborted and
restarted needlessly, even though those transactions may never actually cause a deadlock.

❖ Protocols that prevent deadlock but do not require timestamps:

➢ In the no waiting (NW) algorithm, if a transaction is unable to obtain a lock, it is immediately
aborted and then restarted after a certain time delay without checking whether a deadlock will
actually occur or not. In this case, no transaction ever waits, so no deadlock will occur. But this
scheme can cause transactions to abort and restart needlessly.
➢ The cautious waiting (CW) algorithm was proposed to try to reduce the number of needless
aborts/restarts. Suppose that transaction Ti tries to lock an item X but is not able to do so because X
is locked by some other transaction Tj with a conflicting lock. The cautious waiting rule is as
follows:
If Tj is not blocked (not waiting for some other locked item), then Ti is blocked and allowed to
wait; otherwise abort Ti.
❖ Cautious waiting is deadlock-free, because no transaction will ever wait for another blocked

Dept of CS&E,KVGCE,Sullia 10
Database management Systems (BCS403)

transaction. By considering the time b(T) at which each blocked transaction T was blocked, if the two
transactions Ti and Tj above both become blocked and Ti is waiting for Tj, then b(Ti) < b(Tj),
since Ti can only wait for Tj at a time when Tj is not blocked itself. Hence, the blocking times form a
total ordering on all blocked transactions, so no cycle that causes deadlock can occur.

Deadlock Detection:
❖ In deadlock detection, the system checks if a state of deadlock actually exists. This can happen if
different transactions rarely access the same items at the same time, or if transactions are short and
each transaction locks only a few items, or if the transaction load is light.

❖ If transactions are long and each transaction uses many items, or if the transaction load is heavy, it
may be advantageous to use a deadlock prevention scheme.
❖ The system can construct and maintain a wait-for graph to detect a state of deadlock. One node is
created in the wait-for graph for each transaction that is currently executing. Whenever a transaction Ti
is waiting to lock an item X that is currently locked by a transaction Tj, a directed edge (Ti → Tj) is
created in the wait-for graph. When Tj releases the lock(s) on the items that Ti was waiting for, the
directed edge is dropped from the wait-for graph. A state of deadlock occurs if and only if the wait-for
graph has a cycle.
This approach has the problem of determining when the system should check for a deadlock. It can be
checked for a cycle every time an edge is added to the wait-for graph, but this may cause excessive
overhead. Criteria such as the number of currently executing transactions or the period of time several
transactions have been waiting to lock items may be used instead to check for a cycle.

Figure 5(b) shows the wait-for graph for the (partial) schedule shown in Figure 5(a). If the system is
in a state of deadlock, some of the transactions causing the deadlock must be aborted. Choosing which
transactions to abort is known as victim selection. The algorithm for victim selection should generally
avoid selecting transactions that have been running for a long time and that have performed many
updates, and it should try instead to select transactions that have not made many changes (younger
transactions).

❖ Timeouts: A scheme to deal with deadlock is the use of timeouts. In this method, if a transaction waits
for a period longer than a system-defined timeout period, the system assumes that the transaction may
be deadlocked and aborts it—regardless of whether a deadlock actually exists.

Dept of CS&E,KVGCE,Sullia 11
Database management Systems (BCS403)

Starvation:
▪ When locking is used, starvation can occur when a transaction cannot proceed for an indefinite period
of time while other transactions in the system continue normally.
▪ This may occur if the waiting scheme for locked items is unfair in that it gives priority to some
transactions over others.
▪ One solution for starvation is to have a fair waiting scheme, such as using a first-come-first-served
queue; transactions are enabled to lock an item in the order in which they originally requested the lock.
▪ Another scheme allows some transactions to have priority over others but increases the priority of a
transaction the longer it waits, until it eventually gets the highest priority and proceeds.
▪ Starvation can also occur because of victim selection if the algorithm selects the same transaction as
victim repeatedly, thus causing it to abort and never finish execution.

▪ The algorithm can use higher priorities for transactions that have been aborted multiple times to avoid
this problem.
▪ The wait-die and wound-wait schemes avoid starvation, because they restart a transaction that has
been aborted with its same original timestamp, so the possibility that the same transaction is aborted
repeatedly is slim.

Questions :
1. Discuss the problems of deadlock and starvation, and the different approaches to dealing with these
problems.
2. Describe the wait-die and wound-wait protocols for deadlock prevention.
3. Describe the cautious waiting, no waiting, and timeout protocols for deadlock prevention.

5.2 CONCURRENCY CONTROL BASED ON TIMESTAMP ORDERING

The use of locking, combined with the 2PL protocol, guarantees serializability of schedules. The
serializable schedules produced by 2PL have their equivalent serial schedules based on the order in which
executing transactions lock the items they acquire. If a transaction needs an item that is already locked, it
may be forced to wait until the item is released. Some transactions may be aborted and restarted because
of the deadlock problem. A different approach to concurrency control involves using transaction
timestamps to order transaction execution for an equivalent serial schedule.

1. Timestamps:

Dept of CS&E,KVGCE,Sullia 12
Database management Systems (BCS403)

❖ The timestamp of transaction T is referred to as TS(T).

❖ A timestamp is a unique identifier created by the DBMS to identify a transaction. Timestamp values
are assigned in the order in which the transactions are submitted to the system, so a timestamp can be
thought of as the transaction start time.
❖ Concurrency control techniques based on timestamp ordering do not use locks. Hence, deadlocks
cannot occur.
❖ Timestamps can be generated in several ways.
▪ Use a counter that is incremented each time its value is assigned to a transaction. The transaction
timestamps are numbered 1, 2, 3, … in this scheme. A computer counter has a finite maximum
value, so the system must periodically reset the counter to zero when no transactions are
executing for some short period of time.

▪ Use the current date/time value of the system clock and ensure that no two timestamp values are
generated during the same tick of the clock.

2. The Timestamp Ordering Algorithm for Concurrency Control:

❖ This scheme enforces the equivalent serial order on the transactions based on their timestamps.
❖ A schedule in which the transactions participate is then serializable, and the only equivalent serial
schedule permitted has the transactions in order of their timestamp values. This is called timestamp
ordering (TO).
❖ Timestamp ordering differs from 2PL, where a schedule is serializable by being equivalent to some
serial schedule allowed by the locking protocols. In timestamp ordering, the schedule is equivalent to
the particular serial order corresponding to the order of the transaction timestamps.
❖ The algorithm allows interleaving of transaction operations, but it must ensure that for each pair of
conflicting operations in the schedule, the order in which the item is accessed must follow the
timestamp order.
❖ The algorithm associates with each database item X two timestamp (TS) values:
1. read_TS(X): The read timestamp of item X is the largest timestamp among all the timestamps of
transactions that have successfully read item X—that is, read_TS(X) = TS(T), where T is the
youngest transaction that has read X successfully.

2. write_TS(X): The write timestamp of item X is the largest of all the timestamps of transactions
that have successfully written item X—that is, write_TS(X) = TS(T), where T is the youngest

Dept of CS&E,KVGCE,Sullia 13
Database management Systems (BCS403)

transaction that has written X successfully. Based on the algorithm, T will also be the last
transaction to write item X.

Basic Timestamp Ordering (TO): Whenever some transaction T tries to issue a read_item(X) or a
write_item(X) operation, the basic TO algorithm compares the timestamp of T with read_TS(X)
and write_TS(X) to ensure that the timestamp order of transaction execution is not violated. If this
order is violated, then transaction T is aborted and resubmitted to the system as a new transaction with a
new timestamp. If T is aborted and rolled back, any transaction T1 that may have used a value written by T
must also be rolled back. Similarly, any transaction T2 that may have used a value written by T1 must also
be rolled back, and so on. This effect is known as cascading rollback and is one of the problems associated
with basic TO, since the schedules produced are not guaranteed to be recoverable. An additional protocol
must be enforced to ensure that the schedules are recoverable, cascadeless, or strict.
The concurrency control algorithm must check whether conflicting operations violate the timestamp
ordering in the following two cases:

1. Whenever a transaction T issues a write_item(X) operation, the following check is performed:

a. If read_TS(X)>TS(T)or if write_TS(X)>TS(T), then abort and roll back T and reject
the operation. This should be done because some younger transaction with a timestamp greater
thanTS(T)—and hence after T in the timestamp ordering—has already read or written the
value of item X before T had a chance to write X, thus violating the timestamp ordering.
b. If the condition in part (a) does not occur, then execute the write_item(X)operation of T
and set write_TS(X) to TS(T).
2. Whenever a transaction T issues a read_item(X) operation, the following check is performed:
a. If write_TS(X)>TS(T), then abort and roll back T and reject the operation. This should be
done because some younger transaction with timestamp greater than TS(T)—and hence after T
in the timestamp ordering—has already written the value of item X before T had a chance to
read X.
b. If write_TS(X)≤ TS(T), then execute the read_item(X) operation of T and set
read_TS(X)to the larger of TS(T)and the current read_TS(X).

Whenever the basic TO algorithm detects two conflicting operations that occur in the incorrect order, it
rejects the later of the two operations by aborting the transaction that issued it. The schedules produced by
basic TO are hence guaranteed to be conflict serializable. Deadlock does not occur with timestamp

Dept of CS&E,KVGCE,Sullia 14
Database management Systems (BCS403)

ordering. But cyclic restart (and hence starvation) may occur if a transaction is continually aborted and
restarted.

Strict Timestamp Ordering (TO): Strict TO ensures that the schedules are both strict (for easy
recoverability) and (conflict) serializable. In this variation, a transaction T issues a read_item(X) or
write_item(X) such that TS(T)>write_TS(X) has its read or write operation delayed until the
transaction T′ that wrote the value of X (hence TS(T′) = write_TS(X)) has committed or aborted.
To implement this algorithm, it is necessary to simulate the locking of an item X that has been written by
transaction T′ until T′ is either committed or aborted. This algorithm does not cause deadlock, since T
waits for T′ only if TS(T)>TS(T′).

Thomas’s Write Rule: A modification of the basic TO algorithm, known as Thomas’s write rule, does not
enforce conflict serializability, but it rejects fewer write operations by modifying the checks for the
write_item(X) operation as follows:
1. If read_TS(X)>TS(T), then abort and roll back T and reject the operation.
2. If write_TS(X)>TS(T), then do not execute the write operation but continue processing. This is
because some transaction with timestamp greater than TS(T)—and hence after T in the timestamp
ordering—has already written the value of X. Thus, we must ignore the write_item(X) operation of
T because it is already outdated and obsolete. Notice that any conflict arising from this situation
would be detected by case (1).
3. If neither the condition in part (1) nor the condition in part (2) occurs, then execute the
write_item(X) operation of T and set write_TS(X) to TS(T).

Questions :
1. What is a timestamp? How does the system generate timestamps?
2. Discuss the timestamp ordering protocol for concurrency control. How does strict timestamp
ordering differ from basic timestamp ordering?

5.3 MULTIVERSION CONCURRENCY CONTROL TECHNIQUES

❖ These protocols for concurrency control keep copies of the old values of a data item when the item is
updated (written). They are known as multi-version concurrency control because several versions
(values) of an item are kept by the system.

Dept of CS&E,KVGCE,Sullia 15
Database management Systems (BCS403)

❖ When a transaction requests to read an item, the appropriate version is chosen to maintain the
serializability of the currently executing schedule.
❖ One reason for keeping multiple versions is that some read operations that would be rejected in other
techniques can still be accepted by reading an older version of the item to maintain serializability. When
a transaction writes an item, it writes a new version and the old version(s) of the item is retained. Some
multiversion concurrency control algorithms use the concept of view serializability rather than conflict
serializability.
❖ Drawback of multiversion techniques: More storage is needed to maintain multiple versions of the
database items. In some cases, older versions can be kept in a temporary store. It is also possible that
older versions may have to be maintained anyway—for example, for recovery purposes. Some database
applications may require older versions to be kept to maintain a history of the changes of data item
values. The extreme case is a temporal database, which keeps track of all changes and the times at
which they occurred. In such cases, there is no additional storage penalty for multiversion techniques,
since older versions are already maintained.

1. Multiversion Technique Based on Timestamp Ordering:

❖ In this method, several versions X1, X2, … ,Xk of each data item X are maintained. For each version,
the value of version Xi and the following two timestamps associated with version Xi are kept:
1. read_TS(Xi): The read timestamp of Xi is the largest of all the timestamps of transactions that
have successfully read version Xi.
2. write_TS(Xi): The write timestamp of Xi is the timestamp of the transaction that wrote the
value of version Xi.

❖ Whenever a transaction T is allowed to execute a write_item(X) operation, a new version Xk+1 of

item X is created, with both the write_TS(Xk+1) and the read_TS(Xk+1) set to TS(T).
Correspondingly, when a transaction T is allowed to read the value of version Xi, the value of
read_TS(Xi) is set to the larger of the current read_TS(Xi) and TS(T).
❖ To ensure serializability, the following rules are used:
1. If transaction T issues a write_item(X) operation, and version i of X has the highest
write_TS(Xi) of all versions of X that is also less than or equal to TS(T), and
read_TS(Xi)> TS(T), then abort and roll back transaction T; otherwise, create a new
version Xjof X with read_TS(Xj) = write_TS(Xj) = TS(T).

Dept of CS&E,KVGCE,Sullia 16
Database management Systems (BCS403)

2. If transaction T issues a read_item(X) operation, find the version i of X that has the highest
write_TS(Xi) of all versions of X that is also less than or equal to TS(T); then return the
value of Xi to transaction T, and set the value of read_TS(Xi) to the larger of TS(T) and the
current read_TS(Xi).
A read_item(X) is always successful as seen in case 2, since it finds the appropriate version Xi to
read based on the write_TS of the various existing versions of X. In case 1, however, transaction T
may be aborted and rolled back. This happens if T attempts to write a version of X that should have
been read by another transaction T′ whose timestamp is read_TS(Xi); however, T′ has already read
version Xi, which was written by the transaction with timestamp equal to write_TS(Xi). If this
conflict occurs, T is rolled back; otherwise, a new version of X, written by transaction T, is created. If
T is rolled back, cascading rollback may occur. Hence, to ensure recoverability, a transaction T should
not be allowed to commit until after all the transactions that have written some version that T has read
have committed.

2. Multiversion Two-Phase Locking Using Certify Locks:

❖ In the multiple-mode locking scheme, there are three locking modes for an item— read, write, and
certify. Hence, the state of LOCK(X) for an item X can be one of read-locked, write-locked, certify-
locked, or unlocked.
❖ In the standard locking scheme, with only read and write locks, a write lock is an exclusive lock. The
relationship between read and write locks in the standard scheme can be described by means of the
lock compatibility table shown in Figure 6(a).

Figure 6: Lock compatibility tables.(a) Lock compatibility table for read/write locking scheme.
(b) Lock compatibility table for read/write/certify locking scheme.

Dept of CS&E,KVGCE,Sullia 17
Database management Systems (BCS403)

An entry of Yes means that if a transaction T holds the type of lock specified in the column header
on item X and if transaction T′ requests the type of lock specified in the row header on the same item
X, then T′ can obtain the lock because the locking modes are compatible. An entry of No in the table
indicates that the locks are not compatible, so T′ must wait until T releases the lock. In the standard
locking scheme, once a transaction obtains a write lock on an item, no other transactions can access
that item.

❖ Multiversion 2PL will allow other transactions T′ to read an item X while a single transaction T holds
a write lock on X. This is accomplished by allowing two versions for each item X-
▪ The committed version must always have been written by some committed transaction.
▪ The second local version X′ can be created when a transaction T acquires a write lock on X.

Other transactions can continue to read the committed version of X while T holds the write lock.
Transaction T can write the value of X′ as needed, without affecting the value of the committed
version X. But once T is ready to commit, it must obtain a certify lock on all items that it currently
holds write locks on before it can commit; this is another form of lock upgrading. The certify lock is
not compatible with read locks, so the transaction may have to delay its commit until all its write-
locked items are released by any reading transactions in order to obtain the certify locks. Once the
certify locks—which are exclusive locks—are acquired, the committed version X of the data item is
set to the value of version X′, version X′ is discarded and the certify locks are then released. The lock
compatibility table for this scheme is shown in Figure 6(b).

❖ In the multiversion 2PL scheme, reads can proceed concurrently with a single write operation which is
not permitted under the standard 2PL schemes. The cost is that a transaction may have to delay its
commit until it obtains exclusive certify locks on all the items it has updated.
❖ This scheme avoids cascading aborts, since transactions are only allowed to read the version X that
was written by a committed transaction. But deadlocks may occur and must be handled by various
techniques.

Questions :
1. How do optimistic concurrency control techniques differ from other concurrency control
techniques? Why are they called validation or certification techniques? Discuss the typical
phases of an optimistic concurrency control method.

Dept of CS&E,KVGCE,Sullia 18
Database management Systems (BCS403)

2. What is the two-phase locking protocol? How does it guarantee serializability?

3. What are some variations of the two-phase locking protocol? Why is strict or rigorous two-
phase locking often preferred?

5.4 VALIDATION (OPTIMISTIC) TECHNIQUES AND SNAPSHOT ISOLATION

CONCURRENCY CONTROL
In locking, a check is done to determine whether the item being accessed is locked. In timestamp ordering,
the transaction timestamp is checked against the read and write timestamps of the item. Such checking
represents overhead during transaction execution, with the effect of slowing down the transactions. In
optimistic concurrency control techniques, also known as validation or certification techniques, no
checking is done while the transaction is executing. Several concurrency control methods are based on the
validation technique. The implementations of these concurrency control methods can utilize a combination
of the concepts from validation-based techniques and versioning techniques, as well as utilizing
timestamps. Some of these methods may suffer from anomalies that can violate serializability, but because
they generally have lower overhead than 2PL, they have been implemented in several relational DBMSs.

1. Validation-Based (Optimistic) Concurrency Control:

❖ Updates in the transaction are not applied directly to the database items on disk until the transaction
reaches its end and is validated.
❖ During transaction execution, all updates are applied to local copies of the data items that are kept for
the transaction. At the end of transaction execution, a validation phase checks whether any of the

transaction’s updates violate serializability. Certain information needed by the validation phase must
be kept by the system. If serializability is not violated, the transaction is committed and the database is
updated from the local copies; otherwise, the transaction is aborted and then restarted later.
❖ There are three phases for this concurrency control protocol:
1. Read phase: A transaction can read values of committed data items from the database. However,
updates are applied only to local copies (versions) of the data items kept in the transaction
workspace.
2. Validation phase: Checking is performed to ensure that serializability will not be violated if the
transaction updates are applied to the database.
3. Write phase: If the validation phase is successful, the transaction updates are applied to the
database; otherwise, the updates are discarded and the transaction is restarted.

Dept of CS&E,KVGCE,Sullia 19
Database management Systems (BCS403)

The optimistic concurrency control does all the checks at once. Hence transaction execution proceeds
with a minimum of overhead until the validation phase is reached. If there is little interference among
transactions, most will be validated successfully. If there is much interference, many transactions that
execute to completion will have their results discarded and must be restarted later. Under such
circumstances, optimistic techniques do not work well.

❖ The techniques are called optimistic because they assume that little interference will occur and hence
most transaction will be validated successfully, so that there is no need to do checking during
transaction execution. This assumption is generally true in many transaction processing workloads.
❖ The optimistic protocol described uses transaction timestamps and also requires that the
write_sets and read_sets of the transactions be kept by the system. Additionally, start and end
times for the three phases need to be kept for each transaction.
❖ In the validation phase for transaction Ti, the protocol checks that Ti does not interfere with any
recently committed transactions or with any other concurrent transactions that have started their
validation phase. The validation phase for Ti checks that, for each such transaction Tj that is either
recently committed or is in its validation phase, one of the following conditions holds:
1. Transaction Tj completes its write phase before Ti starts its read phase.
2. Ti starts its write phase after Tj completes its write phase, and the read_set of Ti has no items
in common with the write_set of Tj.
3. Both the read_set and write_set of Ti have no items in common with the write_set of
Tj, and Tj completes its read phase before Ti completes its read phase.
When validating transaction Ti against each one of the transactions Tj, the first condition is checked
first since (1) is the simplest condition to check. Only if condition 1 is false condition 2 checked, and
only if (2) is false condition 3 is checked. If any one of these three conditions holds with each
transaction Tj, there is no interference and Ti is validated successfully. If none of these three conditions
holds for any one Tj, the validation of transaction Ti fails (because Ti and Tj may violate serializability)
and so Ti is aborted and restarted later because interference with Tj may have occurred.

2. Concurrency Control Based on Snapshot Isolation:

❖ In snapshot isolation a transaction sees the data items that it reads based on the committed values of
the items in the database snapshot (or database state) when the transaction starts.
❖ Snapshot isolation will ensure that the phantom record problem does not occur, since the database

Dept of CS&E,KVGCE,Sullia 20
Database management Systems (BCS403)

transaction, or, in some cases, the database statement, will only see the records that were committed in
the database at the time the transaction started. Any insertions, deletions, or updates that occur after
the transaction starts will not be seen by the transaction.
❖ Snapshot isolation does not allow the problems of dirty read and nonrepeatable read to occur. Certain
anomalies that violate serializability can occur when snapshot isolation is used as the basis for
concurrency control.
❖ In this scheme, read operations do not require read locks to be applied to the items, thus reducing the
overhead associated with two-phase locking. But write operations do require write locks. Thus, for
transactions that have many reads, the performance is much better than 2PL. When writes do occur,
the system will have to keep track of older versions of the updated items in a temporary version store
(also known as tempstore), with the timestamps of when the version was created. This is necessary so
that a transaction that started before the item was written can still read the value (version) of the item
that was in the database snapshot when the transaction started.

❖ To keep track of versions, items that have been updated will have pointers to a list of recent versions
of the item in the tempstore, so that the correct item can be read for each transaction. The tempstore
items will be removed when no longer needed, so a method to decide when to remove unneeded
versions will be needed.
❖ Variations of this method have been used in several commercial and open source DBMSs, including
Oracle and PostGRES. If the users require guaranteed serializability, then the problems with anomalies
that violate serializability will have to be solved by the programmers/software engineers by analyzing
the set of transactions to determine which types of anomalies can occur, and adding checks that do not
permit these anomalies. This can place a burden on the software developers when compared to the
DBMS enforcing serializability in all cases.
❖ Variations of snapshot isolation (SI) techniques, known as serializable snapshot isolation (SSI), have
been proposed and implemented in some of the DBMSs that use SI as their primary concurrency
control method. For example, recent versions of the PostGRES DBMS allow the user to choose
between basic SI and SSI. The tradeoff is ensuring full serializability with SSI versus living with
possible rare anomalies but having better performance with basic SI.

Questions :
1. What is snapshot isolation? What are the advantages and disadvantages of concurrency control

Dept of CS&E,KVGCE,Sullia 21
Database management Systems (BCS403)

methods that are based on snapshot isolation?

2. How does the granularity of data items affect the performance of concurrency control? What factors
affect selection of granularity size for data items?

5.5 GRANULARITY OF DATA ITEMS AND MULTIPLE GRANULARITY LOCKING

All concurrency control techniques assume that the database is formed of a number of named data
items. A database item could be chosen to be one of the following:
■ a database record
■ a field value of a database record
■ a disk block
■ a whole file
■ the whole database

The particular choice of data item type can affect the performance of concurrency control and
recovery.

1. Granularity Level Considerations for Locking:

❖ The size of data items is called the data item granularity.
❖ Fine granularity refers to small item sizes, whereas coarse granularity refers to large item sizes.
❖ The larger the data item size the lower is the degree of concurrency permitted. For example, if the data
item size is a disk block, a transaction T that needs to lock a single record B must lock the whole disk
block X that contains B because a lock is associated with the whole data item (block). Now, if another

transaction S wants to lock a different record C that happens to reside in the same disk block X in a
conflicting lock mode, it is forced to wait. If the data item size was a single record instead of a disk
block, transaction S would be able to proceed, because it would be locking a different data item
(record).
❖ The smaller the data item size, the more is the number of items in the database. Because every item is
associated with a lock, the system will have a larger number of active locks to be handled by the lock
manager. More lock and unlock operations will be performed, causing a higher overhead. In addition,
more storage space will be required for the lock table. For timestamps, storage is required for the
read_TS and write_TS for each data item, and there will be similar overhead for handling a large
number of items.
❖ The best item size depends on the types of transactions involved. If a typical transaction accesses a

Dept of CS&E,KVGCE,Sullia 22
Database management Systems (BCS403)

small number of records, it is advantageous to have the data item granularity be one record. If a
transaction typically accesses many records in the same file, it may be better to have block or file
granularity so that the transaction will consider all those records as one (or a few) data items.

2. Multiple Granularity Level Locking:

❖ A database system should support multiple levels of granularity, where the granularity level can be
adjusted dynamically for various mixes of transactions.
Figure 7 shows a simple granularity hierarchy with a database containing two files, each file
containing several disk pages, and each page containing several records.

Figure 7: A granularity hierarchy for illustrating multiple granularity level Locking

This can be used to illustrate a multiple granularity level 2PL protocol, with shared/exclusive locking
modes, where a lock can be requested at any level. Additional types of locks will be needed to support
such a protocol efficiently. Consider the following scenario, which refers to the example in Figure 7.
Suppose transaction T1 wants to update all the records in file f1, and T1 requests and is granted an
exclusive lock for f1. Then all of f1’s pages (p11 through p1n)—and the records contained on those

pages are locked in exclusive mode. This is beneficial for T1 because setting a single file-level lock is
more efficient than setting n page level locks or having to lock each record individually.

Now suppose another transaction T2 only wants to read record r1nj from page p1n of file f1; then T2
would request a shared record-level lock on r1nj. However, the database system- the transaction
manager or the lock manager - must verify the compatibility of the requested lock with already held
locks. This can be verified by traversing the tree from the leaf r1nj to p1n to f1 to db. If at any time a
conflicting lock is held on any of those items, then the lock request for r1njis denied and T2 is blocked
and must wait. This traversal would be fairly efficient.

Dept of CS&E,KVGCE,Sullia 23
Database management Systems (BCS403)

If transaction T2’s request came before transaction T1’s request, the shared record lock is granted to T2
for r1nj, but when T1’s file-level lock is requested, it can be time-consuming for the lock manager to
check all nodes (pages and records) that are descendants of node f1for a lock conflict. This would be
very inefficient and would defeat the purpose of having multiple granularity level locks.

❖ To make multiple granularity level locking practical, intention locks are needed. The idea behind
intention locks is for a transaction to indicate, along the path from the root to the desired node, what
type of lock (shared or exclusive) it will require from one of the node’s descendants.

❖ There are three types of intention locks:

1. Intention-shared (IS) indicates that one or more shared locks will be requested on some
descendant node(s).
2. Intention-exclusive (IX) indicates that one or more exclusive locks will be requested on some
descendant node(s).
3. Shared-intention-exclusive (SIX) indicates that the current node is locked in shared mode but that
one or more exclusive locks will be requested on some descendant node(s).

The compatibility table of the three intention locks, and the actual shared and exclusive locks, is
shown in Figure 8.

Figure 8: Lock compatibility matrix for multiple granularity locking

❖ An appropriate locking protocol must be used in addition to the three types of intention locks. The
multiple granularity locking (MGL) protocol consists of the following rules:
1. The lock compatibility (based on Figure 8) must be adhered to.
2. The root of the tree must be locked first, in any mode.
3. A node N can be locked by a transaction T in S or IS mode only if the parent node N is already
locked by transaction T in either IS or IX mode.

Dept of CS&E,KVGCE,Sullia 24
Database management Systems (BCS403)

4. A node N can be locked by a transaction T in X, IX, or SIX mode only if the parent of node N is
already locked by transaction T in either IX or SIX mode.
5. A transaction T can lock a node only if it has not unlocked any node (to enforce the 2PL protocol).
6. A transaction T can unlock a node, N, only if none of the children of node N are currently locked
by T.

Rule 1 states that conflicting locks cannot be granted. Rules 2, 3, and 4 state the conditions when a
transaction may lock a given node in any of the lock modes. Rules 5 and 6 of the MGL protocol
enforce 2PL rules to produce serializable schedules.
Basically, the locking starts from the root and goes down the tree until the node that needs to be locked
is encountered, whereas unlocking starts from the locked node and goes up the tree until the root itself
is unlocked.

Consider the following three transactions:

1. T1 wants to update record r111 and record r211.
2. T2 wants to update all records on page p12.
3. T3 wants to read record r11j and the entire f2 file. Figure 9 shows a possible serializable schedule
for these three transactions.
Only the lock and unlock operations are shown. The notation <lock_type>(<item>) is used to
display the locking operations in the schedule.

❖ The multiple granularity level protocol is especially suited when processing a mix of transactions that
include:
(1) short transactions that access only a few items (records or fields) and
(2) long transactions that access entire files.

In this environment, less transaction blocking and less locking overhead are incurred by such a
protocol when compared to a single-level granularity locking approach.

Dept of CS&E,KVGCE,Sullia 25
Database management Systems (BCS403)

Figure 9: Lock operations to illustrate a serializable schedule

Questions :

What is multiple granularity locking ? What are the types of locks?

What are the rules to be followed in multiple granularity locking protocol?

Dept of CS&E,KVGCE,Sullia 26
Database management Systems (BCS403)

INTRODUCTION TO NOSQL SYSTEMS

Emergence of NOSQL Systems:
Consider a free e-mail application, such as Google Mail or Yahoo Mail or other similar service. This
application can have millions of users, and each user can have thousands of e-mail messages. There is a
need for a storage system that can manage all these e-mails; a structured relational SQL system may not
be appropriate because-

(1) SQL systems offer too many services (powerful query language, concurrency control, etc.),
which this application may not need
(2) a structured data model such the traditional relational model may be too restrictive.
Although newer relational systems do have more complex object-relational modeling options they still
require schemas, which are not required by many of the NOSQL systems.
Consider an application such as Facebook, with millions of users who submit posts, many with images
and videos; then these posts must be displayed on pages of other users using the social media
relationships among the users. User profiles, user relationships, and posts must all be stored in a huge
collection of data stores, and the appropriate posts must be made available to the sets of users that have
signed up to see these posts. Some of the data for this type of application is not suitable for a traditional
relational system and typically needs multiple types of databases and data storage systems. Some of the
organizations that were faced with these data management and storage applications decided to develop
their own systems:

➢ Google developed a proprietary NOSQL system known as BigTable, which is used in many of
Google’s applications that require vast amounts of data storage, such as Gmail, Google Maps,
and Web site indexing. Apache Hbase is an open source NOSQL system based on similar
concepts. Google’s innovation led to the category of NOSQL systems known as column-based
or wide column stores; they are also sometimes referred to as column family stores.
➢ Amazon developed a NOSQL system called DynamoDB that is available through Amazon’s
cloud services. This innovation led to the category known as key-value data stores or sometimes
key-tuple or key-object data stores.
➢ Facebook developed a NOSQL system called Cassandra, which is now open source and known
as Apache Cassandra. This NOSQL system uses concepts from both key-value stores and
column-based systems.

Dept of CS&E,KVGCE,Sullia 27
Database management Systems (BCS403)

➢ Other software companies started developing their own solutions and making them available to
users who need these capabilities—for example, MongoDB and CouchDB, which are classified
as document-based NOSQL systems or document stores.
➢ Another category of NOSQL systems is the graph-based NOSQL systems, or graph databases;
these include Neo4J and GraphBase, among others.
➢ Some NOSQL systems, such as OrientDB, combine concepts from many of the categories
discussed above.

In addition to the newer types of NOSQL systems listed above, it is also possible to classify database
systems based on the object model or on the native XML model as NOSQL systems, although they may
not have the high-performance and replication characteristics of the other types of NOSQL systems.

3. Characteristics of NOSQL Systems:

1) NOSQL characteristics related to distributed databases and distributed systems: NOSQL systems
emphasize high availability, so replicating the data is inherent in many of these systems. Scalability is
another important characteristic, because many of the applications that use NOSQL systems tend to have
data that keeps growing in volume. High performance is another required characteristic, whereas
serializable consistency may not be as important for some of the NOSQL applications.

1. Scalability: Two kinds of scalability in distributed systems: horizontal and vertical. In NOSQL
systems, horizontal scalability is generally used, where the distributed system is expanded by adding
more nodes for data storage and processing as the volume of data grows. Vertical scalability refers to
expanding the storage and computing power of existing nodes. In NOSQL systems, horizontal
scalability is employed while the system is operational, so techniques for distributing the existing data
among new nodes without interrupting system operation are necessary.

2. Availability, Replication and Eventual Consistency: Many applications that use NOSQL systems
require continuous system availability. To accomplish this, data is replicated over two or more nodes
in a transparent manner, so that if one node fails, the data is still available on other nodes. Replication
improves data availability and can also improve read performance, because read requests can often be
serviced from any of the replicated data nodes. However, write performance becomes more
cumbersome because an update must be applied to every copy of the replicated data items; this can
slow down write performance if serializable consistency is required. Many NOSQL applications do not
require serializable consistency, so more relaxed forms of consistency known as eventual consistency

Dept of CS&E,KVGCE,Sullia 28
Database management Systems (BCS403)

are used.

3. Replication Models: Two major replication models used in NOSQL systems are master-slave and
master-master replication. Master-slave replication requires one copy to be the master copy; all write
operations must be applied to the master copy and then propagated to the slave copies, usually using
eventual consistency (the slave copies will eventually be the same as the master copy). For read, the
master-slave paradigm can be configured in various ways. One configuration requires all reads to also
be at the master copy, so this would be similar to the primary site or primary copy methods of
distributed concurrency control, with similar advantages and disadvantages. Another configuration
would allow reads at the slave copies but would not guarantee that the values are the latest writes,

since writes to the slave nodes can be done after they are applied to the master copy. The master-
master replication allows reads and writes at any of the replicas but may not guarantee that reads at
nodes that store different copies see the same values. Different users may write the same data item
concurrently at different nodes of the system, so the values of the item will be temporarily inconsistent.
A reconciliation method to resolve conflicting write operations of the same data item at different nodes
must be implemented as part of the master-master replication scheme.

4. Sharding of Files: In many NOSQL applications, files (or collections of data objects) can have many
millions of records (or documents or objects), and these records can be accessed concurrently by
thousands of users. So it is not practical to store the whole file in one node. Sharding of the file records
is often employed in NOSQL systems. This serves to distribute the load of accessing the file records to
multiple nodes. The combination of sharding the file records and replicating the shards works in
tandem to improve load balancing as well as data availability.

5. High-Performance Data Access: In many NOSQL applications, it is necessary to find individual

records or objects (data items) from among the millions of data records or objects in a file. This is
achieved by most systems using one of two techniques: hashing or range partitioning on object keys.
The majority of accesses to an object will be by providing the key value rather than by using complex
query conditions. The object key is similar to the concept of object id. In hashing, a hash function h(K)
is applied to the key K, and the location of the object with key K is determined by the value of h(K). In
range partitioning, the location is determined via a range of key values; for example, location i would
hold the objects whose key values K are in the range Kimin ≤ K ≤ Kimax. In applications that require
range queries, where multiple objects within a range of key values are retrieved, range partition is

Dept of CS&E,KVGCE,Sullia 29
Database management Systems (BCS403)

preferred. Other indexes can also be used to locate objects based on attribute conditions different from
the key K.

2) NOSQL characteristics related to data models and query languages: NOSQL systems emphasize
performance and flexibility over modeling power and complex querying.

1. Not Requiring a Schema: The flexibility of not requiring a schema is achieved in many NOSQL
systems by allowing semi-structured, self describing data . The users can specify a partial schema in
some systems to improve storage efficiency, but it is not required to have a schema in most of the
NOSQL systems. As there may not be a schema to specify constraints, any constraints on the data
would have to be programmed in the application programs that access the data items. There are various
languages for describing semistructured data, such as JSON (JavaScript Object Notation) and XML

(Extensible Markup Language;). JSON is used in several NOSQL systems, but other methods for
describing semi-structured data can also be used.

2. Less Powerful Query Languages: Many applications that use NOSQL systems may not require a
powerful query language such as SQL, because search (read) queries in these systems often locate
single objects in a single file based on their object keys. NOSQL systems typically provide a set of
functions and operations as a programming API (application programming interface), so reading and
writing the data objects is accomplished by calling the appropriate operations by the programmer. In
many cases, the operations are called CRUD operations, for Create, Read, Update, and Delete. In other
cases, they are known as SCRUD because of an added Search (or Find) operation. Some NOSQL
systems also provide a high-level query language, but it may not have the full power of SQL; only a
subset of SQL querying capabilities would be provided. In particular, many NOSQL systems do not
provide join operations as part of the query language itself; the joins need to be implemented in the
application programs.

3. Versioning: Some NOSQL systems provide storage of multiple versions of the data items, with the
timestamps of when the data version was created.

Categories of NOSQL Systems:

NOSQL systems have been characterized into four major categories with some additional categories
that encompass other types of systems. The common categorization with four major categories are:
1. Document-based NOSQL systems: These systems store data in the form of documents using well-

Dept of CS&E,KVGCE,Sullia 30
Database management Systems (BCS403)

known formats, such as JSON (JavaScript Object Notation). Documents are accessible via their
document id, but can also be accessed rapidly using other indexes.
2. NOSQL key-value stores: These systems have a simple data model based on fast access by the key
to the value associated with the key; the value can be a record or an object or a document or even
have a more complex data structure.
3. Column-based or wide column NOSQL systems: These systems partition a table by column into
column families (a form of vertical partitioning) where each column family is stored in its own
files. They also allow versioning of data values.
4. Graph-based NOSQL systems: Data is represented as graphs, and related nodes can be found by
traversing the edges using path expressions.
Additional categories can be added as follows to include some systems that are not easily categorized
into the above four categories, as well as some other types of systems that have been available even
before the term NOSQL became widely used.

5. Hybrid NOSQL systems: These systems have characteristics from two or more of the above four
categories.
6. Object databases
7. XML databases

Even keyword-based search engines store large amounts of data with fast search access, so the stored
data can be considered as large NOSQL big data stores.

CAP THEOREM
❖ Concurrency control in distributed databases (DDBS) is required to enforce the ACID properties
(atomicity, consistency, isolation, durability) of transactions that are running concurrently.
❖ In a system with data replication, concurrency control becomes more complex because there can be
multiple copies of each data item. So if an update is applied to one copy of an item, it must be applied
to all other copies in a consistent manner.
❖ The possibility exists that one copy of an item X is updated by a transaction T1 whereas another copy
is updated by a transaction T2, so two inconsistent copies of the same item exist at two different nodes
in the distributed system. If two other transactions T3 and T4 want to read X, each may read a different
copy of item X. There are distributed concurrency control methods that do not allow this inconsistency
among copies of the same data item, thus enforcing serializability and hence the isolation property in

Dept of CS&E,KVGCE,Sullia 31
Database management Systems (BCS403)

the presence of replication.

❖ These techniques have high overhead, which would defeat the purpose of creating multiple copies to
improve performance and availability in distributed database systems such as NOSQL.
❖ In the field of distributed systems, there are various levels of consistency among replicated data items,
from weak consistency to strong consistency.
❖ Enforcing serializability is considered the strongest form of consistency, but it has high overhead so it
can reduce performance of read and write operations and hence adversely affect system performance.
❖ The CAP theorem, which was originally introduced as the CAP principle, can be used to explain some
of the competing requirements in a distributed system with replication.
❖ The three letters in CAP refer to three desirable properties of distributed systems with replicated data:
consistency (among replicated copies), availability (of the system for read and write operations) and
partition tolerance (in the face of the nodes in the system being partitioned by a network fault).
❖ Availability means that each read or write request for a data item will either be processed successfully
or will receive a message that the operation cannot be completed.

❖ Partition tolerance means that the system can continue operating if the network connecting the nodes
has a fault that results in two or more partitions, where the nodes in each partition can only
communicate among each other.
❖ Consistency means that the nodes will have the same copies of a replicated data item visible for
various transactions.
❖ The word consistency in CAP and its use in ACID do not refer to the same identical concept. In CAP,
the term consistency refers to the consistency of the values in different copies of the same data item in
a replicated distributed system. In ACID, it refers to the fact that a transaction will not violate the
integrity constraints specified on the database schema. However, if we consider that the consistency of
replicated copies is a specified constraint, then the two uses of the term consistency would be related.
❖ The CAP theorem states that it is not possible to guarantee all three of the desirable properties—
consistency, availability, and partition tolerance—at the same time in a distributed system with data
replication. If this is the case, then the distributed system designer would have to choose two
properties out of the three to guarantee. It is generally assumed that in many traditional (SQL)
applications, guaranteeing consistency through the ACID properties is important.
❖ In a NOSQL distributed data store, a weaker consistency level is often acceptable, and guaranteeing
the other two properties (availability, partition tolerance) is important. Hence, weaker consistency

Dept of CS&E,KVGCE,Sullia 32
Database management Systems (BCS403)

levels are often used in NOSQL system instead of guaranteeing serializability. In particular, a form of
consistency known as eventual consistency is often adopted in NOSQL systems.

Questions :
1. For which types of applications were NOSQL systems developed?
2. What are the main categories of NOSQL systems? List a few of the NOSQL systems in each
category.
3. What are the main characteristics of NOSQL systems in the areas related to data models and
querylanguages?
4. What are the main characteristics of NOSQL systems in the areas related to distributed
systems anddistributed databases?
5. What is the CAP theorem? Which of the three properties (consistency, availability, partition
tolerance)are most important in NOSQL systems?
6. What are the similarities and differences between using consistency in CAP versus using
consistency inACID?

DOCUMENT BASED NOSQL SYSTEMS AND MongoDB

Document-based or document-oriented NOSQL systems typically store data as collections of similar
documents. These types of systems are also sometimes known as document stores. The individual
documents somewhat resemble complex objects or XML documents but a major difference between
document-based systems versus object and object-relational systems and XML is that there is no
requirement to specify a schema—rather, the documents are specified as self-describing data. Although the
documents in a collection should be similar, they can have different data elements (attributes), and new
documents can have new data elements that do not exist in any of the current documents in the collection.
The system basically extracts the data element names from the self-describing documents in the collection,
and the user can request that the system create indexes on some of the data elements. Documents can be
specified in various formats, such as XML. A popular language to specify documents in NOSQL systems
is JSON (JavaScript Object Notation). There are many document-based NOSQL systems, including
MongoDB and CouchDB, among many others.

5.8.1 MongoDB Data Model

❖ MongoDB documents are stored in BSON (Binary JSON) format, which is a variation of JSON with
some additional data types and is more efficient for storage than JSON. Individual documents are
stored in a collection.

Dept of CS&E,KVGCE,Sullia 33
Database management Systems (BCS403)

❖ Example based on COMPANY database: The operation createCollection is used to create each
collection. For example, the following command can be used to create a collection called project to
hold PROJECT objects from the COMPANY database:
db.createCollection(“project”, { capped : true, size : 1310720, max : 500 } )
The first parameter “project” is the name of the collection, which is followed by an optional document
that specifies collection options. In the example, the collection is capped; this means it has upper limits
on its storage space (size) and number of documents (max). The capping parameters help the system
choose the storage options for each collection.
❖ For the example, create another document collection called worker to hold information about the
EMPLOYEEs who work on each project. For example:
db.createCollection(“worker”, { capped : true, size : 5242880, max : 2000 } ) )
❖ Each document in a collection has a unique ObjectId field, called _id, which is automatically
indexed in the collection unless the user explicitly requests no index for the _id field. The value of
ObjectId can be specified by the user, or it can be system-generated if the user does not specify an _id
field for a particular document.
▪ System-generated ObjectIds have a specific format, which combines the timestamp when the
object is created (4 bytes, in an internal MongoDB format), the node id (3 bytes), the process id
(2 bytes), and a counter (3 bytes) into a 16-byte Id value.
▪ User-generated ObjectsIds can have any value specified by the user as long as it uniquely
identifies the document and so these Ids are similar to primary keys in relational systems.

❖ A collection does not have a schema. The structure of the data fields in documents is chosen based on
how documents will be accessed and used, and the user can choose a normalized design (similar to
normalized relational tuples) or a denormalized design (similar to XML documents or complex
objects). Interdocument references can be specified by storing in one document the ObjectId or
ObjectIds of other related documents.

❖ Figure 10(a) shows a simplified MongoDB document showing some of the data from the COMPANY
database example. In the example, the _id values are user-defined, and the documents whose _id starts
with P (for project) will be stored in the “project” collection, whereas those whose _id starts with W
(for worker) will be stored in the “worker” collection. In Figure 10(a), the workers information is
embedded in the project document; so there is no need for the “worker” collection. This is known as

Dept of CS&E,KVGCE,Sullia 34
Database management Systems (BCS403)

the denormalized pattern, which is similar to creating a complex object or an XML document. A list of
values that is enclosed in square brackets [ … ] within a document represents a field whose value is an
array.
❖ Another option is to use the design in Figure 10(b), where worker references are embedded in the
project document, but the worker documents themselves are stored in a separate “worker” collection.
❖ A third option in Figure 10(c) would use a normalized design, similar to First Normal Form relations.
The choice of which design option to use depends on how the data will be accessed. The design in
Figure 24.1(c) is not the general normalized design for a many-to-many relationship, such as the one
between employees and projects; rather, we would need three collections for “project”, “employee”,
and “works_on”.
❖ In the design in Figure 10(c), an EMPLOYEE who works on several projects would be represented by
multiple worker documents with different _id values; each document would represent the employee as
worker for a particular project. This is similar to the design decisions for XML schema design. The
typical document-based system does not have a schema, so the design rules would have to be followed
whenever individual documents are inserted into a collection.

Dept of CS&E,KVGCE,Sullia 35
Database management Systems (BCS403)

Dept of CS&E,KVGCE,Sullia 36
Database management Systems (BCS403)

Figure 10: Example of simple documents in MongoDB. (a) Denormalized document design with embedded
subdocuments. (b) Embedded array of document references. (c) Normalized documents. (d) Inserting the
documents in (c) into their collections.

MongoDB CRUD Operations

❖ MongoDb has several CRUD operations, where CRUD stands for (create, read, update, delete).
❖ Documents created and inserted into their collections using the insert operation, whose format is:
db.<collection_name>.insert(<document(s)>)
The parameters of the insert operation can include either a single document or an array of documents,
as shown in Figure 10(d).
❖ The delete operation is called remove, and the format is:
db.<collection_name>.remove(<condition>)
The documents to be removed from the collection are specified by a Boolean condition on some of the
fields in the collection documents.
❖ The update operation has a condition to select certain documents, and a $set clause to specify the
update. It is also possible to use the update operation to replace an existing document with another one
but keep the same ObjectId.
❖ For read queries, the main command is called find, and the format is:
db.<collection_name>.find(<condition>)
❖ General Boolean conditions can be specified as <condition>, and the documents in the collection that
return true are selected for the query result.

MongoDB Distributed Systems Characteristics

Most MongoDB updates are atomic if they refer to a single document, but MongoDB also provides a
pattern for specifying transactions on multiple documents. Since MongoDB is a distributed system, the

Dept of CS&E,KVGCE,Sullia 37
Database management Systems (BCS403)

two-phase commit method is used to ensure atomicity and consistency of multidocument transactions.

Replication in MongoDB: The concept of replica set is used in MongoDB to create multiple copies of
the same data set on different nodes in the distributed system, and it uses a variation of the master-slave
approach for replication.

Example: to replicate a particular document collection C, replica set will have one primary copy of the
collection C stored in one node N1, and at least one secondary copy (replica) of C stored at another
node N2. Additional copies can be stored in nodes N3, N4, etc., as needed, but the cost of storage and
update (write) increases with the number of replicas. The total number of participants in a replica set
must be at least three, so if only one secondary copy is needed, a participant in the replica set known as
an arbiter must run on the third node N3. The arbiter does not hold a replica of the collection but
participates in elections to choose a new primary if the node storing the current primary copy fails. If
the total number of members in a replica set is n (one primary plus i secondaries, for a total of n = i + 1),
then n must be an odd number; if it is not, an arbiter is added to ensure the election process works
correctly if the primary fails.

In MongoDB replication, all write operations must be applied to the primary copy and then propagated
to the secondaries. For read operations, the user can choose the particular read preference for their
application. The default read preference processes all reads at the primary copy, so all read and write
operations are performed at the primary node. In this case, secondary copies are mainly to make sure
that the system continues operation if the primary fails, and MongoDB can ensure that every read
request gets the latest document value. To increase read performance, it is possible to set the read
preference so that read requests can be processed at any replica (primary or secondary); however, a read
at a secondary is not guaranteed to get the latest version of a document because there can be a delay in
propagating writes from the primary to the secondaries.

Sharding in MongoDB: When a collection holds a very large number of documents or requires a large
storage space, storing all the documents in one node can lead to performance problems, particularly if
there are many user operations accessing the documents concurrently using various CRUD operations.
▪ Sharding of the documents in the collection (also known as horizontal partitioning) divides the
documents into disjoint partitions known as shards. This allows the system to add more nodes as
needed by a process known as horizontal scaling of the distributed system, and to store the shards
of the collection on different nodes to achieve load balancing. Each node will process only those

Dept of CS&E,KVGCE,Sullia 38
Database management Systems (BCS403)

operations pertaining to the documents in the shard stored at that node. Also, each shard will
contain fewer documents than if the entire collection were stored at one node, thus further
improving performance.

▪ Two ways to partition a collection into shards in MongoDB are range partitioning and hash
partitioning. Both require that the user specify a particular document field to be used as the basis
for partitioning the documents into shards. The partitioning field known as the shard key in
MongoDB must have two characteristics: it must exist in every document in the collection, and it
must have an index. The ObjectId can be used, but any other field possessing these two
characteristics can also be used as the basis for sharding. The values of the shard key are divided
into chunks either through range partitioning or hash partitioning, and the documents are partitioned
based on the chunks of shard key values.
▪ Range partitioning creates the chunks by specifying a range of key values; for example, if the shard
key values ranged from one to ten million, it is possible to create ten ranges—1 to 1,000,000;
1,000,001 to 2,000,000; . . . ; 9,000,001 to 10,000,000—and each chunk would contain the key
values in one range. Hash partitioning applies a hash function h(K) to each shard key K, and the
partitioning of keys into chunks is based on the hash values. In general, if range queries are
commonly applied to a collection (for example, retrieving all documents whose shard key value is
between 200 and 400), then range partitioning is preferred because each range query will typically
be submitted to a single node that contains all the required documents in one shard. If most searches
retrieve one document at a time, hash partitioning may be preferable because it randomizes the
distribution of shard key values into chunks.
▪ When sharding is used, MongoDB queries are submitted to a module called the query router,
which keeps track of which nodes contain which shards based on the particular partitioning method
used on the shard keys. The query (CRUD operation) will be routed to the nodes that contain the
shards that hold the documents that the query is requesting. If the system cannot determine which
shards hold the required documents, the query will be submitted to all the nodes that hold shards of
the collection.
Sharding and replication are used together; sharding focuses on improving performance via load
balancing and horizontal scalability, whereas replication focuses on ensuring system availability when
certain nodes fail in the distributed system.

Dept of CS&E,KVGCE,Sullia 39
Database management Systems (BCS403)

NOSQL Key-Value Stores

Key-value stores focus on high performance, availability and scalability by storing data in a
distributed storage system.

The data model used in key-value stores is relatively simple, and in many of these systems, there is no
query language but rather a set of operations that can be used by the application programmers.
The key is a unique identifier associated with a data item and is used to locate this data item
rapidly.

The value is the data item and it can have very different formats for different key-value storage
systems.

➢ The value is just a string of bytes or an array of bytes, and the application using the key-
value store has to interpret the structure of the data value.
➢ Standard formatted data is allowed;
Example: structured data rows (tuples) similar to relational data, or semistructured data
using JSON or some other self-describing data format.

The main characteristic of key-value stores is the fact that every value (data item) must be
associated with a unique key and that retrieving the value by supplying the key must be very fast.

1. DynamoDB Overview
The DynamoDB system is an Amazon product and is available as part of Amazon’s AWS/SDK
platforms (Amazon Web Services/Software Development Kit).
It can be used as part of Amazon’s cloud computing services, for the data storage component.
DynamoDB data model. The basic data model in DynamoDB uses the concepts of tables, items, and
attributes. A table in DynamoDB does not have a schema; it holds a collection of self-describing items.
Each item will consist of a number of (attribute, value) pairs, and attribute values can be single-
valued or multivalued. A table will hold a collection of items, and each item is a self-describing record
(or object).DynamoDB also allows the user to specify the items in JSON format, and the
system will convert them to the internal storage format of DynamoDB. When a table is created, it is
required to specify a table name and a primary key; The primary key will be used to rapidly locate the
items in the table.Thus, the primary key is the key and the item is the value for the DynamoDB key-
value store. The primary key attribute must exist in every item in the table. The primary key can be one
of the following two types:
■ A single attribute.

The DynamoDB system will use this attribute to build a hash index on the items in the table.

Dept of CS&E,KVGCE,Sullia 40
Database management Systems (BCS403)

This is called a hash type primary key.

The items are not ordered in storage on the value of the hash attribute.
■ A pair of attributes.

This is called a hash and range type primary key.

The primary key will be a pair of attributes (A, B): attribute A will be used for hashing and
because there will be multiple items with the same value of A, the B values will be used for
ordering the records with the same A value.

A table with this type of key can have additional secondary indexes defined on its attributes.

For example, if we want to store multiple versions of some type of items in a table, we could use
ItemID as hash and Date or Timestamp (when the version was created) as range in a hash and
range type primary key.

2. Voldemort Key-Value Distributed DataStore

Voldemort has been used by LinkedIn for data storage. Some of the features of Voldemort are as
follows:

■ Simple basic operations.

A collection of (key, value) pairs is kept in a Voldemort store.

Assume the store is called s. The basic interface for data storage and retrieval is very simple and
includes three operations: get, put, and delete.

The operation s.put(k, v) inserts an item as a key-value pair with key k and value v.

The operation s.delete(k) deletes the item whose key is k from the store, and the operation v =
s.get(k) retrieves the value v associated with key k.

The application can use these basic operations to build its own requirements. At the basic storage
level, both keys and values are arrays of bytes (strings).

■ High-level formatted data values.

The values v in the (k, v) items can be specified in JSON (JavaScript Object Notation), and the
system will convert between JSON and the internal storage format.

The Serializer class must be provided by the user and will include operations to convert the user
format into a string of bytes for storage as a value, and to convert back a string (array of bytes)
retrieved via s.get(k) into the user format.

■ Consistent hashing for distributing (key, value) pairs.

Dept of CS&E,KVGCE,Sullia 41
Database management Systems (BCS403)

A variation of the data distribution algorithm known as consistent hashing is used in Voldemort
for data distribution among the nodes in the distributed cluster of nodes.

A hash function h(k) is applied to the key k of each (k, v) pair, and h(k) determines where the
item will be stored.
The method assumes that h(k) is an integer value, usually in the range 0 to Hmax = 2n−1, where
n is chosen based on the desired range for the hash values.

This method is best visualized by considering the range of all possible integer hash values 0 to
Hmax to be evenly distributed on a circle (or ring).

The nodes in the distributed system are then also located on the same ring; usually each node
will have several locations on the ring (Figure ). The positioning of the points on the ring that
represent the nodes is done in a psuedorandom manner.

An item (k, v) will be stored on the node whose position in the ring follows the position of h(k)
on the ring in a clockwise direction.

In Figure 24.2(a), we assume there are three nodes in the distributed cluster labeled A, B, and C,
where node C has a bigger capacity than nodes A and B.

In a typical system, there will be many more nodes.

On the circle, two instances each of A and B are placed, and three instances of C (because of its
higher capacity), in a pseudorandom manner to cover the circle.

Figure (a) indicates which (k, v) items are placed in which nodes based on the h(k) values.

Dept of CS&E,KVGCE,Sullia 42
Database management Systems (BCS403)

The h(k) values that fall in the parts of the circle marked as range 1 in Figure (a) will have their
(k, v) items stored in node A because that is the node whose label follows h(k) on the ring in a
clockwise direction;

Those in range 2 are stored in node B; and those in range 3 are stored in node C.
This scheme allows horizontal scalability because when a new node is added to the distributed
system, it can be added in one or more locations on the ring depending on the node capacity.Only
a limited percentage of the (k, v) items will be reassigned to the new node from the existing
nodes based on the consistent hashing placement algorithm.Also, those items assigned to the new
node may not all come from only one of the existing nodes because the new node can have
multiple locations on the ring.
For example, if a node D is added and it has two placements on the ring as shown in Figure (b),
then some of the items from nodes B and C would be moved to node D.
The items whose keys hash to range 4 on the circle (Figure (b)) would be migrated to node

Dept of CS&E,KVGCE,Sullia 43
Database management Systems (BCS403)

D.This scheme also allows replication by placing the number of specified replicas of an item on
successive nodes on the ring in a clockwise direction.The sharding is built into the method, and
different items in the store (file) are located on different nodes in the distributed cluster, which
means the items are horizontally partitioned (sharded) among the nodes in the distributed system.
When a node fails, its load of data items can be distributed to the other existing nodes whose
labels follow the labels of the failed node in the ring.
Nodes with higher capacity can have more locations on the ring, as illustrated by node C in
Figure (a), and thus store more items than smaller-capacity nodes.

■ Consistency and versioning:

Voldemort uses a method similar to the one developed for DynamoDB for consistency in the
presence of replicas.

Concurrent write operations are allowed by different processes so there could exist two or more
different values associated with the same key at different nodes when items are replicated.

Consistency is achieved when the item is read by using a technique known as versioning and
read repair.

Concurrent writes are allowed, but each write is associated with a vector clock value.

When a read occurs, it is possible that different versions of the same value (associated with the
same key) are read from different nodes.

If the system can reconcile to a single final value, it will pass that value to the read; otherwise,
more than one version can be passed back to the application, which will reconcile the various
versions into one version based on the application semantics and give this reconciled value back
to the nodes.

Examples of Other Key-Value Stores

Oracle key-value store. Oracle has one of the well-known SQL relational database systems, and Oracle
also offers a system based on the key-value store concept; this system is called the Oracle NoSQL
Database.

Redis key-value cache and store. Redis differs from the other systems because it caches its data in
main memory to further improve performance.

It offers master-slave replication and high availability, and it also offers persistence by backing up the
cache to disk.

Apache Cassandra. Cassandra is a NOSQL system that is not easily categorized into one category;It is

Dept of CS&E,KVGCE,Sullia 44
Database management Systems (BCS403)

sometimes listed in the column-based NOSQL category or in the key-value category.

It offers features from several NOSQL categories and is used by Facebook as well as many other
customers.

Questions :
1. What are the data modeling concepts used in MongoDB? What are the main CRUD operations
ofMongoDB?
2. Discuss how replication and sharding are done in MongoDB
3. Discuss the data modeling concepts in DynamoDB.
4. Describe the consistent hashing schema for data distribution, replication, and sharding. How are
consistency and versioning handled in Voldemort?

Column-Based or Wide Column NOSQL Systems

Another category of NOSQL systems is known as column-based or wide column systems.

The Google distributed storage system for big data, known as BigTable, is a well-known example of
this class of NOSQL systems and it is used in many Google applications that require large amounts of
data storage, such as Gmail.

BigTable uses the Google File System (GFS) for data storage and distribution. An open source system
known as Apache Hbase is similar to Google BigTable, but it uses HDFS (Hadoop Distributed File
System) for data storage. HDFS is used in many cloud computing applications.

Hbase can also use Amazon’s Simple Storage System (known as S3) for data storage.

Cassandra is an example of column-based NOSQL systems, it can also be characterized as a key- value
store.

BigTable (and Hbase) is described as a sparse multidimensional distributed persistent sorted map,
where the word map means a collection of (key, value) pairs (the key is mapped to the value).

One of the main differences that distinguish column-based systems from key-value stores is the
nature of the key.

In column-based systems such as Hbase, the key is multidimensional and has several
components:
a combination of table name, row key, column, and timestamp.
The column is composed of two components: column family and column qualifier.
1. Hbase Data Model and Versioning Hbase data model:
The data model in Hbase organizes data using the concepts of namespaces, tables, column

Dept of CS&E,KVGCE,Sullia 45
Database management Systems (BCS403)

families, column qualifiers, columns, rows, and data cells.

A column is identified by a combination of (column family:column qualifier).
Data is stored in a self-describing form by associating columns with data values, where data
values are strings.
Hbase also stores multiple versions of a data item, with a timestamp associated with each
version, so versions and timestamps are also part of the Hbase data model .
As with other NOSQL systems, unique keys are associated with stored data items for fast
access, but the keys identify cells in the storage system.
The focus is on high performance when storing huge amounts of data, the data model
includes some storage-related concepts.
• Tables and Rows:
Data in Hbase is stored in tables, and each table has a table name. Data in a table is stored as
self-describing rows.
Each row has a unique row key, and row keys are strings that must have the property that they
can be lexicographically ordered, so characters that do not have a lexicographic order in the
character set cannot be used as part of a row key.
• Column Families, Column Qualifiers, and Columns:
A table is associated with one or more column families.
Each column family will have a name, and the column families associated with a table must be
specified when the table is created and cannot be changed later.
Figure (a) shows how a table may be created; the table name is followed by the names of the
column families associated with the table.
When the data is loaded into a table, each column family can be associated with many column
qualifiers, but the column qualifiers are not specified as part of creating a table.
So the column qualifiers make the model a self-describing data model because thequalifiers can
be dynamically specified as new rows are created and inserted into the table.
A column is specified by a combination of ColumnFamily:ColumnQualifier. Basically, column
families are a way of grouping together related columns (attributes in relational terminology) for
storage purposes, except that the column qualifier names are not specified during table creation.
Rather, they are specified when the data is created and stored in rows, so the data is
selfdescribing since any column qualifier name can be used in a new row of data (Figure
(b)).

Dept of CS&E,KVGCE,Sullia 46
Database management Systems (BCS403)

The concept of column family is somewhat similar to vertical partitioning, because columns
(attributes) that are accessed together because they belong to the same column family are
stored in the same files. Each column family of a table is stored in its own files using the
HDFS file system.
• Versions and Timestamps.
Hbase can keep several versions of a data item, along with the timestamp associated with each
version.
The timestamp is a long integer number that represents the system time when the version was
created, so newer versions have larger timestamp values.
Hbase uses midnight ‘January 1, 1970 UTC’ as timestamp value zero, and uses a long integer
that measures the number of milliseconds since that time as the system timestamp value.
It is also possible for the user to define the timestamp value explicitly in a Date format rather
than using the system-generated timestamp.

Dept of CS&E,KVGCE,Sullia 47
Database management Systems (BCS403)

• Cells. A cell holds a basic data item in Hbase.

The key (address) of a cell is specified by a combination of (table, rowid, columnfamily,
columnqualifier, timestamp).

If timestamp is left out, the latest version of the item is retrieved unless a default number of
versions is specified, say the latest three versions.
The default number of versions to be retrieved, as well as the default number of versions that
the system needs to keep, are parameters that can be specified during table creation.
• Namespaces. A namespace is a collection of tables.
A namespace basically specifies a collection of one or more tables that are typically used
together by user applications, and it corresponds to a database that contains a collection of
tables in relational terminology.

2 Hbase CRUD Operations

Hbase has low-level CRUD (create, read, update, delete) operations, as in many of the NOSQL
systems. The formats of some of the basic CRUD operations in Hbase are shown in Figure (c).

Hbase only provides low-level CRUD operations. It is the responsibility of the application
programs to implement more complex operations, such as joins between rows in different tables.

The create operation creates a new table and specifies one or more column families associatedwith
that table.

The put operation is used for inserting new data or new versions of existing data items.

The get operation is for retrieving the data associated with a single row in a table, and the scan
operation retrieves all the rows.

3 Hbase Storage and Distributed System Concepts

Each Hbase table is divided into a number of regions, where each region will hold a range of the row
keys in the table; the row keys must be lexicographically ordered.

Each region will have a number of stores, where each column family is assigned to one storewithin
the region.

Regions are assigned to region servers (storage nodes) for storage.

A master server (master node) is responsible for monitoring the region servers and for splitting a table
into regions and assigning regions to region servers.

Hbase uses the Apache Zookeeper open source system for services related to managing the naming,
distribution, and synchronization of the Hbase data on the distributed Hbase server nodes, as well as for

Dept of CS&E,KVGCE,Sullia 48
Database management Systems (BCS403)

coordination and replication services.

Hbase uses Apache HDFS (Hadoop Distributed File System) for distributed file services. Hbase
is built on top of both HDFS and Zookeeper.

Zookeeper can have several replicas on several nodes for availability, and it keeps the data it
needs in main memory to speed access to the master servers and region servers.
NOSQL Graph Databases and Neo4j

Another category of NOSQL systems is known as graph databases or graphoriented NOSQL

systems. The data is represented as a graph, which is a collection of vertices (nodes) and edges.

Neo4j is an open source system, and it is implemented in Java.

Neo4j Data Model:

The data model in Neo4j organizes data using the concepts of nodes and relationships. Both nodes
and relationships can have properties, which store the data items associated with nodes and
relationships.

Nodes can have labels; the nodes that have the same label are grouped into a collection that
identifies a subset of the nodes in the database graph for querying purposes.

A node can have zero, one, or several labels. Relationships are directed; each relationship has a start
node and end node as well as a relationship type, which serves a similar role to a node label by
identifying similar relationships that have the same relationship type.

Properties can be specified via a map pattern, which is made of one or more “name : value” pairs
enclosed in curly brackets; for example {Lname : ‘Smith’, Fname : ‘John’, Minit : ‘B’}. In
conventional graph theory, nodes and relationships are generally called vertices and edges.

Comparing the Neo4j graph model with ER/EER concepts, nodes correspond to entities, node labels
correspond to entity types and subclasses, relationships correspond to relationship instances,
relationship types correspond to relationship types, and properties correspond to attributes.

Following differences are:

➢ A relationship is directed in Neo4j, but is not in ER/EER.

➢ A node may have no label in Neo4j, which is not allowed in ER/EER because every
entity must belong to an entity type.
➢ The graph model of Neo4j is used as a basis for an actual high-performance distributed

Dept of CS&E,KVGCE,Sullia 49
Database management Systems (BCS403)

database system whereas the ER/EER model is mainly used for database design.

Figure (a) shows few nodes that can be created in Neo4j.

Example: Calling appropriate Neo4j operations from various Neo4j APIs. High-level syntax for
creating nodes and relationships is shown;

The Neo4j CREATE command is used, which is part of the high-level declarative query
language Cypher.

■ Labels and properties.

When a node is created, the node label can be specified. It is also possible to create nodes
without any labels. In Figure (a), the node labels are EMPLOYEE, DEPARTMENT, PROJECT,
and LOCATION, and the created nodes correspond to some of the data from the COMPANY
database in Figure below, with a few modifications;

For example, EmpId is used instead of SSN, a small subset of the data is used for illustration
purposes.

Properties are enclosed in curly brackets { … }. It is possible that some nodes have multiple
labels;

Example: the same node can be labeled as PERSON and EMPLOYEE and MANAGER by
listing all the labels separated by the colon symbol as follows:
PERSON:EMPLOYEE:MANAGER.

■ Relationships and relationship types. Figure shows a example relationships in Neo4j based
on the COMPANY database in Figure above.

The → specifies the direction of the relationship, but the relationship can be traversed in either
direction.

The relationship types (labels) in Figure (b) are WorksFor, Manager, LocatedIn, and WorksOn;
only relationships with the relationship type WorksOn have properties (Hours) in Figure (b).

■ Paths.

A path specifies a traversal of part of the graph.

A path is typically specified by a start node, followed by one or more relationships, leading to
one or more end nodes that satisfy the pattern.

■ Optional Schema. A schema is optional in Neo4j.

Graphs can be created and used without a schema, but in Neo4j version 2.0, a few schema-
related functions were added.

Dept of CS&E,KVGCE,Sullia 50
Database management Systems (BCS403)

The main features related to schema creation involve creating indexes and constraints based on
the labels and properties.

For example, it is possible to create the equivalent of a key constraint on a property of a label, so
all nodes in the collection of nodes associated with the label must have unique values for that
property.

■ Indexing and node identifiers.

When a node is created, the Neo4j system creates an internal unique system-defined identifier for
each node.

To retrieve individual nodes using other properties of the nodes efficiently, the user can create
indexes for the collection of nodes that have a particular label.
For example, Empid can be used to index nodes with the EMPLOYEE label, Dno to index the
nodes with the DEPARTMENT label, and Pno to index the nodes with the PROJECT label.

2. The Cypher Query Language of Neo4j Neo4j has a high-level query language, Cypher.

There are declarative commands for creating nodes and relationships (Figures (a) and (b)) and
for finding nodes and relationships based on specifying patterns.

Deletion and modification of data is also possible in Cypher.

A Cypher query is made up of clauses. When a query has several clauses, the result from one
clause can be the input to the next clause in the query.

Query 1 in Figure (d) shows the query that retrieves the locations for department number 5.

Match specifies the pattern and the query variables (d and loc) and RETURN specifies the query
result to be retrieved by referring to the query variables.

Query 2 has three variables (e, w, and p), and returns the projects and hours per week that the
employee with NOSQL Databases and Big Data Storage Systems Empid = 2 works on.

Query 3, returns the employees and hours per week who work on the project with Pno = 2.

Query 4 illustrates the ORDER BY clause and returns all employees and the projects they work
on, sorted by Ename.

It is also possible to limit the number of returned results by using the LIMIT clause as in query 5,
which only returns the first 10 answers.

Query 6 illustrates the use of WITH and aggregation, although the WITH clause can be used to
separate clauses in a query even if there is no aggregation.

Dept of CS&E,KVGCE,Sullia 51
Database management Systems (BCS403)

Query 6 also illustrates the WHERE clause to specify additional conditions, and the query
returns the employees who work on more than two projects, and the number of projects each
employee works on.

Query 7 is similar to query 5 but returns the nodes and relationships only, and so the query result
can be displayed as a graph using Neo4j’s visualization tool.

Query 8 shows how to add more properties to a node by adding a Job property to an employee
node.

Neo4j Interfaces and Distributed System Characteristics

Neo4j has other interfaces that can be used to create, retrieve, and update nodes and relationships
in a graph database.

It also has two main versions: the enterprise edition, which comes with additional capabilities,
and the community edition.
Some of the additional features of Neo4j are:
■ Enterprise edition vs. community edition.

Both editions support the Neo4j graph data model and storage system, as well as the Cypher
graph query language, and several other interfaces, including a high-performance native API,
language drivers for several popular programming languages, such as Java, Python, PHP, and the
REST (Representational State Transfer) API.

In addition, both editions support ACID properties.

The enterprise edition supports additional features for enhancing performance, such as caching
and clustering of data and locking.

■ Graph visualization interface.

Neo4j has a graph visualization interface, so that a subset of the nodes and edges in a database graph
can be displayed as a graph.

This tool can be used to visualize query results in a graph representation.

■ Master-slave replication. Neo4j can be configured on a cluster of distributed system nodes

(computers), where one node is designated the master node.

The data and indexes are fully replicated on each node in the cluster.

Various ways of synchronizing the data between master and slave nodes can be configured in the
distributed cluster.

■ Caching. A main memory cache can be configured to store the graph data for improved
performance.

Dept of CS&E,KVGCE,Sullia 52
Database management Systems (BCS403)

■ Logical logs. Logs can be maintained to recover from failures.

Dept of CS&E,KVGCE,Sullia 53
Database management Systems (BCS403)

Questions :
1. What are the data modeling concepts used in column-based NOSQL systems and Hbase?
2. What are the main CRUD operations in Hbase?
3. Discuss the storage and distributed system methods used in Hbase.
4. What are the data modeling concepts used in the graph-oriented NOSQL system Neo4j?
5. What is the query language for Neo4j?
6. Discuss the interfaces and distributed systems characteristics of Neo4j.

Dept of CS&E,KVGCE,Sullia 54

Module 5 - DBMS
No ratings yet
Module 5 - DBMS
26 pages
BCS403 - MODULE 5 Compressed
No ratings yet
BCS403 - MODULE 5 Compressed
53 pages
Mod 5
No ratings yet
Mod 5
40 pages
DBMS M5 PPT
No ratings yet
DBMS M5 PPT
62 pages
Dbms Module5 Notes
No ratings yet
Dbms Module5 Notes
46 pages
Chapter 4 Concurrency Control Techniques
No ratings yet
Chapter 4 Concurrency Control Techniques
19 pages
Module 5
No ratings yet
Module 5
38 pages
Concurrency Control Techniques
No ratings yet
Concurrency Control Techniques
64 pages
Dbms m5 Module 5 Notes of Dbms Vtu
No ratings yet
Dbms m5 Module 5 Notes of Dbms Vtu
49 pages
Imp QN Ans Mod5
No ratings yet
Imp QN Ans Mod5
24 pages
BCS403 Module 5
No ratings yet
BCS403 Module 5
83 pages
Unit - 5 DBMS
No ratings yet
Unit - 5 DBMS
42 pages
Unit 11 Concurrency Control Techniques: Structure
No ratings yet
Unit 11 Concurrency Control Techniques: Structure
7 pages
Adbms CH 1.b
No ratings yet
Adbms CH 1.b
59 pages
Binary Locks
No ratings yet
Binary Locks
4 pages
Concurency
100% (1)
Concurency
32 pages
DBMS Mod 5 @vtudeveloper - in
No ratings yet
DBMS Mod 5 @vtudeveloper - in
38 pages
ADBMS CH 2
No ratings yet
ADBMS CH 2
59 pages
5th Module Notes
No ratings yet
5th Module Notes
46 pages
Lecturenotes Module-5 BCS403 Databasemanagementsystem
No ratings yet
Lecturenotes Module-5 BCS403 Databasemanagementsystem
20 pages
Vtunotesforall Module 5 (2) DBMS
No ratings yet
Vtunotesforall Module 5 (2) DBMS
38 pages
Concurrency Control Techniques: Shilpa R Assistant Professor CSE Sdmit Ujire
No ratings yet
Concurrency Control Techniques: Shilpa R Assistant Professor CSE Sdmit Ujire
60 pages
CH-4 Concurrency Control
No ratings yet
CH-4 Concurrency Control
80 pages
Module 5
No ratings yet
Module 5
38 pages
Two-Phase Locking for Concurrency
No ratings yet
Two-Phase Locking for Concurrency
22 pages
CH-3 Concurrency Control New
No ratings yet
CH-3 Concurrency Control New
71 pages
Dbms ss3
No ratings yet
Dbms ss3
16 pages
DBMS Unit4 Notes
No ratings yet
DBMS Unit4 Notes
14 pages
CSE 301 Assignment
No ratings yet
CSE 301 Assignment
16 pages
ADBMS Chapter 4
No ratings yet
ADBMS Chapter 4
13 pages
Database Concurrency Control
No ratings yet
Database Concurrency Control
66 pages
ADB - Chapter 4,5,6 2
No ratings yet
ADB - Chapter 4,5,6 2
39 pages
Concurrency Control
No ratings yet
Concurrency Control
5 pages
Chapter 4 Concurrency Control Techniques
No ratings yet
Chapter 4 Concurrency Control Techniques
41 pages
Concurrency Control Protocol
No ratings yet
Concurrency Control Protocol
65 pages
Chapter 4 Concurrency Control
No ratings yet
Chapter 4 Concurrency Control
38 pages
Unit 6 DBMS Concurrency Control and Normalization
No ratings yet
Unit 6 DBMS Concurrency Control and Normalization
65 pages
Concurrency Control
No ratings yet
Concurrency Control
25 pages
DBMS Ans
No ratings yet
DBMS Ans
14 pages
Chapter 3 Concurrency Control
No ratings yet
Chapter 3 Concurrency Control
44 pages
ADB Chapter3
No ratings yet
ADB Chapter3
39 pages
Concurrency Control
No ratings yet
Concurrency Control
12 pages
Chap 2
No ratings yet
Chap 2
43 pages
Dbms Module 5
No ratings yet
Dbms Module 5
44 pages
Chapter 4-Concrruncy Controling Techniques
No ratings yet
Chapter 4-Concrruncy Controling Techniques
39 pages
Concurrency New
No ratings yet
Concurrency New
28 pages
ADBMS Chapter 3
No ratings yet
ADBMS Chapter 3
13 pages
Concurrency Control Protocols
No ratings yet
Concurrency Control Protocols
8 pages
Locks-Granularity, Types, 2PL
No ratings yet
Locks-Granularity, Types, 2PL
19 pages
Database Concurrency Control Techniques
No ratings yet
Database Concurrency Control Techniques
43 pages
Locking System in DBMS: Introdunction
No ratings yet
Locking System in DBMS: Introdunction
8 pages
CH-4 Concurrency Control
No ratings yet
CH-4 Concurrency Control
82 pages
Lecture 3-1
No ratings yet
Lecture 3-1
48 pages
Concurrency Control Techniques
No ratings yet
Concurrency Control Techniques
45 pages
Chapter 2-Concrruncy Controling Techniques
0% (1)
Chapter 2-Concrruncy Controling Techniques
42 pages
DBMS Concurrency Control Guide
No ratings yet
DBMS Concurrency Control Guide
39 pages
ADS Chapter 4 Concurrency Control Techniques
No ratings yet
ADS Chapter 4 Concurrency Control Techniques
36 pages
DBMS Unit 5
No ratings yet
DBMS Unit 5
6 pages
DBMS 6 (5th Unit) - Updated
No ratings yet
DBMS 6 (5th Unit) - Updated
10 pages
CISA 30 Questions
No ratings yet
CISA 30 Questions
6 pages
MySynovus BillPay QuickTips
No ratings yet
MySynovus BillPay QuickTips
3 pages
S1 CS - U4 Data Ranges - Frequencies - Shifting
No ratings yet
S1 CS - U4 Data Ranges - Frequencies - Shifting
24 pages
Ece1100 Bennett Bush Resume
No ratings yet
Ece1100 Bennett Bush Resume
1 page
Amadeus EMD Server: Boost Airline Revenue
No ratings yet
Amadeus EMD Server: Boost Airline Revenue
2 pages
Pa Series Pitch Deck
100% (1)
Pa Series Pitch Deck
16 pages
CISSP Cheat Sheet Series
95% (37)
CISSP Cheat Sheet Series
8 pages
TransNav Management System Documentation. Product Overview Guide
No ratings yet
TransNav Management System Documentation. Product Overview Guide
58 pages
User Guide For Student v3.0.1
No ratings yet
User Guide For Student v3.0.1
16 pages
DBMS Question Bank
No ratings yet
DBMS Question Bank
6 pages
Arya's Resume
No ratings yet
Arya's Resume
1 page
PyMOL Guide for Students & Professors
No ratings yet
PyMOL Guide for Students & Professors
11 pages
Alfa Account Server V2.2.36
No ratings yet
Alfa Account Server V2.2.36
26 pages
Internalisation and Localisation
No ratings yet
Internalisation and Localisation
41 pages
Audio Headroom Essentials
No ratings yet
Audio Headroom Essentials
9 pages
Lab Report Cse
No ratings yet
Lab Report Cse
5 pages
Gahana Nepal (7th Sem Project Report)
No ratings yet
Gahana Nepal (7th Sem Project Report)
45 pages
Linux 2025
No ratings yet
Linux 2025
3 pages
Pmath 1 2
No ratings yet
Pmath 1 2
18 pages
Automata Unit 5
No ratings yet
Automata Unit 5
64 pages
ABB Review 2 2020 en 72dpi V01
No ratings yet
ABB Review 2 2020 en 72dpi V01
64 pages
Spark T8
No ratings yet
Spark T8
102 pages
Mini Project Report (1) - 1
No ratings yet
Mini Project Report (1) - 1
20 pages
Warhol Project Tutorial!
No ratings yet
Warhol Project Tutorial!
7 pages
Nanoscope Analysis v140r1 Download Instructions
No ratings yet
Nanoscope Analysis v140r1 Download Instructions
1 page
Veek-Mt2s User Manual
No ratings yet
Veek-Mt2s User Manual
50 pages
01 Logarithm - Sheet
100% (1)
01 Logarithm - Sheet
20 pages
Emp Tech Module 4.3 Lesson 7
No ratings yet
Emp Tech Module 4.3 Lesson 7
42 pages
DevOps Q.
No ratings yet
DevOps Q.
13 pages
Module 3 Private Blockchain System
No ratings yet
Module 3 Private Blockchain System
34 pages