BCS403 DBMS Mod5
BCS403 DBMS Mod5
MODULE – 5
A lock is a variable associated with a data item that describes the status of the item with respect to
possible operations that can be applied to it. Generally, there is one lock for each data item in the
database. Locks are used as a means of synchronizing the access by concurrent transactions to the
database items.
Binary Locks:
▪ A binary lock can have two states or values: locked and unlocked (or 1 and 0).
▪ A distinct lock is associated with each database item X. If the value of the lock on X is 1, item X
cannot be accessed by a database operation that requests the item. If the value of the lock on X is 0,
theitem can be accessed when requested, and the lock value is changed to 1.
▪ The current value (or state) of the lock associated with item X is referred to as lock(X).
▪ Two operations, lock_item and unlock_item, are used with binary locking.
▪ A transaction requests access to an item X by first issuing a lock_item(X) operation. If LOCK(X)
= 1, the transaction is forced to wait. If LOCK(X) = 0, the transaction locks the item) and the
transaction is allowed to access item X.
▪ When the transaction is through using the item, it issues an unlock_item(X) operation, which
sets LOCK(X) back to 0 (unlocks the item) so that X may be accessed by other transactions.
Hence, a binary lock enforces mutual exclusion on the data item.
▪ A description of the lock_item(X) and unlock_item(X) operations is shown in Figure 1.
Dept of CS&E,KVGCE,Sullia 1
Database management Systems (BCS403)
Dept of CS&E,KVGCE,Sullia 2
Database management Systems (BCS403)
The lock manager module of the DBMS can enforce these rules. Between the lock_item(X) and
unlock_item(X) operations in transaction T, T is said to hold the lock on item X. At most one
transaction can hold the lock on a particular item. Thus no two transactions can access the same item
concurrently.
Read_lock(X):
Dept of CS&E,KVGCE,Sullia 3
Database management Systems (BCS403)
B: if LOCK(X) = “unlocked”
then begin LOCK(X) ← “read-locked”;
no_of_reads(X) ← 1
end
else if LOCK(X) = “read-locked”
then no_of_reads(X) ← no_of_reads(X) + 1
else begin
wait (until LOCK(X) = “unlocked” and the lock manager
wakes up the transaction);
go to B
end;
write_lock(X):
B: if LOCK(X) = “unlocked”
then LOCK(X) ← “write-locked”
else begin
wait (until LOCK(X) = “unlocked” and the lock manager
wakes up the transaction);
go to B
end;
unlock(X):
if LOCK(X) = “write-locked”
then begin LOCK(X) ← “unlocked”;
wakeup one of the waiting transactions, if any
end
else if LOCK(X) = “read-locked”
then begin
no_of_reads(X) ← no_of_reads(X) −1;
if no_of_reads(X) = 0
then begin LOCK(X) = “unlocked”;
wakeup one of the waiting transactions, if any
end
end;
Figure 2: Locking and unlocking operations for two mode (read/write, or)shared/exclusive locks.
Dept of CS&E,KVGCE,Sullia 4
Database management Systems (BCS403)
▪ When the shared/exclusive locking scheme is used, the system must enforce the following rules:
1. A transaction T must issue the operation read_lock(X) or write_lock(X) before any
read_item(X) operation is performed in T.
2. transaction T must issue the operation write_lock(X) before any
write_item(X)operation is performed in T.
3. A transaction T must issue the operation unlock(X) after all read_item(X) and
write_item(X) operations are completed in T.
4. A transaction T will not issue a read_lock(X)operation if it already holds a read (shared) lock or
a write (exclusive) lock on item X. This rule may be relaxed for downgrading of locks.
5. A transaction T will not issue a write_lock(X)operation if it already holds a read (shared) lock
or write (exclusive) lock on item X. This rule may also be relaxed for upgrading of locks, as we
discuss shortly.
6. A transaction T will not issue an unlock(X) operation unless it already holds a read (shared) lock
or a write (exclusive) lock on item X.
Dept of CS&E,KVGCE,Sullia 5
Database management Systems (BCS403)
Dept of CS&E,KVGCE,Sullia 6
Database management Systems (BCS403)
Y; consequently, when T2′ issues its read_lock(X), it is forced to wait until T1′ releases the lock
by issuing unlock(X) in the schedule. This can lead to deadlock.
Figure 4: Transactions T1′ and T2′, which are the same as T1 and T2 in Figure 3 but follow the two-
phase locking protocol. They can produce a deadlock.
❖ If every transaction in a schedule follows the two-phase locking protocol, the schedule is guaranteed to
be serializable. The locking protocol, by enforcing two-phase locking rules, also enforces
serializability.
❖ Two-phase locking may limit the amount of concurrency that can occur in a schedule because a
transaction T may not be able to release an item X after it is through using it if T must lock an
additional item Y later; or, conversely, T must lock the additional item Y before it needs it so that it
can release X. Hence, X must remain locked by T until all items that the transaction needs to read or
write have been locked; only then can X be released by T. Meanwhile, another transaction seeking to
access X may be forced to wait, even though T is done with X; conversely, if Y is locked earlier than it
is needed, another transaction seeking to access Y is forced to wait even though T is not using Y yet.
This is the price for guaranteeing serializability of all schedules without having to check the schedules
themselves.
❖ The two-phase locking protocol guarantees serializability (that is, every schedule that is permitted is
serializable) but it does not permit all possible serializable schedules (that is, some serializable
schedules will be prohibited by the protocol).
Dept of CS&E,KVGCE,Sullia 7
Database management Systems (BCS403)
not lock any item; instead, it waits until all the items are available for locking.
❖ Conservative 2PL is a deadlock-free protocol. But it is difficult to use in practice because of the need
to predeclare the read-set and write-set, which is not possible in some situations.
❖ Strict 2PL guarantees strict schedules where a transaction T does not release any of its exclusive
(write) locks until after it commits or aborts. Hence, no other transaction can read or write an item that
is written by T unless T has committed, leading to a strict schedule for recoverability.
❖ Strict 2PL is not deadlock-free.
❖ Rigorous 2PL also guarantees strict schedules. In this variation, a transaction T does not release any of
its locks (exclusive or shared) until after it commits or aborts, and so it is easier to implement than
strict 2PL.
❖ Difference between strict and rigorous 2PL: the former holds write-locks until it commits, whereas the
latter holds all locks (read and write).
❖ Difference between conservative and rigorous 2PL: the former must lock all its items before it starts,
so once the transaction starts it is in its shrinking phase; the latter does not unlock any of its items until
after it terminates (by committing or aborting), so the transaction is in its expanding phase until it
ends.
❖ Usually the concurrency control subsystem itself is responsible for generating the read_lock and
write_lock requests.
Example: Suppose the system is to enforce the strict 2PL protocol. Then, whenever transaction T
issues a read_item(X), the system calls the read_lock(X) operation on behalf of T. If the state
of LOCK(X) is write_locked by some other transaction T′, the system places T in the waiting queue
for item X; otherwise, it grants the read_lock(X) request and permits the read_item(X)
operation of T to execute. On the other hand, if transaction T issues a write_item(X), the system
calls the write_lock(X) operation on behalf of T. If the state of LOCK(X) is write_locked or
read_locked by some other transaction T′, the system places T in the waiting queue for item X; if the
state of LOCK(X) is read_locked and T itself is the only transaction holding the read lock on X, the
system upgrades the lock to write_locked and permits the write_item(X) operation by T. Finally,
if the state of LOCK(X) is unlocked, the system grants the write_lock(X) request and permits the
write_item(X) operation to execute. After each action, the system must update its lock table
appropriately.
Dept of CS&E,KVGCE,Sullia 8
Database management Systems (BCS403)
❖ Disadvantages of locking: Locking is generally considered to have a high overhead, because every
read or write operation is preceded by a system locking request. The use of locks can also cause two
additional problems: deadlock and starvation.
Figure 5: Illustrating the deadlock problem. (a) A partial schedule of T1′ and T2′ that is
in a state of deadlock. (b) A wait-for graph for the partial schedule in (a).
Dept of CS&E,KVGCE,Sullia 9
Database management Systems (BCS403)
❖ In wait-die, an older transaction is allowed to wait for a younger transaction, whereas a younger
transaction requesting an item held by an older transaction is aborted and restarted. In the wound-wait
approach, a younger transaction is allowed to wait for an older one, whereas an older transaction
requesting an item held by a younger transaction preempts the younger transaction by aborting it.
❖ Both schemes end up aborting the younger of the two transactions (the transaction that started later)
that may be involved in a deadlock, assuming that this will waste less processing.
❖ These two techniques are deadlock-free, since in wait-die, transactions only wait for younger
transactions so no cycle is created. Similarly, in wound-wait, transactions only wait for older
transactions so no cycle is created. But both techniques may cause some transactions to be aborted and
restarted needlessly, even though those transactions may never actually cause a deadlock.
Dept of CS&E,KVGCE,Sullia 10
Database management Systems (BCS403)
transaction. By considering the time b(T) at which each blocked transaction T was blocked, if the two
transactions Ti and Tj above both become blocked and Ti is waiting for Tj, then b(Ti) < b(Tj),
since Ti can only wait for Tj at a time when Tj is not blocked itself. Hence, the blocking times form a
total ordering on all blocked transactions, so no cycle that causes deadlock can occur.
Deadlock Detection:
❖ In deadlock detection, the system checks if a state of deadlock actually exists. This can happen if
different transactions rarely access the same items at the same time, or if transactions are short and
each transaction locks only a few items, or if the transaction load is light.
❖ If transactions are long and each transaction uses many items, or if the transaction load is heavy, it
may be advantageous to use a deadlock prevention scheme.
❖ The system can construct and maintain a wait-for graph to detect a state of deadlock. One node is
created in the wait-for graph for each transaction that is currently executing. Whenever a transaction Ti
is waiting to lock an item X that is currently locked by a transaction Tj, a directed edge (Ti → Tj) is
created in the wait-for graph. When Tj releases the lock(s) on the items that Ti was waiting for, the
directed edge is dropped from the wait-for graph. A state of deadlock occurs if and only if the wait-for
graph has a cycle.
This approach has the problem of determining when the system should check for a deadlock. It can be
checked for a cycle every time an edge is added to the wait-for graph, but this may cause excessive
overhead. Criteria such as the number of currently executing transactions or the period of time several
transactions have been waiting to lock items may be used instead to check for a cycle.
Figure 5(b) shows the wait-for graph for the (partial) schedule shown in Figure 5(a). If the system is
in a state of deadlock, some of the transactions causing the deadlock must be aborted. Choosing which
transactions to abort is known as victim selection. The algorithm for victim selection should generally
avoid selecting transactions that have been running for a long time and that have performed many
updates, and it should try instead to select transactions that have not made many changes (younger
transactions).
❖ Timeouts: A scheme to deal with deadlock is the use of timeouts. In this method, if a transaction waits
for a period longer than a system-defined timeout period, the system assumes that the transaction may
be deadlocked and aborts it—regardless of whether a deadlock actually exists.
Dept of CS&E,KVGCE,Sullia 11
Database management Systems (BCS403)
Starvation:
▪ When locking is used, starvation can occur when a transaction cannot proceed for an indefinite period
of time while other transactions in the system continue normally.
▪ This may occur if the waiting scheme for locked items is unfair in that it gives priority to some
transactions over others.
▪ One solution for starvation is to have a fair waiting scheme, such as using a first-come-first-served
queue; transactions are enabled to lock an item in the order in which they originally requested the lock.
▪ Another scheme allows some transactions to have priority over others but increases the priority of a
transaction the longer it waits, until it eventually gets the highest priority and proceeds.
▪ Starvation can also occur because of victim selection if the algorithm selects the same transaction as
victim repeatedly, thus causing it to abort and never finish execution.
▪ The algorithm can use higher priorities for transactions that have been aborted multiple times to avoid
this problem.
▪ The wait-die and wound-wait schemes avoid starvation, because they restart a transaction that has
been aborted with its same original timestamp, so the possibility that the same transaction is aborted
repeatedly is slim.
Questions :
1. Discuss the problems of deadlock and starvation, and the different approaches to dealing with these
problems.
2. Describe the wait-die and wound-wait protocols for deadlock prevention.
3. Describe the cautious waiting, no waiting, and timeout protocols for deadlock prevention.
1. Timestamps:
Dept of CS&E,KVGCE,Sullia 12
Database management Systems (BCS403)
▪ Use the current date/time value of the system clock and ensure that no two timestamp values are
generated during the same tick of the clock.
2. write_TS(X): The write timestamp of item X is the largest of all the timestamps of transactions
that have successfully written item X—that is, write_TS(X) = TS(T), where T is the youngest
Dept of CS&E,KVGCE,Sullia 13
Database management Systems (BCS403)
transaction that has written X successfully. Based on the algorithm, T will also be the last
transaction to write item X.
Basic Timestamp Ordering (TO): Whenever some transaction T tries to issue a read_item(X) or a
write_item(X) operation, the basic TO algorithm compares the timestamp of T with read_TS(X)
and write_TS(X) to ensure that the timestamp order of transaction execution is not violated. If this
order is violated, then transaction T is aborted and resubmitted to the system as a new transaction with a
new timestamp. If T is aborted and rolled back, any transaction T1 that may have used a value written by T
must also be rolled back. Similarly, any transaction T2 that may have used a value written by T1 must also
be rolled back, and so on. This effect is known as cascading rollback and is one of the problems associated
with basic TO, since the schedules produced are not guaranteed to be recoverable. An additional protocol
must be enforced to ensure that the schedules are recoverable, cascadeless, or strict.
The concurrency control algorithm must check whether conflicting operations violate the timestamp
ordering in the following two cases:
Whenever the basic TO algorithm detects two conflicting operations that occur in the incorrect order, it
rejects the later of the two operations by aborting the transaction that issued it. The schedules produced by
basic TO are hence guaranteed to be conflict serializable. Deadlock does not occur with timestamp
Dept of CS&E,KVGCE,Sullia 14
Database management Systems (BCS403)
ordering. But cyclic restart (and hence starvation) may occur if a transaction is continually aborted and
restarted.
Strict Timestamp Ordering (TO): Strict TO ensures that the schedules are both strict (for easy
recoverability) and (conflict) serializable. In this variation, a transaction T issues a read_item(X) or
write_item(X) such that TS(T)>write_TS(X) has its read or write operation delayed until the
transaction T′ that wrote the value of X (hence TS(T′) = write_TS(X)) has committed or aborted.
To implement this algorithm, it is necessary to simulate the locking of an item X that has been written by
transaction T′ until T′ is either committed or aborted. This algorithm does not cause deadlock, since T
waits for T′ only if TS(T)>TS(T′).
Thomas’s Write Rule: A modification of the basic TO algorithm, known as Thomas’s write rule, does not
enforce conflict serializability, but it rejects fewer write operations by modifying the checks for the
write_item(X) operation as follows:
1. If read_TS(X)>TS(T), then abort and roll back T and reject the operation.
2. If write_TS(X)>TS(T), then do not execute the write operation but continue processing. This is
because some transaction with timestamp greater than TS(T)—and hence after T in the timestamp
ordering—has already written the value of X. Thus, we must ignore the write_item(X) operation of
T because it is already outdated and obsolete. Notice that any conflict arising from this situation
would be detected by case (1).
3. If neither the condition in part (1) nor the condition in part (2) occurs, then execute the
write_item(X) operation of T and set write_TS(X) to TS(T).
Questions :
1. What is a timestamp? How does the system generate timestamps?
2. Discuss the timestamp ordering protocol for concurrency control. How does strict timestamp
ordering differ from basic timestamp ordering?
Dept of CS&E,KVGCE,Sullia 15
Database management Systems (BCS403)
❖ When a transaction requests to read an item, the appropriate version is chosen to maintain the
serializability of the currently executing schedule.
❖ One reason for keeping multiple versions is that some read operations that would be rejected in other
techniques can still be accepted by reading an older version of the item to maintain serializability. When
a transaction writes an item, it writes a new version and the old version(s) of the item is retained. Some
multiversion concurrency control algorithms use the concept of view serializability rather than conflict
serializability.
❖ Drawback of multiversion techniques: More storage is needed to maintain multiple versions of the
database items. In some cases, older versions can be kept in a temporary store. It is also possible that
older versions may have to be maintained anyway—for example, for recovery purposes. Some database
applications may require older versions to be kept to maintain a history of the changes of data item
values. The extreme case is a temporal database, which keeps track of all changes and the times at
which they occurred. In such cases, there is no additional storage penalty for multiversion techniques,
since older versions are already maintained.
Dept of CS&E,KVGCE,Sullia 16
Database management Systems (BCS403)
2. If transaction T issues a read_item(X) operation, find the version i of X that has the highest
write_TS(Xi) of all versions of X that is also less than or equal to TS(T); then return the
value of Xi to transaction T, and set the value of read_TS(Xi) to the larger of TS(T) and the
current read_TS(Xi).
A read_item(X) is always successful as seen in case 2, since it finds the appropriate version Xi to
read based on the write_TS of the various existing versions of X. In case 1, however, transaction T
may be aborted and rolled back. This happens if T attempts to write a version of X that should have
been read by another transaction T′ whose timestamp is read_TS(Xi); however, T′ has already read
version Xi, which was written by the transaction with timestamp equal to write_TS(Xi). If this
conflict occurs, T is rolled back; otherwise, a new version of X, written by transaction T, is created. If
T is rolled back, cascading rollback may occur. Hence, to ensure recoverability, a transaction T should
not be allowed to commit until after all the transactions that have written some version that T has read
have committed.
Figure 6: Lock compatibility tables.(a) Lock compatibility table for read/write locking scheme.
(b) Lock compatibility table for read/write/certify locking scheme.
Dept of CS&E,KVGCE,Sullia 17
Database management Systems (BCS403)
An entry of Yes means that if a transaction T holds the type of lock specified in the column header
on item X and if transaction T′ requests the type of lock specified in the row header on the same item
X, then T′ can obtain the lock because the locking modes are compatible. An entry of No in the table
indicates that the locks are not compatible, so T′ must wait until T releases the lock. In the standard
locking scheme, once a transaction obtains a write lock on an item, no other transactions can access
that item.
❖ Multiversion 2PL will allow other transactions T′ to read an item X while a single transaction T holds
a write lock on X. This is accomplished by allowing two versions for each item X-
▪ The committed version must always have been written by some committed transaction.
▪ The second local version X′ can be created when a transaction T acquires a write lock on X.
Other transactions can continue to read the committed version of X while T holds the write lock.
Transaction T can write the value of X′ as needed, without affecting the value of the committed
version X. But once T is ready to commit, it must obtain a certify lock on all items that it currently
holds write locks on before it can commit; this is another form of lock upgrading. The certify lock is
not compatible with read locks, so the transaction may have to delay its commit until all its write-
locked items are released by any reading transactions in order to obtain the certify locks. Once the
certify locks—which are exclusive locks—are acquired, the committed version X of the data item is
set to the value of version X′, version X′ is discarded and the certify locks are then released. The lock
compatibility table for this scheme is shown in Figure 6(b).
❖ In the multiversion 2PL scheme, reads can proceed concurrently with a single write operation which is
not permitted under the standard 2PL schemes. The cost is that a transaction may have to delay its
commit until it obtains exclusive certify locks on all the items it has updated.
❖ This scheme avoids cascading aborts, since transactions are only allowed to read the version X that
was written by a committed transaction. But deadlocks may occur and must be handled by various
techniques.
Questions :
1. How do optimistic concurrency control techniques differ from other concurrency control
techniques? Why are they called validation or certification techniques? Discuss the typical
phases of an optimistic concurrency control method.
Dept of CS&E,KVGCE,Sullia 18
Database management Systems (BCS403)
transaction’s updates violate serializability. Certain information needed by the validation phase must
be kept by the system. If serializability is not violated, the transaction is committed and the database is
updated from the local copies; otherwise, the transaction is aborted and then restarted later.
❖ There are three phases for this concurrency control protocol:
1. Read phase: A transaction can read values of committed data items from the database. However,
updates are applied only to local copies (versions) of the data items kept in the transaction
workspace.
2. Validation phase: Checking is performed to ensure that serializability will not be violated if the
transaction updates are applied to the database.
3. Write phase: If the validation phase is successful, the transaction updates are applied to the
database; otherwise, the updates are discarded and the transaction is restarted.
Dept of CS&E,KVGCE,Sullia 19
Database management Systems (BCS403)
The optimistic concurrency control does all the checks at once. Hence transaction execution proceeds
with a minimum of overhead until the validation phase is reached. If there is little interference among
transactions, most will be validated successfully. If there is much interference, many transactions that
execute to completion will have their results discarded and must be restarted later. Under such
circumstances, optimistic techniques do not work well.
❖ The techniques are called optimistic because they assume that little interference will occur and hence
most transaction will be validated successfully, so that there is no need to do checking during
transaction execution. This assumption is generally true in many transaction processing workloads.
❖ The optimistic protocol described uses transaction timestamps and also requires that the
write_sets and read_sets of the transactions be kept by the system. Additionally, start and end
times for the three phases need to be kept for each transaction.
❖ In the validation phase for transaction Ti, the protocol checks that Ti does not interfere with any
recently committed transactions or with any other concurrent transactions that have started their
validation phase. The validation phase for Ti checks that, for each such transaction Tj that is either
recently committed or is in its validation phase, one of the following conditions holds:
1. Transaction Tj completes its write phase before Ti starts its read phase.
2. Ti starts its write phase after Tj completes its write phase, and the read_set of Ti has no items
in common with the write_set of Tj.
3. Both the read_set and write_set of Ti have no items in common with the write_set of
Tj, and Tj completes its read phase before Ti completes its read phase.
When validating transaction Ti against each one of the transactions Tj, the first condition is checked
first since (1) is the simplest condition to check. Only if condition 1 is false condition 2 checked, and
only if (2) is false condition 3 is checked. If any one of these three conditions holds with each
transaction Tj, there is no interference and Ti is validated successfully. If none of these three conditions
holds for any one Tj, the validation of transaction Ti fails (because Ti and Tj may violate serializability)
and so Ti is aborted and restarted later because interference with Tj may have occurred.
Dept of CS&E,KVGCE,Sullia 20
Database management Systems (BCS403)
transaction, or, in some cases, the database statement, will only see the records that were committed in
the database at the time the transaction started. Any insertions, deletions, or updates that occur after
the transaction starts will not be seen by the transaction.
❖ Snapshot isolation does not allow the problems of dirty read and nonrepeatable read to occur. Certain
anomalies that violate serializability can occur when snapshot isolation is used as the basis for
concurrency control.
❖ In this scheme, read operations do not require read locks to be applied to the items, thus reducing the
overhead associated with two-phase locking. But write operations do require write locks. Thus, for
transactions that have many reads, the performance is much better than 2PL. When writes do occur,
the system will have to keep track of older versions of the updated items in a temporary version store
(also known as tempstore), with the timestamps of when the version was created. This is necessary so
that a transaction that started before the item was written can still read the value (version) of the item
that was in the database snapshot when the transaction started.
❖ To keep track of versions, items that have been updated will have pointers to a list of recent versions
of the item in the tempstore, so that the correct item can be read for each transaction. The tempstore
items will be removed when no longer needed, so a method to decide when to remove unneeded
versions will be needed.
❖ Variations of this method have been used in several commercial and open source DBMSs, including
Oracle and PostGRES. If the users require guaranteed serializability, then the problems with anomalies
that violate serializability will have to be solved by the programmers/software engineers by analyzing
the set of transactions to determine which types of anomalies can occur, and adding checks that do not
permit these anomalies. This can place a burden on the software developers when compared to the
DBMS enforcing serializability in all cases.
❖ Variations of snapshot isolation (SI) techniques, known as serializable snapshot isolation (SSI), have
been proposed and implemented in some of the DBMSs that use SI as their primary concurrency
control method. For example, recent versions of the PostGRES DBMS allow the user to choose
between basic SI and SSI. The tradeoff is ensuring full serializability with SSI versus living with
possible rare anomalies but having better performance with basic SI.
Questions :
1. What is snapshot isolation? What are the advantages and disadvantages of concurrency control
Dept of CS&E,KVGCE,Sullia 21
Database management Systems (BCS403)
The particular choice of data item type can affect the performance of concurrency control and
recovery.
transaction S wants to lock a different record C that happens to reside in the same disk block X in a
conflicting lock mode, it is forced to wait. If the data item size was a single record instead of a disk
block, transaction S would be able to proceed, because it would be locking a different data item
(record).
❖ The smaller the data item size, the more is the number of items in the database. Because every item is
associated with a lock, the system will have a larger number of active locks to be handled by the lock
manager. More lock and unlock operations will be performed, causing a higher overhead. In addition,
more storage space will be required for the lock table. For timestamps, storage is required for the
read_TS and write_TS for each data item, and there will be similar overhead for handling a large
number of items.
❖ The best item size depends on the types of transactions involved. If a typical transaction accesses a
Dept of CS&E,KVGCE,Sullia 22
Database management Systems (BCS403)
small number of records, it is advantageous to have the data item granularity be one record. If a
transaction typically accesses many records in the same file, it may be better to have block or file
granularity so that the transaction will consider all those records as one (or a few) data items.
This can be used to illustrate a multiple granularity level 2PL protocol, with shared/exclusive locking
modes, where a lock can be requested at any level. Additional types of locks will be needed to support
such a protocol efficiently. Consider the following scenario, which refers to the example in Figure 7.
Suppose transaction T1 wants to update all the records in file f1, and T1 requests and is granted an
exclusive lock for f1. Then all of f1’s pages (p11 through p1n)—and the records contained on those
pages are locked in exclusive mode. This is beneficial for T1 because setting a single file-level lock is
more efficient than setting n page level locks or having to lock each record individually.
Now suppose another transaction T2 only wants to read record r1nj from page p1n of file f1; then T2
would request a shared record-level lock on r1nj. However, the database system- the transaction
manager or the lock manager - must verify the compatibility of the requested lock with already held
locks. This can be verified by traversing the tree from the leaf r1nj to p1n to f1 to db. If at any time a
conflicting lock is held on any of those items, then the lock request for r1njis denied and T2 is blocked
and must wait. This traversal would be fairly efficient.
Dept of CS&E,KVGCE,Sullia 23
Database management Systems (BCS403)
If transaction T2’s request came before transaction T1’s request, the shared record lock is granted to T2
for r1nj, but when T1’s file-level lock is requested, it can be time-consuming for the lock manager to
check all nodes (pages and records) that are descendants of node f1for a lock conflict. This would be
very inefficient and would defeat the purpose of having multiple granularity level locks.
❖ To make multiple granularity level locking practical, intention locks are needed. The idea behind
intention locks is for a transaction to indicate, along the path from the root to the desired node, what
type of lock (shared or exclusive) it will require from one of the node’s descendants.
The compatibility table of the three intention locks, and the actual shared and exclusive locks, is
shown in Figure 8.
❖ An appropriate locking protocol must be used in addition to the three types of intention locks. The
multiple granularity locking (MGL) protocol consists of the following rules:
1. The lock compatibility (based on Figure 8) must be adhered to.
2. The root of the tree must be locked first, in any mode.
3. A node N can be locked by a transaction T in S or IS mode only if the parent node N is already
locked by transaction T in either IS or IX mode.
Dept of CS&E,KVGCE,Sullia 24
Database management Systems (BCS403)
4. A node N can be locked by a transaction T in X, IX, or SIX mode only if the parent of node N is
already locked by transaction T in either IX or SIX mode.
5. A transaction T can lock a node only if it has not unlocked any node (to enforce the 2PL protocol).
6. A transaction T can unlock a node, N, only if none of the children of node N are currently locked
by T.
Rule 1 states that conflicting locks cannot be granted. Rules 2, 3, and 4 state the conditions when a
transaction may lock a given node in any of the lock modes. Rules 5 and 6 of the MGL protocol
enforce 2PL rules to produce serializable schedules.
Basically, the locking starts from the root and goes down the tree until the node that needs to be locked
is encountered, whereas unlocking starts from the locked node and goes up the tree until the root itself
is unlocked.
❖ The multiple granularity level protocol is especially suited when processing a mix of transactions that
include:
(1) short transactions that access only a few items (records or fields) and
(2) long transactions that access entire files.
In this environment, less transaction blocking and less locking overhead are incurred by such a
protocol when compared to a single-level granularity locking approach.
Dept of CS&E,KVGCE,Sullia 25
Database management Systems (BCS403)
Questions :
Dept of CS&E,KVGCE,Sullia 26
Database management Systems (BCS403)
(1) SQL systems offer too many services (powerful query language, concurrency control, etc.),
which this application may not need
(2) a structured data model such the traditional relational model may be too restrictive.
Although newer relational systems do have more complex object-relational modeling options they still
require schemas, which are not required by many of the NOSQL systems.
Consider an application such as Facebook, with millions of users who submit posts, many with images
and videos; then these posts must be displayed on pages of other users using the social media
relationships among the users. User profiles, user relationships, and posts must all be stored in a huge
collection of data stores, and the appropriate posts must be made available to the sets of users that have
signed up to see these posts. Some of the data for this type of application is not suitable for a traditional
relational system and typically needs multiple types of databases and data storage systems. Some of the
organizations that were faced with these data management and storage applications decided to develop
their own systems:
➢ Google developed a proprietary NOSQL system known as BigTable, which is used in many of
Google’s applications that require vast amounts of data storage, such as Gmail, Google Maps,
and Web site indexing. Apache Hbase is an open source NOSQL system based on similar
concepts. Google’s innovation led to the category of NOSQL systems known as column-based
or wide column stores; they are also sometimes referred to as column family stores.
➢ Amazon developed a NOSQL system called DynamoDB that is available through Amazon’s
cloud services. This innovation led to the category known as key-value data stores or sometimes
key-tuple or key-object data stores.
➢ Facebook developed a NOSQL system called Cassandra, which is now open source and known
as Apache Cassandra. This NOSQL system uses concepts from both key-value stores and
column-based systems.
Dept of CS&E,KVGCE,Sullia 27
Database management Systems (BCS403)
➢ Other software companies started developing their own solutions and making them available to
users who need these capabilities—for example, MongoDB and CouchDB, which are classified
as document-based NOSQL systems or document stores.
➢ Another category of NOSQL systems is the graph-based NOSQL systems, or graph databases;
these include Neo4J and GraphBase, among others.
➢ Some NOSQL systems, such as OrientDB, combine concepts from many of the categories
discussed above.
In addition to the newer types of NOSQL systems listed above, it is also possible to classify database
systems based on the object model or on the native XML model as NOSQL systems, although they may
not have the high-performance and replication characteristics of the other types of NOSQL systems.
1. Scalability: Two kinds of scalability in distributed systems: horizontal and vertical. In NOSQL
systems, horizontal scalability is generally used, where the distributed system is expanded by adding
more nodes for data storage and processing as the volume of data grows. Vertical scalability refers to
expanding the storage and computing power of existing nodes. In NOSQL systems, horizontal
scalability is employed while the system is operational, so techniques for distributing the existing data
among new nodes without interrupting system operation are necessary.
2. Availability, Replication and Eventual Consistency: Many applications that use NOSQL systems
require continuous system availability. To accomplish this, data is replicated over two or more nodes
in a transparent manner, so that if one node fails, the data is still available on other nodes. Replication
improves data availability and can also improve read performance, because read requests can often be
serviced from any of the replicated data nodes. However, write performance becomes more
cumbersome because an update must be applied to every copy of the replicated data items; this can
slow down write performance if serializable consistency is required. Many NOSQL applications do not
require serializable consistency, so more relaxed forms of consistency known as eventual consistency
Dept of CS&E,KVGCE,Sullia 28
Database management Systems (BCS403)
are used.
3. Replication Models: Two major replication models used in NOSQL systems are master-slave and
master-master replication. Master-slave replication requires one copy to be the master copy; all write
operations must be applied to the master copy and then propagated to the slave copies, usually using
eventual consistency (the slave copies will eventually be the same as the master copy). For read, the
master-slave paradigm can be configured in various ways. One configuration requires all reads to also
be at the master copy, so this would be similar to the primary site or primary copy methods of
distributed concurrency control, with similar advantages and disadvantages. Another configuration
would allow reads at the slave copies but would not guarantee that the values are the latest writes,
since writes to the slave nodes can be done after they are applied to the master copy. The master-
master replication allows reads and writes at any of the replicas but may not guarantee that reads at
nodes that store different copies see the same values. Different users may write the same data item
concurrently at different nodes of the system, so the values of the item will be temporarily inconsistent.
A reconciliation method to resolve conflicting write operations of the same data item at different nodes
must be implemented as part of the master-master replication scheme.
4. Sharding of Files: In many NOSQL applications, files (or collections of data objects) can have many
millions of records (or documents or objects), and these records can be accessed concurrently by
thousands of users. So it is not practical to store the whole file in one node. Sharding of the file records
is often employed in NOSQL systems. This serves to distribute the load of accessing the file records to
multiple nodes. The combination of sharding the file records and replicating the shards works in
tandem to improve load balancing as well as data availability.
Dept of CS&E,KVGCE,Sullia 29
Database management Systems (BCS403)
preferred. Other indexes can also be used to locate objects based on attribute conditions different from
the key K.
2) NOSQL characteristics related to data models and query languages: NOSQL systems emphasize
performance and flexibility over modeling power and complex querying.
1. Not Requiring a Schema: The flexibility of not requiring a schema is achieved in many NOSQL
systems by allowing semi-structured, self describing data . The users can specify a partial schema in
some systems to improve storage efficiency, but it is not required to have a schema in most of the
NOSQL systems. As there may not be a schema to specify constraints, any constraints on the data
would have to be programmed in the application programs that access the data items. There are various
languages for describing semistructured data, such as JSON (JavaScript Object Notation) and XML
(Extensible Markup Language;). JSON is used in several NOSQL systems, but other methods for
describing semi-structured data can also be used.
2. Less Powerful Query Languages: Many applications that use NOSQL systems may not require a
powerful query language such as SQL, because search (read) queries in these systems often locate
single objects in a single file based on their object keys. NOSQL systems typically provide a set of
functions and operations as a programming API (application programming interface), so reading and
writing the data objects is accomplished by calling the appropriate operations by the programmer. In
many cases, the operations are called CRUD operations, for Create, Read, Update, and Delete. In other
cases, they are known as SCRUD because of an added Search (or Find) operation. Some NOSQL
systems also provide a high-level query language, but it may not have the full power of SQL; only a
subset of SQL querying capabilities would be provided. In particular, many NOSQL systems do not
provide join operations as part of the query language itself; the joins need to be implemented in the
application programs.
3. Versioning: Some NOSQL systems provide storage of multiple versions of the data items, with the
timestamps of when the data version was created.
Dept of CS&E,KVGCE,Sullia 30
Database management Systems (BCS403)
known formats, such as JSON (JavaScript Object Notation). Documents are accessible via their
document id, but can also be accessed rapidly using other indexes.
2. NOSQL key-value stores: These systems have a simple data model based on fast access by the key
to the value associated with the key; the value can be a record or an object or a document or even
have a more complex data structure.
3. Column-based or wide column NOSQL systems: These systems partition a table by column into
column families (a form of vertical partitioning) where each column family is stored in its own
files. They also allow versioning of data values.
4. Graph-based NOSQL systems: Data is represented as graphs, and related nodes can be found by
traversing the edges using path expressions.
Additional categories can be added as follows to include some systems that are not easily categorized
into the above four categories, as well as some other types of systems that have been available even
before the term NOSQL became widely used.
5. Hybrid NOSQL systems: These systems have characteristics from two or more of the above four
categories.
6. Object databases
7. XML databases
Even keyword-based search engines store large amounts of data with fast search access, so the stored
data can be considered as large NOSQL big data stores.
CAP THEOREM
❖ Concurrency control in distributed databases (DDBS) is required to enforce the ACID properties
(atomicity, consistency, isolation, durability) of transactions that are running concurrently.
❖ In a system with data replication, concurrency control becomes more complex because there can be
multiple copies of each data item. So if an update is applied to one copy of an item, it must be applied
to all other copies in a consistent manner.
❖ The possibility exists that one copy of an item X is updated by a transaction T1 whereas another copy
is updated by a transaction T2, so two inconsistent copies of the same item exist at two different nodes
in the distributed system. If two other transactions T3 and T4 want to read X, each may read a different
copy of item X. There are distributed concurrency control methods that do not allow this inconsistency
among copies of the same data item, thus enforcing serializability and hence the isolation property in
Dept of CS&E,KVGCE,Sullia 31
Database management Systems (BCS403)
❖ Partition tolerance means that the system can continue operating if the network connecting the nodes
has a fault that results in two or more partitions, where the nodes in each partition can only
communicate among each other.
❖ Consistency means that the nodes will have the same copies of a replicated data item visible for
various transactions.
❖ The word consistency in CAP and its use in ACID do not refer to the same identical concept. In CAP,
the term consistency refers to the consistency of the values in different copies of the same data item in
a replicated distributed system. In ACID, it refers to the fact that a transaction will not violate the
integrity constraints specified on the database schema. However, if we consider that the consistency of
replicated copies is a specified constraint, then the two uses of the term consistency would be related.
❖ The CAP theorem states that it is not possible to guarantee all three of the desirable properties—
consistency, availability, and partition tolerance—at the same time in a distributed system with data
replication. If this is the case, then the distributed system designer would have to choose two
properties out of the three to guarantee. It is generally assumed that in many traditional (SQL)
applications, guaranteeing consistency through the ACID properties is important.
❖ In a NOSQL distributed data store, a weaker consistency level is often acceptable, and guaranteeing
the other two properties (availability, partition tolerance) is important. Hence, weaker consistency
Dept of CS&E,KVGCE,Sullia 32
Database management Systems (BCS403)
levels are often used in NOSQL system instead of guaranteeing serializability. In particular, a form of
consistency known as eventual consistency is often adopted in NOSQL systems.
Questions :
1. For which types of applications were NOSQL systems developed?
2. What are the main categories of NOSQL systems? List a few of the NOSQL systems in each
category.
3. What are the main characteristics of NOSQL systems in the areas related to data models and
querylanguages?
4. What are the main characteristics of NOSQL systems in the areas related to distributed
systems anddistributed databases?
5. What is the CAP theorem? Which of the three properties (consistency, availability, partition
tolerance)are most important in NOSQL systems?
6. What are the similarities and differences between using consistency in CAP versus using
consistency inACID?
Dept of CS&E,KVGCE,Sullia 33
Database management Systems (BCS403)
❖ Example based on COMPANY database: The operation createCollection is used to create each
collection. For example, the following command can be used to create a collection called project to
hold PROJECT objects from the COMPANY database:
db.createCollection(“project”, { capped : true, size : 1310720, max : 500 } )
The first parameter “project” is the name of the collection, which is followed by an optional document
that specifies collection options. In the example, the collection is capped; this means it has upper limits
on its storage space (size) and number of documents (max). The capping parameters help the system
choose the storage options for each collection.
❖ For the example, create another document collection called worker to hold information about the
EMPLOYEEs who work on each project. For example:
db.createCollection(“worker”, { capped : true, size : 5242880, max : 2000 } ) )
❖ Each document in a collection has a unique ObjectId field, called _id, which is automatically
indexed in the collection unless the user explicitly requests no index for the _id field. The value of
ObjectId can be specified by the user, or it can be system-generated if the user does not specify an _id
field for a particular document.
▪ System-generated ObjectIds have a specific format, which combines the timestamp when the
object is created (4 bytes, in an internal MongoDB format), the node id (3 bytes), the process id
(2 bytes), and a counter (3 bytes) into a 16-byte Id value.
▪ User-generated ObjectsIds can have any value specified by the user as long as it uniquely
identifies the document and so these Ids are similar to primary keys in relational systems.
❖ A collection does not have a schema. The structure of the data fields in documents is chosen based on
how documents will be accessed and used, and the user can choose a normalized design (similar to
normalized relational tuples) or a denormalized design (similar to XML documents or complex
objects). Interdocument references can be specified by storing in one document the ObjectId or
ObjectIds of other related documents.
❖ Figure 10(a) shows a simplified MongoDB document showing some of the data from the COMPANY
database example. In the example, the _id values are user-defined, and the documents whose _id starts
with P (for project) will be stored in the “project” collection, whereas those whose _id starts with W
(for worker) will be stored in the “worker” collection. In Figure 10(a), the workers information is
embedded in the project document; so there is no need for the “worker” collection. This is known as
Dept of CS&E,KVGCE,Sullia 34
Database management Systems (BCS403)
the denormalized pattern, which is similar to creating a complex object or an XML document. A list of
values that is enclosed in square brackets [ … ] within a document represents a field whose value is an
array.
❖ Another option is to use the design in Figure 10(b), where worker references are embedded in the
project document, but the worker documents themselves are stored in a separate “worker” collection.
❖ A third option in Figure 10(c) would use a normalized design, similar to First Normal Form relations.
The choice of which design option to use depends on how the data will be accessed. The design in
Figure 24.1(c) is not the general normalized design for a many-to-many relationship, such as the one
between employees and projects; rather, we would need three collections for “project”, “employee”,
and “works_on”.
❖ In the design in Figure 10(c), an EMPLOYEE who works on several projects would be represented by
multiple worker documents with different _id values; each document would represent the employee as
worker for a particular project. This is similar to the design decisions for XML schema design. The
typical document-based system does not have a schema, so the design rules would have to be followed
whenever individual documents are inserted into a collection.
Dept of CS&E,KVGCE,Sullia 35
Database management Systems (BCS403)
Dept of CS&E,KVGCE,Sullia 36
Database management Systems (BCS403)
Figure 10: Example of simple documents in MongoDB. (a) Denormalized document design with embedded
subdocuments. (b) Embedded array of document references. (c) Normalized documents. (d) Inserting the
documents in (c) into their collections.
Dept of CS&E,KVGCE,Sullia 37
Database management Systems (BCS403)
two-phase commit method is used to ensure atomicity and consistency of multidocument transactions.
Replication in MongoDB: The concept of replica set is used in MongoDB to create multiple copies of
the same data set on different nodes in the distributed system, and it uses a variation of the master-slave
approach for replication.
Example: to replicate a particular document collection C, replica set will have one primary copy of the
collection C stored in one node N1, and at least one secondary copy (replica) of C stored at another
node N2. Additional copies can be stored in nodes N3, N4, etc., as needed, but the cost of storage and
update (write) increases with the number of replicas. The total number of participants in a replica set
must be at least three, so if only one secondary copy is needed, a participant in the replica set known as
an arbiter must run on the third node N3. The arbiter does not hold a replica of the collection but
participates in elections to choose a new primary if the node storing the current primary copy fails. If
the total number of members in a replica set is n (one primary plus i secondaries, for a total of n = i + 1),
then n must be an odd number; if it is not, an arbiter is added to ensure the election process works
correctly if the primary fails.
In MongoDB replication, all write operations must be applied to the primary copy and then propagated
to the secondaries. For read operations, the user can choose the particular read preference for their
application. The default read preference processes all reads at the primary copy, so all read and write
operations are performed at the primary node. In this case, secondary copies are mainly to make sure
that the system continues operation if the primary fails, and MongoDB can ensure that every read
request gets the latest document value. To increase read performance, it is possible to set the read
preference so that read requests can be processed at any replica (primary or secondary); however, a read
at a secondary is not guaranteed to get the latest version of a document because there can be a delay in
propagating writes from the primary to the secondaries.
Sharding in MongoDB: When a collection holds a very large number of documents or requires a large
storage space, storing all the documents in one node can lead to performance problems, particularly if
there are many user operations accessing the documents concurrently using various CRUD operations.
▪ Sharding of the documents in the collection (also known as horizontal partitioning) divides the
documents into disjoint partitions known as shards. This allows the system to add more nodes as
needed by a process known as horizontal scaling of the distributed system, and to store the shards
of the collection on different nodes to achieve load balancing. Each node will process only those
Dept of CS&E,KVGCE,Sullia 38
Database management Systems (BCS403)
operations pertaining to the documents in the shard stored at that node. Also, each shard will
contain fewer documents than if the entire collection were stored at one node, thus further
improving performance.
▪ Two ways to partition a collection into shards in MongoDB are range partitioning and hash
partitioning. Both require that the user specify a particular document field to be used as the basis
for partitioning the documents into shards. The partitioning field known as the shard key in
MongoDB must have two characteristics: it must exist in every document in the collection, and it
must have an index. The ObjectId can be used, but any other field possessing these two
characteristics can also be used as the basis for sharding. The values of the shard key are divided
into chunks either through range partitioning or hash partitioning, and the documents are partitioned
based on the chunks of shard key values.
▪ Range partitioning creates the chunks by specifying a range of key values; for example, if the shard
key values ranged from one to ten million, it is possible to create ten ranges—1 to 1,000,000;
1,000,001 to 2,000,000; . . . ; 9,000,001 to 10,000,000—and each chunk would contain the key
values in one range. Hash partitioning applies a hash function h(K) to each shard key K, and the
partitioning of keys into chunks is based on the hash values. In general, if range queries are
commonly applied to a collection (for example, retrieving all documents whose shard key value is
between 200 and 400), then range partitioning is preferred because each range query will typically
be submitted to a single node that contains all the required documents in one shard. If most searches
retrieve one document at a time, hash partitioning may be preferable because it randomizes the
distribution of shard key values into chunks.
▪ When sharding is used, MongoDB queries are submitted to a module called the query router,
which keeps track of which nodes contain which shards based on the particular partitioning method
used on the shard keys. The query (CRUD operation) will be routed to the nodes that contain the
shards that hold the documents that the query is requesting. If the system cannot determine which
shards hold the required documents, the query will be submitted to all the nodes that hold shards of
the collection.
Sharding and replication are used together; sharding focuses on improving performance via load
balancing and horizontal scalability, whereas replication focuses on ensuring system availability when
certain nodes fail in the distributed system.
Dept of CS&E,KVGCE,Sullia 39
Database management Systems (BCS403)
The data model used in key-value stores is relatively simple, and in many of these systems, there is no
query language but rather a set of operations that can be used by the application programmers.
The key is a unique identifier associated with a data item and is used to locate this data item
rapidly.
The value is the data item and it can have very different formats for different key-value storage
systems.
➢ The value is just a string of bytes or an array of bytes, and the application using the key-
value store has to interpret the structure of the data value.
➢ Standard formatted data is allowed;
Example: structured data rows (tuples) similar to relational data, or semistructured data
using JSON or some other self-describing data format.
The main characteristic of key-value stores is the fact that every value (data item) must be
associated with a unique key and that retrieving the value by supplying the key must be very fast.
1. DynamoDB Overview
The DynamoDB system is an Amazon product and is available as part of Amazon’s AWS/SDK
platforms (Amazon Web Services/Software Development Kit).
It can be used as part of Amazon’s cloud computing services, for the data storage component.
DynamoDB data model. The basic data model in DynamoDB uses the concepts of tables, items, and
attributes. A table in DynamoDB does not have a schema; it holds a collection of self-describing items.
Each item will consist of a number of (attribute, value) pairs, and attribute values can be single-
valued or multivalued. A table will hold a collection of items, and each item is a self-describing record
(or object).DynamoDB also allows the user to specify the items in JSON format, and the
system will convert them to the internal storage format of DynamoDB. When a table is created, it is
required to specify a table name and a primary key; The primary key will be used to rapidly locate the
items in the table.Thus, the primary key is the key and the item is the value for the DynamoDB key-
value store. The primary key attribute must exist in every item in the table. The primary key can be one
of the following two types:
■ A single attribute.
The DynamoDB system will use this attribute to build a hash index on the items in the table.
Dept of CS&E,KVGCE,Sullia 40
Database management Systems (BCS403)
The items are not ordered in storage on the value of the hash attribute.
■ A pair of attributes.
The primary key will be a pair of attributes (A, B): attribute A will be used for hashing and
because there will be multiple items with the same value of A, the B values will be used for
ordering the records with the same A value.
A table with this type of key can have additional secondary indexes defined on its attributes.
For example, if we want to store multiple versions of some type of items in a table, we could use
ItemID as hash and Date or Timestamp (when the version was created) as range in a hash and
range type primary key.
Assume the store is called s. The basic interface for data storage and retrieval is very simple and
includes three operations: get, put, and delete.
The operation s.put(k, v) inserts an item as a key-value pair with key k and value v.
The operation s.delete(k) deletes the item whose key is k from the store, and the operation v =
s.get(k) retrieves the value v associated with key k.
The application can use these basic operations to build its own requirements. At the basic storage
level, both keys and values are arrays of bytes (strings).
The values v in the (k, v) items can be specified in JSON (JavaScript Object Notation), and the
system will convert between JSON and the internal storage format.
The Serializer class must be provided by the user and will include operations to convert the user
format into a string of bytes for storage as a value, and to convert back a string (array of bytes)
retrieved via s.get(k) into the user format.
Dept of CS&E,KVGCE,Sullia 41
Database management Systems (BCS403)
A variation of the data distribution algorithm known as consistent hashing is used in Voldemort
for data distribution among the nodes in the distributed cluster of nodes.
A hash function h(k) is applied to the key k of each (k, v) pair, and h(k) determines where the
item will be stored.
The method assumes that h(k) is an integer value, usually in the range 0 to Hmax = 2n−1, where
n is chosen based on the desired range for the hash values.
This method is best visualized by considering the range of all possible integer hash values 0 to
Hmax to be evenly distributed on a circle (or ring).
The nodes in the distributed system are then also located on the same ring; usually each node
will have several locations on the ring (Figure ). The positioning of the points on the ring that
represent the nodes is done in a psuedorandom manner.
An item (k, v) will be stored on the node whose position in the ring follows the position of h(k)
on the ring in a clockwise direction.
In Figure 24.2(a), we assume there are three nodes in the distributed cluster labeled A, B, and C,
where node C has a bigger capacity than nodes A and B.
On the circle, two instances each of A and B are placed, and three instances of C (because of its
higher capacity), in a pseudorandom manner to cover the circle.
Figure (a) indicates which (k, v) items are placed in which nodes based on the h(k) values.
Dept of CS&E,KVGCE,Sullia 42
Database management Systems (BCS403)
The h(k) values that fall in the parts of the circle marked as range 1 in Figure (a) will have their
(k, v) items stored in node A because that is the node whose label follows h(k) on the ring in a
clockwise direction;
Those in range 2 are stored in node B; and those in range 3 are stored in node C.
This scheme allows horizontal scalability because when a new node is added to the distributed
system, it can be added in one or more locations on the ring depending on the node capacity.Only
a limited percentage of the (k, v) items will be reassigned to the new node from the existing
nodes based on the consistent hashing placement algorithm.Also, those items assigned to the new
node may not all come from only one of the existing nodes because the new node can have
multiple locations on the ring.
For example, if a node D is added and it has two placements on the ring as shown in Figure (b),
then some of the items from nodes B and C would be moved to node D.
The items whose keys hash to range 4 on the circle (Figure (b)) would be migrated to node
Dept of CS&E,KVGCE,Sullia 43
Database management Systems (BCS403)
D.This scheme also allows replication by placing the number of specified replicas of an item on
successive nodes on the ring in a clockwise direction.The sharding is built into the method, and
different items in the store (file) are located on different nodes in the distributed cluster, which
means the items are horizontally partitioned (sharded) among the nodes in the distributed system.
When a node fails, its load of data items can be distributed to the other existing nodes whose
labels follow the labels of the failed node in the ring.
Nodes with higher capacity can have more locations on the ring, as illustrated by node C in
Figure (a), and thus store more items than smaller-capacity nodes.
Voldemort uses a method similar to the one developed for DynamoDB for consistency in the
presence of replicas.
Concurrent write operations are allowed by different processes so there could exist two or more
different values associated with the same key at different nodes when items are replicated.
Consistency is achieved when the item is read by using a technique known as versioning and
read repair.
Concurrent writes are allowed, but each write is associated with a vector clock value.
When a read occurs, it is possible that different versions of the same value (associated with the
same key) are read from different nodes.
If the system can reconcile to a single final value, it will pass that value to the read; otherwise,
more than one version can be passed back to the application, which will reconcile the various
versions into one version based on the application semantics and give this reconciled value back
to the nodes.
Oracle key-value store. Oracle has one of the well-known SQL relational database systems, and Oracle
also offers a system based on the key-value store concept; this system is called the Oracle NoSQL
Database.
Redis key-value cache and store. Redis differs from the other systems because it caches its data in
main memory to further improve performance.
It offers master-slave replication and high availability, and it also offers persistence by backing up the
cache to disk.
Apache Cassandra. Cassandra is a NOSQL system that is not easily categorized into one category;It is
Dept of CS&E,KVGCE,Sullia 44
Database management Systems (BCS403)
Questions :
1. What are the data modeling concepts used in MongoDB? What are the main CRUD operations
ofMongoDB?
2. Discuss how replication and sharding are done in MongoDB
3. Discuss the data modeling concepts in DynamoDB.
4. Describe the consistent hashing schema for data distribution, replication, and sharding. How are
consistency and versioning handled in Voldemort?
The Google distributed storage system for big data, known as BigTable, is a well-known example of
this class of NOSQL systems and it is used in many Google applications that require large amounts of
data storage, such as Gmail.
BigTable uses the Google File System (GFS) for data storage and distribution. An open source system
known as Apache Hbase is similar to Google BigTable, but it uses HDFS (Hadoop Distributed File
System) for data storage. HDFS is used in many cloud computing applications.
Hbase can also use Amazon’s Simple Storage System (known as S3) for data storage.
Cassandra is an example of column-based NOSQL systems, it can also be characterized as a key- value
store.
BigTable (and Hbase) is described as a sparse multidimensional distributed persistent sorted map,
where the word map means a collection of (key, value) pairs (the key is mapped to the value).
One of the main differences that distinguish column-based systems from key-value stores is the
nature of the key.
In column-based systems such as Hbase, the key is multidimensional and has several
components:
a combination of table name, row key, column, and timestamp.
The column is composed of two components: column family and column qualifier.
1. Hbase Data Model and Versioning Hbase data model:
The data model in Hbase organizes data using the concepts of namespaces, tables, column
Dept of CS&E,KVGCE,Sullia 45
Database management Systems (BCS403)
Dept of CS&E,KVGCE,Sullia 46
Database management Systems (BCS403)
The concept of column family is somewhat similar to vertical partitioning, because columns
(attributes) that are accessed together because they belong to the same column family are
stored in the same files. Each column family of a table is stored in its own files using the
HDFS file system.
• Versions and Timestamps.
Hbase can keep several versions of a data item, along with the timestamp associated with each
version.
The timestamp is a long integer number that represents the system time when the version was
created, so newer versions have larger timestamp values.
Hbase uses midnight ‘January 1, 1970 UTC’ as timestamp value zero, and uses a long integer
that measures the number of milliseconds since that time as the system timestamp value.
It is also possible for the user to define the timestamp value explicitly in a Date format rather
than using the system-generated timestamp.
Dept of CS&E,KVGCE,Sullia 47
Database management Systems (BCS403)
If timestamp is left out, the latest version of the item is retrieved unless a default number of
versions is specified, say the latest three versions.
The default number of versions to be retrieved, as well as the default number of versions that
the system needs to keep, are parameters that can be specified during table creation.
• Namespaces. A namespace is a collection of tables.
A namespace basically specifies a collection of one or more tables that are typically used
together by user applications, and it corresponds to a database that contains a collection of
tables in relational terminology.
Hbase only provides low-level CRUD operations. It is the responsibility of the application
programs to implement more complex operations, such as joins between rows in different tables.
The create operation creates a new table and specifies one or more column families associatedwith
that table.
The put operation is used for inserting new data or new versions of existing data items.
The get operation is for retrieving the data associated with a single row in a table, and the scan
operation retrieves all the rows.
Each region will have a number of stores, where each column family is assigned to one storewithin
the region.
A master server (master node) is responsible for monitoring the region servers and for splitting a table
into regions and assigning regions to region servers.
Hbase uses the Apache Zookeeper open source system for services related to managing the naming,
distribution, and synchronization of the Hbase data on the distributed Hbase server nodes, as well as for
Dept of CS&E,KVGCE,Sullia 48
Database management Systems (BCS403)
Hbase uses Apache HDFS (Hadoop Distributed File System) for distributed file services. Hbase
is built on top of both HDFS and Zookeeper.
Zookeeper can have several replicas on several nodes for availability, and it keeps the data it
needs in main memory to speed access to the master servers and region servers.
NOSQL Graph Databases and Neo4j
Nodes can have labels; the nodes that have the same label are grouped into a collection that
identifies a subset of the nodes in the database graph for querying purposes.
A node can have zero, one, or several labels. Relationships are directed; each relationship has a start
node and end node as well as a relationship type, which serves a similar role to a node label by
identifying similar relationships that have the same relationship type.
Properties can be specified via a map pattern, which is made of one or more “name : value” pairs
enclosed in curly brackets; for example {Lname : ‘Smith’, Fname : ‘John’, Minit : ‘B’}. In
conventional graph theory, nodes and relationships are generally called vertices and edges.
Comparing the Neo4j graph model with ER/EER concepts, nodes correspond to entities, node labels
correspond to entity types and subclasses, relationships correspond to relationship instances,
relationship types correspond to relationship types, and properties correspond to attributes.
Dept of CS&E,KVGCE,Sullia 49
Database management Systems (BCS403)
database system whereas the ER/EER model is mainly used for database design.
Example: Calling appropriate Neo4j operations from various Neo4j APIs. High-level syntax for
creating nodes and relationships is shown;
The Neo4j CREATE command is used, which is part of the high-level declarative query
language Cypher.
When a node is created, the node label can be specified. It is also possible to create nodes
without any labels. In Figure (a), the node labels are EMPLOYEE, DEPARTMENT, PROJECT,
and LOCATION, and the created nodes correspond to some of the data from the COMPANY
database in Figure below, with a few modifications;
For example, EmpId is used instead of SSN, a small subset of the data is used for illustration
purposes.
Properties are enclosed in curly brackets { … }. It is possible that some nodes have multiple
labels;
Example: the same node can be labeled as PERSON and EMPLOYEE and MANAGER by
listing all the labels separated by the colon symbol as follows:
PERSON:EMPLOYEE:MANAGER.
■ Relationships and relationship types. Figure shows a example relationships in Neo4j based
on the COMPANY database in Figure above.
The → specifies the direction of the relationship, but the relationship can be traversed in either
direction.
The relationship types (labels) in Figure (b) are WorksFor, Manager, LocatedIn, and WorksOn;
only relationships with the relationship type WorksOn have properties (Hours) in Figure (b).
■ Paths.
A path is typically specified by a start node, followed by one or more relationships, leading to
one or more end nodes that satisfy the pattern.
Graphs can be created and used without a schema, but in Neo4j version 2.0, a few schema-
related functions were added.
Dept of CS&E,KVGCE,Sullia 50
Database management Systems (BCS403)
The main features related to schema creation involve creating indexes and constraints based on
the labels and properties.
For example, it is possible to create the equivalent of a key constraint on a property of a label, so
all nodes in the collection of nodes associated with the label must have unique values for that
property.
When a node is created, the Neo4j system creates an internal unique system-defined identifier for
each node.
To retrieve individual nodes using other properties of the nodes efficiently, the user can create
indexes for the collection of nodes that have a particular label.
For example, Empid can be used to index nodes with the EMPLOYEE label, Dno to index the
nodes with the DEPARTMENT label, and Pno to index the nodes with the PROJECT label.
2. The Cypher Query Language of Neo4j Neo4j has a high-level query language, Cypher.
There are declarative commands for creating nodes and relationships (Figures (a) and (b)) and
for finding nodes and relationships based on specifying patterns.
A Cypher query is made up of clauses. When a query has several clauses, the result from one
clause can be the input to the next clause in the query.
Query 1 in Figure (d) shows the query that retrieves the locations for department number 5.
Match specifies the pattern and the query variables (d and loc) and RETURN specifies the query
result to be retrieved by referring to the query variables.
Query 2 has three variables (e, w, and p), and returns the projects and hours per week that the
employee with NOSQL Databases and Big Data Storage Systems Empid = 2 works on.
Query 3, returns the employees and hours per week who work on the project with Pno = 2.
Query 4 illustrates the ORDER BY clause and returns all employees and the projects they work
on, sorted by Ename.
It is also possible to limit the number of returned results by using the LIMIT clause as in query 5,
which only returns the first 10 answers.
Query 6 illustrates the use of WITH and aggregation, although the WITH clause can be used to
separate clauses in a query even if there is no aggregation.
Dept of CS&E,KVGCE,Sullia 51
Database management Systems (BCS403)
Query 6 also illustrates the WHERE clause to specify additional conditions, and the query
returns the employees who work on more than two projects, and the number of projects each
employee works on.
Query 7 is similar to query 5 but returns the nodes and relationships only, and so the query result
can be displayed as a graph using Neo4j’s visualization tool.
Query 8 shows how to add more properties to a node by adding a Job property to an employee
node.
It also has two main versions: the enterprise edition, which comes with additional capabilities,
and the community edition.
Some of the additional features of Neo4j are:
■ Enterprise edition vs. community edition.
Both editions support the Neo4j graph data model and storage system, as well as the Cypher
graph query language, and several other interfaces, including a high-performance native API,
language drivers for several popular programming languages, such as Java, Python, PHP, and the
REST (Representational State Transfer) API.
The enterprise edition supports additional features for enhancing performance, such as caching
and clustering of data and locking.
Neo4j has a graph visualization interface, so that a subset of the nodes and edges in a database graph
can be displayed as a graph.
The data and indexes are fully replicated on each node in the cluster.
Various ways of synchronizing the data between master and slave nodes can be configured in the
distributed cluster.
■ Caching. A main memory cache can be configured to store the graph data for improved
performance.
Dept of CS&E,KVGCE,Sullia 52
Database management Systems (BCS403)
Dept of CS&E,KVGCE,Sullia 53
Database management Systems (BCS403)
Questions :
1. What are the data modeling concepts used in column-based NOSQL systems and Hbase?
2. What are the main CRUD operations in Hbase?
3. Discuss the storage and distributed system methods used in Hbase.
4. What are the data modeling concepts used in the graph-oriented NOSQL system Neo4j?
5. What is the query language for Neo4j?
6. Discuss the interfaces and distributed systems characteristics of Neo4j.
Dept of CS&E,KVGCE,Sullia 54