Unit No.
4
Parallel Databases
Unit No.4 Parallel Databases
4.1. Introduction to Parallel Databases
4.2. Parallel Database Architectures
4.3. I/O parallelism
4.4. Inter-query and Intra-query parallelism
4.5. Inter-operational and Intra-operational parallelism
4.6. Key elements of parallel database processing: Speed-up, Scale-up Synchronization and Locking
4.1. Introduction to Parallel Databases
Nowadays organizations need to handle a huge amount of data with a high transfer rate. For such requirements, the client-
server or centralized system is not efficient.
• With the need to improve the efficiency of the system, the concept of the parallel database comes in picture.
• A parallel database system seeks to improve the performance of the system through parallelizing concept.
A parallel database is one which involves multiple processors and working in parallel on the database used to provide the
services. A parallel database system seeks to improve performance through parallelization of various operations like loading
data, building index(Indexing is a data structure technique to efficiently retrieve records from the database files) and
evaluating queries parallel systems improve processing and I/O speeds by using multiple CPU’s and disks in parallel.
Serial Computing
Need :
• Multiple resources like CPUs and Disks are used in parallel. The operations are performed simultaneously, as
opposed to serial processing.
• A parallel server can allow access to a single database by users on multiple machines.
• It also performs many parallelization operations like data loading, query processing, building indexes, and
evaluating queries.
• Parallel databases improve processing and input/output speeds by using multiple CPUs and disks in parallel. Centralized
and client–server database systems are not powerful enough to handle such applications.
• In parallel processing, many operations are performed simultaneously, as opposed to serial processing, in which the
computational steps are performed sequencially.
4.2. Parallel Database Architectures
• The basic idea behind Parallel DB is to carry out evaluation steps in parallel whenever is possible.
• There are many opportunities for parallelism in RDBMS.
3 main architectures have been proposed for building parallel DBMS.
1. Shared Memory
2. Shared Disk
3. Shared Nothing
1. Shared Memory
• Shared-memory or shared-everything style is the traditional approach of implementing an RDBMS on
SMP(shared-memory multiprocessing) hardware.
• The key point of this technique is that a single RDBMS server can probably apply all processors, access all memory, and
access the entire database, thus providing the client with a consistent single system image.
• It is relatively simple to implement and has been very successful up to the point where it runs into the scalability
limitations of the shared-everything architecture.
2. Shared Disk Architecture
• Shared-disk architecture implements a concept of shared ownership of the entire database between RDBMS
servers, each of which is running on a node of a distributed memory system.
• Each RDBMS server can read, write, update, and delete information from the same shared database, which
would need the system to implement a form of a distributed lock manager (DLM).
• DLM components can be found in hardware, the operating system, and separate software layer, all depending on
the system vendor.
3. Shared-Nothing Architecture
• In a shared-nothing distributed memory environment, the data is partitioned across all disks, and the DBMS is
"partitioned" across multiple co-servers, each of which resides on individual nodes of the parallel system and has
an ownership of its disk and thus its database partition.
• A shared-nothing RDBMS parallelizes the execution of a SQL query across multiple processing nodes.
• Each processor has its memory and disk and communicates with other processors by exchanging messages
and data over the interconnection network.
Parallelism in Query in DBMS
Parallelism in a query allows us to parallel execution of multiple queries by decomposing them into the parts that work in
parallel. This can be achieved by shared-nothing architecture. Parallelism is also used in fastening the process of a query
execution as more and more resources like processors and disks are provided. We can achieve parallelism in a query by the
following methods :
1. I/O parallelism
2.Inter-query parallelism
3.Intra-query parallelism
4.Inter-operation parallelism
5.Intra-operation parallelism
1. I/O parallelism
In it simplest form, I/O parallelism refers to reducing the time required to retrieve relations from disk by partitioning the
relations on multiple disks.
• It is a form of parallelism in which the relations are partitioned on multiple disks a motive to reduce the retrieval time of
relations from the disk.
• The most common form of data partitioning in a parallel database environment is horizontal partitioning. In horizontal
partitioning, the tuples of a relation are divided (or declustered) among many disks, so that each tuple resides on one disk.
Within, the data inputted is partitioned and then processing is done in parallel with each partition. The results are merged
after processing all the partitioned data. It is also known as data-partitioning.
Range partitioning –
In range partitioning, it issues continuous attribute value ranges to each disk. For example, we have 3 disks numbered 0,
1, and 2 in range partitioning, and may assign relation with a value that is less than 5 to disk0, values between 5-40 to
disk1, and values that are greater than 40 to disk2. It has some advantages, like it involves placing shuffles containing
attribute values that fall within a certain range on the disk. See figure 1: Range partitioning given below:
•Round-robin partitioning –
In Round Robin partitioning, the relations are studied in any order. The ith tuple is sent to the disk number(i % n). So, disks
take turns receiving new rows of data. This technique ensures the even distribution of tuples across disks and is ideally
suitable for applications that wish to read the entire relation sequentially for each query.
•Schema partitioning –
In schema partitioning, different tables within a database are placed on different disks. See figure 2 below:
2.Inter-query parallelism
In Inter-query parallelism, there is an execution of multiple transactions by each CPU. It is called parallel
transaction processing.
• DBMS uses transaction dispatching to carry inter query parallelism. We can also use some different methods,
like efficient lock management. In this method, each query is run sequentially, which leads to slowing down the
running of long queries.
• In such cases, DBMS must understand the locks held by different transactions running on different processes.
Inter query parallelism on shared disk architecture performs best when transactions that execute in parallel do
not accept the same data.
• Also, it is the easiest form of parallelism in DBMS, and there is an increased transaction throughput.
3.Intra-query parallelism
Intra-query parallelism refers to the execution of a single query in a parallel process on different CPUs using a
shared-nothing paralleling architecture technique. This uses two types of approaches:
• First approach –
In this approach, each CPU can execute the identical task against some data portion.
• Second approach –
In this approach, the task can be divided into different sectors with each CPU executing a distinct subtask.
4. Inter-operation parallelism
When different operations in a query expression are executed in parallel, then it is called inter-operation parallelism. They
are of two types –
•Pipelined parallelism –
In pipeline parallelism, the output row of one operation is consumed by the second operation even before the first
operation has produced the entire set of rows in its output. Also, it is possible to run these two operations simultaneously
on different CPUs, so that one operation consumes tuples in parallel with another operation, reducing them. It is useful for
the small number of CPUs and avoids writing of intermediate results to disk.
•Independent parallelism –
In this parallelism, the operations in query expressions that are not dependent on each other can be executed in parallel.
This parallelism is very useful in the case of the lower degree of parallelism.
5. Intra-operation parallelism
Intra-operation parallelism is a sort of parallelism in which we parallelize the execution of each individual operation of a
task like sorting, joins, projections, and so on. The level of parallelism is very high in intra-operation parallelism. This type
of parallelism is natural in database systems. Let’s take an SQL query example:
SELECT * FROM Vehicles ORDER BY Model_Number;
In the above query, the relational operation is sorting and since a relation can have a large number of records in it, the
operation can be performed on different subsets of the relation in multiple processors, which reduces the time required
to sort the data.
4.6. Key elements of parallel database processing: Speed-up, Scale-up ,Synchronization and Locking
1. Speedup
Speedup is the extent to which more hardware can perform the same task in less time than the original system. With
added hardware, speedup holds the task constant and measures time savings. Figure 1-5 shows how each parallel
hardware system performs half of the original task in half the time required to perform it on a single system.
Figure 1-5 Speedup
With good speedup, additional processors reduce system response time. You can measure speedup using this formula:
2.Scale-up
Scaleup is the factor that expresses how much more work can be done in the same time period by a larger system. With
added hardware, a formula for scaleup holds the time constant, and measures the increased size of the job which can be
done.
Figure 1-6 Scaleup
If transaction volumes grow and you have good scale-up, you can keep response time constant by adding hardware
resources such as CPUs. You can measure scaleup using this formula:
3. Synchronization
Coordination of concurrent tasks is called synchronization. Synchronization is necessary for correctness. The key to successful
parallel processing is to divide tasks so very little synchronization is necessary. The less synchronization necessary, the better
the speedup and scaleup.
4. Locks
Locks are resource control mechanisms that synchronize tasks. Many different types of locking mechanisms are required to
synchronize tasks required by parallel processing.
The Integrated Distributed Lock Manager (Integrated DLM, or IDLM) is the internal locking facility used with OPS. It
coordinates resource sharing between nodes running a parallel server. The instances of a parallel server use the IDLM to
communicate with each other and coordinate modification of database resources. Each node operates independently of
other nodes, except when contending for the same resource.
The IDLM allows applications to synchronize access to resources such as data, software, and peripheral devices, so
concurrent requests for the same resource are coordinated among applications running on different nodes.
Forms of Query Parallism
Interquery Parallism