KEMBAR78
Distributed DBMS Reliability Notes | PDF | Databases | Load Balancing (Computing)
0% found this document useful (0 votes)
99 views10 pages

Distributed DBMS Reliability Notes

Uploaded by

laharipriya6435
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views10 pages

Distributed DBMS Reliability Notes

Uploaded by

laharipriya6435
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

III Year I Sem - CSE Distributed Databases - Class Notes 2024-2025

MODULE – IV
Distributed DBMS Reliability: Reliability concepts and measures, fault-tolerance in distributed
systems, failures in Distributed DBMS, local & distributed reliability protocols, site failures and
network partitioning.
Parallel Database Systems: Parallel database system architectures, parallel data placement, parallel
query processing, load balancing, database clusters.

Discuss the reasons for failures in distributed systems and explain the
1 4 L2 4
types of failures in distributed DBMS.

In a distributed database system, we need to deal with four types of failures: transaction failures (aborts), site
(system) failures, media (disk) failures, and communication line failures. Some of these are due to hardware
and others are due to software. The ratio of hardware failures vary from study to study and range from 18%
to over 50%. Soft failures make up more than 90% of all hardware system failures.

The occurrence of soft failures is significantly higher than that of hard failures and most of the software
failures are transient and therefore suggesting that a dump and restart may be sufficient to recover without
any need to “repair” the software.

In a distributed database system, we need to deal with four types of failures:


1. transaction failures (aborts),
2. site (system) failures,
3. media (disk) failures, and
4. communication line failures.
Some of these are due to hardware and others are due to software.

Transaction Failures:
• Transactions can fail for a number of reasons.
• Failure can be due to an error in the transaction caused by incorrect input data as well as the
detection of a present or potential deadlock.
• Furthermore, some concurrency control algorithms do not permit a transaction to proceed or even to
wait if the data that they attempt to access are currently being accessed by another transaction.
• This might also be considered a failure.
• The usual approach to take in cases of transaction failure is to abort the transaction, thus resetting the
database to its state prior to the start of this transaction.

Site (System) Failures:


• The reasons for system failure can be a hardware or software failure.
• A system failure is always assumed to result in the loss of main memory contents.
• Therefore, any part of the database that was in main memory buffers is lost as a result of a system
failure.
• However, the database that is stored in secondary storage is assumed to be safe and correct.
• In distributed database terminology, system failures are typically referred to as site failures, since
they result in the failed site being unreachable from other sites in the distributed system.
• We typically differentiate between partial and total failures in a distributed system.
➢ Total failure refers to the simultaneous failure of all sites in the distributed system;
➢ partial failure indicates the failure of only some sites while the others remain operational.

Media Failures:
• Media failure refers to the failures of the secondary storage devices that store the database.
• Such failures may be due to operating system errors, as well as to hardware faults such as head
crashes or controller failures.

1 P. V Ramana Murthy
LEE; B.E(Comp); M.Tech(CS); (Ph.D(CSE));
Malla Reddy Engineering College (Autonomous)
III Year I Sem - CSE Distributed Databases - Class Notes 2024-2025

• The important point from the perspective of DBMS reliability is that all or part of the database that is
on the secondary storage is considered to be destroyed and inaccessible.
• Duplexing of disk storage and maintaining archival copies of the database are common techniques
that deal with this sort of catastrophic problem.
• Media failures are frequently treated as problems local to one site.

Communication Failures:
• The three types of failures above are common to both centralized and distributed DBMSs.
• Communication failures, however, are unique to the distributed case.
• There are a number of types of communication failures.
• The most common ones are the errors in the messages, improperly ordered messages, lost (or
undeliverable) messages, and communication line failures.
• The first two errors are the responsibility of the computer network.
• Lost or undeliverable messages are typically the consequence of communication line failures or
(destination) site failures.
• If a communication line fails, in addition to losing the message(s) in transit, it may also divide the
network into two or more disjoint groups.
• This is called network partitioning.
• If the network is partitioned, the sites in each partition may continue to operate.
• In this case, executing transactions that access data stored in multiple partitions becomes a major
issue.

2 Discuss in detail about reliability concept and majors. 4 L2 4

System, State, and Failure:


• Reliability refers to a system that consists of a set of components.
• The system has a state, which changes as the system operates.
• Any deviation of a system from the behavior described in the specification is considered a failure.
• For example, in a distributed transaction manager the specification may state that only serializable
schedules for the execution of concurrent transactions should be generated.
• If the transaction manager generates a non-serializable schedule, we say that it has failed.
• Any error in the internal states of the components of a system or in the design of a system is called a fault
in the system.
• Thus, a fault causes an error that results in a system failure (Fig).

Reliability and Availability:


• Reliability refers to the probability that the system under consideration does not experience any failures
in a given time interval.
• It is typically used to describe systems that cannot be repaired or where the operation of the system is so
critical that no downtime for repair can be tolerated.
• Formally, the reliability of a system, R(t), is defined as the following conditional probability:
R(t) = Pr(0 failures in time [0, t] | no failures at t = 0)

• Availability, A(t), refers to the probability that the system is operational according to its specification at a
given point in time t.

2 P. V Ramana Murthy
LEE; B.E(Comp); M.Tech(CS); (Ph.D(CSE));
Malla Reddy Engineering College (Autonomous)
III Year I Sem - CSE Distributed Databases - Class Notes 2024-2025

• A number of failures may have occurred prior to time t, but if they have all been repaired, the system is
available at time t.
• Obviously, availability refers to systems that can be repaired.

Mean Time between Failures/Mean Time to Repair:


• MTBF is the expected time between subsequent failures in a system with repair.
• MTBF can be calculated either from empirical data or from the reliability function as Since R(t) is related
to the system failure rate, there is a direct relationship between MTBF and the failure rate of a system.
• MTTR is the expected time to repair a failed system.
• It is related to the repair rate as MTBF is related to the failure rate.
• Using these two metrics, the steady-state availability of a system with exponential failure and repair rates
can be specified as
A = MTBF / (MTBF + MTTR)
• System failures may be latent, in that a failure is typically detected some time after its occurrence.
• This period is called error latency, and the average error latency time over a number of identical systems
is called mean time to detect (MTTD).
• Figure depicts the relationship of various reliability measures with the actual occurrences of faults.

3 Explain an overview of distributed reliability protocols. 4 L2 4

• The distributed reliability protocols aim to maintain the atomicity and durability of distributed
transactions that execute over a number of databases.
• The protocols address the distributed execution of the begin transaction, read, write, abort, commit, and
recover commands.
• The implementation of distributed reliability protocols assume that at the originating site of a transaction
there is a coordinator process and at each site where the transaction executes there are participant
processes.
• Thus, the distributed reliability protocols are implemented between the coordinator and the participants.
• The reliability techniques in distributed database systems consist of commit, termination, and recovery
protocols.
• Both of these commands need to be executed differently in a distributed DBMS than in a centralized
DBMS.
• Termination and recovery protocols are two opposite faces of the recovery problem: given a site failure,
termination protocols address how the operational sites deal with the failure, whereas recovery protocols
deal with the procedure that the process (coordinator or participant) at the failed site has to go through to
recover its state once the site is restarted.
• In the case of network partitioning, the termination protocols take the necessary measures to terminate
the active transactions that execute at different partitions, while the recovery protocols address the
establishment of mutual consistency of replicated databases following reconnection of the partitions of
the network.
3 P. V Ramana Murthy
LEE; B.E(Comp); M.Tech(CS); (Ph.D(CSE));
Malla Reddy Engineering College (Autonomous)
III Year I Sem - CSE Distributed Databases - Class Notes 2024-2025

• The primary requirement of commit protocols is that they maintain the atomicity of distributed
transactions.
• This means that even though the execution of the distributed transaction involves multiple sites, some
of which might fail while executing, the effects of the transaction on the distributed database is all-or-
nothing. This is called atomic commitment.
• We would prefer the termination protocols to be non-blocking.
• A protocol is non-blocking if it permits a transaction to terminate at the operational sites without
waiting for recovery of the failed site.
• This would significantly improve the response-time performance of transactions.
• Distributed recovery protocols are independent.
• Independent recovery protocols determine how to terminate a transaction that was executing at the
time of a failure without having to consult any other site.
• Existence of such protocols would reduce the number of messages that need to be exchanged during
recovery.
• Note that the existence of independent recovery protocols would imply the existence of nonblocking
termination protocols, but the reverse is not true.

4 Describe short note on fault-tolerance in distributed systems. 4 L2 4


• Fault Tolerance is required in order to provide below four features.
1. Availability: Availability is defined as the property where the system is readily available for its use at
any time.
2. Reliability: Reliability is defined as the property where the system can work continuously without any
failure.
3. Safety: Safety is defined as the property where the system can remain safe from unauthorized access
even if any failure occurs.
4. Maintainability: Maintainability is defined as the property states that how easily and fastly the failed
node or system can be repaired.

Phases of Fault Tolerance in distribution system:

1. Fault Detection: Continuous monitoring to detect faults as soon as they occur.


2. Fault Diagnosis: Identifying the root cause and nature of the fault.
3. Evidence Generation: Preparing a report based on the diagnosis.
4. Assessment: Analyzing the damages caused by the faults.
5. Recovery: Making the system fault-free and restoring it to a stable state

1. Fault Detection
• Fault Detection is the first phase where the system is monitored continuously.
• The outcomes are being compared with the expected output.
• During monitoring if any faults are identified they are being notified.
• These faults can occur due to various reasons such as hardware failure, network failure, and software
issues.

4 P. V Ramana Murthy
LEE; B.E(Comp); M.Tech(CS); (Ph.D(CSE));
Malla Reddy Engineering College (Autonomous)
III Year I Sem - CSE Distributed Databases - Class Notes 2024-2025

• The main aim of the first phase is to detect these faults as soon as they occur so that the work being
assigned will not be delayed.

2. Fault Diagnosis
• Fault diagnosis is the process where the fault that is identified in the first phase will be diagnosed
properly in order to get the root cause and possible nature of the faults.
• Fault diagnosis can be done manually by the administrator or by using automated Techniques in order to
solve the fault and perform the given task.

3. Evidence Generation
• Evidence generation is defined as the process where the report of the fault is prepared based on the
diagnosis done in an earlier phase.
• This report involves the details of the causes of the fault, the nature of faults, the solutions that can be
used for fixing, and other alternatives and preventions that need to be considered.

4. Assessment
• Assessment is the process where the damages caused by the faults are analyzed.
• It can be determined with the help of messages that are being passed from the component that has
encountered the fault.
• Based on the assessment further decisions are made.

5. Recovery
• Recovery is the process where the aim is to make the system fault free.
• It is the step to make the system fault free and restore it to state forward recovery and backup
recovery.
• Some of the common recovery techniques such as reconfiguration and resynchronization can be used.

Explain general architecture of a parallel database system and shared


5 4 L2 4
memory architecture.

general architecture:
The General architecture of parallel database system contains the following these subsystems.

5 P. V Ramana Murthy
LEE; B.E(Comp); M.Tech(CS); (Ph.D(CSE));
Malla Reddy Engineering College (Autonomous)
III Year I Sem - CSE Distributed Databases - Class Notes 2024-2025

1. Session Manager.
• It plays the role of a transaction monitor, providing support for client interactions with the server.
• In particular, it performs the connections and disconnections between the client processes and the two
other subsystems.
• Therefore, it initiates and closes user sessions (which may contain multiple transactions).
• In case of OLTP sessions, the session manager is able to trigger the execution of pre-loaded transaction
code within data manager modules.
2. transaction Manager.
• It receives client transactions related to query compilation and execution.
• It can access the database directory that holds all meta-information about data and programs.
• The directory itself should be managed as a database in the server.
• Depending on the transaction, it activates the various compilation phases, triggers query execution, and
returns the results as well as error codes to the client application.
• Because it supervises transaction execution and commit, it may trigger the recovery procedure in case of
transaction failure.
• To speed up query execution, it may optimize and parallelize the query at compile-time.
3. Data Manager.
• It provides all the low-level functions needed to run compiled queries in parallel, i.e., database operator
execution, parallel transaction support, cache management, etc.
• If the transaction manager is able to compile dataflow control, then synchronization and communication among
data manager modules is possible.
• Otherwise, transaction control and synchronization must be done by a transaction manager module.

Shared-Memory Architecture:
• In the shared-memory approach, any processor has access to any memory module or disk unit through a
fast interconnect (e.g., a high-speed bus or a cross-bar switch).
• All processors are under the control of a single operating system.
• Current mainframe designs and symmetric multiprocessors (SMP) follow this approach.

.
• All shared-memory parallel database products can exploit inter-query parallelism to provide high
transaction throughput and intra-query parallelism to reduce response time of decision-support queries.
• Shared-memory has two strong advantages: simplicity and load balancing.
• In particular, inter-query parallelism comes for free.
• Intra-query parallelism requires some parallelization but remains simple.
• Load balancing is easy to achieve since it can be achieved at run-time using the shared-memory by
allocating each new task to the least busy processor.
• Shared-memory has three problems: high cost, limited extensibility and low availability.
• High cost is incurred by the interconnect that requires fairly complex hardware because of the need to
link each processor to each memory module or disk.
• With faster processors, conflicting accesses to the shared-memory increase rapidly and degrade
performance.
• Extensibility is limited to a few tens of processors, typically up to 16 for the best cost/performance using
4-processor boards.
• Finally, since the memory space is shared by all processors, a memory fault may affect most processors
thereby availability.
6 P. V Ramana Murthy
LEE; B.E(Comp); M.Tech(CS); (Ph.D(CSE));
Malla Reddy Engineering College (Autonomous)
III Year I Sem - CSE Distributed Databases - Class Notes 2024-2025

Classify parallel execution problems? Explain each with suitable


6 4 L2 4
example.

• The principal problems introduced by parallel query execution are


1. initialization,
2. interference and
3. skew.

Initialization:
• Before the execution takes place, an initialization step is necessary.
• It includes process creation and initialization, communication initialization, etc.
• The duration of this step is proportional to the degree of parallelism.
Interferences.
• A highly parallel execution can be slowed down by interference.
• Interference occurs when several processors simultaneously access the same resource, hardware or
software.
Skew.
• Load balancing problems can appear with intra-operator parallelism (variation in partition size), namely
data skew, and inter-operator parallelism (variation in the complexity of operators).
• The effects of skewed data distribution on a parallel execution can be classified as follows.
• Attribute value skew (AVS), tuple placement skew (TPS), Selectivity skew (SS), Redistribution skew
(RS), join product skew (JPS).

• Solutions to these problems can be obtained at the intra- and inter-operator levels
Intra-Operator Load Balancing
• Good intra-operator load balancing depends on the degree of parallelism and the allocation of
processors for the operator.
• The skew problem makes it hard for a parallel query optimizer to make this decision statically (at
compile-time) as it would require a very accurate and detailed cost model.
• Therefore, the main solutions rely on
➢ adaptive or
➢ specialized techniques
• that can be incorporated in a hybrid query optimizer.

Adaptive techniques:
• The main idea is to statically decide on an initial allocation of the processors to the operator (using a
cost model) and, at execution time, adapt to skew using load reallocation.
• A simple approach to load reallocation is to detect the oversized partitions and partition them again onto
several processors to increase parallelism.

Specialized techniques:
• Parallel join algorithms can be specialized to deal with skew.
• One approach is to use multiple join algorithms, each specialized for a different degree of skew, and to
determine, at execution time, which algorithm is best.
• It relies on two main techniques: range partitioning and sampling.

Inter-Operator Load Balancing


• In order to obtain good load balancing at the inter-operator level, it is necessary to choose, for each
operator, how many and which processors to assign for its execution.
• This should be done taking into account pipeline parallelism, which requires interoperator
communication.
• Finally, the processors associated with the latest operators in a pipeline chain may remain idle a
significant time.
• This is called the pipeline delay problem.
7 P. V Ramana Murthy
LEE; B.E(Comp); M.Tech(CS); (Ph.D(CSE));
Malla Reddy Engineering College (Autonomous)
III Year I Sem - CSE Distributed Databases - Class Notes 2024-2025

7 Distinguish between parallel and distributed database systems. 4 L4 4

Parallel Database Distributed Database


In parallel databases, processes are tightly In distributed databases, the sites are loosely coupled and
coupled and constitutes a single database share no physical components i.e., distributed database is
system i.e., the parallel database is a centralized our geographically departed, and data are distributed at
database and data reside in a single location several locations.
In parallel databases, query processing and In distributed databases, query processing and transaction is
transaction is complicated. more complicated.
In parallel databases, it’s not applicable. In distributed databases, a local and global transaction can
be transformed into distributed database systems
In parallel databases, the data is partitioned In distributed databases, each site preserve a local database
among various disks so that it can be retrieved system for faster processing due to the slow interconnection
faster. between sites
In parallel databases, there are 3 types of Distributed databases are generally a kind of shared-nothing
architecture: shared memory, shared disk, and architecture
shared shared-nothing.
In parallel databases, query optimization is In distributed databases, query Optimisation techniques
more complicated. may be different at different sites and are easy to maintain
In parallel databases, data is generally not In distributed databases, data is replicated at any number of
copied. sites to improve the performance of systems
Parallel databases are generally homogeneous in Distributed databases may be homogeneous or
nature heterogeneous in nature.
Skew is the major issue with the increasing Blocking due to site failure and transparency are the major
degree of parallelism in parallel databases. issues in distributed databases.

Classify the issues related to query processing and load balancing in


8 4 L2 4
parallel database systems? Discuss each.

Issues related to query processing can be classified as


1. Query Parallelism
2. Parallel Algorithms for Data Processing
3. Parallel Query Optimization

Query Parallelism:
• Parallel query execution can exploit two forms of parallelism: inter- and intra-query.
• Inter-query parallelism enables the parallel execution of multiple queries generated by concurrent
transactions, in order to increase the transactional throughput.
• Within a query (intra-query parallelism), inter-operator and intra-operator parallelism are used to
decrease response time.
• Inter-operator parallelism is obtained by executing in parallel several operators of the query tree on
several processors
• while with intra operator parallelism, the same operator is executed by many processors, each one
working on a subset of the data.

Intra-operator Parallelism:
• Intra-operator parallelism is based on the decomposition of one operator in a set of independent sub-
operators, called operator instances.
• This decomposition is done using static and/or dynamic partitioning of relations.

8 P. V Ramana Murthy
LEE; B.E(Comp); M.Tech(CS); (Ph.D(CSE));
Malla Reddy Engineering College (Autonomous)
III Year I Sem - CSE Distributed Databases - Class Notes 2024-2025

• Each operator instance will then process one relation partition, also called a bucket.
• The operator decomposition frequently benefits from the initial partitioning of the data.
Ex:
• The select operator can be directly decomposed into several select operators, each on a different
partition, and no redistribution is required .
• Note that if the relation is partitioned on the select attribute, partitioning properties can be used to
eliminate some select instances.
• For example, in an exact-match select, only one select instance will be executed if the relation was
partitioned by hashing (or range) on the select attribute.

Inter-operator Parallelism:
• Two forms of inter-operator parallelism can be exploited.
• With pipeline parallelism, several operators with a producer-consumer link are executed in parallel.
• For instance, the select operator in Fig will be executed in parallel with the join operator.
• The advantage of such execution is that the intermediate result is not materialized, thus saving
memory and disk accesses.
• In the example of Fig, only S may fit in memory.
• Independent parallelism is achieved when there is no dependency between the operators that are
executed in parallel.
• For instance, the two select operators of Fig can be executed in parallel.
• This form of parallelism is very attractive because there is no interference between the
processors.

load balancing:
• Good load balancing is crucial for the performance of a parallel system.
• Balancing the load of different transactions and queries among different nodes is essential to maximize
throughput.
• Although the parallel query optimizer incorporates decisions on how to execute a parallel execution
plan, load balancing can be hurt by several problems incurring at execution time.
• Solutions to these problems can be obtained at the intra- and inter-operator levels

Intra-Operator Load Balancing


• Good intra-operator load balancing depends on the degree of parallelism and the allocation of
processors for the operator.

9 P. V Ramana Murthy
LEE; B.E(Comp); M.Tech(CS); (Ph.D(CSE));
Malla Reddy Engineering College (Autonomous)
III Year I Sem - CSE Distributed Databases - Class Notes 2024-2025

• The skew problem makes it hard for a parallel query optimizer to make this decision statically (at
compile-time) as it would require a very accurate and detailed cost model.
• Therefore, the main solutions rely on
➢ adaptive or
➢ specialized techniques
• that can be incorporated in a hybrid query optimizer.

Adaptive techniques:
• The main idea is to statically decide on an initial allocation of the processors to the operator (using a
cost model) and, at execution time, adapt to skew using load reallocation.
• A simple approach to load reallocation is to detect the oversized partitions and partition them again onto
several processors to increase parallelism.

Specialized techniques:
• Parallel join algorithms can be specialized to deal with skew.
• One approach is to use multiple join algorithms, each specialized for a different degree of skew, and to
determine, at execution time, which algorithm is best.
• It relies on two main techniques: range partitioning and sampling.

Inter-Operator Load Balancing


• In order to obtain good load balancing at the inter-operator level, it is necessary to choose, for each
operator, how many and which processors to assign for its execution.
• This should be done taking into account pipeline parallelism, which requires interoperator
communication.
• Finally, the processors associated with the latest operators in a pipeline chain may remain idle a
significant time.
• This is called the pipeline delay problem.

10 P. V Ramana Murthy
LEE; B.E(Comp); M.Tech(CS); (Ph.D(CSE));
Malla Reddy Engineering College (Autonomous)

You might also like