Distributed Databases
Distributed Databases
DDBS is not only specify about the data but also Here goes an example…..
about the structure of data
Consider an engineering firm that has offices in
In a DDBS , despite the existence of a network, the Boston Waterloo Paris and SanFrancisco.
database resides at only one node of the network.
They run projects and maintain database of their
So we face problems not less than maintaining employees (ex: projects, employee data)
dbms.
Let us assume that the database is relational and LDI stores information about how data is managed
stored in following two relations inside.
The other relation to store salary information ii. Network Transparency /Distribution
SAL(TITLE, AMT)
transperancy
The fourth relation to know the assign projects with
Other than data the user should be protected from
durationand responsibility indicates as ASG
the Operational details of the network.
ASG(ENO,PNO,RESP,DUR).
Allowing a user to access a resource ( application
Based on the query it is going to search in different
program or data) without the user needing to know
databases of Boston,Parisetc…
whether the resource is located on the local machine
In order to quick processing of query we are going to
iii. Replication Transparency
partition each of the relations and store each
partition at a different site. This is known as Replication transparency ensures that replication of
fragmentation. databases are hidden from the users.
Sometimes we also duplicate some of this data at It enables users to query up on a table as if only as in
other sites for performance and reliability reasons copy Of the table exists.
It means distributed database which is fragmented
ii. Fragmentation Transparency
and replicated.
dividing each database relation into smaller
So Fully transparent access means that the users can
fragments and treat each fragment as a separate
still pose the query as specified above, without
database object .This is for reasons of
paying any attention to the fragmentation, location,
performance ,availability, and reliability.
or replication of data
Fragmentation Transparency hides the fact that the
I.Data Independence table the user is querying on is actually a fragment or
union of some fragments.
Data independence is a fundamental form of
transparency. It is capacity of changing the Database So to provide easy and efficient access of the DBMS
schema (structure/description) at one level of a we need to have full transparency.
database system without effecting the schema at the
next higher level. 2)Reliability (reliable access to data) Through
Distributed Transactions
Database system follows multilayer architecture to
store meta data.2 types of Data independence: Distributed DBMSs are designed to improve
Logical data independence reliability by having replicated components results in
eliminating failure.
Followed by logical schema
So the failure creates problem to the entire system.
: Physical data independence In distributed proper care is taken such that instead
of failure part user may be permitted to access other
Followed by physical schema
parts of the distributed database. This is useful to It has inter-query and intra-query parallelism.
support for distributed transactions.
Inter-query parallelism is the ability to execute
A transaction is a basic unit of consistent and reliable multiple queries at the same time
computing, consisting of a sequence of database
operations executed as an atomic action. intra-query parallelism is achieved by breaking up a
single query into a number of sub queries each of
Let us take an example of transaction based on the which is executed at a different site, accessing a
engineering firm. Assuming that there is an different part of the distributed database.
application that updates the salaries of all the
employees by 10%. 2) Easier System Expansion
Therefore we would like the system to be able to It normally costs much less to put together a system
synchronize the concurrent execution of these two of “smaller” computers with the equivalent power of
programs. a single big machine.
--Since each site handles only a portion of the -If requested choosing one of the stored copies of
database, contention for CPU and I/O services is not the requested data for access making sure that the
severe. effect of an update is reflected on each & every copy
of that data item.
---Local initialization reduces remote access delays.
1. Second, if some site fail either by S/W or H/W
Implementation of inherent parallelism of
malfunctioning or if some communication link fail
distributed systems.
while execution of updating, the system must make
sure that the effects will be reflected on the data The location of data on different storage devices and
residing at the failing or unreachable sites as soon as the access mechanisms used to reach and
the system can recover from the failure. manipulate data are the issues dealt with at this
level. The external view, which is concerned with
2. Third, since each site cannot have instantaneous how users view the database.
information on the actions currently being carried
out at the other sites, the synchronization of In between the set ends is the conceptual schema,
transactions on multiple sites has to be take cared. which is an abstract definition of the database used
to represent the data and the relationships among
Distributed DBMS Architecture data. The advantage is it mainly supports data
independence. The separation of schemas leads to
The architecture of a system defines its structure the physical and logical data independence.
Here the components of the system are identified Architectural Models for Distributed DBMSs
The function of each component is specified There are various possible ways in which a
distributed DBMS may be architected.
The inter relationships and interactions among these
components are defined. We use a classification that organizes the systems as
characterized with respect to
1) ANSI/SPARC Architecture
(1)the autonomy of local systems,
In late 1972, the Computer and Information
Processing Committee(X3)of the American National (2)their distribution
Standards Institute (ANSI) established a Study Group
on DBMS under the Standards Planning and Autonomy
Requirements Committee (SPARC).
Autonomy refers to the distribution of control, which
The mission of the study group was to study the indicates the degree to which individual DBMSs can
feasibility of setting up standards. The study group operate independently. Autonomy is a function of a
issued its report in1975 and its final report in 1977. number of factors such as
The architectural framework proposed in these 1) whether the component systems (individual
reports known as the “ANSI/SPARC architecture,” DBMSs) exchange information.
its full title being “ANSI/X3/SPARC DBMS
2)whether they can independently execute
Framework.”
transactions.
There are three views of data:
3) whether one is allowed to modify them.
The external view, which is that of the end user, who
Requirements of an autonomous system have been
might be a programmer.
specified as follows
The internal view, that of the system or machine.
1.The local operations of the individual DBMSs
The conceptual view, that of the enterprise. should not be affected by their participation in the
distributed system.
Lowest level of the architecture is the internal view,
which deals with the physical definition and 2. The manner in which the individual DBMSs
organization of data. process queries should not be affected by the
execution of global queries that access multiple Distribution is the physical distribution of data over
databases multiple sites. There are many ways DBMSs have
been distributed. We abstract these alternatives into
3. System consistency or operation should not be two classes:
compromised when individual DBMSs join or leave
the distributed system. 1.client/server distribution : The sites on a network
are distinguished as “clients” and “servers”.
There are dimensions of autonomy which can be Communication duties are shared between the client
specified as follows machines and servers.
1.Design autonomy: Individual DBMSs are free to 2.peer-to-peer distribution : There is no distinction
use the data models and transaction management of client machines versus servers. Each machine has
techniques that they prefer. full DBMS functionality and can communicate with
other machines to execute queries and transactions.
2.Communication autonomy: Each of the individual
DBMSs is free to make its own decision as to what Heterogeneity
type of information it wants to provide to the other
DBMSs. It refers to the uniformity or dissimilarity of the data
models system components and databases.
3.Execution autonomy: Each DBMS can execute the
transactions that are submitted to it in any way that It occurs in various forms in Distributed Systems
it wants to. Other hand autonomous systems there which may be in the field of hardware heterogeneity
are other alternatives. and differences in networking protocols to variations
in data managers.
They are tight integration ,semiautonomous, total
isolation. It may be in data models query languages and
transaction management protocols.
1.Tight integration : a single-image of the entire
database is available to any user who wants to share Representing data with different modeling tools
the information, which may reside in multiple creates heterogeneity
databases. In these tightly- integrated systems, the
data managers are implemented. so that one of Heterogeneity in query languages involves the use
them is in control of the processing of each user of completely different data access paradigms
request even if that request is serviced by more than
Even if the SQL is the standard relational query
one data manager
language there are many different implementations
2.Semiautonomous : They are not fully autonomous also.
systems because they need to be modified to enable
Client/Server Systems
them to exchange information with one another.
Client/server DBMSs entered the computing at the
3.Total isolation :the individual systems are stand-
beginning of 1990’s
alone DBMSs that know neither of the existence of
other DBMSs nor how to communicate with them. In It mainly focus on two functions1)
such systems, the processing of user transactions
that access multiple databases is especially difficult. Server functions2) client functions.
The more sophisticated client/server architecture is The enterprise view of the data is described by the
one where there are multiple servers in the system global conceptual schema (GCS).
called as multiple client/multiple server approach.
which is global because it describes the logical
In multiple client/multiple server approach two structure of the Data at all the sites.
alternative management strategies are possible :
To handle data fragmentation and replication the
1.either each client manages its own connection to logical Organization of data teach site needs to be
the appropriate server or each client knows only its described.Therefore,thereneedstobeathirdlayerinthe
“home server” which then communicates with other architecture,theLocal conceptual schema (LCS).
servers as required. This approach simplifies server
User applications and user access to the database is
code, but loads the client machines with additional
supported by external schemas (ESs).
responsibilities which is called “heavy client”
systems.
Data independence is supported since the model is 1.The local query optimizer, which acts as the access
an extension of ANSI/SPARC Location and replication path selector, is responsible for choosing the best
transparencies are supported by the local and global access path to access any data item
conceptual schemas.
2.The local recovery manager is responsible for
Network transparency, on the other hand, is making sure that the local database remains
supported by the Global conceptual schema. consistent even when failures occur.
The detail components of a distributed DBMS are 3.Ther un-times up port processor is the interface to
shown in before picture the operating system and contains the database
buffer (or cache) manager, which is responsible for
One component handles the interaction with users, maintaining the main memory buffers and managing
and another deals with the storage the data accesses.
User processor consists of 4 elements: In peer-to-peer systems, both the user processor
modules and the data processor modules on each
1.user interface handler
machine.
2.semantic data controller
Distributed database design
3.Global query optimizer and decomposer
Alternative design strategies
4.distributed execution monitor
Two major strategies that have been identified for
1.The user interface handler is responsible for designing
interpreting user commands as they come in, and
formatting the result data as it is sent to the user. Distributed databases. They are
The conceptual design, is the process by which the 4.is there any way to test the correctness of
enterprise is examined to determine entity types and decomposition?
relationships among these entities.
5.How should we allocate?
From the conceptual design step, comes the
6.What is the necessary information for
definition of global conceptual schema.
fragmentation and allocation?
The global conceptual schema (GCS) and access
1)Reasons for Fragmentation
pattern information collected as a result of view
design are inputs to the distribution design step. The decomposition of a relation into fragments,
each being treated as a unit, permits a number of
The objective at this stage, is to design the local
transactions to execute concurrently.
conceptual schemas (LCSs) by distributing the
entities over the sites of the distributed system. Fragmentation also results in the parallel execution
of a single query by dividing it into a set of sub
The last step in the design process is the physical
queries that operate on fragments.
design, which maps the local conceptual schemas to
the physical storage devices available at the Thus fragmentation typically increases the level of
corresponding sites. concurrency.
As designing and development activity is an ongoing This form of concurrency is refer to a query
process which requires constant monitoring and concurrency
periodic adjustment and tuning.
2)Fragmentation Alternatives
For that reason observation and monitoring came
into existence. Relation instances are essentially tables
The result is some form of feedback, which may Finding alternative ways of dividing a table into
result in backing up to one of the earlier steps in the smaller ones.
design.
There are two alternatives: dividing it horizontally or
Distribution design issues dividing it vertically.
They are three rules that we follow during In this case a vertical fragmentation may be followed
fragmentation, which ,together ,ensure that the by a horizontal one, or vice versa, producing a tree
database does not undergo semantic change during structured partitioning
fragmentation.
Since two types of partitioning strategies are applied
a)Completeness b)Reconstruction C)Disjointness one after the other which is called as hybrid
fragmentation (mixed or nested fragmentation.)
1Allocation Alternatives
Allocation
After the database is fragmented, one has to decide
on the allocation of the fragments to various sites on The allocation of resources across the nodes or
the network. placing individual files of a computer network is a big
task.
When data are allocated ,it may either be replicated
or maintained as a single copy. Allocation Problem
The reasons for replication are reliability and It is defined with respect to two measures
efficiency of read-only queries.
1.Minimal cost.
The replication of data that depends on the ratio of
the read-only queries to the update queries. 2.Performance.
There are two fundamental fragmentation The objective of query processing in a distributed
strategies: horizontal and vertical. context is to transform a high-level query on a
distributed database into an efficient execution
Furthermore, there is a possibility of nesting
strategy expressed in a low-level language on local
fragments in a hybrid fashion.
databases
1.Horizontal Fragmentation
We assume that the high-level language is relational
2.Vertical Fragmentation calculus, while the low-level language is an extension
of relational algebra with communication operators.
3.Hybrid Fragmentation
Important aspect of query processing is query
Hybrid Fragmentation optimization
Transformations of the high-level query means the 1.Languages
one that Optimizes (minimizes)
2.Types of Optimization
A good measure of resource consumption is the total
cost that will be incurred in processing the query. 3OptimizationTiming
1.The CPU cost is incurred when performing Query optimization aims at choosing the “best”
operators on data in main memory point in the solution space of all possible execution
strategies.
2.TheI/O cost is the time necessary for disk accesses.
This cost can be minimized by reducing the number An immediate method for query optimization is to
of disk accesses through fast access methods to the search the solution space ,predict the cost of each
data and efficient use of main memory (buffer strategy, and select the strategy with minimum cost.
management).
The problem is that the solution space can be large;
3.The communication cost is the time needed for that is, there may be many equivalent strategies,
exchanging data between sites participating in the even with a small number of relations.
execution of the query.
Therefore, an “exhaustive” search approach is often
Characterization of Query Processors used whereby (almost) all possible execution
strategies are considered
It is quite difficult to evaluate and compare query
processors in both centralized systems and 1)Optimization Timing
distributed systems because they may differ in many
aspects. A query maybe optimized at different times.
Here are some important characteristics of query Optimization can be done statically or dynamically
processors which is used as a basis for comparison.
Statically means before executing the query
The first four characteristics hold for both
Dynamically means as the query is executed.
centralized and distributed query processors
The main advantage over static query optimization is
Next four characteristics are particular to distributed
that the actual sizes of intermediate relations are
query processors distributed DBMSs.
available to the query processor, thereby minimizing Relations are mapped into queries on physical
the probability of a bad choice fragments of relations by translating relations into
fragments.
2)Statistics
We call this process localization because its main
The effectiveness of query optimization relies on function is to localize the data involved in the query.
statistics on the database.
1)Use of Semijoins
Dynamic query optimization requires statistics in
order to choose which operators should be done The semijoin operator has the important property of
first. Static query optimization is even more reducing the size of the operand relation.
demanding since the size of intermediate relations
must also be estimated based on statistical A semijoin is particularly useful for improving the
information processing of distributed join operators as it reduces
the size of data exchanged between sites
1)Decision Sites
The early distributed DBMSs, which has slow wide
When optimization is used, either a single site or area networks, make extensive use of semijoins.
several sites may participate in the selection of the
strategy to be applied for answering the query. Some later systems, which has faster networks and
do not employ semijoins.
In centralized decision approach a single site
generates the strategy even the decision process The input is a query on global data expressed in
could be distributed among various sites relational calculus.
participating in the elaboration of the best strategy
This query is posed on
Even it is simple but requires knowledge of the global(distributed)relations,meaning that data
entire distributed database Where as in distributed distribution is hidden.
approach it requires only local information.
Four main layers are involved in distributed query
Hybrid approaches where ones it makes the major processing. The first three layers map the input
decisions and other sites can make local decisions query in to an optimized distributed query execution
are also frequent plan. They perform the functions of query
decomposition,data localization, and global query
1)Exploitation of the Network Topology optimization.
The network topology is generally exploited by the The fourth layer performs distributed query
distributed query processorWith wide area execution by executing the plan and returns the
networks,the cost function to be minimized can be answer to the query.
restricted to the data communication cost
It is done by the local sites and the control site.
With local area networks,communication costs are
comparable to I/O costs 1)Query Decomposition
2)Exploitation of Replicated Fragments The first layer decomposes the calculus query in to
an algebraic query on global relations.
A distributed relation is usually divided into relation
fragments Distributed queries expressed on global The information needed for this transformation is
found in the global conceptual schema describing
the global relations
Query decomposition can be done as four successive It is a weighted combination of I/O, CPU, and
steps communication costs.
First, the calculus query is rewritten in a normalized The output of the query optimization layer is a
form that is suitable for subsequent manipulation optimized algebraic query
Second, the normalized query is analyzed It is represented and saved (for future executions)as
semantically so that incorrect queries are detected a distributed query execution plan .
and rejected as early as possible.
1)Distributed Query Execution
Third,the correct query(still expressed in relational
calculus) is simplified (eliminate redundant The last layer is performed by all the sites having
predicates) fragments involved in the query.
Fourth,the calculus query is restructured as an Each sub query executing a tone site,called a local
algebraic query. The algebraic query generated by query, is then optimized using the local schema of
this layer avoid worse executions. the site and executed.
The input to the second layer is an algebraic query It is the first phase of query processing that
on global relations. transforms a relational calculus query into a
relational algebra query
The main role of the second layer is to localize the
query’s data using data distribution information in The successive steps of query decomposition are (1)
the fragment schema.
normalization,(2)analysis,(3)elimination of
This layer determines which fragments are involved redundancy, and (4) rewriting.
in the query and transforms the distributed query in
Normalization
to a query on fragments
It is the goal of normalization to transform the query
A global relation can be reconstructed by applying
to a normalized form to facilitate further processing.
the fragmentation rules, and then deriving a
program, called a localization program, of relational Two types of normal from conjunctive(˄),
algebra operators, which then act on fragments. disjunctive(˅) normal form.
1)Global Query Optimization Analysis
The input to the third layer is an algebraic query on Query analysis enables rejection of normalized
fragments. queries for which further processing is either
impossible or unnecessary.
The goal of query optimization is to find an
execution strategy for the query. The main reasons for rejection are that the query is
type incorrect or semantically incorrect.
Query optimization find bestway that minimize a
cost function. When one of these cases is detected, the query is
simply returned to the user with an explanation.
The cost function, often defined in terms of time
Otherwise, query processing is continued.
units.
Rewriting
The last step of query decomposition rewrites the execution cost is expressed as a weighted
query in relational algebra. combination of I/O, CPU, and communication costs.
Determine which fragments are involved The search space provides execution plans that
represent the input query.
The localization layer translates an algebraic query
on global relations into an algebraic query expressed All plans provides same result for input but differ in
on physical fragments. the execution order of operations and the way the
operations are implemented, and therefore in their
To simplify this section, we do not consider the fact performance.
that data fragments maybe replicated
The cost model predicts the cost of a given execution
This can be viewed as replacing the leaves of the plan. To be accurate, the cost model must have good
operator tree of the distributed query with sub trees knowledge about the distributed execution
corresponding to the localization programs. environment.
We call the query obtained this way the localized The search strategy explores the search space and
query. selects the best plan, using the cost model.
Objective of the optimizer is to find a strategy close Query execution plans are abstracted by means of
to optimal & avoid bad strategies. operator trees by defining the order in which the
operations are executed
The strategy produced by the optimizer as the
optimal strategy (or optimal ordering). They are enriched with additional information, such
as the best algorithm chosen for each operation.
The output of the optimizer is an optimized query
execution plan consisting of the algebraic query For a given query,the search space can thus be
specified on fragments and the communication defined as the set of equivalent operator trees.
operations to support the execution of the query
over the fragment sites. Each of the join trees can be assigned a cost based
on the estimated cost of each operator.
The selection of the optimal strategy requires the
prediction of execution costs(total cost). The
Join tree(c) which starts with a Cartesian product Dynamic Query Optimization : The QEP is
may have a much higher cost than the other join dynamically constructed by the query optimizer
trees. which makes calls to the DBMS execution engine for
executing the query’s operations. Thus, there is no
If a query is large we use some restrictions need for a cost model.
The most common heuristic is to perform selection Static Query Optimization : With static query
and projection when accessing base relations optimization, there is a clear separation between the
generation of the QEP at compile-time and its
Another common heuristic is to avoid Cartesian
execution by the DBMS execution engine. Thus, an
products that are not required by the query.
accurate cost model is key to predict the costs of
Like operator tree(c)would not be part of the search candidate QEPs.
space considered by the optimizer.
The minimization of communication costs makes
To characterize query optimizers, we concentrate on distributed query optimization more complex.
join trees, which are operator trees whose operators
the optimization timing, which can be dynamic,
are join or Cartesian product.
static or hybrid, is a good basis for classifying query
Because permutations of the join order have the optimization techniques.
most important effect on performance of relational
queries. DDB UNIT-3
It is a pre requisite to understanding distributed Reliability refers to both the resiliency of a system to
query optimization for three reasons. various types of failures and its capability to recover
from them. A transaction is a unit of consistent and
First, a distributed query is translated into local reliable computation. A transaction takes a
queries, each of which is processed in a centralized database, performs an action on it, and generates a
way. new version of the database, causing a state
transition.
Second, distributed query optimization techniques
are often extensions of the techniques for transaction is considered to be made up of a
centralized systems. sequence of read and write operations on the
database, together with computation steps.
Finally, centralized query optimization is a simpler
problem. A transaction maybe though to as a program with
embedded database access queries.
Properties of Transactions In other words, a transaction is a correct program
that maps one consistent database state to another.
The consistency and reliability aspects of
transactions are due to four properties: There is a classification of consistency that groups
databases into four levels.
(1)atomicity,
Then based on the concept of dirty data, the four
(2)consistency, levels are defined as follows:
Atomicity requires that if the execution of a Degree2: Transaction T sees degree2 consistency if:
transaction is interrupted by any sort of failure, the
DBMS will be responsible for determining what to do 1.T does not overwrite dirty data of other
with the transaction upon recovery from the failure. transactions.
There are two possible courses of action: it can 2.T does not commit any writes before EOT.
either be terminated by completing the remaining
actions, or it can be terminated by undoing all the 3.T does not read dirty data from other transactions.
actions that have already been executed. Degree 1:Transaction Tsees degree 1 consistency if:
There are two types of failures: A transaction itself 1.T doesnot overwrite dirty data of other
may fail due to input data errors, deadlocks, or other transactions.
factors.
2.T doesnot commit any writes before EOT.
Maintaining transaction atomicity in the presence of
Degree0: Transaction T sees degree0 consistency if:
this type of failure is commonly called the
transaction recovery. 1.T doesnot overwrite dirty data of other
transactions.”
The second type of failure is caused by system
crashes, such as media failures, process or failures, defining multiple levels of consistency is to provide
communication link breakages, power outages etc. application programmers the flexibility to define
transactions that operate at different levels.
Ensuring transaction atomicity in the presence of
system crashes is called crash recovery. Consequently,whilesometransactionsoperateatDegr
ee3 consistency level, others may operate at lower
Consistency
levels Isolation
The consistency of a transaction is its correctness
This property ensures that multiple transactions can Read(x) Read(y) x←x+1 y←y+1 Write(x) Write(y)
occur concurrently without leading to the commit
inconsistency of database state.
Read(x) x ←x+1 Write(x) Read(y) y ←y+1 Write(y)
Durability commit
Durability refers to that property of transactions Transactions can also be classified according to their
which ensures that once a transaction commits, its structure
results are permanent and cannot be erased from
the database. Distinguishing four broad categories in increasing
complexity
The DBMS ensures that the results of a transaction
will survive subsequent system failures. 1.flat transactions
Transactions have been classified according to a 3.work flow models(are combinations of various
number of criteria. nested forms.) flat transactions
First goes duration of transactions transactions Flat transactions have a single start point(Begin
maybe classified as online or batch These two classes transaction)and a single termination point (End
are also called short-life and long-life transactions. transaction).
If a transaction is both two step and restricted, it is 1.providing a higher-level of concurrency among
called a restricted two-step transaction. transactions. Since a transaction consists of a
number of other transactions, more concurrency is First, read operations do not conflict with each
possible within a single transaction other. We can, therefore ,talk about two types of
conflicts: read-write(or write- read), and write-write.
2.It is possible to recover in dependently from
failures of each sub transaction. Second, the two operations can belong to the same
transaction or to two different transactions.
Workflows
The existence of a conflict between two operations
To model business activities flat and nested indicates that their order of execution is important
transactions are less sufficient.
The ordering of two read operations is insignificant.
So workflows came into existence a workflow is“ a (very small)
collection of tasks organized to accomplish some
business process.” Concurrency control mechanisms & algorithms
Three types of work flows are identified There are a number of ways that the concurrency
control approaches can be classified.
Distributed concurrency control
Grouping the concurrency control mechanisms into
The distributed concurrency control mechanism of a two broad classes:
distributed DBMS ensures that the consistency of
the database maintained in a multiuser distributed 1.pessimistic concurrency control methods and
environment.
2.optimistic concurrency control methods.
In this chapter we make two major assumptions: the
distributed system is fully reliable and does not Pessimistic algorithms synchronize the concurrent
experience any failures (of hardware or software), execution of transactions early in their execution
and the database is not replicated. lifecycle
Pessimistic concurrency control algorithms assume Using the WFG, it is easier to indicate the condition
that the conflicts between transactions are quite for the occurrence of a deadlock.
frequent and do not permit a transaction to access a
A deadlock occurs when the WFG contains a cycle.
data item if there is a conflicting transaction that
Deadlock Prevention
accesses that data item.
Deadlock prevention methods guarantee that
Thus operation of a transaction follows the sequence
deadlocks can not occur in the first place
of phases validation (V), read (R), computation (C),
write (W)
The transaction manager checks a transaction when •If the transaction manager generates a non-
it is first initiated and does not permit it to proceed if serialize able schedule,we say that it has failed.
it may cause a deadlock
(i)Reliability and Availability
To perform this check, it is required that all of the
data items that will be accessed by a transaction be •Reliability refers to the probability that the system
pre declared under consideration does not experience any failures
in a given time interval
The transaction manager then permits a transaction
to proceed fall the data items that it will access a re •Formally,thereliabilityofasystem,R(t),isdefinedasthe
available. followingconditionalprobability:
Otherwise, the transaction is not permitted to •If we assume that failures follow a Poisson
proceed. distribution(which is usually the case for hardware),
this formula reduces to
DISTRIBUTED DATABASES UNIT-4
•Under the same assumptions, it is possible to
Distributed DBMS Reliability derive that
A reliable distributed database management system •Availability, A(t), refers to the probability that the
is one that can continue to process user requests system is operational according to its specification at
even when the underlying system is unreliable. a given point in time t.
Even when components of the distributed •Reliability and availability of a system are
computing environment fail, a reliable distributed considered to be contradictory objectives.
DBMS should be able to continue executing user
requests without violating database consistency. (i)Mean Time between Failures(MTBF)/ Mean Time
to Repair(MTTR)
So we discuss the reliability features of a distributed
DBMS. •MTB Finish the expected time between sub sequent
failures in a system with repair.
Reliability concepts & measures
•MTBF canbe calculated either from empirical data
(i)System,State,andFailure: or from the reliability function as:
•Reliability refers to a system that consists of a set of •MTTD: mean time to detect
components.
Failures in distributed DBMS
•The system has a state, which changes as the
system operates. •Designing a reliable system that can recover from
failures requires identifying the types of failures with
•Any deviation of a system from the behavior which the system has to deal.
described in the specification is considered a failure.
•In distributed database system,we need to deal
•For example, in a distributed transaction manager with four types of failures:
the specification may state that only serialize able
schedules for the execution of concurrent i.transaction failures(aborts)
transactions should be generated.
ii.site(system) failures
iii.media(disk)failures
iv. communication line failures. •These failures may be due to operating system
errors, as well as to hardware faults
(i)transaction failures:
•It means that all or part of the database that is on
•Transactions can fail for a number of reasons. the secondary storage is considered to be destroyed
and in accessible
•Failure can be due to an error in the transaction
caused by incorrect input data or due to deadlocks (ii)Communication Failures:
•some concurrency control algorithms do not permit •The three types of failures described above are
a transaction to proceed or even to wait if the data common to both centralized and distributed DBMSs.
that they attempt to access are currently being
accessed by another transaction. •Communication failures are unique to the
distributed & there are a number of types of
•This might also be considered a failure. communication failures.
•The approach to take incases of transaction failure •Error In the messages, improperly or de
is to abort the transaction,thus re setting the redmessages,lost(or undeliverable)messages,and
Database to its state prior to the start of this communication line failures.
transaction.
Local & distributed reliability protocols
(i)site (system) failures:
Local reliability protocols
•The reasons for system failure can be traced back
to a hardware or to a software failure •we discuss the functions performed by the local
recovery manager(LRM)that exists at each site.
•A system failure is always assumed to result in the
loss of main memory contents. •These functions maintain the atomicity and
durability properties of local transactions.
•Therefore, any part of the database that was in
main memory buffers is lost as a result of a system •Which relate to the execution of the commands
failure. that are passed to the LRM, which are
begin_transaction, read, write, commit, and abort.
•system failures are referred to as site failures,which
makes site un reachable •When the LRM wants to read a page of data on
behalf of a transaction it issues a fetch command,
•We differentiate between partial and total failures indicating the page that it wants to read.
in a distributed system.
•The buffer manager checks to see if that page is
•Total failure refers to the simultaneous failure of all already in the buffer and if so,makes it available for
sites in the distributed system partial failure that transaction; if not, it reads the page from the
indicates the failure of only some sites while the stable database into an empty database buffer
others remain operational
•Other than above a sixth interface command to the
(i)media failures: LRM: recover.
•Media failure refers to the failures of the secondary •The recover command is the interface that the
storage devices that store the database. operating system has to the LRM
•It is used during recovery from system failures •The first kind of access is representative of On-Line
when the operating system asks the DBMS to Transaction Processing(OLTP) applications
recover the database to the state that existed when
the failure occurred. •while the second is representative of On-Line
Analytical Processing (OLAP) applications
Distributed reliability protocols
•Supporting very large databases efficiently for
•the distributed version also aim to maintain the either OLTP or OLAP can be addressed by combining
atomicity and durability of distributed transactions parallel computing and distributed database
that execute over a number of databases management.
•The protocols address the distributed execution of Parallel Database System Architectures
the begin_transaction,read,write,abort,commit, and
recover commands. Objectives
•all the commands are executed in the same manner •Parallel database systems combined at a base
of centralized system management and parallel processing to increase
performance and availability
•We assume that at the originating site of a
transaction there is a coordinator process and a •A parallel database system can be loosely defined
teach site where the transaction executes there are as a DBMS implemented on a parallel\computer.
participant processes.
•The objectives of parallel database systems are
•Thus,the distributed reliability protocols are covered by those of distributed DBMS(performance,
implemented between the coordinator and the availability,extensibility).
participants.
•Ideally,a parallel database system should provide
•Assuming that during the execution of a distributed the following advantages High-performance:
transaction,one of the sites involved in the execution obtained by parallel datamanagement, query
fails; we would like the other sites to terminate the optimization,and load balancing(Load balancing is
transaction the ability of the system to divide a given workload
equally among all processors.)etc
•Recovery protocols deal with the procedure that
the process(coordinator or participant)at the failed High availability: A parallel database system
site has to go through to recover its state once the consists of many redundant components, it can well
site is restarted. increase data availability and fault-tolerance
•Very large databases are accessed through high Session Manager: provide support for client
numbers of concurrent transactions (e.g., interactions with the server and also performs the
performing on-line orders on an electronic store)or connections and disconnections between the client
complex queries(e.g., decision-support queries). processes and the two other subsystems. Therefore,
it initiates and closes user sessions
Transaction Manager: It receives client transactions •There are three basic strategies for data
related to query compilation and execution. partitioning: round-robin,hash,and range
Depending on the transaction,it activates the various partitioning.
compilation phases, triggers query execution, and
returns the results as well as error codes to the Round-robin partitioning is the simplest strategy,it
client application ensures uniform data distribution.This strategy
Enables the sequential access to a relation to be
3.Data Manager : It provides all the low-level done in parallel.
functions needed to run compiled queries in parallel
database operator execution,parallel transaction Hash partitioning applies a hash function to some
support, cache management, etc. attribute that yields the partition number. This
strategy allows exact-match queries on the selection
Parallel DBMS Architectures attribute to be processed by exactly one node and all
other queries to be processed by all the nodes in
•There are three basic parallel computer parallel.
architectures depending on how main memory or
disk is shared: shared-memory,shared-disk and 3.Range partitioning distributes tuples based on the
shared-nothing value intervals(ranges)of some attribute
1.shared-disk: In this any processor has access to •It focuses on both into a operator parallelism (a
any disk unit through the interconnect but exclusive single operator is distributed among multiple
(non-shared) access to its main memory . Each processors.)and inter-operator parallelism(each
processor-memory node is under the control of its query runs on multiple processors which
own copy of the operating system. corresponds to different operators of a query
running in different processors.)
2.shared-nothing:In this approach each processor
has exclusive access to its main memory and disk A parallel query optimizer can be seen as three
unit(s).Similar to shared-disk,each processor components:a search space,a cost model,and a
memory-disk node is under the control of its own search strategy.
copy of the operating system. Then, each node can
Load balancing
be viewed as a local site
•Good load balancing is crucial for the performance
Parallel data placement
of a parallel system
•Data placement in a parallel database system
•the response time of a set of parallel operators is
exhibits similarities with data fragmentation in
that of the longest one.
distributed databases
•Thus,minimizing the time of the longest one is
•we use the terms partitioning and partition instead
important for minimizing response time.
of horizontal fragmentation and horizontal fragment
•Balancing the load of different transactions and •We can fragment the state,the method
queries among different nodes is also essential to definitions,and the method implementation.
maximize throughput
•Furthermore,the objects in a class extent can also
•Solutions to these problems canbe obtained at the be fragmented and placed at different sites.
intra-and inter-operator levels
1.Horizontal Class Partitioning :class C for
Database clusters partitioning,we create classes C1,...,Cn, each of
which takes the instances of C that satisfy the
•a cluster can have a shared-disk or shared-nothing particular partitioning predicate.
architecture
2.Vertical Class Partitioning : Vertical fragmentation
•Shared disk requires a special interconnect that nis considerably more
provides a shared disk space to all nodes with complicated.GivenaclassC,fragmentingitverticallyinto
provision for cache consistency C1,...,Cm produces a number of classes,each of
which contains some of the attributes and some of
•Shared nothing can better support database
the methods. Thus, each of the fragments is less
autonomy without the additional cost of a special
defined than the original class.
interconnect and can scale upto very large
configurations 3.PathPartitioning:Path partitioning is a concept
describing the clustering of all the objects forming a
UNIT-5 DISTRIBUTEDDATABASES
composite object into a partition. A path partition
Fundamental object concepts and models consists of grouping the objects of all the domain
classes that correspond to all the instance variables
•An object DBMS is a system that uses an“ object” as in the subtree rooted at the composite object
the fundamental modeling,in which information is
represented in the form of objects Object Architectural issues
•all object DBMSs are built around the fundamental •The preferred architectural model for object
concept of an object DBMSs has been client/server
•An object represents a real entity in the system that •The unit of communication between the clients and
is being modeled. the server is an issue in object dbms
•It is represented as a(OID,state,interface) •Since data are shared by many clients, the
management of client cache buffers for data
•OID is the object identifier consistency becomes a serious concern
•state is some representation of the current state of •Since objects may be composite or complex,there
the object may be possibilities for prefetching component
objects when an object is requested.
•interface defines the behavior of the object
Alternative Client/Server Architectures
Object distributed design
•Two main types of client/ server architectures have
•The two important aspects of distribution design
been proposed:
are fragmentation and allocation
1.object servers
•An object is defined by its state and its methods.
2.page servers
•The distinction is partly based on the granularity of •Garbage collection is a problem that arises in object
data that are shipped between the clients and the databases due to reference-based sharing.
servers, and partly on the functionality provided to
the clients and servers. •Indeed, in many object DBMSs,the only way to
delete an object is to delete all references to it.
Object management
Object query processing
•Object management includes tasks such as object
identifier management,pointer swizzling,object •Almost all object query processors and optimizers
migration,deletion of objects tasks at the server that have been proposed to date use techniques
developed for relational systems.
•object identifier management: object
identifiers(OIDs)are system-generated and used to •Consequently,it is possible to claim that distributed
uniquely identify every object. object query processing and optimization techniques
require the extension of centralized object query
•The implementation object identifier has two processing and optimization with the distribution
common solutions,based on either physical or logical approaches(discussed earlier)
identifiers
•Objects can(and usually do)have complex
•The physical identifier(POID)approach equates the structures where by the state of an object references
OID with the physical address of the corresponding another object. Accessing such complex objects
object. The address can be a disk page address involves path expressions
oodbms Ordbms
•In the object oriented database,the data •Inrelational database,data is stored in the form of
tables, which contains rows and column.
Is stored in the form of objects.
•In ordbms, connections between two relations are
•In oodbms, relationships are represented by represented by foreign key attributes
references via the object identifier (OID).
•Handles comparatively simpler data.
•Handles larger and complex data than RDBMS.
•n relational database systems there are datam
•In oodbms,the data management language is anipulation languages such as SQL,
typically incorporated into a programming
languagesuch as #C++. •Stores data in entries is described as tables.