Amrita Vishwa Vidyapeetham,
Mysuru Campus
School of Computing
Advanced Databases
Distributed Databases
Dr.Thilagaraj.T
Assistant Professor
Distributed Database System(DDBS)
• To form a distributed database system (DDBS), the files must
be
• structured,
• logically interrelated, and
• physically distributed across multiple sites.
• Distributed DBMS is defined as
• the software system that permits management of the
DDBS
• makes distribution transparent to the users
Distributed Database System(DDBS)
• Not a collection of files
• Individually store at each node of a computer network
• Access should be a common interface
• Physical distribution creates problems
• Two or more processor shares some form of memory
• primary memory or secondary – shared or tightly coupled
• Shared memory multiprocessor
• Shared disk multiprocessor
Shared-everything and Shared-nothing
• Architecture – Each processor hash primary and secondary
memories as well as peripherals
• Communicate with other processors – high speed (bus or
switch) (Fig. 1.5)
• Quite similar to distributed environment
• OS will control remaining process
Distributed Database System
• Database resides only one node of the network(Fig. 1.6)
• Transmission delay
• Data distributed among number of sites(Fig 1.7)
What constitutes a DDB?
• Connection of database nodes over a computer network.
– Transmit data and commands among sites
• Logical interrelation of the connected databases.
– Essentially information are logically related
• Possible absence of homogeneity among connected nodes.
(LAN, WAN, Topologies)
– Not necessary all nodes be identical
Transparency
• Idea of hiding implementation details from end users
• It offers a lot of flexibility to end user/application developers
• Little or no awareness
• In DDB – Data and software are distributed – Transparencies
Data distribution and replication among distributed
databases
Types of Transparencies
1. Data organization transparency(Distribution or Network)
– Freedom from operational details
– Placement of data in distributed systems
– Local Transparency
• Independent - Location of data
• Location of node
– Naming Transparency
• Named object accessed without additional specification
Types of Transparencies
2. Replication Transparency
– Figure
– Same copy of data objects
– Better availability
– Performance
– Reliability
– Makes users unaware of multiple copies
3. Fragmentation Transparency
– Horizontal Fragmentation
• A subset of tuples in that relations
– Vertical Fragmentation
• Keeps only certain attributes of the relation
Types of Transparencies
• Design Transparency and Execution Transparency
– To freedom from knowing – design
– Through design – Execution happens
Scalability and Partition Tolerance
Expand its capacity without interruption
• Horizontal Scalability
– Expanding no. of nodes in distributed systems
– Possible to distribute loads and data from existing to new nodes
• Vertical Scalability
– Expanding the capacity of individual nodes in the system
– Storage Capacity
– Processing power of node
Reliability
• A system is running at certain time point
• Faults, errors and failures associated with it
• Errors -> Fault
• Provides transactional support through queries
Advantages of Distributed Databases
• Improved ease and flexibility of application development
• Increased availability
• Improved performance
• Easier expansion via scalability
Architecture Models for Distributed DBMSs
Two types of multiprocessor system architecture:
• Shared memory(tightly coupled) architecture:
– Multiple processors share secondary (disk) storage
– Also share primary memory
• Shared disk(loosely coupled) architecture:
– Multiple processors share secondary(disk) storage
– But each has their own primary memory
Shared-nothing architecture
• Every processor has its own primary and secondary(disk)
memory
• No common memory exists
• The processors communicate over high speed interconnection
network(bus or switch)
• It resembles a distributed database computing environment
• Difference in mode of operation
• Symmetry and homogeneity of hardware and OS at each node
• Environment of parallel database
Parallel database management
• Both tightly coupled and loosely coupled architecture used
• Data is shared, not database
• Homogenous distributed database system, the data is
distributed but all servers run the same DBMS software.
• Heterogeneous distributed databases dissimilar sites run under
the control of different DBMSs.
Illustrates a parallel database(shared nothing)
Illustrates a centralized database with distributed
access
A Pure distributed database
General Architecture of Pure Databases
• Global Conceptual Schema(GCS) – Provides network
transparency
• To accommodate potential heterogeneity in the DDB – each
node have its own Local Internal Schema(LIS)
• Local organization of data at each site by Local Conceptual
Schema (LCS).
• LCS, GCS – Fragmentation, replication
• Global query compiler references GCS
• Global query optimizer refer both global and local queries
General Architecture of Pure Databases
• Estimates cost based on response time (CPU, I/O, network
latencies)
• Estimated sizes
• Joins
• Select minimum cost of execution
• Local DBMS have local optimizer, transaction manager,
execution engines, local system catalog
• Global transactional manager – coordinating execution across
multiple sites in conjunction with local transactional manager
Schema architecture of distributed databases
Five level schema architecture in a federated database system for
managing distributed heterogeneous and autonomous databases
FDBS
• Local schema – Conceptual schema of a component database
• Component schema – Translate local schema to Common Data
Model(CDM) – generating mapping to transform commands
• Export schema – Export the data belonging to applications
• Federated schema – global schema or view, result of
integrating all the sharable export schemas
• External schema – defines scheme for a user group or an
application
Overview of Three-Tier Client/Server Architecture
Presentation layer(Client)
• Provides user interface and interacts with the user
• Web interface or forms to client to interface with application
• HTML, XHTML, CSS, Flash, MathML, Scalable Vector
Graphics (SVG), Java, JavaScript, Adobe Flex
• Handles user input, output, navigation
• Static and dynamic web pages
• Web interface – communicates - HTTP
Application layer (business logic)
• Programs application logic
• Clients – queries
• Security checks and identity verification
• Interact with more than one databases or data sources
• ODBC, JDBC, SQL/CLI (Command Line Input)
Database server
• Handles query and update request from the application layer
• Send results
• Query results (XML)
• PSM(Persistent Stored Modules) – Manage database with
procedural language
Autonomy
• It refers to distribution of control, not a data
• Individual DBMS can operate independently
– Whether the component system exchange information
– Whether they can independently execute transactions
– Whether one is allowed to modify them
Requirements of an autonomous systems:
• Local operation of the individual DBMS
– not affected by their participation in multidatabase system
• The manner in which the individual DBMS process queries
and optimize them
– should not be affected by execution of global queries that
access multiple databases
• System consistency or operation
– should not be compromised when individual DBMSs join
or leave the multidatabase groups
Dimensions of Autonomy
• Design autonomy:
– Individual DBMS can use the data models and transaction
management techniques that they prefer
• Communication autonomy:
– DBMS is free to make its own decision and need to provide
information for other DBMS or Software
• Execution autonomy:
– Each DBMS can execute the transactions that are submitted
Autonomy-Alternatives
• Tight integration
– Single-image of entire database is available
– Users can share information resides in multiple databases
– Users Perspective - Data is logically centralized in one
database
– Data manager is implemented to control process and user
requests
Autonomy-Alternatives
• Semiautonomous
– It consists of DBMS operates independently
– Data is locally sharable
– Not fully autonomous – they need to modify to enable them
to exchange information
• Total isolation
• Transactions that access multiple databases is especially
difficult since there is no global control.
Distribution
• Distribution of control
• Distribution dimension of the taxonomy deals with data
• Physical distribution of data over multiple sites
Two classes:
• Client/Server distribution
• Peer-to-peer distribution(or full distribution)
Client/Server distribution
• Concentrates data management duties at servers
• Clients focus on providing the application environment
including user interface
• Communication duties shared between clients and servers
• Client server DBMSs – distributing functionality
• Client server architecture
Peer-to-peer systems
• No distinction of client machines versus servers
• Each machine has full DBMS functionality and communicate
with others to execute queries and transactions
• Also called as fully distributed
Heterogeneity
• Heterogeneity is applied to the network, computer hardware,
operating system and implementation of different developers.
• A key component of the heterogeneous distributed system
client-server environment is middleware.
• Middleware is a set of services that enables application and
end-user to interacts with each other across a heterogeneous
distributed system.
Fragmentation in Distributed DBMS
• Fragmentation is a process of dividing the whole or full
database into various subtables or sub relations so that data can
be stored in different systems.
• The small pieces of sub relations or subtables are
called fragments.
• These fragments are called logical data units and are stored at
various sites.
• It must be made sure that the fragments are such that they can
be used to reconstruct the original relation (i.e, there isn’t any
loss of data).
• In the fragmentation process, let’s say, If a table T is
fragmented and is divided into a number of fragments say T1,
T2, T3….TN.
• The fragments contain sufficient information to allow the
restoration of the original table T.
• This restoration can be done by the use of UNION or JOIN
operation on various fragments. This process is called data
fragmentation.
• All of these fragments are independent which means these
fragments can not be derived from others.
• The users needn’t be logically concerned about fragmentation
is called fragmentation Independence or we can
say fragmentation transparency.
Advantages :
• As the data is stored close to the usage site, the efficiency of
the database system will increase
• Local query optimization methods are sufficient for some
queries as the data is available locally
• In order to maintain the security and privacy of the database
system, fragmentation is advantageous
Disadvantages :
• Access speeds may be very high if data from different
fragments are needed
• If we are using recursive fragmentation, then it will be very
expensive
We have three methods for data fragmenting of a table:
• Horizontal fragmentation
• Vertical fragmentation
• Mixed or Hybrid fragmentation
Horizontal fragmentation
• Horizontal fragmentation refers to the process of dividing a
table horizontally by assigning each row or (a group of rows)
of relation to one or more fragments.
• These fragments are then be assigned to different sides in the
distributed system.
• Some of the rows or tuples of the table are placed in one
system and the rest are placed in other systems.
• The rows that belong to the horizontal fragments are
specified by a condition on one or more attributes of the
relation.
Horizontal fragmentation
• In relational algebra horizontal fragmentation on table T, can
be represented as follows:
• Note that a union operation can be performed on the fragments to construct
table T. Such a fragment containing all the rows of table T is called
a complete horizontal fragment.
Vertical Fragmentation
• Vertical fragmentation refers to the process of decomposing a table
vertically by attributes are columns.
• In this fragmentation, some of the attributes are stored in one
system and the rest are stored in other systems.
• This is because each site may not need all columns of a table. In
order to take care of restoration, each fragment must contain the
primary key field(s) in a table.
• The fragmentation should be in such a manner that we can rebuild
a table from the fragment by taking the natural JOIN operation and
to make it possible we need to include a special attribute
called Tuple-id to the schema.
• For this purpose, a user can use any super key. And by this, the
tuples or rows can be linked together.
Mixed Fragmentation
• The combination of vertical fragmentation of a table followed by further
horizontal fragmentation of some fragments is called mixed or hybrid
fragmentation.
• For defining this type of fragmentation we use the SELECT and the PROJECT
operations of relational algebra.
• In some situations, the horizontal and the vertical fragmentation isn’t enough
to distribute data for some applications and in that conditions, we need a
fragmentation called a mixed fragmentation.
Mixed fragmentation can be done in two different ways:
• The first method is to first create a set or group of horizontal fragments and
then create vertical fragments from one or more of the horizontal fragments.
• The second method is to first create a set or group of vertical fragments and
then create horizontal fragments from one or more of the vertical fragments.
The original relation can be obtained by the combination of JOIN and UNION
operations which is given as follows:
Resource Allocation
The optimality can be defined with respect to two measures
1. Minimal cost
The cost function consists of
1. the cost of storing each Fi at a site Sj
2. the cost of querying Fi at site Sj ,
3. the cost of updating Fi at all sites where it is stored, and
4. the cost of data communication.
The allocation problem, then, attempts to find an allocation scheme that
minimizes a combined cost function.
2. Performance
– The allocation strategy is designed to maintain a performance metric.
– To minimize the response time and to maximize the system
throughput at each site.
Information requirements
Information requirements
Database Information
• To perform horizontal fragmentation, we defined the
selectivity of minterms(single condition or set of simple
predicates to predict data).
• We now need to extend that definition to fragments, and
define the selectivity of a fragment Fj with respect to query qi
• This is the number of tuples of Fj that need to be accessed in
order to process qi .
• Another piece of necessary information on the database
fragments is their size. The size of a fragment Fj is given by
Application Information
• Most of the application-related information is already
compiled during the fragmentation activity, but a few more
are required by the allocation model.
• the number of read accesses that a query qi makes to a
fragment Fj during its execution
• And its counterpart for the update accesses
• These may, for example, count the number of block accesses
required by the query
Events happened in sites o(i)
Site Information
Network Information
Allocation Model
the processing component, PC, consists of three cost factors, the access cost (AC), the
integrity enforcement cost (IE) (communication and processing cost), and the
concurrency control cost (CC)(the activity of co- ordinating concurrent accesses):