Distributed Database Design
Seminary of Distributed and Parallel Database Management Systems
Diego Godoy Gonzlez Cueto
Table of Contents
1.-Design Problem 1.1 What's design? 1.2 Organization of distributed systems
4.- Allocation 4.1 Allocation Problem 4.2 Optimality Measures 4.3 Alternatives 4.4 Information Requirements 4.4.1 Database
1.2.1 Level of sharing
1.2.2 Behavior of access patterns 1.2.3 Level of knowledge
4.4.2 Application
4.4.3 Site 4.4.4 Network 5.- Bottom-Up Strategy 5.1 Data Warehouse Approach 5.2 Design Process 5.3 Approaches 5.4 Relationship Schemes
1.2.4 Framework of Distribution
5.5 Generation Process
2
5.5.1 Matching/Integration/Mapping
What is design?
What is design?
Design is the creation of a plan or convention for the construction of an object or a system (as in architectural blueprints, engineering drawings, business processes, circuit diagrams and sewing patterns).
In order to be functional and optimize or improve a certain characteristic.
4
Design Problem
Design problem of distributed systems: Making decisions about the placement of data and programs across the sites of a computer network.
In DDBMS, the distribution of applications involves:
 Distribution of the DDBMS software  Distribution of applications that run on the database
5
Organization of distributed systems
It has been suggested that the organization of distributed systems can be investigated along three orthogonal dimensions.
Level of sharing
Behavior of access patterns Level of knowledge (on access pattern behavior)
Level of sharing
There are three possibilities in terms of this level:
No Sharing: Each application and its data execute at one site, and there is no communication with any other program or access to any data file at other sites. Level data sharing: All the programs are replicated at all the sites, but data files are not. Data-plus-program sharing: Both data and programs may be shared, meaning that a program at 7 a given site can request a service from another
Behavior of access patterns
It is possible to identify two alternatives in terms of this level. The access patterns of user requests may be: Static: So that they don't change over time. Dynamic: Which can change over time. The relationship between the distributed database design and query processing is established in this dimension.
Level of knowledge
The possibilities within this level are:
Without Information: It is not known by the designer how the users will access the database.
Complete Information: The access patterns can be predicted without deviations.
Partial Information: There are deviations from the predictions.
Framework of Distribution
10
Design Strategies
Top-down approach - Designing systems from scratch - Homogeneous systems - Better for tightly integrated homogeneous DDBMS
Bottom-up approach
- The databases already exist at a number of sites
- The databases should be connected to solve common tasks
11
Top-Down Design Process
Requirements Analysis: Defines the environment of the system. System Requirements (Objectives): Performance, reliability and availability, economics, and expandability (flexibility). Requirements document is input to two parallel activities: - Conceptual Design: Determine entity types and relationships among these entities. Entity Analysis/Functional Analysis
12
Top-Down Design Process
Distribution Design: Takes as inputs the results collected from View Design (Global Conceptual Schema -GCS- and Access Patterns).
In order to design the Local Conceptual Schema (LCS):
Distributing the entities (relations) over the sites of the of the distributed system.
13
Design Schema
14
Distribution of Relations
We first divide them into sub-relations called fragments. Therefore, the distribution design activity splits in two steps: - Fragmentation - Allocation Physical Design: Maps the LCS to the physical storage devices available at the corresponding sites.
Reasons to have observation and monitoring?
15
Distribution Design Issues
Why fragment at all? How should we fragment? How much should we fragment? Is there any way to test the correctness of decomposition?
How should we allocate?
What is the necessary information for fragmentation and allocation?
16
Reasons for Fragmentation
From a distribution viewpoint, there is really no need to fragment data. The important issue is to have an appropriate unit of distribution. A relation is not a suitable unit, because: - Application views are usually subsets of relations. - If the applications that have views defined on a given relation reside at different sites, two 17 alternatives can be followed (one site or some/all of
Fragmentation Difficulties
If the applications have conflicting requirements that prevent decomposition of the relation into mutually exclusive fragments, those applications whose views are defined on more than one fragment may suffer performance degradation. As a result of fragmentation, attributes participating in a dependency may be decomposed into different fragments that might be allocated to different sites. In this case, even the simpler task of checking for dependencies would result in chasing after data in 18 a number of sites.
Fragmentation Alternatives
Relation instances are essentially tables, so the issue is one of finding alternative ways of dividing a table into smaller ones. There are two ways to do so:
Horizontally
Vertically Hybrid/Mixed (Nested)
19
Fragmentation Alternatives
20
Fragmentation Alternatives
- Horizontal
- Vertical
21
Degree of Fragmentation
The degree of fragmentation goes from one of two extremes:
Not to fragment at all.
To fragment to the level of individual tuples (horizontal) or to the level of individual attributes (vertical). It is need to find a suitable level of fragmentation that is a compromise between the two extremes. Such a level can only be defined with respect to the 22
Degree of Fragmentation
How to do so?
In general, the applications need to be characterized with respect to a number of parameters.
According to the values of these parameters, individual fragments can be identified.
23
Correctness Rule of Fragmentation
There are three rules that, together, ensure the database does not undergo semantic change during fragmentation:
Completeness Reconstruction Disjointness
24
Completeness
For a relation instance R decomposed into fragments FR = {R1, R2, . . . ,Rn}, each data item that can be found in R can also be found in one or more of Ris
Ensures that the data in a global relation are mapped into fragments without any loss.
Horizontal  Tuple Vertical - Attribute
25
Reconstruction
For a relation R decomposed into fragments FR = {R1, R2, . . . ,Rn}, it should be possible to define a relational operator such that:
R = Ri,  Ri  FR
Ensures that constraints defined on the data in the form of dependencies are preserved. ( will be different for different forms of fragmentation)
26
Disjointness
For a relation R horizontally decomposed into fragments FR = {R1, R2, . . . ,Rn} and data item di is in Rj, it is not in any other fragment Rk (k  j).
It ensures that the horizontal fragments are disjoint.
If vertically decomposed, primary key attributes are repeated in all it's fragments (for reconstruction). (For these case disjointness is defined only in nonprimary key attributes).
27
Allocation Problem
Assume that there are a set of fragments:
F = {F1, F2, . . . , Fn } And a distributed system consisting of sites: S = {S1, S2 , . . . , Sm } On which a set of applications:
Q = {q1, q2, . . . , qq } is running.
The allocation problem involves finding the optimal distribution of F to S.
28
Optimality Measures
The optimality can be defined with respect of two measures:
Minimal Cost: The cost function consists of the cost of storing each Fi at a site Sj, the cost of querying Fi at site Sj, the cost of updating Fi at all sites where it is stored, and the cost of data communication. (Scheme that minimizes a combined cost function) Performance: The allocation strategy is designed to maintain a performance metric. Two well-known 29
Allocation
Assuming that the database is fragmented properly, one has to decide on the allocation of the fragments to various sites on the network.
If the data is allocated the data can be:
Replicated Maintained as a single copy
30
Allocation Alternatives
The reasons for Replication are:
Reliability: With multiple copies, comes good chances for some copies to be accessible even with system failures. Efficiency of read-only queries: Queries that access the same data item can be executed in parallel (since there are copies in multiple sites). However the execution of an updated query, brings troubles, since the system has to check all the copies 31 were updated correctly.
Allocation Alternatives
Partitioned DB: Non-Replicated database contains fragments that are allocated to sites, and there is only one copy of any fragment on the network.
In case of replication:
The database exists in its entirety at each site (Fully Replicated DB). Fragments are distributed to the sites in such a way that copies of a fragment may reside in multiple sites (Partially Replicated DB).
32
Allocation Alternatives
33
Information Requirements
Too many factors contribute to an optimal design.
The logical organization of the database. The location of the applications. The access characteristics of the applications to the database.
The properties of the computer systems at each site.
All have an influence on distribution decisions. And this complicates the formulation of the distribution problem. 34
Information Requirements
The information needed for distribution design can be divided into four categories:
Database information.
Application information.
Communication network information. (Quantitative nature) Computer system information. (Quantitative nature)
35
Database Information
We now need to define the selectivity of a fragment Fj with respect to query qi. This is the number of tuples of Fj that need to be accessed in order to process qi.
SELi (Fj). Another necessary information on the database fragments is their size. The size of a fragment Fj is given by:
SIZE(Fj) = CARD(Fj)  LENGTH(Fj)
36
Application Information
The two important measures are the number of read accesses that a query qi makes to a fragment Fj during its execution (RRij), and its counterpart for the update accesses (URij).
We also need to define two matrices UM and RM, with elements uij and rij, respectively. A vector O to specifies the originating site of query qi.
37
Site Information
For each computer site, we need to know its storage and processing capacity. (Elaborate functions or by estimates)
The unit cost of storing data at site Sk.
There is also a need to specify the cost of processing one unit of work at site Sk. (The work unit should be identical to that of the RR and UR measures)
38
Network Information
In our model we assume the existence of a simple network where the cost of communication is defined in terms of one frame of data (cost per frame between sites Si and Sj). To enable the calculation of the number of messages, we use the size (in bytes) of one frame.
39
Bottom-Up Design Process
In this case, a number of databases already exist, and the design task involves integrating them into one database.
The process consists of integrating local databases with their (local) schemes into a global database with its global conceptual schema (GCS). Database integration can be either physical or logical.
The integration is aided by extract-transform-load (ETL) tools that enable extraction of data from
40
Data Warehouse Approach
41
Approaches
GCS Defined first: In which case the bottom-up design involves mapping LCSs to this schema. (This is the case in data warehouses)
GCS defined as an integration of parts of LCSs: In this case, the bottom-up design involves both the generation of the GCS and the mapping of individual LCSs to this GCS.
42
Relationship of Schemes
If the GCS is defined up-front, the relationship between the GCS and the LCS can be of two fundamental types:
Local-as-View (LAV): The GCS definition exists, and each LCS is treated as a view definition over it. (The results are constrained by the objects in the local DBMSs).
Global-as-View (GAV): The GCS is defined as a set of views over the LCSs. (The query results are constrained to the set of 43 objects that are defined in the GCS).
GAV and LAV Mapping
44
Database Generation Process
Then the intermediate schemes are used to generate a GCS.
Schema Matching: to determine the syntactic and semantic correspondences among the translated LCS elements. Schema Integration: of the common schema elements into a global conceptual (mediated) schema if one has not yet been defined. Schema Mapping: determines how to map the elements of each LCS to the other elements of the
45
Database Generation Process
46
Scheme Heterogeneity
Structural conflicts occur in four possible ways:
Type conflicts: when the same object is represented by an attribute in one schema and by an entity (relation) in another Dependency conflicts: when different relationship modes are used to represent the same thing in different schemes. Key conflicts: when different candidate keys are available and different primary keys are selected in different schemes.
47
Integration Methodologies
This step is only necessary if a GCS has not already been defined and matching was performed on individual LCSs.
Binary: involve the manipulation of two schemes at a time. These can occur in a stepwise (ladder) fashion or in a purely binary fashion, where each schema is integrated with one other.
Nary: offers more flexibility (more information is available) and is more general (the number of schemes can be varied depending on the integrators48 preferences).
Integration Methodologies
Binary
Nary
49
Mapping Methodologies
Once a GCS is defined, it is necessary to identify how the data from each of the local databases (source) can be mapped to GCS (target) while preserving semantic consistency (as defined by both the source and the target). Although schema matching has identified the correspondences between the LCSs and the GCS, it may not have identified explicitly how to obtain the global database from the local ones.
50
Mapping Methodologies
Mapping Creation: the process of creating explicit queries that map data from a local database to the global data.
Mapping Maintenance: the detection and correction of mapping inconsistencies resulting from schema evolution.
51
Data Cleaning
Errors in source databases can always occur, requiring cleaning in order to correctly answer user queries. Cleaning is performed as the global database is created. (Warehouse) The process needs to be performed during query processing when data are returned from the source databases. (Integration Systems)
52