Distributed DBMSs - Concepts and Design
1
Chapter 24 - Objectives
• Concepts.
• Advantages and disadvantages of
  distributed databases.
• Functions and architecture for a
  DDBMS.
• Distributed database design.
• Levels of transparency.
• Comparison criteria for DDBMSs.
                                 2
Concepts
Distributed Database
   A logically interrelated collection of
   shared data (and a description of this
   data), physically distributed over a
   computer network.
Distributed DBMS
   Software system that permits the
   management of the distributed database
   and makes the distribution transparent to
   users.
                                           3
    Concepts
• Collection of logically-related shared data.
• Data split into fragments.
• Fragments may be replicated.
• Fragments/replicas allocated to sites.
• Sites linked by a communications network.
• Data at each site is under control of a DBMS.
• DBMSs        handle      local    applications
  autonomously.
• Each DBMS participates in at least one
  global application.
                                                   4
Distributed DBMS
                   5
Distributed Processing
  A centralized database that can be
  accessed over a computer network.
                                   6
  Parallel DBMS
  A DBMS running across multiple processors
  and disks designed to execute operations in
  parallel, whenever possible, to improve
  performance.
• Based on premise that single processor
  systems can no longer meet requirements
  for cost-effective scalability, reliability, and
  performance.
• Parallel DBMSs link multiple, smaller
  machines to achieve same throughput as
  single,    larger machine,      with greater
  scalability and reliability.
                                                     7
Parallel DBMS
• Main architectures for parallel DBMSs
  are:
  – Shared memory,
  – Shared disk,
  – Shared nothing.
                                          8
Parallel DBMS
                (a) shared
                memory
                (b) shared disk
                (c) shared
                nothing
                              9
Advantages of DDBMSs
• Reflects organizational structure
• Improved     shareability   and   local
  autonomy
• Improved availability
• Improved reliability
• Improved performance
• Economics
• Modular growth
                                        10
Disadvantages of DDBMSs
•   Complexity
•   Cost
•   Security
•   Integrity control more difficult
•   Lack of standards
•   Lack of experience
•   Database design more complex
                                       11
Types of DDBMS
• Homogeneous DDBMS
• Heterogeneous DDBMS
                        12
Homogeneous DDBMS
• All sites use same DBMS product.
• Much easier to design and manage.
• Approach       provides   incremental
  growth      and    allows   increased
  performance.
                                          13
Heterogeneous DDBMS
• Sites may run different DBMS products,
  with possibly different underlying data
  models.
• Occurs when sites have implemented their
  own    databases     and     integration is
  considered later.
• Translations required to allow for:
  – Different hardware.
  – Different DBMS products.
  – Different hardware and     different   DBMS
    products.
• Typical solution is to use gateways.
                                                  14
Open Database Access and Interoperability
• Open Group formed a Working Group to
  provide specifications that will create a
  database infrastructure environment where
  there is:
   – Common SQL API that allows client applications to be
     written that do not need to know vendor of DBMS they are
     accessing.
   – Common database protocol that enables DBMS from one
     vendor to communicate directly with DBMS from another
     vendor without the need for a gateway.
   – A common network protocol that allows communications
     between different DBMSs.
                                                            15
Open Database Access and Interoperability
• Most ambitious goal is to find a way
  to enable transaction to span DBMSs
  from different vendors without use of
  a gateway.
• Group has now evolved into DBIOP
  Consortium and are working in
  version 3 of DRDA (Distributed
  Relational Database Architecture)
  standard.
                                            16
Multidatabase System (MDBS)
  DDBMS in which each site maintains
  complete autonomy.
• DBMS that resides transparently on top of
  existing database and file systems and
  presents a single database to its users.
• Allows users to access and share data
  without requiring physical database
  integration.
• Unfederated MDBS (no local users) and
  federated MDBS.
                                              17
Overview of Networking
  Network - Interconnected collection
  of autonomous computers, capable of
  exchanging information.
• Local Area Network (LAN) intended for
  connecting computers at same site.
• Wide Area Network (WAN) used when
  computers or LANs need to be connected over
  long distances.
• WAN relatively slow and less reliable than
  LANs. DDBMS using LAN provides much
  faster response time than one using WAN.
                                                18
Overview of Networking
                         19
Functions of a DDBMS
• Expect DDBMS to have at least the
  functionality of a DBMS.
• Also to have following functionality:
  –   Extended communication services.
  –   Extended Data Dictionary.
  –   Distributed query processing.
  –   Extended concurrency control.
  –   Extended recovery services.
                                         20
Reference Architecture for DDBMS
• Due     to   diversity,  no   accepted
  architecture equivalent to ANSI/SPARC
  3-level architecture.
• A reference architecture consists of:
  –   Set of global external schemas.
  –   Global conceptual schema (GCS).
  –   Fragmentation schema and allocation schema.
  –   Set of schemas for each local DBMS conforming
      to 3-level ANSI/SPARC.
• Some    levels may    be    missing,
  depending on levels of transparency
  supported.
                                                  21
Reference Architecture for DDBMS
                                   22
Reference Architecture for MDBS
• In DDBMS, GCS is union of all local
  conceptual schemas.
• In FMDBS, GCS is subset of local
  conceptual schemas (LCS), consisting
  of data that each local system agrees
  to share.
• GCS of tightly coupled system
  involves integration of either parts of
  LCSs or local external schemas.
• FMDBS with no GCS is called loosely
  coupled.                               23
Reference Architecture for Tightly-Coupled
FMDBS
                                             24
Components of a DDBMS
                        25
Distributed Database Design
• Three key issues:
  – Fragmentation,
  – Allocation,
  – Replication.
                              26
Distributed Database Design
  Fragmentation
    Relation may be divided into a number of
    sub-relations, which are then distributed.
  Allocation
   Each fragment is stored at site with
   “optimal” distribution.
  Replication
   Copy of fragment may be maintained at
   several sites.
                                             27
Fragmentation
• Definition and allocation of fragments
  carried out strategically to achieve:
  – Locality of Reference.
  – Improved Reliability and Availability.
  – Improved Performance.
  – Balanced Storage Capacities and Costs.
  – Minimal Communication Costs.
• Involves analyzing most important
  applications,          based          on
  quantitative/qualitative information.
                                             28
Fragmentation
• Quantitative information may include:
  – frequency with which an application is
    run;
  – site from which an application is run;
  – performance criteria for transactions and
    applications.
• Qualitative information may include
  transactions that are executed by
  application, type of access (read or
  write), and predicates of read
  operations.
                                                29
Data Allocation
• Four alternative strategies regarding
  placement of data:
   – Centralized,
   – Partitioned (or Fragmented),
   – Complete Replication,
   – Selective Replication.
                                          30
Data Allocation
Centralized: Consists of single database and
  DBMS stored at one site with users
  distributed across the network.
Partitioned: Database partitioned into disjoint
  fragments, each fragment assigned to one
  site.
Complete       Replication:      Consists      of
  maintaining complete copy of database at
  each site.
Selective    Replication:     Combination      of
  partitioning, replication, and centralization.
                                                    31
Comparison     of   Strategies   for   Data
Distribution
                                          32
Why Fragment?
• Usage
  – Applications work with views rather than
    entire relations.
• Efficiency
  – Data is stored close to where it is most
    frequently used.
  – Data that is not needed by local
    applications is not stored.
                                           33
Why Fragment?
• Parallelism
  – With fragments as unit of distribution,
    transaction can be divided into several
    subqueries that operate on fragments.
• Security
  – Data not required by local applications is
    not stored and so not available to
    unauthorized users.
                                             34
Why Fragment?
• Disadvantages
  – Performance,
  – Integrity.
                   35
Correctness of Fragmentation
• Three correctness rules:
  – Completeness,
  – Reconstruction,
  – Disjointness.
                               36
Correctness of Fragmentation
Completeness
    If relation R is decomposed into fragments R1,
    R2, ... Rn, each data item that can be found in R
    must appear in at least one fragment.
Reconstruction
• Must be possible to define a relational
  operation that will reconstruct R from the
  fragments.
• Reconstruction for horizontal fragmentation
  is Union operation and Join for vertical .
                                                    37
Correctness of Fragmentation
Disjointness
• If data item di appears in fragment Ri, then it
  should not appear in any other fragment.
• Exception: vertical fragmentation, where
  primary key attributes must be repeated to
  allow reconstruction.
• For horizontal fragmentation, data item is a
  tuple.
• For vertical fragmentation, data item is an
  attribute.
                                               38
Types of Fragmentation
• Four types of fragmentation:
   – Horizontal,
   – Vertical,
   – Mixed,
   – Derived.
• Other possibility is no fragmentation:
   – If relation is small and not updated
     frequently, may be better not to fragment
     relation.                               39
Horizontal and Vertical Fragmentation
                                        40
Mixed Fragmentation
                      41
Horizontal Fragmentation
• Consists of a subset of the tuples of a
  relation.
• Defined using Selection operation of
  relational algebra:
          p(R)
• For example:
          P1 =  type=‘House’(PropertyForRent)
          P2 =  type=‘Flat’(PropertyForRent)
                                                 42
Horizontal Fragmentation
• This strategy is determined by looking at
  predicates used by transactions.
• Involves finding set of minimal (complete
  and relevant) predicates.
• Set of predicates is complete, if and only if,
  any two tuples in same fragment are
  referenced with same probability by any
  application.
• Predicate is relevant if there is at least one
  application    that   accesses      fragments
  differently.                                  43
 Vertical Fragmentation
• Consists of a subset of attributes of a
  relation.
• Defined using Projection operation of
  relational algebra:
           a1, ... ,an(R)
• For example:
    S1 = staffNo, position, sex, DOB, salary(Staff)
    S2 = staffNo, fName, lName, branchNo(Staff)
• Determined by establishing affinity of
  one attribute to another.
                                                       44
Mixed Fragmentation
• Consists of a horizontal fragment that
  is vertically fragmented, or a vertical
  fragment      that    is     horizontally
  fragmented.
• Defined using Selection and Projection
  operations of relational algebra:
          p(a1, ... ,an(R))   or
         a1, ... ,an(σp(R))
                                          45
Example - Mixed Fragmentation
  S1 = staffNo, position, sex, DOB, salary(Staff)
  S2 = staffNo, fName, lName, branchNo(Staff)
  S21 =  branchNo=‘B003’(S2)
  S22 =  branchNo=‘B005’(S2)
  S23 =  branchNo=‘B007’(S2)
                                                     46
Derived Horizontal Fragmentation
• A horizontal fragment that is based on
  horizontal fragmentation of a parent
  relation.
• Ensures that fragments that are
  frequently joined together are at same
  site.
• Defined using Semijoin operation of
  relational algebra:
    Ri = R   F   Si,     1iw         47
Example - Derived Horizontal Fragmentation
  S3 =  branchNo=‘B003’(Staff)
  S4 =  branchNo=‘B005’(Staff)
  S5 =  branchNo=‘B007’(Staff)
Could use derived fragmentation for
 Property:
  Pi = PropertyForRent branchNo Si, 3  i  5
                                             48
Derived Horizontal Fragmentation
• If relation contains more than one
  foreign key, need to select one as
  parent.
• Choice      can    be  based    on
  fragmentation used most frequently
  or fragmentation with better join
  characteristics.
                                       49
Distributed Database Design Methodology
1. Use normal methodology to produce a design for
   the global relations.
2. Examine topology of system to determine where
   databases will be located.
3. Analyze most important transactions and identify
   appropriateness        of       horizontal/vertical
   fragmentation.
4. Decide which relations are not to be fragmented.
5. Examine relations on 1 side of relationships and
   determine a suitable fragmentation schema.
   Relations on many side may be suitable for
   derived fragmentation.
                                                         50
Transparencies in a DDBMS
• Distribution Transparency
  – Fragmentation Transparency
  – Location Transparency
  – Replication Transparency
  – Local Mapping Transparency
  – Naming Transparency
                                 51
Transparencies in a DDBMS
• Transaction Transparency
  – Concurrency Transparency
  – Failure Transparency
• Performance Transparency
  – DBMS Transparency
• DBMS Transparency
                               52
Distribution Transparency
• Distribution transparency allows user to
  perceive database as single, logical entity.
• If DDBMS exhibits distribution transparency,
  user does not need to know:
  – data is fragmented (fragmentation transparency),
  – location of data items (location transparency),
  – otherwise call this local mapping transparency.
• With replication transparency, user                  is
  unaware of replication of fragments .
                                                        53
Naming Transparency
• Each item in a DDB must have a unique
  name.
• DDBMS must ensure that no two sites
  create a database object with same name.
• One solution is to create central name
  server. However, this results in:
  – loss of some local autonomy;
  – central site may become a bottleneck;
  – low availability; if the central site fails, remaining
    sites cannot create any new objects.
                                                         54
Naming Transparency
• Alternative solution - prefix object with
  identifier of site that created it.
• For example, Branch created at site S1 might be
  named S1.BRANCH.
• Also need to identify each fragment and its
  copies.
• Thus, copy 2 of fragment 3 of Branch created at
  site    S1     might      be    referred to  as
  S1.BRANCH.F3.C2.
• However, this results in loss of distribution
  transparency.
                                                55
Naming Transparency
• An approach that resolves these
  problems uses aliases for each
  database object.
• Thus, S1.BRANCH.F3.C2 might be
  known as LocalBranch by user at site
  S1.
• DDBMS has task of mapping an alias to
  appropriate database object.
                                      56
Transaction Transparency
• Ensures that all distributed transactions
  maintain distributed database’s integrity and
  consistency.
• Distributed transaction accesses data stored
  at more than one location.
• Each transaction is divided into number of
  subtransactions, one for each site that has to
  be accessed.
• DDBMS must ensure the indivisibility of both
  the global transaction and each of the
  subtransactions.                             57
Example - Distributed Transaction
• T prints out names of all staff, using
  schema defined above as S1, S2, S21,
  S22,    and    S23.     Define     three
  subtransactions TS3, TS5, and TS7 to
  represent agents at sites 3, 5, and 7.
                                         58
Concurrency Transparency
• All transactions must execute independently and
  be logically consistent with results obtained if
  transactions executed one at a time, in some
  arbitrary serial order.
• Same fundamental principles as for centralized
  DBMS.
• DDBMS must ensure both global and local
  transactions do not interfere with each other.
• Similarly, DDBMS must ensure consistency of all
  subtransactions of global transaction.
                                                 59
Classification of Transactions
• In   IBM’s    Distributed Relational
  Database Architecture (DRDA), four
  types of transactions:
   – Remote request
   – Remote unit of work
   – Distributed unit of work
   – Distributed request.
                                     60
Classification of Transactions
                                 61
Concurrency Transparency
• Replication makes concurrency more
  complex.
• If a copy of a replicated data item is
  updated, update must be propagated to all
  copies.
• Could propagate changes as part of original
  transaction, making it an atomic operation.
• However, if one site holding copy is not
  reachable, then transaction is delayed until
  site is reachable.
                                             62
Concurrency Transparency
• Could limit update propagation to
  only those sites currently available.
  Remaining sites updated when they
  become available again.
• Could allow updates to copies to
  happen asynchronously, sometime
  after the original update. Delay in
  regaining consistency may range
  from a few seconds to several hours.
                                          63
Failure Transparency
• DDBMS must ensure atomicity and
  durability of global transaction.
• Means ensuring that subtransactions of
  global transaction either all commit or all
  abort.
• Thus, DDBMS must synchronize global
  transaction      to     ensure    that    all
  subtransactions          have      completed
  successfully before recording a final
  COMMIT for global transaction.
• Must do this in presence of site and network
  failures.
                                              64
Performance Transparency
• DDBMS must perform as if it were a
  centralized DBMS.
  – DDBMS       should     not   suffer   any
    performance      degradation     due     to
    distributed architecture.
  – DDBMS should determine most cost-
    effective strategy to execute a request.
                                                  65
Performance Transparency
• Distributed Query Processor (DQP)
  maps data request into ordered
  sequence of operations on local
  databases.
• Must       consider       fragmentation,
  replication, and allocation schemas.
• DQP has to decide:
  – which fragment to access;
  – which copy of a fragment to use;
  – which location to use.
                                             66
Performance Transparency
• DQP produces execution strategy
  optimized with respect to some cost
  function.
• Typically, costs associated with a
  distributed request include:
  – I/O cost;
  – CPU cost;
  – communication cost.
                                    67
Performance Transparency - Example
Property(propNo, city)   10000 records in London
Client(clientNo,maxPrice)     100000    records    in
  Glasgow
Viewing(propNo, clientNo)     1000000    records   in
  London
SELECT p.propNo
FROM Property p INNER JOIN
(Client c INNER JOIN Viewing v ON c.clientNo =
  v.clientNo)
  ON p.propNo = v.propNo
WHERE p.city=‘Aberdeen’ AND c.maxPrice > 200000;
                                                   68
Performance Transparency - Example
Assume:
• Each tuple in each relation is 100
  characters long.
• 10 renters with maximum price
  greater than £200,000.
• 100 000 viewings for properties in
  Aberdeen.
• Computation       time    negligible
  compared to communication time.      69
Performance Transparency - Example
                                     70
Date’s 12 Rules for a DDBMS
0.    Fundamental Principle
     To the user, a distributed system should look
     exactly like a nondistributed system.
1.    Local Autonomy
2.    No Reliance on a Central Site
3.    Continuous Operation
4.    Location Independence
5.    Fragmentation Independence
6.    Replication Independence
                                                71
Date’s 12 Rules for a DDBMS
7.    Distributed Query Processing
8.    Distributed Transaction Processing
9.    Hardware Independence
10.   Operating System Independence
11.   Network Independence
12.   Database Independence
• Last four rules are ideals.
                                           72