W3
Schema Matching and Mapping
II Architecture: Virtualization Layer Approach
II Architecture: A Data Warehousing Approach
Information Integration Key Challenges
•   Managing different platforms:
     •   Identifying relevant information from multiple data sources
     •   Logical specification of data desired
     •   Handle dynamic arrival and departure of data sources
•   Automated data transformations:
     •   Data curation,
     •   Defining and working with data quality
• Schema and Data heterogeneity:
   • Integrating diverse information from the recorded state of the business within
     cost and skill constraints
   • schema mapping, data mapping, information discovery, …
• Uniform (or source specific) query access to data sources
• Distributed query processing and optimization
• Consolidating, transforming, and mining data for analysis.
Can AI help in Information Integration cycle?
     • Reduce the effort needed to set up an data curation, integration tasks.
     • Enable the system to perform gracefully with uncertainty (e.g., on the
       web/noisy source/…)
       Steps involved in
Data Source Selection, Data Acquisition,
Understanding, Cleansing, Transforming
                         Data Source Selection
   Once you have the set of questions for querying integrated
    data, the next step is to identify the data sources
   Must evaluate the data sources to know which to select
    – Is it complete? In terms of coverage? In terms of missing data?
    – Is it correct?
    – Is it clean?
    – Is it structured? If not, can you extract structured data out of it easily and
      accurately or expose text document as an attribute?
    – Is it up to date? How quickly is it changing?
    – Hard to answer some of these questions until you have acquired some data
      from the sources
                        Data Acquisition
   Then need to acquire the data from the sources
    – This is highly non-trivial
    – Types of data sources:
        Structured data sources: relational databases, XML files, …
        Textual data: emails, memos, documents, etc.
        Web data: need to crawl, maybe can get a dump, API may exist
        Other types of data: excel, pdf, images, etc.
    – Being able to extract data from such sources is non-trivial, time
      consuming
    – Build connectors, wrappers,…
                      Data Acquisition
   Then need to acquire the data from the sources
    – Some of the data come from within the company
        Need to go talk with data owner, convincing him/her, get help
        Can take months – acquisition due to legal and compliance
         reasons.
    – Some of the data come from outside the company
        Public data, Open source data, Paid data
          – Pros: clean, quick
          -- Cons: untrustworthy, noisy, expensive
        Understanding, Cleaning, & Transformation
                          Do These for Each Source, then Integrate
   For data from each source
     – Data problems
          –   missing values
          –   incorrect values, illegal values, outliers
          –   synonyms
          –   misspellings
          –   conflicting data (eg, age and birth year)
          –   wrong value formats
          –   variations of values
          –   duplicate tuples
     – understand current vs ideal schema/data
          – Attribute values profiling, relationship between attributes, integrity constraints…
     – Tools exist for data profiling, relationship discovery,…
     – compare the two and identify possible problems
          – violations of constraints for the ideal schema/data
     – clean and transform
     – possibly enrich/enhance
   Integrate data from the multiple sources
     – schema matching/merging, data matching/merging
         –    misspelt names
         –    violating constraints (key, uniqueness, foreign key, etc)
           The Schema Matching Problem                                                      Query
                                                                                   Global Schema
 Given two input schemas in any data model and, optionally, auxiliary
 information, compute a mapping between schema elements of the two
 input schemas that passes user validation.
                                                                                                    in
                                                                                 Source-1           Source-2
                                                    BookInfo
       Books                                        ID              char(15) Primary Key
                                                    AuthorID        integer references AuthorInfo
           ISBN       char(15) key                  BookTitle       varchar(150)
           Title      varchar(100)                  ListPrice       float
                                                    DiscountPrice   float
           Author varchar(50)
           MarkedPrice float
                                                    AuthorInfo
 πISBN,Title,MarkedPrice(Books)                     AuthorID    integer   key
 = πID,BookTitle,ListPrice(BookInfo)                LastName    varchar(25)
 πAuthor(Books) = πFirstName+LastName(AuthorInfo)   FirstName   varchar(25)
Books= πID, BookTitle, FirstName+LastName, ListPrice(BookInfo ⋈ AuthorInfo)
              Inputs to Matching Technique
• Schema structure          • Constraints: data type, keys, nullability
• Attribute names
• Synonyms                     •    Acronyms
  Code = Id = Num = No             ◦ PO = Purchase Order
  Zip=PIN = Postal [code]          ◦ UOM = Unit of Measure
  Node = Server                    ◦ SS# = Social Security Number
• Data instances
  Attributes match if they have similar instances or value
  distributions
Schema-based hybrid matching algorithm
        Based on combining multiple approaches that use only schema (no instances)
Input: Two schema graphs
Output: Similarity matrix and candidate mapping
• Linguistic matching: compare attributes based on names, data types, etc
    •   Use a thesaurus to help match names by identifying short-forms (Qty for Quantity),
        acronyms (UoM for UnitOfMeasure) and synonyms (Bill and Invoice). The result is a
        linguistic similarity coefficient, Lsim, between each pair of elements.
• Structure matching: compare elements based on the similarity of their contexts or
   vicinities. The result is a structural similarity coefficient, Ssim, for each pair of elements.
• Compute the Weighted similarity: Wsim = w * Lsim + (1 – w) * Ssim
• Mapping generation: a mapping is created by choosing pairs of schema
  elements with maximal weighted similarity.
                                 Example
            PO                              PurchaseOrder
                   POLines             Items
 POShipTo                                                DeliverTo
                   Item              Item
                                                  Name    Address
Name        City
       Street
                          Line      ItemNumber           City Street
                          UoM      UnitOfMeasure
                          Qty          Quantity
                           Linguistic Matching
• Tokenization of names
   • PurchaseOrder  purchase + order
• Expansion of acronyms
   • UOM  unit + of + measure
• Clustering based on keywords and data-types
   • Street, City, POAddress  Address
• Linguistic similarity
   • Pair-wise comparison of elements that belong to the same cluster
   • Token similarity = f(string matching, synonyms score)
   • Token set similarity = average (best matching token similarity)
• Thesaurus: acronymns, synonyms, stop words and categories
Structure Matching
                             PO                                           PurchaseOrder
                                     POLines                        Items
            POShipTo                                                                             DeliverTo
                                     Item                        Item
                                                                                     Name         Address
         Name                 City
                    Street                    Line              ItemNumber
                                                                                                 City Street
                                              UoM           UnitOfMeasure
                                              Qty                       Quantity
Tree Match Algorithm (Bottom-up Traversal)
•   Atomic elements (leaves) are similar if
                                                                    •    Mutually dependent formulation
     •   Linguistically and data-type similar
                                                                          •    Leaves determine internal node similarity
     •   Their contexts, i.e., ancestors, are similar
                                                                          •    Similarity of internal nodes leads to increase in
•   Compound elements (non-leaves) are similar if
                                                                               leaf similarity
     •   Linguistically similar
     •   Elements in their context, i.e., subtrees rooted at the elements, are similar
Collective Schema Matching
                                     allcars.com
                                            craigslist auto
    [He+, SIGMOD’03]: Build mediated schema for a domain by clustering elements in
    multiple schemas
                                                                  craigslist auto
    Schema Mapping
•    Global schema defined in terms of sources
     (global schema centric or Global-As-View                     Query
     (GAV))
     •  Query reformulation easier
     • Any change in sources, needs change in             Global Schema
       global schema
     • Global relations cannot model any information
       not present in at least one source.
•    Sources defined in terms of global schema
     (source-centric or Local-As-View (LAV))
                                                       Source-1           Source-2
         • High modularity and extensibility (if the
           global schema is well designed, when a
           source changes, only its definition is
           affected)
         • Query reformulation complex
         • It allows to add a source to the system
           independently of other sources.
             Example
Example taken from Dr. Subbarao Kambhampati’s lecture notes.
Global-as-View (GAV)
 Q1: The ‘BookStore.com’ is an aggregator that provides a single stop solution to their clients
for buying books. The business has agreement with three suppliers, and the schema of each
supplier is given below. Note, each supplier may have a different price for the same book. The
primary key in each schema is underlined.
Supplier1(ISBN, Book_Name, Book_Price)
Supplier2(ISBN, BTitle, BPrice)
Supplier3(ISBN, Book_Title, Price)
a. Design the global schema of the BookStore.com where (1) the client can create his/her
   profile (Name, Home#, St-Name, Area, City, PIN, Mobile), (2) the client can view/search all
   his/her current and past transactions (Order#, Date, ISBN, Book-Title, Price, Supplier-
   Name), and (3) the client can search the books availability by <title> and/or <ISBN> and can
   also view the supplier name. (3)
b. Write the schema mapping between the global schema (obtained in part (a)) and local
schemas using GAV (2)
c. The BookStore.com decides to show the supplier name selling the book at the lowest price
among all, that is, 1 record per ISBN unless more than one supplier has the same lowest price.
this case, it displays information of all suppliers. For this constraint, design the schema and its
mapping to global schema (obtained in part (a)).
Local-as-View (LAV)
              GAV                        vs.             LAV
•   Not modular                            •   Modular--adding new sources is easy
     – Addition of new sources
       changes the mediated schema
                                           •   Very flexible--power of the entire
•   Can be awkward to write mediated           query language available to describe
    schema without loss of information         sources
•   Query reformulation easy               •   Reformulation is hard
     – reduces to view unfolding                – Involves answering queries only
       (polynomial)
                                                  using views
     – Can build hierarchies of
       mediated schemas
                                           •   Best when
•   Best when
                                                – Many, relatively unknown data
     – Few, stable, data sources                  sources
     – well-known to the mediator (e.g.         – possibility of addition/deletion of
       corporate integration)                     sources
         • Garlic, TSIMMIS,                         • Information Manifold,
           HERMES                                     InfoMaster, Emerac, Havasu
Clio: Schema Discovery and Mapping for Integration
 Clio Grows Up: From Research Prototype to Industrial Tool
CLIO Architecture
                    (SQL/XQuery/XSLT/…)