PRINCIPLES OF
DATA INTEGRATION
Seyed Ahmad Seyed Mousavi
What is Data Integration?
Provide uniform access to data available in multiple,
autonomous, heterogeneous and distributed data sources
2
Data Integration: A Higher-level Abstraction
Query Independence of:
• source & location
Mediated Schema • data model, syntax
• semantic variations
•…
Semantic
Mappings
S1 S2 S3
<cd> <title> The best of … </title>
SSN
123-45-6789
234-56-7890
Name
Charles
Dan
…
Category
undergrad
grad
…
SSN
123-45-6789
123-45-6789
234-56-7890
CID
CSE444
CSE444
CSE142
… <artist> Carreras </artist>
<artist> Pavarotti </artist>
…
… <artist> Domingo </artist>
CID Name Quarter
CSE444 Databases fall <price> 19.95 </price>
CSE541 Operating systems winter
</cd>
3
Applications of Data Integration
Business
Science
Government
The Web
Pretty much everywhere
4
Application Area : Science
Sequenceable Structured
Phenotype Gene
Entity Vocabulary
Nucleotide
Protein
Sequence
Swiss-
OMIM HUGO GO
Prot
Gene- Locus-
Entrez
Clinics Link
Hundreds of biomedical data sources available; growing rapidly!
5
The Deep Web
Millions of high quality HTML forms out there
Each form has its own special interface
Hard to explore data across sites.
Goal (for some domains):
A single interface into a multitude of deep-web
sources.
6
Create a single site to search for jobs/rentals/… 7
Goals of Data Integration
Uniform query access to a set of data sources
Handle:
Scale of sources: from tens to millions
Heterogeneity
Autonomy
Semi-structured
Unstructured
8
Different Levels of Integration
9
Why is it Hard?
Distributed query processing
Schema heterogeneity
Convincing people to share
Security, privacy and performance implications.
10
Data integration architectures
Data warehousing:
integrate by bringing the data into a single
physical warehouse
Virtual data integration:
leave the data at the sources and access it at
query time.
11
Advantages of Mediated Schema approache
Data Freshness (low latency - almost realtime)
Higher Agility
Less costlier :
Lot of infrastructure cost can be saved since data
localization not required
12
Disadvantages of Mediated Schema approaches
Semantic conflicts :
The meaning of "net profit" can be different in
different systems
Slower query response:
due to network bandwidth
workload on the source system
13
Virtual Data Integration Architecture
Leave the data in the sources.
When a query comes in:
Determine the relevant sources to the query
Break down the query into sub-queries for the
sources.
Get the answers from the sources, and
combine them appropriately.
Data is fresh
14
Virtual Data Integration Architecture(2)
Mediated Schema
or Warehouse Query reformulation/
Query over materialized data
Wrapper / Wrapper / Wrapper / Wrapper /
Extractor Extractor Extractor Extractor
RDBMS 1 RDBMS 2
HTML1 XML1
15
Virtual Data Integration Architecture(3)
16
Example
Movie (title ,director ,year, genre )
Actors ( title , actor )
Plays (movie ,location ,startTime)
Reviews (title, rating ,description )
S1 S2 S3 S4 S5
Movies (name, Cinemas (place , CinemasInNYC CinemasInSF Reviews (title,
actors ,director, movie, start) (cinema ,title, (location ,movie, date ,grade ,
genre) startTime) startingTime) review)
17
Wrappers
Sources export data in different formats
Wrappers are custom-built programs that
transform data from the source native format to
something acceptable to the mediator
XML
HTML
<book>
<b> Introduction to DB </b> <title> Introduction to DB </title>
<i> Phil Bernstein </i> <author> Phil Bernstein </author>
<i> Eric Newcomer </i> <author> Eric Newcomer </author>
Addison Wesley, 1999 <publisher> Addison Wesley </publisher>
<year> 1999 </year>
</book>
18
18
Wrappers(2)
Complexity of wrapper depends on nature of data
source
Maintenance problems – have to change if source
interface changes
19
Data Source Catalog
Contains meta-information about sources
Logical source contents (books, new cars)
Source capabilities (can answer SQL queries)
Source completeness (has all books)
20
Query Reformulation
Users pose queries over the mediated schema
Reformulation:
Queries over the mediated schema have to be
rewritten as queries over the source schemas
How would we do the reformulation?
21
Approachs for Schema Mapping
Q
Mediated Schema
GAV LAV GLAV
Q’ Q’ Q’
Source Source Source
22
Approachs for Schema Mapping(2)
Global-as-view (GAV):
express the mediated schema relations as a set of
views over the data source relations
Local-as-view (LAV):
express the source relations as views over the
mediated schema.
23
Global-as-View: Example 1
Mediated schema: Express mediator schema
Movie(title, dir, year, genre). relations as views over
Schedule(cinema, title, time). source relations
Create View Movie
Sources:
[S1(title,dir,year,genre)]
[S2(title, dir,year,genre)]
S3 [T1(title,dir) ,T2(title,year,genre)]
where S3.title=S4.title
24
Global-as-View: Example 1
Mediated schema: Sources:
[S1(title,dir,year,genre)] Schedule(cinema, title, time).
[S2(title, dir,year,genre)] Movie(title, dir, year, genre).
S3[T1 (title,dir),T2(title,year,genre)]
Create View Movie AS select * from S1
union select * from S2
union select S3.T1.title, S3.T1.dir, S3.T2.year, S3.T2.genre
from S3.T1,S3.T2
re where S3.T1.title=S3.T2.title
25
Global-as-View: Example 2
Mediated schema: Sources:
Movie(title, dir, year, genre). [S1(title,dir,year)]
Schedule(cinema, title, time). [S2(title, dir,genre)]
Create View Movie AS
select title, dir, year, NULL
from S1
union
select title, dir, NULL, genre
from S2
26
Global-as-View: Example 3
Mediated schema:
Movie(title, dir, year, genre).
Schedule(cinema, title, time).
Source : S4(cinema, genre)
Create View Movie AS
select NULL, NULL, NULL, genre
from S4
Create View Schedule AS
select cinema, NULL, NULL
from S4.
But what if we want to find which cinemas are
playing comedies? 27
Local-as-View: Example 1
Mediated
Mediatedschema:
schema: Express source schema
relations as views over
Movie(title, dir, year, genre).
mediator relations
Schedule(cinema, title, time).
Create View S1 AS
select * from Movie
Create View S3 AS
S1(title,dir,year,genre)
select title, dir from Movie
S3(title,dir)
Create View S5 AS
select title, dir, year S5(title,dir,year), year >1960, genre=“Comedy
from Movie
where year > 1960 AND genre=“Comedy”
Sources are “materialized views” of
mediator schema 28
Local-as-View: Example 2
Mediated schema: Express source schema
relations as views over
Movie(title, dir, year, genre), mediator relations
Schedule(cinema, title, time).
Source : S4(cinema, genre)
Create View S4 as
select cinema, genre
from Movie m, Schedule s
where m.title=s.title
Now if we want to find which cinemas are
playing comedies, there is hope! 29
GAV vs. LAV
• Not modular • Modular--adding new
– Addition of new sources is easy
sources changes the
mediated schema
• Reformulation is hard
• Query reformulation • Best when
easy – possibility of
addition/deletion of
• Best when sources
– Few, stable data
sources
30
Query Processing
Query Query reformulation
Query optimizer
Replanning request
Execution engine
wrapper wrapper wrapper wrapper wrapper
source source source source source 31
Reformulation in LAV
Query: Find all the years in which
Zhang Yimou released movies.
Mediated schema:
Select year Movie(title, dir, year, genre).
from movie M
Schedule(cinema, title, time).
where M.dir=Yimou
Q(Y) :- movie(T,D,Y,G),D=Yimou Sources
:S1(title,dir,year,genre)
QY) :- S1(T,D,Y,G) , D=Yimou (1) S3(title,dir)
S5(title,dir,year), year >1960
Q(Y) :- S5(T,D,Y) , D=Yimou (2)
32
Reformulation Algorithms
Bucket Algorithm: will check all possible combinations
Create a bucket for each subgoal g in the query.
The bucket contains view atoms that contribute to g.
Cartesian product of buckets
Followed by “containment” check
33
Bucket Algorithm : Example
Q( ID, dir ) : Movie( ID, title, year , genre), Re venues( ID, amount ),
Director ( ID, dir ), amount $100 M
V1 ( I , Y ) : Movie( I , T , Y , G ), Re venues( I , A), I 5000, A $200M
V2 ( I , A) : Movie( I , T , Y , G ), Re venues( I , A)
V3 ( I , A) : Re venues( I , A), A $50M
V4 ( I , D, Y ) : Movie( I , T , Y , G ), Director ( I , D), I 3000
View atoms that can contribute to Movie:
V1(ID,year), V2(ID,A’), V4(ID,D’,year)
34
The Buckets and Cartesian product
Movie(ID,title, Revenues(ID, Director(ID,dir)
year,genre) amount)
V1(ID,year) V1(ID,Y’) V4(ID,Dir,Y’)
V2(ID,A’) V2(ID,amount)
V4(ID,D’,year)
Consider first candidate rewriting: first V1 subgoal is
redundant, and V1 and V4 are mutually exclusive.
q1'(ID,dir) : -V1(ID, year),V1(ID, y'),V4 (ID,dir, y')
35
Next Candidate Rewriting
Movie(ID,title, Revenues(ID, Director(ID,dir)
year,genre) amount)
V1(ID,year) V1(ID,Y’) V4(ID,Dir,Y’)
V2(ID,A’) V2(ID,amount)
V4(ID,D’,year)
q2 '(ID,dir) : -V2 (ID, A'),V2 (ID,amount),V4 (ID,dir, y')
q2 '(ID,dir) : -V2 (ID,amount),V4 (ID,dir, y'),
amount ³ $100M 36
TSIMMIS Project
TSIMMIS acronym of “The Stanford-IBM Manager of
Multiple Information Sources”
translate queries and information (source wrappers)
combine information from several sources (mediator)
Global-as-View (Gav)
To add a new source is not easy – need to specify it in the
mediator model
37
P2P Data Integration
38
P2P Data Integration
39
P2P Data Integration
Data integration is not based on a global schema.
P2P systems do not require a centralized management
The peer schema is the schema that is visible to other peers.
peer mappings relate the schemas of multiple peers
40
Question
42