0% found this document useful (0 votes)

4 views42 pages

Data Integration

Data integration provides uniform access to data from multiple, heterogeneous sources, enabling users to query diverse datasets seamlessly. It faces challenges such as schema heterogeneity and distributed query processing, and can be implemented through architectures like data warehousing or virtual data integration. Applications span various fields including business, science, and government, with methods like Global-as-View and Local-as-View for schema mapping to facilitate data integration.

Uploaded by

Ahmad Mousavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views42 pages

Data Integration

Uploaded by

Ahmad Mousavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

PRINCIPLES OF

DATA INTEGRATION

Seyed Ahmad Seyed Mousavi

What is Data Integration?

Provide uniform access to data available in multiple,

autonomous, heterogeneous and distributed data sources
2
Data Integration: A Higher-level Abstraction

Query Independence of:

• source & location
Mediated Schema • data model, syntax
• semantic variations
•…
Semantic
Mappings
S1 S2 S3
<cd> <title> The best of … </title>
SSN
123-45-6789
234-56-7890
Name
Charles
Dan
…
Category
undergrad
grad
…
SSN
123-45-6789
123-45-6789
234-56-7890
CID
CSE444
CSE444
CSE142
… <artist> Carreras </artist>
<artist> Pavarotti </artist>
…
… <artist> Domingo </artist>
CID Name Quarter
CSE444 Databases fall <price> 19.95 </price>
CSE541 Operating systems winter
</cd>
3
Applications of Data Integration
 Business
 Science
 Government
 The Web
 Pretty much everywhere

4
Application Area : Science
Sequenceable Structured
Phenotype Gene
Entity Vocabulary

Nucleotide
Protein
Sequence

Swiss-
OMIM HUGO GO
Prot

Gene- Locus-
Entrez
Clinics Link

Hundreds of biomedical data sources available; growing rapidly!

5
The Deep Web
 Millions of high quality HTML forms out there

 Each form has its own special interface

 Hard to explore data across sites.

 Goal (for some domains):

 A single interface into a multitude of deep-web
sources.

6
Create a single site to search for jobs/rentals/… 7
Goals of Data Integration
 Uniform query access to a set of data sources
 Handle:
 Scale of sources: from tens to millions
 Heterogeneity
 Autonomy
 Semi-structured
 Unstructured

8
Different Levels of Integration

9
Why is it Hard?

 Distributed query processing

 Schema heterogeneity

 Convincing people to share

 Security, privacy and performance implications.

10
Data integration architectures
 Data warehousing:
integrate by bringing the data into a single
physical warehouse

 Virtual data integration:

leave the data at the sources and access it at
query time.

11
Advantages of Mediated Schema approache
 Data Freshness (low latency - almost realtime)

 Higher Agility

 Less costlier :
Lot of infrastructure cost can be saved since data
localization not required

12
Disadvantages of Mediated Schema approaches

 Semantic conflicts :
The meaning of "net profit" can be different in
different systems

 Slower query response:

 due to network bandwidth
 workload on the source system

13
Virtual Data Integration Architecture
 Leave the data in the sources.

 When a query comes in:

 Determine the relevant sources to the query
 Break down the query into sub-queries for the
sources.
 Get the answers from the sources, and
combine them appropriately.

 Data is fresh

14
Virtual Data Integration Architecture(2)
Mediated Schema
or Warehouse Query reformulation/
Query over materialized data

Wrapper / Wrapper / Wrapper / Wrapper /

Extractor Extractor Extractor Extractor

RDBMS 1 RDBMS 2
HTML1 XML1
15
Virtual Data Integration Architecture(3)

16
Example

Movie (title ,director ,year, genre )

Actors ( title , actor )
Plays (movie ,location ,startTime)
Reviews (title, rating ,description )

S1 S2 S3 S4 S5
Movies (name, Cinemas (place , CinemasInNYC CinemasInSF Reviews (title,
actors ,director, movie, start) (cinema ,title, (location ,movie, date ,grade ,
genre) startTime) startingTime) review)

17
Wrappers
 Sources export data in different formats
 Wrappers are custom-built programs that
transform data from the source native format to
something acceptable to the mediator
XML
HTML
<book>
 Introduction to DB <title> Introduction to DB </title>
 Phil Bernstein <author> Phil Bernstein </author>
 Eric Newcomer <author> Eric Newcomer </author>
Addison Wesley, 1999 <publisher> Addison Wesley </publisher>
<year> 1999 </year>
</book>
18
18
Wrappers(2)
 Complexity of wrapper depends on nature of data
source

 Maintenance problems – have to change if source

interface changes

19
Data Source Catalog
Contains meta-information about sources

 Logical source contents (books, new cars)

 Source capabilities (can answer SQL queries)

 Source completeness (has all books)

20
Query Reformulation
 Users pose queries over the mediated schema

 Reformulation:
Queries over the mediated schema have to be
rewritten as queries over the source schemas

 How would we do the reformulation?

21
Approachs for Schema Mapping
Q

Mediated Schema

GAV LAV GLAV

Q’ Q’ Q’
Source Source Source

22
Approachs for Schema Mapping(2)
 Global-as-view (GAV):
express the mediated schema relations as a set of
views over the data source relations

 Local-as-view (LAV):
express the source relations as views over the
mediated schema.

23
Global-as-View: Example 1
Mediated schema: Express mediator schema
Movie(title, dir, year, genre). relations as views over
Schedule(cinema, title, time). source relations
Create View Movie
Sources:
[S1(title,dir,year,genre)]
[S2(title, dir,year,genre)]
S3 [T1(title,dir) ,T2(title,year,genre)]

where S3.title=S4.title
24
Global-as-View: Example 1
Mediated schema: Sources:
[S1(title,dir,year,genre)] Schedule(cinema, title, time).
[S2(title, dir,year,genre)] Movie(title, dir, year, genre).
S3[T1 (title,dir),T2(title,year,genre)]

Create View Movie AS select * from S1

union select * from S2
union select S3.T1.title, S3.T1.dir, S3.T2.year, S3.T2.genre
from S3.T1,S3.T2
re where S3.T1.title=S3.T2.title

25
Global-as-View: Example 2
Mediated schema: Sources:
Movie(title, dir, year, genre). [S1(title,dir,year)]
Schedule(cinema, title, time). [S2(title, dir,genre)]

Create View Movie AS

select title, dir, year, NULL
from S1
union
select title, dir, NULL, genre
from S2

26
Global-as-View: Example 3
Mediated schema:
Movie(title, dir, year, genre).
Schedule(cinema, title, time).
Source : S4(cinema, genre)
Create View Movie AS
select NULL, NULL, NULL, genre
from S4
Create View Schedule AS
select cinema, NULL, NULL
from S4.
But what if we want to find which cinemas are
playing comedies? 27
Local-as-View: Example 1
Mediated
Mediatedschema:
schema: Express source schema
relations as views over
Movie(title, dir, year, genre).
mediator relations
Schedule(cinema, title, time).
Create View S1 AS
select * from Movie
Create View S3 AS
S1(title,dir,year,genre)
select title, dir from Movie
S3(title,dir)
Create View S5 AS
select title, dir, year S5(title,dir,year), year >1960, genre=“Comedy

from Movie
where year > 1960 AND genre=“Comedy”
Sources are “materialized views” of
mediator schema 28
Local-as-View: Example 2
Mediated schema: Express source schema
relations as views over
Movie(title, dir, year, genre), mediator relations

Schedule(cinema, title, time).

Source : S4(cinema, genre)
Create View S4 as
select cinema, genre
from Movie m, Schedule s
where m.title=s.title

Now if we want to find which cinemas are

playing comedies, there is hope! 29
GAV vs. LAV
• Not modular • Modular--adding new
– Addition of new sources is easy
sources changes the
mediated schema
• Reformulation is hard

• Query reformulation • Best when

easy – possibility of
addition/deletion of
• Best when sources
– Few, stable data
sources

30
Query Processing
Query Query reformulation

Query optimizer

Replanning request
Execution engine

wrapper wrapper wrapper wrapper wrapper

source source source source source 31

Reformulation in LAV
Query: Find all the years in which
Zhang Yimou released movies.
Mediated schema:
Select year Movie(title, dir, year, genre).
from movie M
Schedule(cinema, title, time).
where M.dir=Yimou

Q(Y) :- movie(T,D,Y,G),D=Yimou Sources

:S1(title,dir,year,genre)
QY) :- S1(T,D,Y,G) , D=Yimou (1) S3(title,dir)

S5(title,dir,year), year >1960

Q(Y) :- S5(T,D,Y) , D=Yimou (2)

32
Reformulation Algorithms
Bucket Algorithm: will check all possible combinations

 Create a bucket for each subgoal g in the query.

 The bucket contains view atoms that contribute to g.

 Cartesian product of buckets

 Followed by “containment” check

33
Bucket Algorithm : Example

Q( ID, dir ) :  Movie( ID, title, year , genre), Re venues( ID, amount ),
Director ( ID, dir ), amount  $100 M
V1 ( I , Y ) :  Movie( I , T , Y , G ), Re venues( I , A), I  5000, A  $200M
V2 ( I , A) :  Movie( I , T , Y , G ), Re venues( I , A)
V3 ( I , A) :  Re venues( I , A), A  $50M
V4 ( I , D, Y ) :  Movie( I , T , Y , G ), Director ( I , D), I  3000

View atoms that can contribute to Movie:

V1(ID,year), V2(ID,A’), V4(ID,D’,year)
34
The Buckets and Cartesian product

Movie(ID,title, Revenues(ID, Director(ID,dir)

year,genre) amount)
V1(ID,year) V1(ID,Y’) V4(ID,Dir,Y’)

V2(ID,A’) V2(ID,amount)

V4(ID,D’,year)

Consider first candidate rewriting: first V1 subgoal is

redundant, and V1 and V4 are mutually exclusive.

q1'(ID,dir) : -V1(ID, year),V1(ID, y'),V4 (ID,dir, y')

35
Next Candidate Rewriting

Movie(ID,title, Revenues(ID, Director(ID,dir)

year,genre) amount)
V1(ID,year) V1(ID,Y’) V4(ID,Dir,Y’)

V2(ID,A’) V2(ID,amount)

V4(ID,D’,year)

q2 '(ID,dir) : -V2 (ID, A'),V2 (ID,amount),V4 (ID,dir, y')

q2 '(ID,dir) : -V2 (ID,amount),V4 (ID,dir, y'),
amount ³ $100M 36
TSIMMIS Project
 TSIMMIS acronym of “The Stanford-IBM Manager of
Multiple Information Sources”

 translate queries and information (source wrappers)

 combine information from several sources (mediator)

 Global-as-View (Gav)

 To add a new source is not easy – need to specify it in the

mediator model

37
P2P Data Integration

38
P2P Data Integration

39
P2P Data Integration