0% found this document useful (0 votes)

28 views13 pages

2709 - Data and Knowledge Integration

Chapter 12 discusses the challenges of data integration in e-Government, highlighting the lack of standardization across various government data formats and systems. It presents three main approaches to tackle this issue: direct access using information retrieval, metadata reconciliation, and data mapping, with examples provided for the latter two. The chapter emphasizes the importance of proper organization and maintenance of data to enable better decision-making within government agencies.

Uploaded by

Shahanoor Alam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views13 pages

2709 - Data and Knowledge Integration

Uploaded by

Shahanoor Alam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Chapter 12

DATA AND KNOWLEDGE INTEGRATION FOR

E-GOVERNMENT

Eduard Hovy

Information Sciences Institute, University of Southern California, Marina del Rey, California,
U.S.A. (hovy@isi.edu)

CHAPTER OVERVIEW
Data integration is one of the most significant IT problems facing government today. Using
information technology, government agencies have collected vastly more data than was ever
possible to collect before. Unfortunately, for lack of standardization, most government data
today exists in thousands of different formats and resides in hundreds of systems and
versions, spread across dozens of agencies. This situation makes the data almost impossible to
find, re-use, and build upon after further data collection. A considerable amount of research
has been performed over the past decades to overcome this problem. Within Digital
Government, several projects have focused on government data collections. Three principal
approaches have been followed: (1) direct access, using information retrieval techniques; (2)
metadata reconciliation, using ontology alignment techniques; and (3) data mapping, using
information theoretic techniques. This chapter discusses each approach, and provides specific
examples of the last two.
220 Chapter 12. Hovy

1. INTRODUCTION

Government is, in principle, a data-intensive enterprise: in the ideal case,

the more data available about some issue, organization, or individual, the
better decisions government agencies can make. Information technology has
enabled government agencies to collect vastly more data than was ever
possible before. But this ability comes at a cost: in order to be useful, the
data must be properly organized, standardized, and maintained. Unfortunately
the situation most characteristic of present-day government in almost all its
branches is that much data has been collected, and stored, but has not been
stored in a uniform way, in a common representation, or using a
standardized system. That is, the data may reside in hundreds of different
formats, systems, and versions. While the information might be somewhere,
the user often doesn’t know where to find it, how to access it, or how to
convert all variations of it to a single useful format.
It is therefore no surprise that one of the most significant problems
experienced by government agencies is data and knowledge integration.
Reconciling the differences across data collections is not a trivial matter.
Data and information integration involves several aspects: recognizing that
two data sets describe the ‘same’ topic; understanding their differences (in
numerous ways, including specificity, accuracy, and coverage); creating a
common description or framework of metadata for them; establishing
equivalences by performing data mapping across the data collections; and
possibly converting the actual data from one or more data sources into a
common form.
One of the principal problems facing efforts to integrate nonhomogeneous
data sets is terminology standardization: what one agency calls salary
another might call income, and a third might call wages (even while it may
have something else entirely that it calls salary). Defining exactly what data
has been obtained, and making sure that the actual method of data capture
employed did in fact accurately observe the specifications, is a task for
specialists, and may require some very sophisticated analysis and
description. It is not uncommon for specialists in different government
agencies to spend weeks or even months understanding precisely what
differences exist between their respective data collections, even when to the
untrained eye the collections seem essentially identical. For example, it
makes a difference when you want to record gasoline prices in a given area
not only in which locations you measure the prices, but also whether you
measure the prices every Tuesday or once a month, for example. Determining
how significant the differences in measurement are, and deciding how to
reconcile them (simple numerical average? Average weighted by volume
sold?) into a single number, is a matter of interpretation, and may easily have
Digital Government 221

unexpected policy consequences—after all, the results will presumably used

by some lawmaker or reported in some publication, or by the press. Not only
technically difficult, the data integration process is fraught with potential
unexpected legal and social ramifications.
Several IT researchers have performed research on data and information
integration with e-Government data collections. As can be expected, most of
them have avoided the definitive final integration, mostly by providing tools
and/or methods that government specialists can use to make their own
integration decisions. Three principal approaches exist:
• Direct access, using information retrieval techniques
• Metadata reconciliation, using ontology alignment techniques
• Data mapping, using information theoretic techniques
We discuss these in the next section, and provide specific examples of
the last two in Section 3.

2. OVERVIEW OF THE FIELD

2.1 Direct Access using Information Retrieval

Direct access methods do not aim to provide a single uniform perspective

over the data sources. Rather, along the lines of web search technology like
Google, the IT tools return to the user all data pertinent to his or her request,
after which the user must decide what to do. As one might expect, this
approach works best for textual, not numerical, data. Typically, the technology
inspects metadata and the text accompanying the metadata, such as
documentation, commentary, or footnotes (Gravano et al., 1994; Lewis and
Hayes, 1994; Pyreddy and Croft, 1997; Xu and Callan, 1998)
Experience with traditional forms of metadata such as controlled
vocabularies shows that it is expensive and time-consuming to produce, that
authors often resist creating it, and that information consumers often have
difficulty relating their information need to pre-specified ontologies or
controlled vocabularies. Controlled vocabularies and relatively static
metadata ontologies are difficult to update and hence not really suitable to
support the rapid integration of new information that must be easy for the
general population to use and that must be maintained at moderate expense.
To address this problem, one approach is to try to generate metadata
automatically, using language models (lists of basic vocabulary, phrases,
names, etc., with frequency information) instead of ontologies or controlled
vocabularies. These language models are extracted by counting the words
and phrases appearing in the texts accompanying the data collections. Most
222 Chapter 12. Hovy

retrieval systems use term frequency, document frequency, and document

length statistics.
This approach has been adopted by information retrieval researchers
(Callan et al., 1995; Ponte and Croft, 1998). It is based on older work on the
automatic categorization of information relative to a controlled vocabulary
or classification hierarchy (Lewis and Hayes, 1994; Larkey, 1999). Ponte
and Croft (1998) infer a language model for each document and to estimate
the probability of generating the query according to each of these models.
Documents are then ranked according to these probabilities. Research by
Callan et al. (1999) shows that language models enable relatively accurate
database selection. More details of this approach, and a comparison with the
following one, appear in (Callan et al., 2001).

2.2 Metadata Reconciliation

Almost all data collections are accompanied by metadata that provides

some definitional information (at the very least, provides the names and
types of data collected). Given several data collections in a domain, people
often attempt to enforce (or at least enable) standardization of nomenclature,
and facilitate interoperability of IT across data sources, by creating
centralized metadata descriptions that provide the overarching data
‘framework’ for the whole domain. When the metadata for a specific data
resource is integrated with this centralized framework, the data becomes
interpretable in the larger context, and can therefore be compared to, and
used in tandem with, data from other data collections similarly connected.
In the US, the government has funded several metadata initiatives,
including the Government Information Locator Service (GILS) (http://www.
gils.net/) and the Advanced Search Facility (ASF) (http://asf.gils.net/). These
initiatives seek to establish a structure of cooperation and standards between
agencies, including defining structural information (formats, encodings, and
links). However, they do not focus on the actual creation of metadata, and do
not define the algorithms needed to generate metadata.
A large amount of research has been devoted to the problem of creating
general metadata frameworks for a domain, linking individual data
collections’ metadata to a central framework, and providing user access to
the most appropriate data source, given a query (Baru et al., 1999; Doan
et al., 2001; Ambite and Knoblock, 2000; French et al., 1999; Arens et al.,
1996). Two major approaches have been studied. In the first, called the
global-as view, the global model is defined and used as a view on the various
data sources. This model first appeared in Multibase and later in TSIMMIS
(Chawathe et al., 1994). In the second, called the local-as view or sometimes
view rewriting, the sources are used as views on the global model (Levy,
Digital Government 223

1998). The disadvantage of the first approach is that the user must reengineer
the definitions of the global model whenever any of the sources change or
when new sources are added. The view rewriting approach does not suffer
from this problem, but instead must face the problem of rewriting queries
into data access plans for all the other sources using views, a problem that is
NP-hard or worse.
Below we describe an example of a hybrid approach that defines the data
sources in terms of the global model and then compiles the source
descriptions into axioms that define the global model in terms of the
individual sources. These axioms can be efficiently instantiated at run-time
to determine the most appropriate rewriting to answer a query automatically.
This approach combines the flexibility of the view rewriting with the
efficiency of the query processing in Multibase and TSIMMIS.
To date, the general approach of integration by using metadata to find
similarities between entities within or across heterogeneous data sources
always requires some manual effort. Despite some promising recent work,
the automated creation of such mappings at high accuracy and high coverage
is still in its infancy, since equivalences and differences manifest themselves
at all levels, from individual data values through metadata to the explanatory
text surrounding the data collection as a whole.

2.3 Data Mapping

Formally defined metadata may provide a great deal of useful

information about a data source, and thereby greatly facilitate the work of
the IT specialist required to integrate it with another data source. But all too
often, the metadata is sketchy, and sometimes even such basic information as
data column headings are some kind of abbreviated code. In addition, such
auxiliary data can be outdated, irrelevant, overly domain specific, or simply
non-existent. A general-purpose solution to this problem cannot therefore
rely on such auxiliary data. All one can count on is the data itself: a set of
observations describing the entities.
A very recent approach to data integration skirts metadata altogether, and
focuses directly on the data itself. Necessarily, this data-driven paradigm
requires some method to determine which individual data differences are
significant and which are merely typical data value variations. To date, the
approach focuses on numerical data only. The general paradigm is to
employ statistical / information theoretic techniques to calculate average or
characteristic values for data (sub)sets, to then determine which values are
unusual with respect to their (sub)set, and to compare the occurrences of
unusual values across comparable data collections in order to find
corresponding patterns. From such patterns, likely data alignments are then
224 Chapter 12. Hovy

proposed for manual validation. Davis et al. (2005) proposed a supervised

learning algorithm for discovering aliases in multi-relational domains. Their
method uses two stages. High recall is obtained by first learning a set of
rules, using Inductive Logic Programming (ILP), and then these rules are
used as the features of a Bayesian Network classifier. In many domains,
however, training data is unavailable.
A different approach uses Mutual Information, an information theoretic
measure of the degree to which one data value predicts another. Kang and
Naughton (2003) begin with a known alignment to match unaligned columns
after schema- and instance-based matching fails. Given two columns A.x and
B.x in databases A and B that are known to be aligned, they use Mutual
Information to compute the association strength between column A.x with
each other column in A and column B.x and each other column in B. The
assumption is that highly associated columns from A and B are the best
candidates for alignment. Also using Mutual Information, the work of Pantel
et al. (2005), which we describe in more detail below, appears to be very
promising.

3. TWO EXAMPLES

In order to provide some detail, we describe EDC, an example database

access planning system based on metadata reconciliation, and SIFT-Guspin,
an example of the data mapping approach.

3.1 The EDC System1

The Energy Data Collection (EDC) project (Ambite et al., 2001; 2002)
focused on providing access to a large amount of data about gasoline prices
and weekly volumes of sale, collected in several quite different databases by
government researchers at the US Energy Information Administration (EIA),
the Bureau of Labor Statistics (BLS), the Census Bureau, and the California
Energy Commission. In all, over 50,000 data tables were materialized and
used in the final EDC system. The system could be accessed via various
interfaces, including cascaded menus, a natural language question analyzer
for English and Spanish (Philpot et al., 2004), and an ontology (metadata)
browser. Other research in this project focused on data aggregation to
integrate data collected at various granularities (Bruno et al., 2002), query
and result caching for rapid access to very large data collections (Ross,

1
Parts of this section were written by Jose Luis Ambite and Andrew Philpot.
Digital Government 225

2002), and the automated extraction of ontology terms from data glossary
definitions (Klavans et al., 2002).
The principal problem was to develop a system that could present a
single unified view of all the disparate, heterogeneous, data, in such as way
as to support the needs both of experts and of users relatively unfamiliar with
the data, such as journalists or educators, while also being formally specified
so as to be used by an automated data access planner. This planner, inherited
from the SIMS project (Ambite and Knoblock, 2000; Arens et al., 1996),
used Artificial Intelligence techniques to decompose the user’s query into a
set of subqueries, each addressed to a specific database, and to recompose
the results obtained from the various sources into a single response.
This research took the following approach. Rather than building domain
models from scratch, the researchers adopted USC/ISI’s 70,000-node
terminology taxonomy (a very simple ontology) called SENSUS as
overarching meta-model and extended it to incorporate new energy-related
domain models. To speed up this process, they developed automated
concept-to-ontology alignment algorithms (Hovy et al., 2001), and developed
algorithms that extracted terms from data sources and clustered them in
order to jump-start model building (Klavans et al., 2002; Hovy et al., 2003).
In order to connect SENSUS terms with the individual metadata models
of each source database, a domain model of approximately 500 nodes was
created manually to represent the concepts present in the EDC gasoline
domain, and manually connected to the various metadata models. This
model was then semi-automatically linked into SENSUS using a new type of
ontology link called generally-associated-with (GAW) that held between
concepts in the ontology and domain model concepts. GAW links enabled
the user while browsing to rapidly proceed from high-level (quite general
and perhaps inaccurate) concepts to the (very specific and precise) domain
model concepts associated with real data in the databases. In contrast to the
links between data sources and domain model concepts, which were logical
equivalences as required to ensure the correctness of SIMS reasoning, the
semantics of GAW links was purposely vague. Such vagueness allowed a
domain model concept (such as Price) to be connected to several very
disparate SENSUS concepts (such as Price, Cost, Money, Charge, Dollar,
Amount, Fee, Payment, Paying, etc.). Clearly, while these links cannot
support automated inference, they can support the non-expert user, allowing
him or her to start browsing or query formation with whatever terms are
most familiar. In addition, the vague semantics had a fortunate side effect, in
that it facilitated automated alignment of concepts from domain model to
SENSUS.
226 Chapter 12. Hovy

A considerable amount of effort was devoted to developing semi-

automated term-to-term alignment discovery algorithms (Hovy et al., 2001).
These algorithms fell into three classes: name matches, with various
heuristics on decomposing term names; definition matches, that considered
term definitions and definitional descriptions from associated documents;
and dispersal matches, that considered the relative locations in SENSUS of
groups of candidate matches. A fairly extensive series of experiments
focused on determining the optimal parameter settings for these algorithms,
using three sets of data: the abovementioned EDC gasoline data, the
NHANES collection of 20,000 rows of 1238 fields from a survey by the
National Center for Health Statistics, and (for control purposes) a set of 60
concepts in 3 clusters extracted from SENSUS. Although the alignment
techniques were never very accurate, they did significantly shorten the time
required to connect SENSUS to the domain model, compared to manual
insertion. For GAW links, they were quite well suited. Research in (semi-)
automated ontology alignment is an ongoing and popular endeavor; see
http://www.atl.lmco.com/projects/ontology/.

3.2 The SIFT-Guspin System2

Not all data is equally useful for comparison—some observations are

much more informative and important than others. This work uses Pointwise
Mutual Information to calculate the information content (approximately, the
unpredictability) of individual data items in various data collections, and
then compares groupings of unusual (i.e., unpredictable in surprising ways)
ones across collections. In simple terms, the hypothesis of this work is that
correspondences of unusual values are much more indicative of likely data
alignments than correspondences that arise due to ‘random’ variation.
When assessing the similarity between entities, important observations
should be weighed higher than less important ones. Shannon’s theory of
information (Shannon, 1948) provides a metric, called Pointwise Mutual
measuring to what degree one event predicts the other. More precisely, the
formula measures the amount of information one event x gives about another
event y, where P(x) denotes the probability that x occurs, and P(x,y) the
probability that they both occur:

P ( x, y )
mi(x, y ) = log
P( x )P( y )
Given a method of ranking observations according to their relative
importance, one also needs a comparison metric for determining the

2
Parts of this section were written by Patrick Pantel and Andrew Philpot.
Digital Government 227

similarity between two entities. An important requirement is that the metric

be not too sensitive to unseen observations. That is, the absence of a
matching observation does not as strongly indicate dissimilarity as the
presence of one indicates similarity. Since not all distance metrics make this
distinction (Euclidean distance, for example, does not), a good choice is the
cosine coefficient, a common metric in which the similarity between each
pair of entities ei and ej is given by:

∑ mi(e , o )× mi(e , o)
i j
sim(ei , e j ) = o

∑ mi(e , o) × ∑ mi(e , o)
2 2
i j
o o

where o ranges through all possible observations. This formula measures the
cosine of the angle between two pointwise mutual information vectors: a
similarity of 0 indicates orthogonal (unrelated) vectors whereas a similarity
of 1 indicates identical vectors.
Pantel et al. (2005; 2006) use individual data sets from various
Environmental Protection Agency (EPA) offices in California and the US. In
one experiment, they align data measuring air (pollution) quality collected
by several of California’s Air Quality Management Districts with
corresponding data compiled by the central California Air Resources Board
(CARB). The process of aligning all 26 districts’ data with CARB’s
database, currently performed manually, takes about one year. The SIFT
system, developed by Pantel and colleagues, used Pointwise Mutual
Information to align the 2001 data collections (which cover facilities,
devices, processes, permitting history, criteria, and toxic emissions) of the
air pollution control districts of Santa Barbara County, Ventura County, and
San Diego County, with that of CARB, over the period of a few weeks in
total, once the system was set up and the data downloaded.
The Santa Barbara County data contained about 300 columns, and the
corresponding CARB data collection approximately the same amount; a
completely naïve algorithm would thus have to consider approximately
90,000 alignment decisions in the worst case. Using Pointwise Mutual
Information, SIFT suggested 295 alignments, of which 75% were correct. In
fact, there were 306 true alignments, of which SIFT identified 221 (or 72%).
Whenever the system managed to find a correct alignment for a given
column, the alignment was found within the topmost two ranked candidate
alignments. Considering only two candidate alignments for each possible
column obviously greatly reduces the number of possible validation
decisions required of a human expert. Assuming that each of the 90,000
candidate alignments must be considered (in practice, many alignments are
easily rejected by human experts) and that for each column the system were
to output at most k alignments, then a human expert would have to inspect
228 Chapter 12. Hovy

only k × 300 alignments. For k = 2, only 0.67% of the possible alignment

decisions must be inspected, representing an enormous saving in time.
The Guspin system, also developed at ISI, has been used to identify
duplicates within several databases of air quality measurements compiled by
the US EPA, including the CARB and AQMDs emissions inventories as
well as EPA’s Facilities Registry System (FRS). In summary, Guspin’s
performance on the CARB and Santa Barbara County Air Pollution Control
District 2001 emissions inventories was:
• with 100% accuracy, Guspin extracted 50% of the matching facilities;
• with 90% accuracy, Guspin extracted 75% of the matching facilities;
• for a given facility and the top-5 mappings returned by Guspin, with
92% accuracy, Guspin extracted 89% of the matching facilities.

4. CONCLUSION
The problem of data and information integration is widespread in
government and industry, and it is getting worse as legacy systems continue
to appear. The absence of efficient large-scale practical solutions to the
problem, and the promise of especially the information theoretic techniques
on comparing sets of data values, make this an extremely rewarding and
potentially high-payoff area for future research. Direct access and metadata
alignment approaches appear to be rather inaccurate and/or still require
considerable human effort. In contrast, the approach of finding possible
alignments across data collections by statistical measures on the actual data
itself holds great promise for the future.

ACKNOWLEDGEMENTS
The author wishes to thank Bruce Croft and Jamie Callan for help with
Section 2.1, Jose Luis Ambite and Andrew Philpot for help with Section 3.1,
and Patrick Pantel and Andrew Philpot for help with Section 3.2.
Writing this paper was funded in part by NSF grant no. EIA-0306899,
dated 08/12/2003, awarded to the author under the NSF’s Digital
Government program.

REFERENCES
Ambite J.L. and C.A. Knoblock. 2000. Flexible and Scalable Cost-Based Query Planning in
Mediators: A Transformational Approach. Artificial Intelligence Journal, 118(1–2).
Digital Government 229

Ambite, J.L., Y. Arens, E.H. Hovy, A. Philpot, L. Gravano, V. Hatzivassiloglou, and J.L.
Klavans. 2001. Simplifying Data Access: The Energy Data Collection Project. IEEE
Computer 34(2), February.
Ambite, J.L., Y. Arens, L. Gravano, V. Hatzivassiloglou, E.H. Hovy, J.L. Klavans, A.
Philpot, U. Ramachandran, K. Ross, J. Sandhaus, D. Sarioz, A. Singla, and B. Whitman.
2002. Data Integration and Access: The Digital Government Research Center’s Energy
Data Collection (EDC) Project. In W. McIver and A.K. Elmagarmid (eds.), Advances in
Digital Government, pp. 85–106. Dordrecht: Kluwer.
Arens, Y., C.A. Knoblock and C.-N. Hsu. 1996. Query Processing in the SIMS Information
Mediator. In A. Tate (ed.), Advanced Planning Technology. Menlo Park: AAAI Press.
Baru, C., A. Gupta, B. Ludaescher, R. Marciano, Y. Papakonstantinou, and P. Velikhov.
1999. XML-Based Information Mediation with MIX. Proceedings of Exhibitions
Program of ACM SIGMOD International Conference on Management of Data.
Bruno, N., S. Chaudhuri, and L. Gravano. 2002. Top-k Selection Queries over Relational
Databases: Mapping Strategies and Performance Evaluation. ACM Transactions on
Database Systems, 27(2), June 2002.
Callan, J., M. Connell, and A. Du. 1999. Automatic Discovery of Language Models for Text
Databases. Proceedings of the 1999 ACM-SIGMOD International Conference on
Management of Data, pp. 479–490.
Callan, J., Z. Lu, and W.B. Croft. 1995. Searching distributed collections with inference
networks. Proceedings of the Eighteenth Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, pp. 21–28.
Callan, J., W.B. Croft, and E.H. Hovy. 2001. Two Approaches toward Solving the Problem of
Access to Distributed and Heterogeneous Data. In DGnOnline, online magazine for
Digital Government. http://www.dgrc.org/dg-online/.
Chawathe, S., H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and
J. Widom. 1994. The TSIMMIS Project: Integration of Heterogeneous Information
Sources. Proceedings of IPSJ Conference, pp. 7–18.
Davis, J., I. Dutra, D. Page, and V.S. Costa. 2005. Establishing Identity Equivalence in Multi-
Relational Domains. Proceedings of the International Conference on Intelligence Analysis.
Doan, A., P. Domingos, and A.Y. Halevy. 2001. Reconciling Schemas of Disparate Data
Sources: A Machine-learning Approach. Proceedings of SIGMOD-2001, pp. 509–520.
French, J.C., A.L. Powell, J. Callan, C.L. Viles, T. Emmitt, and K.J. Prey. 1999. Comparing
the performance of database selection algorithms. Proceedings of the 22nd Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval.
Gravano, L., H. García-Molina, and A. Tomasic. 1994. The effectiveness of GLOSS for the
text database discovery problem. Proceedings of SIGMOD 94, pp. 126–137.
Hovy, E.H., A. Philpot, J.L. Klavans, U. Germann, P.T. Davis, and S.D. Popper. 2003.
Extending Metadata Definitions by Automatically Extracting and Organizing Glossary
Definitions. Proceedings of the dg.o 2003 conference. Boston, MA.
Hovy, E.H., A. Philpot, J.-L. Ambite, Y. Arens, J.L. Klavans, W. Bourne, and D. Sarioz.
2001. Data Acquisition and Integration in the DGRC’s Energy Data Collection Project.
Proceedings of the NSF’s National Conference on Digital Government dg.o 2001.
Kang, J. and J.F. Naughton. 2003. On Schema Matching with Opaque Column Names and
Data Values. Proceedings of SIGMOD-2003.
Klavans, J.L., P.T. Davis, and S. Popper. 2002. Building Large Ontologies using Web-
Crawling and Glossary Analysis Techniques. Proceedings of the NSF’s National
Conference on Digital Government dg.o 2002.
230 Chapter 12. Hovy

Larkey, L. 1999. A Patent Search and Classification System. Proceedings of Digital Libraries
(DL 99).
Levy, A.Y. 1998. The Information Manifold Approach to Data Integration. IEEE Intelligent
Systems (September/October), pp. 11–16.
Lewis, D. and P. Hayes (eds.). 1999. Special issue on Text Categorization, ACM Transactions
on Information Systems, 12(3).
Pantel, P., A. Philpot, and E.H. Hovy. 2005. An Information Theoretic Model for Database
Alignment. Proceedings of Conference on Scientific and Statistical Database Management
(SSDBM-05), pp. 14–23.
Philpot, A., E.H. Hovy, and L. Ding. 2004. Multilingual DGRC AskCal: Querying Energy
Time Series in English, Spanish, and Mandarin Chinese. System demo and description in
Proceedings of the NSF’s National Conference on Digital Government dg.o 2004.
Ponte, J. and W.B. Croft. 1998. A Language Modeling Approach to Information Retrieval.
Proceedings of the 21st International Conference on Research and Development in
Information Retrieval, pp. 275–281.
Pyreddy, P. and W.B. Croft. 1997. TINTIN: A System for Retrieval in Text Tables.
Proceedings of the ACM Conference on Digital Libraries, pp. 193–200.
Ross, K.A. 2002. Conjunctive Selection Conditions in Main Memory. Proceedings of the
2002 PODS Conference.
Shannon, C.E. 1948. A Mathematical Theory of Communication. Bell System Technical
Journal, 20, pp. 50–64.
Xu, J. and J. Callan. 1998. Effective retrieval of distributed collections. Proceedings of the
21st Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, pp. 112–120.

SUGGESTED READINGS
• Access to textual data using Information Retrieval methods. A general
overview appears in Turtle, H.R. and W.B. Croft, 1992. A comparison of
text retrieval models. Computer Journal, 35(3), pp. 279–290. An early
work on language models for IR is Ponte, J. 1998. A Language Modeling
Approach to Information Retrieval. Ph.D. thesis, Computer Science
Department, University of Massachusetts.
• Automated database access using an Artificial Intelligence planner to
decompose user queries into database queries from individual databases
and recompose them, within the approach of a central integrating
metadata model. See Ambite J.L. and C.A. Knoblock. 2000. Flexible
and Scalable Cost-Based Query Planning in Mediators: A
Transformational Approach. Artificial Intelligence Journal, 118(1–2).
• General method of applying Pointwise Mutual Information to discover
mappings across individual columns or rows of numerical data. See
Pantel, P., A. Philpot, & E.H. Hovy. 2005. Data Alignment and
Integration. IEEE Computer 38(12), pp. 43–51. For more details, see
Pantel, P., A. Philpot, and E.H. Hovy. 2005. An Information Theoretic
Digital Government 231

Model for Database Alignment. Proceedings of Conference on Scientific

and Statistical Database Management (SSDBM-05), pp. 14–23.

QUESTIONS FOR DISCUSSION

1. Which of the three methods for data and information integration is most
suitable for the following kinds of data?
• Collections of workplace safety regulations and associated backup
documents (studies, etc.)
• Daily numerical readings of traffic density and flow in a city
• Databases about crop characteristics (text) and annual growth
(numbers)
Why, in each case?
2. Let each student create a hierarchical (ontology-like) metadata model of
foods (fruits, vegetables, meats, dairy products, baked foods, etc.), where
each type contains between 1 and 4 defining characteristics. Include at
least 20 different individual foods, and group them into types. Compare
the various models for both content and organization. Are some models
better than others? If so, why? If not, why not?
3. How would you use a search engine like Google to locate the most
appropriate data table from a collection that contains only numerical
information?
4. Build a system to compare numerical databases. First try simple column-
by-column comparisons such as summing cells with data equality; then
implement Pointwise Mutual Information and the cosine distance metric
and use that. Download data from the EIA’s site http://www.eia.gov that
contains pages in which data from various states has been combined.
Also download some of the sources and evaluate your system’s results
against the combinations produced by the EIA.

IRS - Notes - I&2 CSE A&B
No ratings yet
IRS - Notes - I&2 CSE A&B
27 pages
StdTripSBBD (v2) Mac 20100630 Rev2
No ratings yet
StdTripSBBD (v2) Mac 20100630 Rev2
8 pages
Research Data Archiving Insights
No ratings yet
Research Data Archiving Insights
21 pages
Irs Unit III
No ratings yet
Irs Unit III
74 pages
Semantic Data Integration Approaches For E-Governance
No ratings yet
Semantic Data Integration Approaches For E-Governance
12 pages
ER2014 Keynote CR 3
No ratings yet
ER2014 Keynote CR 3
14 pages
SIGIR 2003 Workshop On Distributed Information Retrieval: Jamie Callan Fabio Crestani Mark Sanderson
No ratings yet
SIGIR 2003 Workshop On Distributed Information Retrieval: Jamie Callan Fabio Crestani Mark Sanderson
5 pages
Hci Unit 5
No ratings yet
Hci Unit 5
22 pages
Research Methods - 3
100% (1)
Research Methods - 3
48 pages
The Steps of Qualitative Data Analysis
100% (1)
The Steps of Qualitative Data Analysis
92 pages
202307 Eurosai Innovations 7 Red (原文)
No ratings yet
202307 Eurosai Innovations 7 Red (原文)
17 pages
Data Integration
No ratings yet
Data Integration
8 pages
Rapid Exploitation and Analysis of Document
No ratings yet
Rapid Exploitation and Analysis of Document
40 pages
Government-Wide Information Sharing For Democratic Accountability - Brookings
No ratings yet
Government-Wide Information Sharing For Democratic Accountability - Brookings
8 pages
Sharon Ve Ali - 2014 - Metadata For Research Data Current Practices and Trends
No ratings yet
Sharon Ve Ali - 2014 - Metadata For Research Data Current Practices and Trends
9 pages
15arspc Submission 213
No ratings yet
15arspc Submission 213
6 pages
Semantic Retrieval in Digital Libraries
No ratings yet
Semantic Retrieval in Digital Libraries
11 pages
E-Gov Ontology for Public Services
No ratings yet
E-Gov Ontology for Public Services
9 pages
Icait2011 Submission 5
No ratings yet
Icait2011 Submission 5
3 pages
Data Wrangling - Data Lake
No ratings yet
Data Wrangling - Data Lake
9 pages
A Personalized Ontology Model For Web Information Gathering Using Local Instance Repository
No ratings yet
A Personalized Ontology Model For Web Information Gathering Using Local Instance Repository
7 pages
A Brief Survey On Data Mining For Biological and Environmental Problems.
No ratings yet
A Brief Survey On Data Mining For Biological and Environmental Problems.
46 pages
Types of Information Retrieval Tools
100% (1)
Types of Information Retrieval Tools
5 pages
Metadata Technique With E-Government For Malaysian Universities
No ratings yet
Metadata Technique With E-Government For Malaysian Universities
6 pages
Information Survey
No ratings yet
Information Survey
35 pages
Data Policy: Mark A Parsons
No ratings yet
Data Policy: Mark A Parsons
8 pages
Towardsenablingsocialanalysis Ofscientificdata
No ratings yet
Towardsenablingsocialanalysis Ofscientificdata
4 pages
Brank Evaluation Si KDD2005
No ratings yet
Brank Evaluation Si KDD2005
4 pages
TYBSC CS Information Retrieval Munotes
No ratings yet
TYBSC CS Information Retrieval Munotes
85 pages
Hci Unit 5 PDF
No ratings yet
Hci Unit 5 PDF
22 pages
Ontology-Based Knowledge Management
No ratings yet
Ontology-Based Knowledge Management
4 pages
2012 Liu
No ratings yet
2012 Liu
5 pages
Improve Business Interoperability Through Context-Based Ontology Reconciliation
No ratings yet
Improve Business Interoperability Through Context-Based Ontology Reconciliation
15 pages
Semantic Distances
No ratings yet
Semantic Distances
32 pages
Concept Maps From RDF (Resource Description Framework) : Navy SBIR 2013.2 - Topic N132-128
No ratings yet
Concept Maps From RDF (Resource Description Framework) : Navy SBIR 2013.2 - Topic N132-128
3 pages
Paper Open Data - ENG
No ratings yet
Paper Open Data - ENG
7 pages
Torna-Freze G M
No ratings yet
Torna-Freze G M
3 pages
Development of An Interactive Map Within The Implementation of Actual State and Public Directions
No ratings yet
Development of An Interactive Map Within The Implementation of Actual State and Public Directions
4 pages
Conceptual Modeling For Data Integration
No ratings yet
Conceptual Modeling For Data Integration
26 pages
Information Retrieval Course Guide
No ratings yet
Information Retrieval Course Guide
16 pages
Buyle 2017
No ratings yet
Buyle 2017
8 pages
Semantic Data Integration Experts
No ratings yet
Semantic Data Integration Experts
12 pages
Example Based Search 2001
No ratings yet
Example Based Search 2001
7 pages
Information PDF
No ratings yet
Information PDF
16 pages
Data Integration
No ratings yet
Data Integration
18 pages
Metadata Standards
100% (1)
Metadata Standards
4 pages
Introduction To Metadata
100% (5)
Introduction To Metadata
89 pages
Deploying Natural Language Processing For Social Science Analysis
No ratings yet
Deploying Natural Language Processing For Social Science Analysis
2 pages
Marciano Et Al Archival Records and Training in The Age of Big Data Final
No ratings yet
Marciano Et Al Archival Records and Training in The Age of Big Data Final
19 pages
Artigo PingER SLAC
No ratings yet
Artigo PingER SLAC
7 pages
The Backend of News As A Juxtaposition of Data and Human Costs
No ratings yet
The Backend of News As A Juxtaposition of Data and Human Costs
24 pages
Olalekan Et Al 2
No ratings yet
Olalekan Et Al 2
8 pages
DW Unit III Notes
No ratings yet
DW Unit III Notes
18 pages
Dataspaces: Dataspaces Are An Abstraction in
No ratings yet
Dataspaces: Dataspaces Are An Abstraction in
5 pages
Optimizing Information Retrieval Systems
No ratings yet
Optimizing Information Retrieval Systems
4 pages
Ijmso 202009 2
No ratings yet
Ijmso 202009 2
17 pages
Peerj Cs 254
No ratings yet
Peerj Cs 254
30 pages
WB IndoBangla TradeExS
No ratings yet
WB IndoBangla TradeExS
21 pages
1566 Action Research 2003 Brydon Miller 9 28
No ratings yet
1566 Action Research 2003 Brydon Miller 9 28
21 pages
WB Gender 2008
No ratings yet
WB Gender 2008
136 pages
1570 - Action Research and Design Science Research - Seemingly Similar But Decisively Dissimilar
No ratings yet
1570 - Action Research and Design Science Research - Seemingly Similar But Decisively Dissimilar
13 pages
ADR Vol22 1 Cororaton
No ratings yet
ADR Vol22 1 Cororaton
17 pages
Adb Sidr
No ratings yet
Adb Sidr
2 pages
982 - Design Science in IS Research
No ratings yet
982 - Design Science in IS Research
6 pages
1516 - Doing Desing Ethno
No ratings yet
1516 - Doing Desing Ethno
14 pages
1517 Coordination Design
No ratings yet
1517 Coordination Design
6 pages
1515 Design Ethnography
No ratings yet
1515 Design Ethnography
3 pages
931 - Structure and Agency Actor-Network
No ratings yet
931 - Structure and Agency Actor-Network
11 pages
1513 - A Definition of Theory Research Guidelines For Different
No ratings yet
1513 - A Definition of Theory Research Guidelines For Different
25 pages
3553 2013-Antipode
No ratings yet
3553 2013-Antipode
21 pages
1216 Biometric Identities and E-Government Services
No ratings yet
1216 Biometric Identities and E-Government Services
7 pages
3554walker Et Al-2015-Transactions of The Institute of British Geographers
No ratings yet
3554walker Et Al-2015-Transactions of The Institute of British Geographers
13 pages
911 - Understanding Egov Projec Failure
No ratings yet
911 - Understanding Egov Projec Failure
19 pages
961 - A - Role of Institution in ICT Innovation)
No ratings yet
961 - A - Role of Institution in ICT Innovation)
20 pages
3555 Sociomat Practice
No ratings yet
3555 Sociomat Practice
11 pages
985 - Maturing IS Discipline
No ratings yet
985 - Maturing IS Discipline
9 pages
908 - E Gov Readiness
No ratings yet
908 - E Gov Readiness
8 pages
1559 - ICT Research in Africa Need For A Strategic Developmental Focus
No ratings yet
1559 - ICT Research in Africa Need For A Strategic Developmental Focus
17 pages
916 - E Gov Past Present Future
No ratings yet
916 - E Gov Past Present Future
4 pages
2287 Agency Theory Integration or A Thousand Flowers
No ratings yet
2287 Agency Theory Integration or A Thousand Flowers
7 pages
1558 - ICT For Development - Web 2.0
No ratings yet
1558 - ICT For Development - Web 2.0
15 pages
2276 - A Humanist Critique of Actant-Network Theory
No ratings yet
2276 - A Humanist Critique of Actant-Network Theory
18 pages
2275 - Material Objects in Social Worlds
No ratings yet
2275 - Material Objects in Social Worlds
12 pages
317 - Evaluating The Contribution of Structuration Theory in IS - Rose
No ratings yet
317 - Evaluating The Contribution of Structuration Theory in IS - Rose
22 pages
2283 - Materiality and Contingency
No ratings yet
2283 - Materiality and Contingency
17 pages
2248 - Understanding Power Within Project Work The Neglected Role of Material and Embodie
No ratings yet
2248 - Understanding Power Within Project Work The Neglected Role of Material and Embodie
15 pages
2246 - Performative Ethnography
No ratings yet
2246 - Performative Ethnography
9 pages
Vapor Diffusion in Air Streams
No ratings yet
Vapor Diffusion in Air Streams
8 pages
Wynns 2
No ratings yet
Wynns 2
8 pages
PROJ
No ratings yet
PROJ
7 pages
ASTM-F1515-03-2008 - Standard Test Method For Measuring Light Stability of Resilient Flooring by Color Change
No ratings yet
ASTM-F1515-03-2008 - Standard Test Method For Measuring Light Stability of Resilient Flooring by Color Change
2 pages
Rem Koolhaas
100% (1)
Rem Koolhaas
7 pages
The Shiphandlers Guide
No ratings yet
The Shiphandlers Guide
143 pages
KV 27TS27
No ratings yet
KV 27TS27
10 pages
Oil & Gas Construction Services
No ratings yet
Oil & Gas Construction Services
22 pages
Product Catalogue 11 Stauff Hire
No ratings yet
Product Catalogue 11 Stauff Hire
20 pages
Learning Strategies and Assessment Techniques As Applied To Edukasyong Pantahanan at Pangkabuhayan/ Technology and Livelihood Education
100% (1)
Learning Strategies and Assessment Techniques As Applied To Edukasyong Pantahanan at Pangkabuhayan/ Technology and Livelihood Education
20 pages
Xi-Maths Model Paper 2025 (According To Reduced Syllabus) - The Anonymous Institute
No ratings yet
Xi-Maths Model Paper 2025 (According To Reduced Syllabus) - The Anonymous Institute
6 pages
Micrometer
No ratings yet
Micrometer
6 pages
American Options Pricing Methods
No ratings yet
American Options Pricing Methods
9 pages
Compal Electronics Engineering Document
75% (4)
Compal Electronics Engineering Document
1 page
Engine TCM
No ratings yet
Engine TCM
165 pages
Ntic
100% (2)
Ntic
510 pages
Error Detection and Correction in Communication Systems Project
50% (2)
Error Detection and Correction in Communication Systems Project
16 pages
G4000+G5000+Miele+Service+Manual
No ratings yet
G4000+G5000+Miele+Service+Manual
159 pages
COT English
No ratings yet
COT English
7 pages
Astava Catalog
No ratings yet
Astava Catalog
26 pages
A Film by Cristian Mungiu: A Mobra Films Production
No ratings yet
A Film by Cristian Mungiu: A Mobra Films Production
9 pages
Grade 5 Learning Activities
No ratings yet
Grade 5 Learning Activities
7 pages
Petroleum Basin Classifications
No ratings yet
Petroleum Basin Classifications
21 pages
Excavator Cat
100% (1)
Excavator Cat
476 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
2 pages
Haas Service and Operator Manual Archive
100% (1)
Haas Service and Operator Manual Archive
75 pages
Action Plan AP
No ratings yet
Action Plan AP
3 pages
Klüber Lubricants for Glass Industry
No ratings yet
Klüber Lubricants for Glass Industry
12 pages
From Pseudo Code To Program Code
No ratings yet
From Pseudo Code To Program Code
24 pages
Product Guide: Hyundai Construction Equipment
100% (1)
Product Guide: Hyundai Construction Equipment
26 pages

2709 - Data and Knowledge Integration

Uploaded by

2709 - Data and Knowledge Integration

Uploaded by

Chapter 12

DATA AND KNOWLEDGE INTEGRATION FOR

Government is, in principle, a data-intensive enterprise: in the ideal case,

unexpected policy consequences—after all, the results will presumably used

2. OVERVIEW OF THE FIELD

2.1 Direct Access using Information Retrieval

Direct access methods do not aim to provide a single uniform perspective

retrieval systems use term frequency, document frequency, and document

2.2 Metadata Reconciliation

Almost all data collections are accompanied by metadata that provides

2.3 Data Mapping

Formally defined metadata may provide a great deal of useful

proposed for manual validation. Davis et al. (2005) proposed a supervised

In order to provide some detail, we describe EDC, an example database

3.1 The EDC System1

A considerable amount of effort was devoted to developing semi-

3.2 The SIFT-Guspin System2

Not all data is equally useful for comparison—some observations are

similarity between two entities. An important requirement is that the metric

only k × 300 alignments. For k = 2, only 0.67% of the possible alignment

Model for Database Alignment. Proceedings of Conference on Scientific

QUESTIONS FOR DISCUSSION

You might also like