Chapter 12
DATA AND KNOWLEDGE INTEGRATION FOR
E-GOVERNMENT
Eduard Hovy
Information Sciences Institute, University of Southern California, Marina del Rey, California,
U.S.A. (hovy@isi.edu)
CHAPTER OVERVIEW
Data integration is one of the most significant IT problems facing government today. Using
information technology, government agencies have collected vastly more data than was ever
possible to collect before. Unfortunately, for lack of standardization, most government data
today exists in thousands of different formats and resides in hundreds of systems and
versions, spread across dozens of agencies. This situation makes the data almost impossible to
find, re-use, and build upon after further data collection. A considerable amount of research
has been performed over the past decades to overcome this problem. Within Digital
Government, several projects have focused on government data collections. Three principal
approaches have been followed: (1) direct access, using information retrieval techniques; (2)
metadata reconciliation, using ontology alignment techniques; and (3) data mapping, using
information theoretic techniques. This chapter discusses each approach, and provides specific
examples of the last two.
220 Chapter 12. Hovy
1. INTRODUCTION
Government is, in principle, a data-intensive enterprise: in the ideal case,
the more data available about some issue, organization, or individual, the
better decisions government agencies can make. Information technology has
enabled government agencies to collect vastly more data than was ever
possible before. But this ability comes at a cost: in order to be useful, the
data must be properly organized, standardized, and maintained. Unfortunately
the situation most characteristic of present-day government in almost all its
branches is that much data has been collected, and stored, but has not been
stored in a uniform way, in a common representation, or using a
standardized system. That is, the data may reside in hundreds of different
formats, systems, and versions. While the information might be somewhere,
the user often doesn’t know where to find it, how to access it, or how to
convert all variations of it to a single useful format.
It is therefore no surprise that one of the most significant problems
experienced by government agencies is data and knowledge integration.
Reconciling the differences across data collections is not a trivial matter.
Data and information integration involves several aspects: recognizing that
two data sets describe the ‘same’ topic; understanding their differences (in
numerous ways, including specificity, accuracy, and coverage); creating a
common description or framework of metadata for them; establishing
equivalences by performing data mapping across the data collections; and
possibly converting the actual data from one or more data sources into a
common form.
One of the principal problems facing efforts to integrate nonhomogeneous
data sets is terminology standardization: what one agency calls salary
another might call income, and a third might call wages (even while it may
have something else entirely that it calls salary). Defining exactly what data
has been obtained, and making sure that the actual method of data capture
employed did in fact accurately observe the specifications, is a task for
specialists, and may require some very sophisticated analysis and
description. It is not uncommon for specialists in different government
agencies to spend weeks or even months understanding precisely what
differences exist between their respective data collections, even when to the
untrained eye the collections seem essentially identical. For example, it
makes a difference when you want to record gasoline prices in a given area
not only in which locations you measure the prices, but also whether you
measure the prices every Tuesday or once a month, for example. Determining
how significant the differences in measurement are, and deciding how to
reconcile them (simple numerical average? Average weighted by volume
sold?) into a single number, is a matter of interpretation, and may easily have
Digital Government 221
unexpected policy consequences—after all, the results will presumably used
by some lawmaker or reported in some publication, or by the press. Not only
technically difficult, the data integration process is fraught with potential
unexpected legal and social ramifications.
Several IT researchers have performed research on data and information
integration with e-Government data collections. As can be expected, most of
them have avoided the definitive final integration, mostly by providing tools
and/or methods that government specialists can use to make their own
integration decisions. Three principal approaches exist:
• Direct access, using information retrieval techniques
• Metadata reconciliation, using ontology alignment techniques
• Data mapping, using information theoretic techniques
We discuss these in the next section, and provide specific examples of
the last two in Section 3.
2. OVERVIEW OF THE FIELD
2.1 Direct Access using Information Retrieval
Direct access methods do not aim to provide a single uniform perspective
over the data sources. Rather, along the lines of web search technology like
Google, the IT tools return to the user all data pertinent to his or her request,
after which the user must decide what to do. As one might expect, this
approach works best for textual, not numerical, data. Typically, the technology
inspects metadata and the text accompanying the metadata, such as
documentation, commentary, or footnotes (Gravano et al., 1994; Lewis and
Hayes, 1994; Pyreddy and Croft, 1997; Xu and Callan, 1998)
Experience with traditional forms of metadata such as controlled
vocabularies shows that it is expensive and time-consuming to produce, that
authors often resist creating it, and that information consumers often have
difficulty relating their information need to pre-specified ontologies or
controlled vocabularies. Controlled vocabularies and relatively static
metadata ontologies are difficult to update and hence not really suitable to
support the rapid integration of new information that must be easy for the
general population to use and that must be maintained at moderate expense.
To address this problem, one approach is to try to generate metadata
automatically, using language models (lists of basic vocabulary, phrases,
names, etc., with frequency information) instead of ontologies or controlled
vocabularies. These language models are extracted by counting the words
and phrases appearing in the texts accompanying the data collections. Most
222 Chapter 12. Hovy
retrieval systems use term frequency, document frequency, and document
length statistics.
This approach has been adopted by information retrieval researchers
(Callan et al., 1995; Ponte and Croft, 1998). It is based on older work on the
automatic categorization of information relative to a controlled vocabulary
or classification hierarchy (Lewis and Hayes, 1994; Larkey, 1999). Ponte
and Croft (1998) infer a language model for each document and to estimate
the probability of generating the query according to each of these models.
Documents are then ranked according to these probabilities. Research by
Callan et al. (1999) shows that language models enable relatively accurate
database selection. More details of this approach, and a comparison with the
following one, appear in (Callan et al., 2001).
2.2 Metadata Reconciliation
Almost all data collections are accompanied by metadata that provides
some definitional information (at the very least, provides the names and
types of data collected). Given several data collections in a domain, people
often attempt to enforce (or at least enable) standardization of nomenclature,
and facilitate interoperability of IT across data sources, by creating
centralized metadata descriptions that provide the overarching data
‘framework’ for the whole domain. When the metadata for a specific data
resource is integrated with this centralized framework, the data becomes
interpretable in the larger context, and can therefore be compared to, and
used in tandem with, data from other data collections similarly connected.
In the US, the government has funded several metadata initiatives,
including the Government Information Locator Service (GILS) (http://www.
gils.net/) and the Advanced Search Facility (ASF) (http://asf.gils.net/). These
initiatives seek to establish a structure of cooperation and standards between
agencies, including defining structural information (formats, encodings, and
links). However, they do not focus on the actual creation of metadata, and do
not define the algorithms needed to generate metadata.
A large amount of research has been devoted to the problem of creating
general metadata frameworks for a domain, linking individual data
collections’ metadata to a central framework, and providing user access to
the most appropriate data source, given a query (Baru et al., 1999; Doan
et al., 2001; Ambite and Knoblock, 2000; French et al., 1999; Arens et al.,
1996). Two major approaches have been studied. In the first, called the
global-as view, the global model is defined and used as a view on the various
data sources. This model first appeared in Multibase and later in TSIMMIS
(Chawathe et al., 1994). In the second, called the local-as view or sometimes
view rewriting, the sources are used as views on the global model (Levy,
Digital Government 223
1998). The disadvantage of the first approach is that the user must reengineer
the definitions of the global model whenever any of the sources change or
when new sources are added. The view rewriting approach does not suffer
from this problem, but instead must face the problem of rewriting queries
into data access plans for all the other sources using views, a problem that is
NP-hard or worse.
Below we describe an example of a hybrid approach that defines the data
sources in terms of the global model and then compiles the source
descriptions into axioms that define the global model in terms of the
individual sources. These axioms can be efficiently instantiated at run-time
to determine the most appropriate rewriting to answer a query automatically.
This approach combines the flexibility of the view rewriting with the
efficiency of the query processing in Multibase and TSIMMIS.
To date, the general approach of integration by using metadata to find
similarities between entities within or across heterogeneous data sources
always requires some manual effort. Despite some promising recent work,
the automated creation of such mappings at high accuracy and high coverage
is still in its infancy, since equivalences and differences manifest themselves
at all levels, from individual data values through metadata to the explanatory
text surrounding the data collection as a whole.
2.3 Data Mapping
Formally defined metadata may provide a great deal of useful
information about a data source, and thereby greatly facilitate the work of
the IT specialist required to integrate it with another data source. But all too
often, the metadata is sketchy, and sometimes even such basic information as
data column headings are some kind of abbreviated code. In addition, such
auxiliary data can be outdated, irrelevant, overly domain specific, or simply
non-existent. A general-purpose solution to this problem cannot therefore
rely on such auxiliary data. All one can count on is the data itself: a set of
observations describing the entities.
A very recent approach to data integration skirts metadata altogether, and
focuses directly on the data itself. Necessarily, this data-driven paradigm
requires some method to determine which individual data differences are
significant and which are merely typical data value variations. To date, the
approach focuses on numerical data only. The general paradigm is to
employ statistical / information theoretic techniques to calculate average or
characteristic values for data (sub)sets, to then determine which values are
unusual with respect to their (sub)set, and to compare the occurrences of
unusual values across comparable data collections in order to find
corresponding patterns. From such patterns, likely data alignments are then
224 Chapter 12. Hovy
proposed for manual validation. Davis et al. (2005) proposed a supervised
learning algorithm for discovering aliases in multi-relational domains. Their
method uses two stages. High recall is obtained by first learning a set of
rules, using Inductive Logic Programming (ILP), and then these rules are
used as the features of a Bayesian Network classifier. In many domains,
however, training data is unavailable.
A different approach uses Mutual Information, an information theoretic
measure of the degree to which one data value predicts another. Kang and
Naughton (2003) begin with a known alignment to match unaligned columns
after schema- and instance-based matching fails. Given two columns A.x and
B.x in databases A and B that are known to be aligned, they use Mutual
Information to compute the association strength between column A.x with
each other column in A and column B.x and each other column in B. The
assumption is that highly associated columns from A and B are the best
candidates for alignment. Also using Mutual Information, the work of Pantel
et al. (2005), which we describe in more detail below, appears to be very
promising.
3. TWO EXAMPLES
In order to provide some detail, we describe EDC, an example database
access planning system based on metadata reconciliation, and SIFT-Guspin,
an example of the data mapping approach.
3.1 The EDC System1
The Energy Data Collection (EDC) project (Ambite et al., 2001; 2002)
focused on providing access to a large amount of data about gasoline prices
and weekly volumes of sale, collected in several quite different databases by
government researchers at the US Energy Information Administration (EIA),
the Bureau of Labor Statistics (BLS), the Census Bureau, and the California
Energy Commission. In all, over 50,000 data tables were materialized and
used in the final EDC system. The system could be accessed via various
interfaces, including cascaded menus, a natural language question analyzer
for English and Spanish (Philpot et al., 2004), and an ontology (metadata)
browser. Other research in this project focused on data aggregation to
integrate data collected at various granularities (Bruno et al., 2002), query
and result caching for rapid access to very large data collections (Ross,
1
Parts of this section were written by Jose Luis Ambite and Andrew Philpot.
Digital Government 225
2002), and the automated extraction of ontology terms from data glossary
definitions (Klavans et al., 2002).
The principal problem was to develop a system that could present a
single unified view of all the disparate, heterogeneous, data, in such as way
as to support the needs both of experts and of users relatively unfamiliar with
the data, such as journalists or educators, while also being formally specified
so as to be used by an automated data access planner. This planner, inherited
from the SIMS project (Ambite and Knoblock, 2000; Arens et al., 1996),
used Artificial Intelligence techniques to decompose the user’s query into a
set of subqueries, each addressed to a specific database, and to recompose
the results obtained from the various sources into a single response.
This research took the following approach. Rather than building domain
models from scratch, the researchers adopted USC/ISI’s 70,000-node
terminology taxonomy (a very simple ontology) called SENSUS as
overarching meta-model and extended it to incorporate new energy-related
domain models. To speed up this process, they developed automated
concept-to-ontology alignment algorithms (Hovy et al., 2001), and developed
algorithms that extracted terms from data sources and clustered them in
order to jump-start model building (Klavans et al., 2002; Hovy et al., 2003).
In order to connect SENSUS terms with the individual metadata models
of each source database, a domain model of approximately 500 nodes was
created manually to represent the concepts present in the EDC gasoline
domain, and manually connected to the various metadata models. This
model was then semi-automatically linked into SENSUS using a new type of
ontology link called generally-associated-with (GAW) that held between
concepts in the ontology and domain model concepts. GAW links enabled
the user while browsing to rapidly proceed from high-level (quite general
and perhaps inaccurate) concepts to the (very specific and precise) domain
model concepts associated with real data in the databases. In contrast to the
links between data sources and domain model concepts, which were logical
equivalences as required to ensure the correctness of SIMS reasoning, the
semantics of GAW links was purposely vague. Such vagueness allowed a
domain model concept (such as Price) to be connected to several very
disparate SENSUS concepts (such as Price, Cost, Money, Charge, Dollar,
Amount, Fee, Payment, Paying, etc.). Clearly, while these links cannot
support automated inference, they can support the non-expert user, allowing
him or her to start browsing or query formation with whatever terms are
most familiar. In addition, the vague semantics had a fortunate side effect, in
that it facilitated automated alignment of concepts from domain model to
SENSUS.
226 Chapter 12. Hovy
A considerable amount of effort was devoted to developing semi-
automated term-to-term alignment discovery algorithms (Hovy et al., 2001).
These algorithms fell into three classes: name matches, with various
heuristics on decomposing term names; definition matches, that considered
term definitions and definitional descriptions from associated documents;
and dispersal matches, that considered the relative locations in SENSUS of
groups of candidate matches. A fairly extensive series of experiments
focused on determining the optimal parameter settings for these algorithms,
using three sets of data: the abovementioned EDC gasoline data, the
NHANES collection of 20,000 rows of 1238 fields from a survey by the
National Center for Health Statistics, and (for control purposes) a set of 60
concepts in 3 clusters extracted from SENSUS. Although the alignment
techniques were never very accurate, they did significantly shorten the time
required to connect SENSUS to the domain model, compared to manual
insertion. For GAW links, they were quite well suited. Research in (semi-)
automated ontology alignment is an ongoing and popular endeavor; see
http://www.atl.lmco.com/projects/ontology/.
3.2 The SIFT-Guspin System2
Not all data is equally useful for comparison—some observations are
much more informative and important than others. This work uses Pointwise
Mutual Information to calculate the information content (approximately, the
unpredictability) of individual data items in various data collections, and
then compares groupings of unusual (i.e., unpredictable in surprising ways)
ones across collections. In simple terms, the hypothesis of this work is that
correspondences of unusual values are much more indicative of likely data
alignments than correspondences that arise due to ‘random’ variation.
When assessing the similarity between entities, important observations
should be weighed higher than less important ones. Shannon’s theory of
information (Shannon, 1948) provides a metric, called Pointwise Mutual
measuring to what degree one event predicts the other. More precisely, the
formula measures the amount of information one event x gives about another
event y, where P(x) denotes the probability that x occurs, and P(x,y) the
probability that they both occur:
P ( x, y )
mi(x, y ) = log
P( x )P( y )
Given a method of ranking observations according to their relative
importance, one also needs a comparison metric for determining the
2
Parts of this section were written by Patrick Pantel and Andrew Philpot.
Digital Government 227
similarity between two entities. An important requirement is that the metric
be not too sensitive to unseen observations. That is, the absence of a
matching observation does not as strongly indicate dissimilarity as the
presence of one indicates similarity. Since not all distance metrics make this
distinction (Euclidean distance, for example, does not), a good choice is the
cosine coefficient, a common metric in which the similarity between each
pair of entities ei and ej is given by:
∑ mi(e , o )× mi(e , o)
i j
sim(ei , e j ) = o
∑ mi(e , o) × ∑ mi(e , o)
2 2
i j
o o
where o ranges through all possible observations. This formula measures the
cosine of the angle between two pointwise mutual information vectors: a
similarity of 0 indicates orthogonal (unrelated) vectors whereas a similarity
of 1 indicates identical vectors.
Pantel et al. (2005; 2006) use individual data sets from various
Environmental Protection Agency (EPA) offices in California and the US. In
one experiment, they align data measuring air (pollution) quality collected
by several of California’s Air Quality Management Districts with
corresponding data compiled by the central California Air Resources Board
(CARB). The process of aligning all 26 districts’ data with CARB’s
database, currently performed manually, takes about one year. The SIFT
system, developed by Pantel and colleagues, used Pointwise Mutual
Information to align the 2001 data collections (which cover facilities,
devices, processes, permitting history, criteria, and toxic emissions) of the
air pollution control districts of Santa Barbara County, Ventura County, and
San Diego County, with that of CARB, over the period of a few weeks in
total, once the system was set up and the data downloaded.
The Santa Barbara County data contained about 300 columns, and the
corresponding CARB data collection approximately the same amount; a
completely naïve algorithm would thus have to consider approximately
90,000 alignment decisions in the worst case. Using Pointwise Mutual
Information, SIFT suggested 295 alignments, of which 75% were correct. In
fact, there were 306 true alignments, of which SIFT identified 221 (or 72%).
Whenever the system managed to find a correct alignment for a given
column, the alignment was found within the topmost two ranked candidate
alignments. Considering only two candidate alignments for each possible
column obviously greatly reduces the number of possible validation
decisions required of a human expert. Assuming that each of the 90,000
candidate alignments must be considered (in practice, many alignments are
easily rejected by human experts) and that for each column the system were
to output at most k alignments, then a human expert would have to inspect
228 Chapter 12. Hovy
only k × 300 alignments. For k = 2, only 0.67% of the possible alignment
decisions must be inspected, representing an enormous saving in time.
The Guspin system, also developed at ISI, has been used to identify
duplicates within several databases of air quality measurements compiled by
the US EPA, including the CARB and AQMDs emissions inventories as
well as EPA’s Facilities Registry System (FRS). In summary, Guspin’s
performance on the CARB and Santa Barbara County Air Pollution Control
District 2001 emissions inventories was:
• with 100% accuracy, Guspin extracted 50% of the matching facilities;
• with 90% accuracy, Guspin extracted 75% of the matching facilities;
• for a given facility and the top-5 mappings returned by Guspin, with
92% accuracy, Guspin extracted 89% of the matching facilities.
4. CONCLUSION
The problem of data and information integration is widespread in
government and industry, and it is getting worse as legacy systems continue
to appear. The absence of efficient large-scale practical solutions to the
problem, and the promise of especially the information theoretic techniques
on comparing sets of data values, make this an extremely rewarding and
potentially high-payoff area for future research. Direct access and metadata
alignment approaches appear to be rather inaccurate and/or still require
considerable human effort. In contrast, the approach of finding possible
alignments across data collections by statistical measures on the actual data
itself holds great promise for the future.
ACKNOWLEDGEMENTS
The author wishes to thank Bruce Croft and Jamie Callan for help with
Section 2.1, Jose Luis Ambite and Andrew Philpot for help with Section 3.1,
and Patrick Pantel and Andrew Philpot for help with Section 3.2.
Writing this paper was funded in part by NSF grant no. EIA-0306899,
dated 08/12/2003, awarded to the author under the NSF’s Digital
Government program.
REFERENCES
Ambite J.L. and C.A. Knoblock. 2000. Flexible and Scalable Cost-Based Query Planning in
Mediators: A Transformational Approach. Artificial Intelligence Journal, 118(1–2).
Digital Government 229
Ambite, J.L., Y. Arens, E.H. Hovy, A. Philpot, L. Gravano, V. Hatzivassiloglou, and J.L.
Klavans. 2001. Simplifying Data Access: The Energy Data Collection Project. IEEE
Computer 34(2), February.
Ambite, J.L., Y. Arens, L. Gravano, V. Hatzivassiloglou, E.H. Hovy, J.L. Klavans, A.
Philpot, U. Ramachandran, K. Ross, J. Sandhaus, D. Sarioz, A. Singla, and B. Whitman.
2002. Data Integration and Access: The Digital Government Research Center’s Energy
Data Collection (EDC) Project. In W. McIver and A.K. Elmagarmid (eds.), Advances in
Digital Government, pp. 85–106. Dordrecht: Kluwer.
Arens, Y., C.A. Knoblock and C.-N. Hsu. 1996. Query Processing in the SIMS Information
Mediator. In A. Tate (ed.), Advanced Planning Technology. Menlo Park: AAAI Press.
Baru, C., A. Gupta, B. Ludaescher, R. Marciano, Y. Papakonstantinou, and P. Velikhov.
1999. XML-Based Information Mediation with MIX. Proceedings of Exhibitions
Program of ACM SIGMOD International Conference on Management of Data.
Bruno, N., S. Chaudhuri, and L. Gravano. 2002. Top-k Selection Queries over Relational
Databases: Mapping Strategies and Performance Evaluation. ACM Transactions on
Database Systems, 27(2), June 2002.
Callan, J., M. Connell, and A. Du. 1999. Automatic Discovery of Language Models for Text
Databases. Proceedings of the 1999 ACM-SIGMOD International Conference on
Management of Data, pp. 479–490.
Callan, J., Z. Lu, and W.B. Croft. 1995. Searching distributed collections with inference
networks. Proceedings of the Eighteenth Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, pp. 21–28.
Callan, J., W.B. Croft, and E.H. Hovy. 2001. Two Approaches toward Solving the Problem of
Access to Distributed and Heterogeneous Data. In DGnOnline, online magazine for
Digital Government. http://www.dgrc.org/dg-online/.
Chawathe, S., H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and
J. Widom. 1994. The TSIMMIS Project: Integration of Heterogeneous Information
Sources. Proceedings of IPSJ Conference, pp. 7–18.
Davis, J., I. Dutra, D. Page, and V.S. Costa. 2005. Establishing Identity Equivalence in Multi-
Relational Domains. Proceedings of the International Conference on Intelligence Analysis.
Doan, A., P. Domingos, and A.Y. Halevy. 2001. Reconciling Schemas of Disparate Data
Sources: A Machine-learning Approach. Proceedings of SIGMOD-2001, pp. 509–520.
French, J.C., A.L. Powell, J. Callan, C.L. Viles, T. Emmitt, and K.J. Prey. 1999. Comparing
the performance of database selection algorithms. Proceedings of the 22nd Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval.
Gravano, L., H. García-Molina, and A. Tomasic. 1994. The effectiveness of GLOSS for the
text database discovery problem. Proceedings of SIGMOD 94, pp. 126–137.
Hovy, E.H., A. Philpot, J.L. Klavans, U. Germann, P.T. Davis, and S.D. Popper. 2003.
Extending Metadata Definitions by Automatically Extracting and Organizing Glossary
Definitions. Proceedings of the dg.o 2003 conference. Boston, MA.
Hovy, E.H., A. Philpot, J.-L. Ambite, Y. Arens, J.L. Klavans, W. Bourne, and D. Sarioz.
2001. Data Acquisition and Integration in the DGRC’s Energy Data Collection Project.
Proceedings of the NSF’s National Conference on Digital Government dg.o 2001.
Kang, J. and J.F. Naughton. 2003. On Schema Matching with Opaque Column Names and
Data Values. Proceedings of SIGMOD-2003.
Klavans, J.L., P.T. Davis, and S. Popper. 2002. Building Large Ontologies using Web-
Crawling and Glossary Analysis Techniques. Proceedings of the NSF’s National
Conference on Digital Government dg.o 2002.
230 Chapter 12. Hovy
Larkey, L. 1999. A Patent Search and Classification System. Proceedings of Digital Libraries
(DL 99).
Levy, A.Y. 1998. The Information Manifold Approach to Data Integration. IEEE Intelligent
Systems (September/October), pp. 11–16.
Lewis, D. and P. Hayes (eds.). 1999. Special issue on Text Categorization, ACM Transactions
on Information Systems, 12(3).
Pantel, P., A. Philpot, and E.H. Hovy. 2005. An Information Theoretic Model for Database
Alignment. Proceedings of Conference on Scientific and Statistical Database Management
(SSDBM-05), pp. 14–23.
Philpot, A., E.H. Hovy, and L. Ding. 2004. Multilingual DGRC AskCal: Querying Energy
Time Series in English, Spanish, and Mandarin Chinese. System demo and description in
Proceedings of the NSF’s National Conference on Digital Government dg.o 2004.
Ponte, J. and W.B. Croft. 1998. A Language Modeling Approach to Information Retrieval.
Proceedings of the 21st International Conference on Research and Development in
Information Retrieval, pp. 275–281.
Pyreddy, P. and W.B. Croft. 1997. TINTIN: A System for Retrieval in Text Tables.
Proceedings of the ACM Conference on Digital Libraries, pp. 193–200.
Ross, K.A. 2002. Conjunctive Selection Conditions in Main Memory. Proceedings of the
2002 PODS Conference.
Shannon, C.E. 1948. A Mathematical Theory of Communication. Bell System Technical
Journal, 20, pp. 50–64.
Xu, J. and J. Callan. 1998. Effective retrieval of distributed collections. Proceedings of the
21st Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, pp. 112–120.
SUGGESTED READINGS
• Access to textual data using Information Retrieval methods. A general
overview appears in Turtle, H.R. and W.B. Croft, 1992. A comparison of
text retrieval models. Computer Journal, 35(3), pp. 279–290. An early
work on language models for IR is Ponte, J. 1998. A Language Modeling
Approach to Information Retrieval. Ph.D. thesis, Computer Science
Department, University of Massachusetts.
• Automated database access using an Artificial Intelligence planner to
decompose user queries into database queries from individual databases
and recompose them, within the approach of a central integrating
metadata model. See Ambite J.L. and C.A. Knoblock. 2000. Flexible
and Scalable Cost-Based Query Planning in Mediators: A
Transformational Approach. Artificial Intelligence Journal, 118(1–2).
• General method of applying Pointwise Mutual Information to discover
mappings across individual columns or rows of numerical data. See
Pantel, P., A. Philpot, & E.H. Hovy. 2005. Data Alignment and
Integration. IEEE Computer 38(12), pp. 43–51. For more details, see
Pantel, P., A. Philpot, and E.H. Hovy. 2005. An Information Theoretic
Digital Government 231
Model for Database Alignment. Proceedings of Conference on Scientific
and Statistical Database Management (SSDBM-05), pp. 14–23.
QUESTIONS FOR DISCUSSION
1. Which of the three methods for data and information integration is most
suitable for the following kinds of data?
• Collections of workplace safety regulations and associated backup
documents (studies, etc.)
• Daily numerical readings of traffic density and flow in a city
• Databases about crop characteristics (text) and annual growth
(numbers)
Why, in each case?
2. Let each student create a hierarchical (ontology-like) metadata model of
foods (fruits, vegetables, meats, dairy products, baked foods, etc.), where
each type contains between 1 and 4 defining characteristics. Include at
least 20 different individual foods, and group them into types. Compare
the various models for both content and organization. Are some models
better than others? If so, why? If not, why not?
3. How would you use a search engine like Google to locate the most
appropriate data table from a collection that contains only numerical
information?
4. Build a system to compare numerical databases. First try simple column-
by-column comparisons such as summing cells with data equality; then
implement Pointwise Mutual Information and the cosine distance metric
and use that. Download data from the EIA’s site http://www.eia.gov that
contains pages in which data from various states has been combined.
Also download some of the sources and evaluate your system’s results
against the combinations produced by the EIA.