Introduction To Statistical Relational Learning
Introduction To Statistical Relational Learning
edited by
Lise Getoor
Ben Taskar
c
2007
Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any electronic
or mechanical means (including photocopying, recording, or information storage and retrieval)
without permission in writing from the publisher.
Contents
Series Foreword
xi
Preface
xiii
1 Introduction
Lise Getoor, Ben Taskar
1.1 Overview . . . . . . . . . . . . . . .
1.2 Brief History of Relational Learning
1.3 Emerging Trends . . . . . . . . . . .
1.4 Statistical Relational Learning . . .
1.5 Chapter Map . . . . . . . . . . . . .
1.6 Outlook . . . . . . . . . . . . . . . .
2 Graphical Models in a Nutshell
Daphne Koller, Nir Friedman,
2.1 Introduction . . . . . . . . . . . .
2.2 Representation . . . . . . . . . .
2.3 Inference . . . . . . . . . . . . . .
2.4 Learning . . . . . . . . . . . . . .
2.5 Conclusion . . . . . . . . . . . .
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
3
3
5
8
13
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
14
22
42
54
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
58
64
71
75
80
84
89
93
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
93
94
100
108
116
122
vi
Contents
5 Probabilistic Relational Models
Lise Getoor, Nir Friedman, Daphne Koller, Avi Pfeer, Ben Taskar
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 PRM Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 The Dierence between PRMs and Bayesian Networks . . . . . . . . . .
5.4 PRMs with Structural Uncertainty . . . . . . . . . . . . . . . . . . . . .
5.5 Probabilistic Model of Link Structure . . . . . . . . . . . . . . . . . . .
5.6 PRMs with Class Hierarchies . . . . . . . . . . . . . . . . . . . . . . . .
5.7 Inference in PRMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.8 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 Relational Markov Networks
Ben Taskar, Pieter Abbeel, Ming-Fai Wong, Daphne Koller
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Relational Classication and Link Prediction . . . . . . . . . .
6.3 Graph Structure and Subgraph Templates . . . . . . . . . . . .
6.4 Undirected Models for Classication . . . . . . . . . . . . . . .
6.5 Learning the Models . . . . . . . . . . . . . . . . . . . . . . . .
6.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . .
6.7 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . .
7 Probabilistic Entity-Relationship Models,
David Heckerman, Chris Meek, Daphne
7.1 Introduction . . . . . . . . . . . . . . . . .
7.2 Background: Graphical Models . . . . . .
7.3 The Basic Ideas . . . . . . . . . . . . . . .
7.4 Probabilistic Entity-Relationship Models .
7.5 Plate Models . . . . . . . . . . . . . . . .
7.6 Probabilistic Relational Models . . . . . .
7.7 Technical Details . . . . . . . . . . . . . .
7.8 Extensions and Future Work . . . . . . .
8 Relational Dependency Networks
Jennifer Neville, David Jensen
8.1 Introduction . . . . . . . . . . . . .
8.2 Dependency Networks . . . . . . .
8.3 Relational Dependency Networks .
8.4 Experiments . . . . . . . . . . . . .
8.5 Related Work . . . . . . . . . . . .
8.6 Discussion and Future Work . . . .
9 Logic-based Formalisms for
Statistical Relational Learning
James Cussens
9.1 Introduction . . . . . . . . . .
9.2 Representation . . . . . . . .
9.3 Inference . . . . . . . . . . . .
9.4 Learning . . . . . . . . . . . .
9.5 Conclusion . . . . . . . . . .
PRMs,
Koller
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
129
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
175
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
129
130
140
141
141
151
159
161
173
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
175
177
178
180
184
187
197
201
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
201
202
204
210
226
228
229
233
239
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
239
242
243
252
262
264
269
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
269
271
278
281
287
291
Contents
vii
10.1
10.2
10.3
10.4
10.5
10.6
10.7
10.8
Introduction . . . . . . . . . . . . . . . . . . . .
On Bayesian Networks and Logic Programs . .
Bayesian Logic Programs . . . . . . . . . . . .
Extensions of the Basic Framework . . . . . . .
Learning Bayesian Logic Programs . . . . . . .
Balios  The Engine for Basic Logic Programs
Related Work . . . . . . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
291
293
296
304
311
315
315
318
323
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
323
324
330
333
335
337
339
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
339
341
342
344
350
354
356
358
360
367
373
Sontag,
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
373
375
378
383
388
388
393
394
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
399
401
407
411
415
416
419
viii
Contents
433
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Statistical Learning
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
433
435
437
444
446
448
449
449
453
.
.
.
.
.
.
.
.
.
.
453
458
463
471
472
499
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
499
502
503
507
516
520
527
530
535
535
536
537
538
549
550
Contents
ix
553
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
553
556
560
562
563
570
576
Contributors
581
Index
587
Series Foreword
The goal of building systems that can adapt to their environments and learn from
their experience has attracted researchers from many elds, including computer
science, engineering, mathematics, physics, neuroscience, and cognitive science.
Out of this research has come a wide variety of learning techniques that have
the potential to transform many scientic and industrial elds. Recently, several
research communities have converged on a common set of issues surrounding supervised, unsupervised, and reinforcement learning problems. The MIT Press series
on Adaptive Computation and Machine Learning seeks to unify the many diverse
strands of machine learning research and to foster high quality research and innovative applications.
Thomas Dietterich
Preface
The goal of this book is to bring together important research at the intersection
of statistical, logical and relational learning. The material in the collection is
aimed at graduate students and researchers in machine learning and articial
intelligence. While by no means exhaustive, the articles introduce a wide variety of
recent approaches to combining expressive knowledge representation and statistical
learning.
The idea for this book emerged from a series of successful workshops addressing
these issues:
Learning Statistical Models from Relational Data (SRL2000) at the National
Conference on Articial Intelligence, AAAI-2000, organized by Lise Getoor and
David Jensen.
Learning Statistical Models from Relational Data (SRL2003) at the International Joint Conference on Articial Intelligence, (IJCAI-2003), organized by Lise
Getoor and David Jensen.
Statistical Relational Learning and its Connections to Other Fields (SRL2004)
at the International Conference on Machine Learning, (ICML2004), organized by
Tom Dietterich, Lise Getoor and Kevin Murphy.
Probabilistic, Logical and Relational Learning - Towards a Synthesis, Dagstuhl
Seminar 2005, organized by Luc De Raedt, Thomas Dietterich, Lise Getoor and
Stephen Muggleton.
Open Problems in Statistical Relational Learning (SRL2006) at the International
Conference on Machine Learning, (ICML2006), organized by Alan Fern, Lise
Getoor, and Brian Milch.
We would like to thank all of the participants at these workshops for their
intellectual contributions and also for creating a warm and welcoming research
community coming together from several distinct research areas.
In addition, there have been several other closely related workshops, including
the series of workshops on Multi-Relational Data Mining held in conjunction with
the Knowledge Discovery and Data Mining Conference beginning in 2002 organized
by Saso Dzeroski, Luc De Raedt, Stefan Wrobel, and Hendrik Blockeel.
This volume contains invited contributions from leading researchers in this new
research area. Each chapter has been reviewed by at least two anonymous reviewers.
We are very grateful to all the authors for their high quality contributions and to
all the reviewers for helping to clarify and improve this work.
In addition to thanking the workshop participants, book contributors and reviewers, we would like to thank our advisors: Daphne Koller, our PhD advisor;
Stuart Russell, Lise Getoors MS advisor; and Michael Jordan, Ben Taskars Postdoctoral advisor. Lise Getoor would also like to thank David Jensen; besides being
one of the people responsible for the name Statistical Relational Learning, David
has been a great mentor, workshop co-organizer and friend. We would also like
to thank Tom Dietterich, Pedro Domingos, and David Heckerman, who have been
very encouraging in developing this book. Luc De Raedt, Kristian Kersting, Stephen
Muggleton, Saso Dzeroski and Hendrik Blockeel have been especially encouraging
members from the inductive logic programming and relational learning community.
Lise would also like to thank her inquisitive graduate students, members of the
LINQs group at the University of Maryland, College Park, for their participation
in this project. Finally, on a more personal note, Lise would like to thank Pete for
his unwavering support and Ben would like to thank Anat for being his rock.
1 Introduction
We outline the major themes, problems and approaches that dene the subject of
the book: statistical relational learning. While the problems of statistical learning
and relational representation and reasoning have a fairly long history on their own
in Articial Intelligence research, the synthesis of the approaches is currently a
burgeoning eld. We briey sketch the background and the recent developments
presented in the book.
1.1
Overview
The vast majority of statistical learning literature assumes the data is represented
by points in a high-dimensional space. For any particular isolated task, such as
learning to detect a face in an image or classify an email message as spam or not,
we can usually construct the relevant low-level features (e.g., pixels, lters, words,
URLs) and solve the problem using standard tools for the vector representation.
While extremely useful for development of elegant and general algorithms and
analysis, this abstraction hides the rich logical structure of the underlying data
that is crucial for solving more general and complex problems. We may like to
detect not only a face in an image but to recognize that, for example, it is the face
of a tall woman who is spiking a volleyball or a little boy jumping into a puddle,
etc. Or, in the case of email, we might want to detect that an email message is
not only not-spam but is a request from our supervisor to meet tomorrow with
three colleagues or an invitation to the downstairs neighbors birthday party next
Sunday, etc. We are ultimately interested in not just answering an isolated yes/no
question, but in producing and manipulating structured representations of the data,
involving objects described by attributes and participating in relationships, actions,
and events. The challenge is to develop formalisms, models, and algorithms that
enable eective and robust reasoning about this type of object-relational structure
of the data.
Introduction
Dealing with real data, like images and text, inevitably requires the ability to
handle the uncertainty that arises from noise and incomplete information (e.g.,
occlusions, misspellings). In relational problems, uncertainty arises on many levels.
Beyond uncertainty about the attributes of an object, there may be uncertainty
about an objects type, the number of objects, and the identity of an object (what
kind, which, and how many entities are depicted or written about), as well as
relationship membership, type, and number (which entities are related, how, and
how many times). Solving interesting relational learning tasks robustly requires
sophisticated treatment of uncertainty at these multiple levels of representation.
In this book, we present the growing body of work on a variety of statistical
models that target relational learning tasks. The goal of these representations is
to express probabilistic models in a compact and intuitive way that reects the
relational structure of the domain and, ideally, supports ecient learning and
inference. The majority of these models are based on combinations of graphical
models, probabilistic grammars, and logical formulae.
1.2
1.3
Emerging Trends
from larger databases [9]. These rules are often used for prediction and may
have a probabilistic interpretation. The ILP community has had successes in
a number of application areas including discovery of 2D structural alerts for
mutagenicity/carcinogenicity [22], 3D pharmacophore discovery for drug design
[10], and analysis of chemical databases [7].
1.3
Emerging Trends
Recently, both the ILP community and the statistical ML community have begun
to incorporate aspects of the complementary technology. Many ILP researchers are
developing stochastic and probabilistic representations and algorithms [31, 21, 6]. In
more traditional ML circles, researchers who have in the past focused on attributevalue or propositional learning algorithms are exploring methods for incorporating
relational information [5, 32, 4]. It is our hope that this trend will continue, and
that the work presented in this book will provide a bridge connecting relational and
statistical learning.
Among the strong motivations for using a relational model is its ability to
model dependencies between related instances. Intuitively, we would like to use
our information about one object to help us reach conclusions about other, related
objects. For example, in web data, we should be able to propagate information
about the topic of a document to documents it has links to and documents that link
to it. These, in turn, would propagate information to yet other documents. Many
researchers have proposed a process along the lines of this relational inuence
propagation idea [3, 44, 32]. Chakrabarti et al. [3] describe a relaxation labeling
algorithm that makes use of the neighboring link information. The algorithm begins
with the labeling given by a text-based classier constructed from the training set. It
then uses the estimated class of neighboring documents to update the distribution of
the document being classied. The intuitions underlying these procedural systems
can be given declarative semantics using probabilistic graphical models [46, 15, 47].
1.4
Introduction
1.5
Chapter Map
still extract meaningful statistics from the data to use in our statistical inference
procedures.
Model selection is a challenging SRL problem. Similar to work in propositional
graphical models, many approaches make use of some type of heuristic search
through the model space. Methods for scoring propositional graphical models have
been extended for SRL learning [12, 13]. The search can make use of certain biases
dened over the model space, such as allowing dependencies only among attributes
of related instances according to the entity relationship model or the use of binding
patterns to constrain clauses to consider adding to the probabilistic rules.
Certain common issues arise repeatedly, in dierent guises, in a number of the
SRL systems. One of the most common issues is feature construction and aggregation. The rich variety in structure combined with the need for a compact parameterization gives rise to the need to construct relational features or aggregates [12]
which capture the local neighborhood of a random variable. Because it is infeasible to explicitly dene factors over all potential neighborhoods, aggregates provide
an intuitive way of describing the relational neighborhood. Common aggregates
include taking the mean or mode of some neighboring attribute, taking the min
or the max, or simply counting the number of neighbors. More complex, domainspecic aggregates are also possible. Aggregation has also been studied as a means
for propositionalizing a relational classication problem [25, 23, 26] Within the SRL
community, Perlich and Provost [36, 37] have studied aggregation extensively and
Popescul and Ungar [42] have worked on statistical predicate invention.
Structural uncertainty is another common issue that researchers have begun
investigating. Many of the early SRL approaches consider the case where there is a
single logical interpretation, or relational skeleton, which denes the set of random
variables, and there is a probability distribution over the instantiations of the
random variables. Structural uncertainty supports uncertainty over the relational
interpretation. Koller and Pfeer [24] introduced several forms, including number
uncertainty, where there is a distribution over the number of related objects. Getoor
et al. [16] studied learning models with structural uncertainty, and showed how
these representations could be supported by a probabilistic logic-based system [14].
Pasula and Russell [35] studied identity uncertainty, a form of structural uncertainty
which allows modeling uncertainty about the identity of a reference. Most of these
models rely on a closed world assumption to dene the semantics for the models.
More recently, Milch et al. [29] have investigated the use of nonparametric models
which allow an innite number of objects and support an open-world model (see
the chapter 13 for details). Other recent exible approaches include the innite
relational models of Kemp et al. [20] and Xu et al. [50].
1.5
Chapter Map
The book begins with several introductory chapters providing tutorials for the material which many of the later chapters build upon. chapter 2 is on graphical models
Introduction
and covers the basics of representation, inference, and learning in both directed and
undirected models. Chapter 3 by Dzeroski describes ILP. ILP, unlike many other ML
approaches, has traditionally dealt with multi-relational data. The learned models
are typically described by sets of relational rules called logic programs, and the
methods can make use of logical background knowledge. Chapter 4 by Sutton and
McCallum covers conditional random elds (CRFs), a very popular class of models
for structured supervised learning. An advantage of CRFs is that the models are
optimized for predictive performance on only the subset of variables of interest.
The chapter provides a tutorial on training and inference in CRFs, with particular
attention given to the important special case of linear CRFs. The chapter concludes
with a discussion of applications to information extraction.
Then next set of chapters describes several frame-based SRL approaches. Chapter 5 provides an introduction to probabilistic relational models (PRMs). PRMs are
directed graphical models which can capture dependencies among objects and uncertainty over the relational structure. In addition to describing the representation,
the chapter describes algorithms for inference and learning. Chapter 6 describes
Markov relational networks (RMNs), which are essentially CRFs lifted to the relational setting. A particularly relevant advantage of RMNs over PRMs is that
acyclicity requirements do not hinder modeling complex, non-causal correlations
concisely; however, as in the non-relational case, this comes at the price of more expensive parameter estimation. Another advantage of RMNs, like CRFs, is that they
are well suited to discriminative training. Algorithms for inference and learning are
given. Chapter 7, by Heckerman et al., describes a graphical language for probabilistic entity-relationship models (PERs). One of the contributions of this chapter
is its discussion of the relationship between PERs, PRMs, and plate models. Plate
models [2, 17] were introduced in the statistics community as a graphical representation for hierarchical models. They can represent the repeated, shared, or tied
parameters in a hierarchical graphical model. PERs synthesize these approaches.
The chapter describes a directed version of PERs, DAPERs, and gives a number of illustrative examples. Chapter 8, by Neville and Jensen, describes relational
dependency networks (RDNs). RDNs extend propositional dependency networks
to relational domains, and, like dependency networks, have some advantages over
directed graphical models and undirected models. This chapter describes the representation, inference, and learning algorithms and presents results on several data
sets.
The next four chapters describe logic-based formalisms for SRL. An introductory
chapter, chapter 9 by Cussens, surveys this area, describing work on some of the
early logic-based formalisms such as Pooles work on probabilistic Horn abduction
[39] and independent choice logic [40], Ngo and Haddawys work on probabilistic
knowledge bases [34] and Satos work on the PRISM system [43], and Ng and Subrahmanians work on probabilistic logic programming [33]. Cussens compares and
contrasts these approaches and describes some of the common representational issues, making connections to approaches described in later chapters. Chapter 10, by
Kersting and De Raedt, describes Bayesian logic programs (BLPs). Their approach
1.5
Chapter Map
combines Bayesian networks and logic programs to upgrade them to a representation which overcomes the propositional nature of Bayesian networks and the purely
logical nature of logic programs. This chapter gives an introduction to BLPs, describing both a Bayesian logic programming tool and a graphical representation for
them. Chapter 11, by Muggleton and Pahlavi, describes stochastic logic programs
(SLPs). SLPs were originally introduced as a means of extending the expressiveness
of stochastic grammars to the level of logic programs. The chapter provides several
example programs and describes both parameter estimation and structure learning. Chapter 12, by Domingos and Richardson, describes Markov logic. Markov
logic combines Markov networks and rst-order logic. First-order logic formulae
are given weights; the formulae dene a log-linear model with a feature for each
grounding of the logical formulae with the appropriate weights. The relationship
between many of the other SRL approaches and Markov logic networks (MLNs) is
discussed, along with several common SRL tasks such as collective inference, link
prediction, and object identication. Inference and learning in MLNs are presented.
Many of the approaches discussed so far have assumed, either implicitly or explicitly, several practical assumptions (the closed-world assumption, domain closure,
unique names) about the underlying logical interpretation in order to dene the
underlying semantics. Chapter 13, by Milch et al. describes BLOG, a system especially tailored toward cases in which these assumptions are not appropriate. BLOG
models dene stochastic processes for generating worlds; inference in these models
is done via a sampling process. Chapter 14, by Pfeer, describes IBAL, a functional
programming language for probabilistic AI. IBAL supports a rich decision-theoretic
framework which includes probabilistic reasoning and utility maximization. The
chapter describes the syntax and semantics for the IBAL, along with a sophisticated inference algorithm which exploits both lazy evaluation and memoization for
ecient inference.
One of the issues that comes up in many of the approaches is the need to perform
eective inference in large scale probabilistic models. Many of the approaches can
make use of lifted inference, inference which is done at level of the rst-order
representation directly, rather than at the propositional level. Chapter 15 describes
rst-order variable elimination, an algorithm for lifted probabilistic inference, and
presents recent results.
One of the issues that comes up in each of the learning algorithms is the need
for feature generation and selection. Chapter 16, by Popescul and Ungar, examines
this issue in the context of structured generalized linear regression (SGLR). They
address the need for an integrated approach to feature generation and selection.
Chapter 17, by Davis et al., addresses a related issue, the need for view learning to
support feature generation and selection. They describe two approaches and present
results on a mammography analysis system.
Chapter 18, by Fern et al. surveys recent work in reinforcement learning in relational domains. There has been a lot of recent work on relational learning within the
reinforcement learning setting and our collection does not try to comprehensively
cover its scope. Instead we have chosen a representative contribution describing a
Introduction
1.6
Outlook
In this introduction we have touched on a number of the common themes and
issues that will be developed in greater detail in the following chapters. While a
single unied framework has yet to emerge, we believe that the book highlights
the commonalities, and claries some of the important dierences among proposed
approaches. Along the way, important representational and algorithmic issues are
identied.
Statistical relational learning is a young and exciting eld. There are many
opportunities to develop new methods and apply the tools to compelling real-world
problems. We hope this book will provide an introduction to the eld, and stimulate
further research, development, and applications.
References
[1] R. Braz, E. Amir, and D. Roth. Lifted rst-order probabilistic inference. In
Proceedings of the International Joint Conference on Articial Intelligence,
2005.
[2] W. Buntine. Operations for learning with graphical models.
Articial Intelligence Research, 3:159225, 1994.
Journal of
References
[6] J. Cussens. Loglinear models for rst-order probabilistic reasoning. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 1999.
[7] L. Dehaspe, H. Toivonen, and R.D. King. Finding frequent substructures in
chemical compounds. In International Conference on Knowledge Discovery
and Data Mining, 1998.
[8] T. Dietterich and R. S. Michalski. Inductive learning of structural descriptions:
Evaluation criteria and comparative review of selected methods. Articial
Intelligence, 16:257294, 1986.
[9] S. Dzeroski and N. Lavrac, editors. Relational Data Mining. Kluwer, Berlin,
2001.
[10] P. Finn, S. Muggleton, D. Page, and A. Srinivasan. Discovery of pharmacophores using the inductive logic programming system Progol. Machine Learning, 30(1-2):241270, 1998.
[11] D. A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice
Hall, Upper Saddle River, NJ, 2002.
[12] N. Friedman, L. Getoor, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In Proceedings of the International Joint Conference on
Articial Intelligence, 1999.
[13] L. Getoor. Learning Statistical Models from Relational Data. PhD thesis,
Stanford University, Stanford, CA, 2001.
[14] L. Getoor and J. Grant. PRL: A probabilistic relational language. Machine
Learning Journal, 62(1-2):731, 2006.
[15] L. Getoor, E. Segal, B. Taskar, and D. Koller. Probabilistic models of text
and link structure for hypertext classication. In Proceedings of the IJCAI
Workshop on Text Learning: Beyond Supervision, 2001.
[16] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic
models of link structure. Journal of Machine Learning Research, 3:679707,
2002.
[17] W. Gilks, A. Thomas, and D. Spiegelhalter. A language and program for
complex Bayesian modeling. The Statistician, 43:169177, 1994.
[18] F. Hayes-Roth and J. McDermott. Knowledge acquisition from structural
descriptions. In Proceedings of the International Joint Conference on Articial
Intelligence, 1997.
[19] D. Heckerman, D. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks for inference, collaborative ltering and data visualization.
Journal of Machine Learning Research, 1:4975, 2000.
[20] C. Kemp, J. Tenenbaum, T. Griths, T. Yamada, and N. Ueda. Learning
systems of concepts with an innite relational model. In Proceedings of the
National Conference on Articial Intelligence, 2006.
10
Introduction
References
11
Probabilistic graphical models are an elegant framework which combines uncertainty (probabilities) and logical structure (independence constraints) to compactly
represent complex, real-world phenomena. The framework is quite general in that
many of the commonly proposed statistical models (Kalman lters, hidden Markov
models, Ising models) can be described as graphical models. Graphical models have
enjoyed a surge of interest in the last two decades, due both to the exibility and
power of the representation and to the increased ability to eectively learn and
perform inference in large networks.
2.1
Introduction
Graphical models [11, 3, 5, 9, 7] have become an extremely popular tool for modeling uncertainty. They provide a principled approach to dealing with uncertainty
through the use of probability theory, and an eective approach to coping with
complexity through the use of graph theory. The two most common types of graphical models are Bayesian networks (also called belief networks or causal networks)
and Markov networks (also called Markov random elds (MRFs)).
At a high level, our goal is to eciently represent a joint distribution P over
some set of random variables X = {X1 , . . . , Xn }. Even in the simplest case where
these variables are binary-valued, a joint distribution requires the specication of
2n numbers  the probabilities of the 2n dierent assignments of values x1 , . . . , xn .
However, it is often the case that there is some structure in the distribution that
allows us to factor the representation of the distribution into modular components.
The structure that graphical models exploit is the independence properties that
exist in many real-world phenomena.
The independence properties in the distribution can be used to represent such
high-dimensional distributions much more compactly. Probabilistic graphical models provide a general-purpose modeling language for exploiting this type of structure
in our representation. Inference in probabilistic graphical models provides us with
14
the mechanisms for gluing all these components back together in a probabilistically
coherent manner. Eective learning, both parameter estimation and model selection, in probabilistic graphical models is enabled by the compact parameterization.
This chapter provides a compact graphical models tutorial based on [8]. We cover
representation, inference, and learning. Our tutorial is not comprehensive; for more
details see [8, 11, 3, 5, 9, 4, 6].
2.2
Representation
The two most common classes of graphical models are Bayesian networks and
Markov networks. The underlying semantics of Bayesian networks are based on
directed graphs and hence they are also called directed graphical models. The
underlying semantics of Markov networks are based on undirected graphs; Markov
networks are also called undirected graphical models. It is possible, though less
common, to use a mixed directed and undirected representation (see, for example,
the work on chain graphs [10, 2]); however, we will not cover them here.
Basic to our representation is the notion of conditional independence:
Denition 2.1
Let X, Y , and Z be sets of random variables. X is conditionally independent of
Y given Z in a distribution P if
P (X = x, Y = y | Z = z) = P (X = x | Z = z)P (Y = y | Z = z)
for all values x  V al(X), y  V al(Y ) and z  V al(Z).
In the case where P is understood, we use the notation (X  Y | Z) to say that X
is conditionally independent of Y given Z. If it is clear from the context, sometimes
we say independent when we really mean conditionally independent.
2.2.1
Bayesian Networks
The core of the Bayesian network representation is a directed acyclic graph (DAG)
G. The nodes of G are the random variables in our domain and the edges correspond,
intuitively, to direct inuence of one node on another. One way to view this graph is
as a data structure that provides the skeleton for representing the joint distribution
compactly in a factorized way.
Let G be a BN graph over the variables X1 , . . . , Xn . Each random variable Xi
in the network has an associated conditional probability distribution (CPD) or local
probabilistic model. The CPD for Xi , given its parents in the graph (denoted PaXi ),
is P (Xi | PaXi ). It captures the conditional probability of the random variable,
given its parents in the graph. CPDs can be described in a variety of ways. A
common, but not necessarily compact, representation for a CPD is a table which
contains a row for each possible set of values for the parents of the node describing
2.2
Representation
15
P T P(I |P, T )
Pneumonia
Tuberculosis
Lung Infiltrates
XRay
Sputum Smear
(a)
p
p
p
p
t
t
t
t
0.8
0.6
0.2
0.01
P(X|I )
i
i
0.8
0.6
P(P)
P(T)
0.05
0.02
P(S|T )
s
s
0.8
0.6
(b)
Figure 2.1 (a) A simple Bayesian network showing two potential diseases, Pneumonia and Tuberculosis, either of which may cause a patient to have Lung Inltrates.
The lung inltrates may show up on an XRay ; there is also a separate Sputum
Smear test for tuberculosis. All of the random variables are Boolean. (b) The same
Bayesian network, together with the conditional probability tables. The probabilities shown are the probability that the random variable takes the value true (given
the values of its parents); the conditional probability that the random variable is
false is simply 1 minus the probability that it is true.
the probability of dierent values for Xi . These are often referred to as table CPDs,
and are tables of multinomial distributions. Other possibilities are to represent
the distributions via a tree structure (called, appropriately enough, tree-structured
CPDs), or using an even more compact representation such as a noisy-OR or noisyMAX.
Example 2.1
Consider the simple Bayesian network shown in gure 2.1. This is a toy example
indicating the interactions between two potential diseases, pneumonia and tuberculosis. Both of them may cause a patient to have lung inltrates. There are two
tests that can be performed. An x-ray can be taken, which may indicate whether
the patient has lung inltrates. There is a separate sputum smear test for tuberculosis. gure 2.1(a) shows the dependency structure among the variables. All of the
variables are assumed to be Boolean. gure 2.1(b) shows the conditional probability
distributions for each of the random variables. We use initials P , T , I, X, and S
for shorthand. At the roots, we have the prior probability of the patient having
each disease. The probability that the patient does not have the disease a priori
is simply 1 minus the probability he or she has the disease; for simplicity only the
probabilities for the true case are shown. Similarly, the conditional probabilities
for the non-root nodes give the probability that the random variable is true, for
dierent possible instantiations of the parents.
16
Denition 2.2
Let G be a Bayesinan network graph over the variables X1 , . . . , Xn . We say that a
distribution PB over the same space factorizes according to G if PB can be expressed
as a product
PB (X1 , . . . , Xn ) =
n
P (Xi | PaXi ).
(2.1)
i=1
2.2
Representation
17
(a)
(b)
Z
X
X
Y
(c)
Y
Z
(d)
Figure 2.2 (a) An indirect causal eect; (b) an indirect evidential eect; (c) a
common cause; (d) a common eect.
Example 2.3
The BN in gure 2.1(a) describes the following local Markov assumptions: (P 
T | ), (T  P | ), (X  {P, T, S} | I), and (S  {P, I, X} | T ).
These are not the only independence assertions that are encoded by a network.
A general procedure called d-separation (which stands for directed separation) can
answer whether an independence assertion must hold in any distribution consistent
with the graph G. However, note that other independencies may hold in some
distributions consistent with G; these are due to ukes in the particular choice of
parameters of the network (and this is why they hold in some of the distributions).
Returning to our denition of d-separation, it is useful to view probabilistic
inuence as a ow in the graph. Our analysis here tells us when inuence from
X can ow through Z to aect our beliefs about Y . We will consider ow allows
(undirected) paths in the graph.
Consider a simple three-node path XY Z If inuence can ow from X to Y
via Z, we say that the path XZY is active. There are four cases:
Causal path X  Z  Y : active if and only if Z is not observed.
Evidential path X  Z  Y : active if and only if Z is not observed.
Common cause X  Z  Y : active if and only if Z is not observed.
Common eect X  Z  Y : active if and only if either Z or one of Zs
descendants is observed.
A structure where X  Z  Y (as in gure 2.2(d)) is also called a v-structure.
Example 2.4
In the BN from gure 2.1(a), the path from P  I  X is active if I is not
observed. On the other hand, the path from P  I  T is active if I is observed.
Now consider a longer path X1     Xn . Intuitively, for inuence to ow
from X1 to Xn , it needs to ow through every single node on the trail. In other
words, X1 can inuence Xn if every two-edge path Xi1 Xi Xi+1 along the trail
allows inuence to ow. We can summarize this intuition in the following denition:
18
Denition 2.4
Let G be a BN structure, and X1  . . . Xn a path in G. Let E be a subset of
nodes of G. The path X1  . . . Xn is active given evidence E if
whenever we have a v-structure Xi1  Xi  Xi+1 , then Xi or one of its
descendants is in E;
no other node along the path is in E.
Our ow intuition carries through to graphs in which there is more than one
path between two nodes: one node can inuence another if there is any path along
which inuence can ow. Putting these intuitions together, we obtain the notion
of d-separation, which provides us with a notion of separation between nodes in a
directed graph (hence the term d-separation, for directed separation):
Denition 2.5
Let X, Y , Z be three sets of nodes in G. We say that X and Y are d-separated
given Z, denoted d-sepG (X; Y | Z), if there is no active path between any node
X  X and Y  Y given Z.
Finally, an important theorem which relates the independencies which hold in a
distribution to the factorization of a distribution is the following:
Theorem 2.6
Let G be a BN graph over a set of random variables X and let P be a joint
distribution over the same space. If all the local Markov properties associated with
G hold in P , then P factorizes according to G.
Theorem 2.7
Let G be a BN graph over a set of random variables X and let P be a joint
distribution over the same space. If P factorizes according to G, then all the local
Markov properties associated with G hold in P .
2.2.3
Markov Networks
The second common class of probabilistic graphical models is called a Markov network or a Markov random eld. The models are based on undirected graphical
models. These models are useful in modeling a variety of phenomena where one
cannot naturally ascribe a directionality to the interaction between variables. Furthermore, the undirected models also oer a dierent and often simpler perspective
on directed models, both in terms of the independence structure and the inference
task.
A representation that implements this intuition is that of an undirected graph.
As in a Bayesian network, the nodes in the graph of a Markov network graph
H represent the variables, and the edges correspond to some notion of direct
probabilistic interaction between the neighboring variables.
The remaining question is how to parameterize this undirected graph. The graph
structure represents the qualitative properties of the distribution. To represent the
2.2
Representation
19
1 
P (X1 , . . . , Xn ),
Z
where
(X1 , . . . , Xn ) = i [D1 ]  2 [D2 ]      m [D m ]
PH
PH
(X1 , . . . , Xn )
X1 ,...,Xn
20
The logarithmic representation ensures that the probability distribution is positive. Moreover, the logarithmic parameters can take any real value.
A subclass of Markov networks that arises in many contexts is that of pairwise
Markov networks, representing distributions where all of the factors are over single
variables or pairs of variables. More precisely, a pairwise Markov network over a
graph H is associated with a set of node potentials {[Xi ] : i = 1, . . . , n} and a set of
edge potentials {[Xi , Xj ] : (Xi , Xj )  H}. The overall distribution is (as always)
the normalized product of all of the potentials (both node and edge). Pairwise
MRFs are attractive because of their simplicity, and because interactions on edges
are an important special case that often arises in practice.
Example 2.5
Figure 2.3(a) shows a simple Markov network. This toy example has random
variables describing the tuberculosis status of four patients. Patients that have been
in contact are linked by undirected edges. The edges indicate the possibilities for the
disease transmission. For example, P atient 1 has been in contact with P atient 2
and P atient 3, but has not been in contact with P atient 4. gure 2.3(b) shows the
same Markov network, along with the node and edge potentials. We use P 1, P 2,
P 3, and P 4 for shorthand. In this case, all of the node and edge potentials are the
same, but this is not a requirement. The node potentials show that the patients
are much more likely to be uninfected. The edge potentials capture the intuition
that it is most likely for two people to have the same infection state  either both
infected, or both not. Furthermore, it is more likely that they are both not infected.
2.2.4
As in the case of Bayesian networks, the graph structure in a Markov network can
be viewed as encoding a set of independence assumptions. Intuitively, in Markov
networks, probabilistic inuence ows along the undirected paths in the graph,
but is blocked if we condition on the intervening nodes. We can dene two sets
2.2
Representation
21
(P1 , P2 )
P1 P2
P1 (P1)
p1
p1
0.2
100
p1
p2
p1
p1
p1
p2
p2
p2
0.5
TB Patient 1
TB Patient 2
(P1 , P3 )
(a)
P2 P4
(P2 , P4 )
p3
p2
p4
p1
p1
p1
p3
p3
p3
0.5
0.5
p2
p2
p2
p4
p4
p4
0.5
0.5
p3
p3
TB Patient 4
0.2
100
P2
p1
P3 (P3)
TB Patient 3
p2
p2
P1
P1 P3
P2 (P2)
0.5
0.2
100
P3
P4
P3 P4
(P3 , P4 )
p3
p4
p3
p3
p3
p4
p4
p4
0.5
P4 (P4)
p4
p4
0.2
100
0.5
2
(b)
Figure 2.3
of independence assumptions, the local Markov properties and the global Markov
properties.
The local Markov properties are associated with each node in the graph and are
based on the intuition that we can block all inuences on a node by conditioning
on its immediate neighbors.
Denition 2.10
Let H be an undirected graph. Then for each node X  X , the Markov blanket of
X, denoted NH (X), is the set of neighbors of X in the graph (those that share an
edge with X). We dene the local Markov independencies associated with H to be
I (H) = {(X  X  {X}  NH (X) | NH (X)) : X  X }.
In other words, the Markov assumptions state that X is independent of the rest of
the nodes in the graph given its immediate neighbors.
Example 2.6
The MN in gure 2.3(a) describes the following local Markov assumptions: (P1 
P4 | {P2 , P3 }), (P2  P3 | {P1 , P4 }), (P3  P2 | {P1 , P4 }), (P4  P1 | {P2 , P3 }).
To dene the global Markov properties, we begin by dening active paths in
undirected graphs.
Denition 2.11
Let H be a Markov network structure, and X1  . . . Xk be a path in H. Let
E  X be a set of observed variables. The path X1  . . . Xk is active given E if
none of the Xi s, i = 1, . . . , k, is in E.
22
Using this notion, we can dene a notion of separation in the undirected graph.
This is the analogue of d-separation; note how much simpler it is.
Denition 2.12
We say that a set of nodes Z separates X and Y in H, denoted sepH (X; Y | Z),
if there is no active path between any node X  X and Y  Y given Z. We dene
the global Markov assumptions associated with H to be
I(H) = {(X  Y | Z) : sepH (X; Y | Z)}.
As in the case of Bayesian networks, we can make a connection between the local
Markov properties and the global Markov properties. The assumptions are in fact
equivalent, but only for positive distributions. (Informally, a distribution is positive
if every possible joint instantiation has probability > 0.)
We begin with the analogue to theorem 2.7, which asserts that a Gibbs distribution satises the global independencies associated with the graph.
Theorem 2.13
Let P be a distribution over X , and H a Markov network structure over X . If P is
a Gibbs distribution over H, then all the local Markov properties associated with
H hold in P .
The other direction, which goes from the global independence properties of a
distribution to its factorization, is known as the Hammersley-Cliord theorem.
Unlike for Bayesian networks, this direction does not hold in general. It only holds
under the additional assumption that P is a positive distribution.
Theorem 2.14
Let P be a positive distribution over X , and H a Markov network graph over X .
If all of the independence constraints implied by H hold in P , then P is a Gibbs
distribution over H.
This result shows that, for positive distributions, the global Markov property
implies that the distribution factorizes according to the network structure. Thus,
for this class of distributions, we have that a distribution P factorizes over a Markov
network H if and only if all of the independencies implied by H hold in P . The
positivity assumption is necessary for this result to hold.
2.3
Inference
Both directed and undirected graphical models represent a full joint probability
distribution over X . We describe some of the main query types one might expect
to answer with a joint distribution, and discuss the computational complexity of
answering such queries using a graphical model.
The most common query type is the standard conditional probability query,
P (Y | E = e). Such a query consists of two parts: the evidence, a subset E of
2.3
Inference
23
24
We assume that we are dealing with a set of factors F over a set of variables X .
This set of factors denes a possibly unnormalized function
PF (X ) =
.
(2.2)
F
For a Bayesian network without evidence, the factors are simply the CPDs, and the
distribution PF is a normalized distribution. For a Bayesian network B with evidence E = e, the factors are the CPDs restricted to e, and PF (X ) = PB (X , e). For
a Markov network H (with or without evidence), the factors are the (restricted)
before dicompatibility potentials, and PF is the unnormalized distribution PH
viding by the partition function. It is important to note, however, that most of
the operations that one can perform on a normalized distribution can also be performed on an unnormalized one. Thus, we can marginalize PF on a subset of the
variables by summing out the others. We can also consider a conditional probability
PF (X | Y ) = PF (X, Y )/PF (Y ). Thus, for the purposes of this section, we treat
PF as a distribution, ignoring the fact that it may not be normalized.
In the worst case, the complexity of probabilistic inference is unavoidable. Below,
we assume that the set of factors {  F } of the graphical model dening the desired
distribution can be specied in a polynomial number of bits (in terms of the number
of variables).
Theorem 2.15
The following decision problems are N P-complete:
Given a distribution PF over X , a variable X  X , and a value x  Val(X),
decide whether PF (X = x) > 0.
Given a distribution PF over X and a number  , decide whether there exists an
assignment x to X such that PF (x) >  .
The following problem is #P-complete:
Given a distribution PF over X , a variable X  X , and a value x  Val(X),
compute PF (X = x).
These results seem like very bad news: every type of inference in graphical
models is N P-hard or harder. In fact, even the simple problem of computing
the distribution over a single binary variable is N P-hard. Assuming (as seems
increasingly likely) that the best computational performance we can achieve for
N P-hard problems is exponential in the worst case, there seems to be no hope for
ecient algorithms for even the simplest type of inference. However, as we discuss
below, the worst-case blowup can often be avoided. For all other models, we will
resort to approximate inference techniques. Note that the worst-case results for
approximate inference are also negative:
2.3
Inference
25
Theorem 2.16
The following problem is N P-hard for any   (0, 1/2): Given a distribution PF
over X , a variable X  X , and a value x  Val(X), nd a number  , such that
|PF (X = x)   |  .
Fortunately, many types of exact inference can be performed eciently for a
very important class of graphical models (low treewidth) we dene below. For a
large number of models, however, exact inference is intractable and we resort to
approximations. Broadly speaking, there are two major frameworks for probabilistic
inference: optimization-based and sampling-based. Exact inference algorithms have
been historically derived from the dynamic programming perspective, by carefully
avoiding repeated computations. We take a somewhat unconventional approach here
by presenting exact and approximate inference in a unied optimization framework.
We thus start out by considering approximate inference and then present conditions
under which it yields exact results.
2.3.1
Inference as Optimization
The methods that fall into an optimization framework are based on a simple
conceptual principle: dene a target class of easy distributions Q, and then search
for a particular instance Q within that class which is the best approximation to
PF . Queries can then be answered using inference on Q rather than on PF . The
specic algorithms that have been considered in the literature dier in many details.
However, most of them can be viewed as optimizing a target function for measuring
the quality of approximation.
Suppose that we want to approximate PF with another distribution Q. Intuitively,
we want to choose the approximation Q to be close to PF . There are many
possible ways to measure the distance between two distributions, such as the
Euclidean distance (L2 ), or the L1 distance. Our main challenge, however, is that
our aim is to avoid performing inference with the distribution PF ; in particular, we
cannot eectively compute marginal distributions in PF . Hence, we need methods
that allow us to optimize the distance (technically, divergence) between Q and
PF without answering hard queries in PF . A priori, this requirement may seem
impossible to satisfy. However, it turns out that there exists a distance measure 
the relative entropy (or KL-divergence)  that allows us to exploit the structure
of PF without performing reasoning with it.
Recall
that the relative entropy between P1 and P2 is dened as ID(P1 ||P2 ) =
1 (X )
IEP1 ln P
P2 (X ) . The relative entropy is always non-negative, and equal to 0 if and
only if P1 = P2 . Thus, we can use it as a distance measure, and choose to nd an
approximation Q to PF that minimizes the relative entropy. However, the relative
entropy is not symmetric  ID(P1 ||P2 ) 
= ID(P2 ||P1 ). A priori, it might appear that
ID(PF ||Q) is a more appropriate measure for approximate inference, as one of the
main information-theoretic justications for relative entropy is the number of bits
lost when coding a true message distribution PF using an (approximate) estimate Q.
26
Before considering approximate inference methods, we illustrate the use of a variational approach to derive an exact inference procedure. The concepts we introduce
here will serve in discussion of the following approximate inference methods.
The goal of exact inference here will be to compute marginals of the distribution.
To achieve this goal, we will need to make sure that the set of distributions Q is
expressive enough to represent the target distribution PF . Instead of approximating
PF , the solution of the optimization problem transforms the representation of the
2.3
Inference
27
(a)
A1,1
A1,2
A1,3
A1,4
A2,1
A2,2
A2,3
A2,4
A3,1
A3,2
A3,3
A3,4
A4,1
A4,2
A4,3
A4,4
(b)
Figure 2.4 (a) Chain-structured Bayesian network and equivalent Markov network (b) Grid-structured Markov network.
distribution from a product of factors into a more useful form Q that directly yields
the desired marginals.
To accomplish this, we will need to optimize over the set of distributions Q that
include PF . Then, if we search over this set, we are guaranteed to nd a distribution
Q for which ID(Q ||PF ) = 0, which is therefore the unique global optimum of our
energy functional. We will represent this set using an undirected graphical model
called the clique tree, for reasons that will be clear below.
Consider the undirected graph corresponding to the set of factors F. In this
graph, nodes are connected if they appear together in a factor. Note that if a factor
is the CPD of a directed graphical model, then the family will be a clique in the
graph, so its connectivity is denser then the original directed graph since parents
have been connected (moralized). The key property for exact inference in the graph
is chordality:
Denition 2.18
Let X1 X2     Xk X1 be a loop in the graph; a chord in the loop is an edge
connecting Xi and Xj for two nonconsecutive nodes Xi , Xj . An undirected graph
H is said to be chordal if any loop X1 X2     Xk X1 for k  4 has a chord.
In other words, the longest minimal loop (one that has no shortcut) is a triangle.
Thus, chordal graphs are often also called triangulated.
The simplest (and most commonly used) chordal graphs are chain-structured
(see gure 2.4(a)). What if the graph is not chordal? For example, grid-structured
graphs are commonly used in computer vision for pixel-labeling problems (see gure 2.4(b)). To make a graph chordal (triangulate it), ll-in edges are added to
short-circuit loops. There are generally many ways to do this and nding the least
number of edges to ll is N P-hard. However, good heuristic algorithms for this
problem exist [12, 1].
We now dene a cluster graph  the backbone of the graphical data structure
needed to perform inference. Each node in the cluster graph is a cluster, which
28
is associated with a subset of variables; the graph contains undirected edges that
connect clusters whose scopes have some nonempty intersection.
Denition 2.19
A cluster graph K for a set of factors F over X is an undirected graph, each of
whose nodes i is associated with a subset C i  X . A cluster graph must be familypreserving  each factor   F must be associated with a cluster C, denoted
(), such that Scope[]  C i . Each edge between a pair of clusters C i and C j is
associated with a sepset S i,j = C i  C j . A singly connected cluster graph (a tree)
is called a cluster tree.
Denition 2.20
Let T be a cluster tree over a set of factors F . We say that T has the running
intersection property if, whenever there is a variable X such that X  C i and
X  C j , then X is also in every cluster in the (unique) path in T between C i and
C j . A cluster tree that satises the running intersection property is called a clique
tree.
Theorem 2.21
Every chordal graph G has a clique tree T .
Constructing the clique tree from a chordal graph is actually relatively easy:
(1) nd maximal cliques of the graph (this is easy in chordal graphs) and (2)
run a maximum spanning tree algorithm on the appropriate clique graph. More
specically, we build an undirected graph whose nodes are the maximal cliques,
and where every pair of nodes C i , C j is connected by an edge whose weight is
|C i  C j |.
Because of this correspondence, we can dene a very important characteristic of
a graph, which is critical to the complexity of exact inference:
Denition 2.22
The treewidth of a chordal graph is the size of the largest clique minus 1. The
treewidth of an untriangulated graph is the minimum treewidth of all of its triangulations.
Note that the treewidth of a chain in gure 2.4(a) is 1 and the treewidth of the
grid in gure 2.4(b) is 4.
2.3.2.1
Suppose we are given a clique tree T for PF . That is, T satises the running
intersection property and the family preservation property. Moreover, suppose we
are given a set of potentials Q = {i }  {i,j : (C i C j )  T }, where C i denotes
clusters in T , S i,j denote separators along edges in T , i is a potential over C i , and
i,j is a potential over S i,j . The set of potentials denes a distribution Q according
2.3
Inference
29
to T by the formula
	
Q(X ) = 	
C i T
(C i C j )T
i,j
(2.4)
F [PF , Q] =
IEi ln i0 +
IHi (C i ) 
IHi,j (S i,j ),
(2.5)
i
C i T
(C i C j )T
	
where i0 = ,()=i .
Before we prove that the energy functional is equivalent to its factored form,
let
us rst understand its form. The rst term is a sum of terms of the form IEi ln i0 .
Recall that i0 is a factor (not necessarily a distribution) over the scope C i , that is,
a function from Val(C i ) to IR+ . Its logarithm is therefore a function from Val(C i )
to IR. The clique potential i is a distribution over Val(C i ). We can therefore
compute the expectation, ci i [ci ] ln i0 . The last two terms are entropies of the
distributions  the potentials and messages  associated with the clusters and
sepsets in the tree.
30
Proposition 2.26
If Q is a set of calibrated potentials for T , and Q is dened by by (2.4), then
F [PF , Q] = F [PF , Q].
Using this form of the energy, we can now dene the optimization problem. We rst
need to dene the space over which we are optimizing. If Q is factorized according
to T , we can represent it by a set of calibrated potentials. Calibration is essentially
a constraint on the potentials, as a clique tree is calibrated if neighboring potentials
agree on the marginal distribution on their joint subset. Thus, we pose the following
constrained optimization procedure:
CTree-Optimize
Find
that maximize
Q
F [PF , Q]
subject to
i = i,j ,
(C i C j ) T ;
(2.6)
i = 1,
C i T .
(2.7)
C i \S i,j
Ci
The constraints (2.6) and (2.7) ensure that the potentials in Q are calibrated and
represent legal distributions. It can be shown that the objective function is strictly
concave in the variables , . The constraints dene a convex set (linear subspace),
so this optimization problem has a unique maximum. Since Q can represent PF ,
this maximum is attained when ID(Q||PF ) = 0.
2.3.2.2
Fixed-Point Characterization
We can now prove that the stationary points of this constrained optimization
function  the points at which the gradient is orthogonal to all the constraints
 can be characterized by a set of self-consistent equations.
Recall that a stationary point of a function is either a local maximum, a local
minimum, or a saddle point. In this optimization problem, there is a single global
maximum. Although we do not show it here, we can show that it is also the
single stationary point. We can therefore dene the global optimum declaratively,
as a set of equations, using standard methods based on Lagrange multipliers.
As we now show, this declarative formulation gives rise to a set of equations
which precisely corresponds to message-passing steps in the clique tree, a standard
inference procedure usually derived via dynamic programming.
2.3
Inference
31
Theorem 2.27
A set of potentials Q is a stationary point of CTree-Optimize if and only if there
exists a set of factors {ij [S i,j ] : C i C j  T } such that
ij 
i0 
ki 
(2.8)
C i S i,j
i i0
kNC i {j}
ji
(2.9)
jNC i
i,j = ji ij ,
(2.10)
32
1: A, B, C
2: B, C, D
3: B,D,F
4: B, E
5: D, E
1: A, B, C
2: B, C, D
3: B,D,F
4: B, E
5: D, E
12: B, C
6: A
7: B
8: C
9: D
(a) K3
10: E
11: F
6: A
7: B
8: C
9: D
10: E
11: F
(b) K4
Figure 2.5 Two additional examples of generalized cluster graphs for a Markov
network with potentials over {A, B, C}, {B, C, D}, {B, D, F }, {B, E}, and {D, E}. (a)
Bethe factorization. (b) Capturing interactions between {A, B, C} and {B, C, D}.
between the cluster C (i,j) that corresponds to the edge Xi Xj and the clusters
C i and C j that correspond to the univariate factors over Xi and Xj .
As there is a direct correspondence between the clusters in the cluster graphs and
variables or edges in the original Markov network, it is often convenient to think
of the propagation steps as operations on the original network. Moreover, as each
pairwise cluster has only two neighbors, we consider two propagation steps along
the path C i C (i,j) C j as propagating information between Xi and Xj . Indeed,
early versions of generalized belief propagation were stated in these terms. This
algorithm is known as loopy belief propagation, as it uses propagation steps used by
algorithms for Markov trees, except that it was applied to networks with loops.
A natural question is how to extend this method to networks that are more
complex than pairwise Markov networks. Once we have larger potentials, they may
overlap in ways that result in complex interactions among them.
One simple construction creates a bipartite graph. The rst layer consists of
large clusters, with one cluster for each factor  in F , whose scope is Scope[].
These clusters ensure that we satisfy the family-preservation property. The second
layer consists of small univariate clusters, one for each random variable. Finally,
we place an edge between each univariate cluster X on the second layer and each
cluster in the rst layer that includes X; the scope of this edge is X itself. For a
concrete example, see gure 2.5(a).
We can easily verify that this is a proper cluster graph. First, by construction it
satises the family-preserving property. Second, the edges that mention a variable
X form a star-shaped subgraph with edges from the univariate cluster with scope
X to all the large clusters that contain X. We will call this construction the Bethe
approximation (for reasons that will be claried below). The construction of this
cluster graph is simple and can easily be automated.
So far, our discussion of belief propagation has been entirely procedural, and motivated purely by similarity to message-passing algorithms for cluster trees. Is there
any formal justication for this approach? Is there a sense in which we can view
this algorithm as providing an approximation to the exact inference task? In this
section, we show that belief propagation can be justied using the energy function
formulation. Specically, the messages passed by generalized belief propagation can
be derived from xed-point equations for the stationary points of an approximate
2.3
Inference
33
version of the energy functional of (2.3). As we shall see, this formulation provides
signicant insight into the generalized belief propagation algorithm. It allows us to
better understand the convergence properties of generalized belief propagation, and
to characterize its convergence points. It also suggests generalizations of the algorithm which have better convergence properties, or that optimize a more accurate
approximation to the energy functional.
Our construction will be similar to the one in section 2.3.2 for exact inference.
However, there are some dierences. As we saw, the calibrated cluster graph
maintains the information in PF . However, the resulting cluster potentials are
not, in general, the marginals of PF . In fact, these cluster potentials may not
represent the marginals of any single coherent joint distribution over X . Thus, we
can think of generalized belief propagation as constructing a set of pseudo-marginal
distributions, each one over the variables in one cluster. These pseudo-marginals are
calibrated, and therefore locally consistent with each other, but are not necessarily
marginals of a single underlying joint distribution.
The energy functional F [PF , Q] has terms involving the entropy of an entire joint
distribution; thus, it cannot be used to evaluate the quality of an approximation
dened in terms of (possibly incoherent) pseudo-marginals. However, the factored
free energy functional F [PF , Q] is dened in terms of entropies of clusters and
messages, and is therefore well-dened for pseudo-marginals Q. Thus, we can write
down an optimization problem as before:
CGraph-Optimize
Find
that maximize
Q
F [PF , Q]
subject to
i = i,j ,
(C i C j ) T ;
(2.11)
i = 1,
C i T .
(2.12)
C i \S i,j
Ci
34
i0
kNC i {j}
ji
(2.14)
jNC i
i,j = ji ij .
(2.15)
This theorem shows that we can characterize convergence points of the energy
function in terms of the original potentials and messages between clusters. We
can, once again, dene a procedural variant, in which we initialize ij , and then
iteratively use (2.13) to redene each ij in terms of the current values of other
ki . theorem 2.28 shows that convergence points of this procedure are related to
stationary points of F [PF , Q].
It is relatively easy to verify that F [PF , Q] is bounded from above. And thus,
this function must have a maximum. There are two cases. The maximum is either
an interior point or a boundary point (some of the probabilities in Q are 0). In the
former case the maximum is also a stationary point, which implies that it satises
the condition of theorem 2.28. In the latter case, the maximum is not necessarily
a stationary point. This situation, however, is very rare in practice, and can be
guaranteed not to arise if we make some fairly benign assumptions.
It is important to understand what these results imply, and what they do not.
The results imply only that the convergence points of generalized belief propagation
are stationary points of the free energy function They do not imply that we can
reach these convergence points by applying belief propagation steps. In fact, there
is no guarantee that the message-passing steps of generalized belief propagation
necessarily improve the free energy objective: a message passing step may increase
or decrease the energy functional. (In fact, if generalized belief propagation was
guaranteed to monotonically improve the functional, then it would necessarily
always converge.)
What are the implications of this result? First, it provides us with a declarative
semantics for generalized belief propagation in terms of optimization of a target
functional. This declarative semantics opens the way to investigate other computational approaches for optimizing the same functional. We discuss some of these
approaches below.
This result also allows us to understand what properties are important for this
type of approximation, and subsequently to design other approximations that may
be more accurate, or better in some other way. As a concrete example, recall that,
in our discussion of generalized cluster graphs, we required the running intersection
2.3
Inference
35
property. This property has two important implications. First, that the set of
clusters that contain some variable X are connected; hence, the marginal over X
will be the same in all of these clusters at the calibration point. Second, that there
is no cycle of clusters and sepsets all of which contain X. We can motivate this
assumption intuitively, by noting that it prevents us from allowing information
about X to cycle endlessly through a loop. The free energy function analysis
provides a more formal justication. To understand it, consider rst the form of the
factored free energy functional when our cluster graph K has the form of the Bethe
approximation Recall that in the Bethe approximation graph there are two layers:
one consisting of clusters that correspond to factors in F , and the other consisting
of univariate clusters. When the cluster graph is calibrated, these univariate clusters
have the same distribution as the separators between them and the factors in the
rst layer. As such, we can combine together the entropy terms for all the separators
labeled by X and the associated univariate cluster and rewrite the free energy, as
follows:
Proposition 2.29
If Q = { :   F }  {i (Xi )} is a calibrated set of potentials for K for a Bethe
approximation cluster graph with clusters {C  :   F }  {Xi : Xi  X }, then
F [PF , Q] =
IE [ln ] +
IH (C  ) 
(di  1)IHi (Xi ),
(2.16)
F
36
As we discussed above, another approach to dealing with the worst-case combinatorial explosion of exact inference in graphical models is via sampling-based methods.
In these methods, we approximate the joint distribution as a set of instantiations
to all or some of the variables in the network. These instantiations, often called
samples, represent part of the probability mass.
The general framework for most of the discussion is as follows. Consider some
distribution P (X ), and assume we want to estimate the probability of some event
Y = y relative to P , for some Y  X and y  Val(Y ). More generally, we might
want to estimate the expectation of some function f (X ) relative to P ; this task
is a generalization, as we can choose f () = 1 {Y  = y}. We approximate this
expectation by generating a set of M samples, estimating the value of the function
or its expectation relative to each of the generated samples, and then aggregating
the results.
2.3.4.1
2.3
Inference
37
Denition 2.30
A Markov chain is dened via a state space Val(X) and a transition probability
model, which denes, for every state x  Val(X) a next-state distribution over
Val(X). The transition probability of going from x to x is denoted T (x  x ).
This transition probability applies whenever the chain is in state x.
We note that, in this denition and in the subsequent discussion, we restrict
attention to homogeneous Markov chains, where the system dynamics do not change
over time.
We can imagine a random sampling process that denes a sequence of states
(0)
x , x(1) , x(2) , . . .. As the transition model is random, the state of the process at
step t can be viewed as a random variable X (t) . We assume that the initial state
X (0) is distributed according to some initial state distribution P (0) (X (0) ). We can
now dene distributions over the subsequent states P (1) (X (1) ), P (2) (X (2) ), . . . using
the chain dynamics:
P (t+1) (X (t+1) = x ) =
P (t) (X (t) = x)T (x  x ).
(2.17)
xVal(X)
Intuitively, the probability of being at state x at time t + 1 is the sum over all
possible states x that the chain could have been in at time t of the probability
being in state x times the probability that the chain took a transition from x to
x .
As the process converges, we would expect P (t+1) to be close to P (t) . Using (2.17),
we obtain
P (t) (x )  P (t+1) (x ) =
P (t) (x)T (x  x ).
xVal(X)
At convergence, we would expect the resulting distribution (X) to be an equilibrium relative to the transition model; i.e., the probability of being in a state
is the same as the probability of transitioning into it from a randomly sampled
predecessor. Formally:
Denition 2.31
A distribution (X) is a stationary distribution for a Markov chain T if it satises
(X = x)T (x  x ).
(2.18)
(X = x ) =
xVal(X)
38
Denition 2.32
A Markov chain is said to be regular if there exists some number k such that, for
every x, x  Val(X), the probability of getting from x to x in exactly k steps is
greater than 0.
The following result can be shown to hold:
Theorem 2.33
A nite-state Markov chain T has a unique stationary distribution if and only if it
is regular.
Ensuring regularity is usually straightforward. Two simple conditions that guarantee regularity in nite-state Markov chains are:
It is possible to get from any state to any state using a positive probability path
in the state graph.
For each state x, there is a positive probability of transitioning from x to x in
one step (a self-loop).
These two conditions together are sucient but not necessary to guarantee regularity. However, they often hold in the chains used in practice.
2.3.4.2
The theory of Markov chains provides a general framework for generating samples
from a target distribution . In this section, we discuss the application of this
framework to the sampling tasks encountered in probabilistic graphical models. In
this case, we typically wish to generate samples from the posterior distribution
P (X | E = e). Thus, we wish to dene a chain for which P (X | e) is the stationary
distribution. Clearly, there are many ways of dening such a chain. We focus on
the most common approaches.
In graphical models, we dene the states of the Markov chain to be instantiations
 to X , which are compatible with e; i.e., all of the states  in the Markov chain
satisfy E = e. The states in our Markov chain are therefore some subset of
the possible assignments to the variables X . In order to dene a Markov chain, we
need to dene a process that transitions from one state to the other, converging to
a stationary distribution () which is the desired posterior distribution P ( | e).
In the case of graphical models, our state space has a factorized structure 
each state is an assignment to several variables. When dening a transition model
over this state space, we can consider a fully general case, where a transition can
go from any state to any state. However, it is often convenient to decompose the
transition model, considering transitions that only update a single component of
the state vector at a time, i.e., only a value for a single variable. In this case,
as in several other settings, we often dene a set of transition models T1 , . . . , Tk ,
each with its own dynamics. In certain cases, the dierent transition models are
necessary, because no single transition model on its own suces to ensure regularity.
2.3
Inference
39
In other cases, having multiple transition models simply makes the state space more
connected, and therefore speeds the convergence to a stationary distribution.
There are several ways of combining these multiple transition models into a single
chain. One common approach is simply to randomly select between them at each
step, using any distribution. Thus, for example, at each step, we might select one
of T1 , . . . , Tk , each with probability 1/k. Alternatively, we can simply cycle over the
dierent transition models, taking each one in turn. Clearly, this approach does not
dene a homogeneous chain, as the transition model used in step i is dierent from
the one used in step i + 1. However, we can simply view the process as dening a
single transition model T each of whose steps is an aggregate step, consisting of
rst taking T1 , then T2 , . . . , through Tk .
In the case of graphical models, we dene X = X  E = {X1 , . . . , Xk }. We
dene a multiple transition chain, where we have a local transition model Ti for
each variable Xi  X. Let U i = X  {Xi }, and let ui denote an instantiation
to U i . The model Ti takes a state (ui , xi ) and transitions to a state of the form
(ui , xi ). As we discussed above, we can combine the dierent local transition models
into a single global model in various ways.
2.3.4.3
Gibbs Sampling
Gibbs sampling is one simple yet eective Markov chain for factored state spaces,
which is particularly ecient for graphical models. We dene the local transition
model Ti as follows. Intuitively, we simply forget the value of Xi in the current
state, and sample a new value for Xi from its posterior given the rest of the current
state. More precisely, let (ui , xi ) be a state in the chain. We dene
T ((ui , xi )  (ui , xi )) = P (xi | ui ).
(2.19)
Note that the transition probability does not depend on the current value xi of Xi ,
but only on the remaining state ui .
The Gibbs chain is dened via a set of local transition models; we use the
multistep transition model to combine them. Note that the dierent local transitions
are taken consecutively; i.e., having changed the value for a variable X1 , the value
for X2 is sampled based on the new value. Also note that we are only collecting a
single sample for every sequence where each local transition has been taken once.
This chain is guaranteed to be regular whenever the distribution is positive,
so that every value of Xi has positive probability given an assignment ui to the
remaining variables. In this case, we can get from any state to any state in at most
k local transition steps, where k = |X  E|. Positivity is, however, not necessary;
there are many examples of nonpositive distributions where the Gibbs chain is
regular. It is also easy to show that the posterior distribution P (X | e) is a
stationary distribution of this process.
Gibbs sampling is particularly well suited to many graphical models, where we
can compute the transition probability P (Xi | ui ) very eciently. In particular, as
40
we now show, this distribution can be done based only on the Markov blanket of Xi .
We show this analysis for a Markov network; the extension to Bayesian networks is
straightforward. In general, we can decompose the probability of an instantiation
as follows:
1 
1
j [C j ] =
j [C j ]
j [C j ].
P (x1 | x2 , . . . , xn ) =
Z j
Z
j : Xi C j
j : Xi C j
For shorthand, let j [xi , u] denote j [xi , uC j ]. We can now compute
	
P (xi , ui )
C j Xi j [xi , ui ]
	
=
.
P (xi | ui ) = 
x P (xi , ui )
x
C j Xi j [(xi , ui )]
i
(2.20)
This last expression uses only the clique potentials involving Xi , and depends only
on the instantiation in ui of Xi s Markov blanket. In the case of Bayesian networks,
this expression reduces to a formula involving only the CPDs of Xi and its children,
and its value, again, depends only on the assignment in ui to the Markov blanket
of Xi . It can thus be computed very eciently.
We note that the Markov chain dened by a graphical model is not necessarily
regular, and might not converge to a unique stationary distribution. It turns out
that this type of situation can only arise if the distribution dened by the graphical
model is nonpositive, i.e., if the CPDs or clique potentials have entries with the
value 0.
Theorem 2.34
Let H be a Markov network such that all of the clique potentials are strictly positive.
Then the Gibbs-sampling Markov chain is regular.
2.3.4.4
Generating Samples
The burn-in time for a large Markov chain is often quite large. Thus, the naive
algorithm described above has to execute a large number of sampling steps for
2.3
Inference
41
every usable sample. However, a key observation is that, if x(t) is sampled from ,
then x(t+1) is also sampled from . Thus, once we have run the chain long enough
that we are sampling from the stationary distribution (or a distribution close to it),
we can continue generating samples from the same trajectory, and obtain a large
number of samples from the stationary distribution.
More formally, assume that we use x(0) , . . . , x(T ) as our burn-in phase, and then
collect M samples x(T +1) , . . . , x(T +M ) . Thus, we have collected a data set D where
xm = x(T +m) , for m = 1, . . . , M . Assume, for simplicity, that x(T +1) is sampled
from , and hence so are all of the samples in D. It follows that for any function
M
f : m=1 f (xm ) is an unbiased estimator for IE(X) [f (X)].
The key problem, of course, is that consecutive samples from the same trajectory
are correlated. Thus, we cannot expect the same performance as we would from
M independent samples from . In other words, the variance of the estimator is
signicantly higher than that of an estimator generated by M independent samples
from , as discussed above.
One solution to this problem is not to collect consecutive samples from the chain.
Rather, having collected a sample x(T ) , we let the chain run for a while, and collect
a second sample x(T +d) for some appropriate choice of d. For d large enough, x(T )
and x(T +d) are only slightly correlated, and we can view them as independent
samples from . However, the time d required for forgetting the correlation is
clearly related to the mixing time of the chain. Thus, chains that are slow to mix
initially also require larger d in order to produce close-to-independent samples.
Nevertheless, the samples do come from the correct distribution for any value of d,
and hence it is often better to compromise and use a shorter d than it is to use a
shorter burn-in time T . This method thus allows us to collect a larger number of
usable samples with fewer transitions of the Markov chain.
In fact, we can often make even better use of the samples generated using this
single-chain approach. Although the samples between x(T ) and x(T +d) are not
independent samples, there is no reason to discard them. That is, using all of the
samples x(T ) , x(T +1) , . . . , x(T +d) produces a provably better estimator than using
just the two samples x(T ) and x(T +d) : our variance is always no higher if we use all
of the samples we generated rather than a subset. Thus, the strategy of picking only
a subset of the samples is useful primarily in settings where there is a signicant
cost associated with using each sample (e.g., the evaluation of f is costly), so that
we might want to reduce the overall number of samples used.
2.3.4.6
Discussion
42
2.4
Learning
Next, we turn our attention to learning graphical models [4, 6]. There are two
variants of the learning task: parameter estimation and structure learning. In the
parameter estimation task, we assume that the qualitative dependency structure
of the graphical model is known; i.e., in the directed model case, G is given, and
in the undirected case, H is given. In this case, the learning task is simply to ll
in the parameters that dene the CPDs of the attributes or the parameters which
dene the potential functions of the Markov network. In the structure learning task,
there is no additional required input (although the user can, if available, provide
prior knowledge about the structure, e.g., in the form of constraints). The goal is
to extract a Bayesian network or Markov network, structure as well as parameters,
from the training data alone. We discuss each of these problems in turn.
2.4.1
We begin with learning the parameters for a Bayesian network where the dependency structure is known. In other words, we are given the structure G that determines the set of parents for each random variable, and our task is to learn the
parameters G that dene the CPDs for this structure. Our learning is based on a
particular training set D = {x1 , . . . , xm }, which, for now, we will assume is complete
(i.e., each instance is fully observed, there are no missing values). While this task is
relatively straightforward, it is of interest in and of itself. In addition, it is a crucial
component in the structure learning algorithm described in section 2.4.3.
There are two approaches to parameter estimation: maximum likelihood estimation (MLE) and Bayesian approaches. The key ingredient for both is the likelihood
function: the probability of the data given the model. This function captures the
response of the probability distribution to changes in the choice of parameters. The
likelihood of a parameter set is dened to be the probability of the data given the
model. For a Bayesian network structure G the likelihood of a parameter set G is
L(G : D) = P (D |  G ).
2.4
Learning
2.4.1.1
43
m
P (xj : G )
j=1
n
m 
P (xji | Paxj :  G )
i
j=1 i=1
m
n 
P (xji | Paxj :  G )
i
i=1 j=1
We will use Xi |Pai to denote the subset of parameters that determine P (Xi | Pai ).
In the case where the parameters are disjoint (each CPD is parameterized by a
separate set of parameters that do not overlap; this allows us to maximize each
parameter set independently. We can write the likelihood as follows:
L(G : D) =
n
i=1
m
j=1
m
xj |uj
j=1
uVal(U )
Nu,x 
x|u
,
(2.21)
xVal(X)
where Nu,x is the number of times X = x and Pai = u in D. That is, we have
grouped together all the occurrences of x|u in the product over all instances.
We need to maximize this term under the constraints that, for each choice of
value for the parents U , the conditional probability is legal:
x|u = 1 for all u.
44
These constraints imply that the choice of value for x|u can impact the choice of
value for x |u . However, the choice of parameters given dierent values u of U
are independent of each other. Thus, we can maximize each of the terms in square
brackets in (2.21) independently.
We can thus further decompose the local likelihood function for a tabular CPD
into a product of simple likelihood functions. It is easy to see that each of these
likelihood functions is a multinomial likelihood. The counts in the data for the
dierent outcomes x are simply {Nu,x : x  Val(X)}. We can then immediately
use the MLE parameters for a multinomial which are simply
Nu,x
x|u =
,
Nu
where we use the fact that Nu = x Nu,x .
2.4.1.2
In many cases, maximum likelihood parameter estimation is not robust, as it overts the training data. The Bayesian approach uses a prior distribution over the
parameters to smooth the irregularities in the training data, and is therefore signicantly more robust. As we will see in section 2.4.3, the Bayesian framework also
gives us a good metric for evaluating the quality of dierent candidate structures.
Roughly speaking, the Bayesian approach introduces a prior over the unknown
parameters, allowing us to specify a joint distribution over the unknown parameters
and the data instances, and performs Bayesian conditioning, using the data as
evidence, to compute a posterior distribution over these parameters.
Consider the following simple example: we want to estimate parameters for a
simple network with two variables X and Y , where X is the parent of Y . Our
training data consists of observations xj , y j for j = 1, . . . , m. In addition, assume
that our CPDs are represented as multinomials and we have unknown parameter
vectors X , Y |x0 , and Y |x1 .
The dependencies between these variables are described in the network of gure 2.6. This is the meta-Bayesian network that describes our learning setup. This
Bayesian network structure immediately reveals several points. For example, the
instances are independent given the unknown parameters. In addition, a common
assumption made is that the individual parameter variables are a priori independent. That is, we believe that knowing the value of one parameter tells us nothing
about another. This is called parameter independence. The suitability of this assumption depends on the domain, and it should be considered with care.
If we accept parameter independence, we can draw an important conclusion.
Complete data d-separates the parameters for dierent CPDs. Given the data set
D, we can determine the posterior over X independently of the posterior over
Y |X . Once we solve each problem separately, we can combine the results. This
is the analogous result to the likelihood decomposition for MLE estimation of
section 2.4.1.1.
2.4
Learning
45
X
X[1]
X[2]
Y[1]
Y[2]
...
...
X[M]
Y|x0
Y|x1
Y[M]
Figure 2.6
Consider, for example, the learning setting described in gure 2.6, where we take
both X and Y to be binary. We need to represent the posterior  X and Y |X given
the data. If we use a Dirichlet prior over X , Y |x0 , and Y |x1 , then the posterior
P (X | x1 , . . . , xM ) can also be represented as a Dirichlet distribution.
Suppose that P (X ) is a Dirichlet prior with hyperparameters x0 and x1 ,
P (Y |x0 ) is a Dirichlet prior with hyperparameters y0 |x0 and y1 |x0 , and P (Y |x1 )
is a Dirichlet prior with hyperparameters y0 |x1 and y1 |x1 .
As in decomposition for the likelihood function in section 2.4.1.1, the likelihood
terms that involve Y |x0 depend on all the data elements X j such that xj = x0 and
the terms that involve Y |x1 depend on all the data elements X j such that xj = x1
We can decompose the joint distribution over parameters and data as follows:
P (G , D) = P (X )LX ( X : D)
P (y j | xn :  Y |x1 )
P (Y |x1 )
j:xj =x1
P (Y |x0 )
P (y j | xj : Y |x0 )
j:xj =x0
46
This induces a predictive model in which, for the next instance, we have that
x |u + Nxi ,u
.
P (Xi [m + 1] = xi | U [m + 1] = u, D) =  i
i xi |u + Nxi ,u
(2.22)
Putting this all together, we can see that for computing the probability of a
new instance, we can use a single network parameterized as usual, via a set of
multinomials, but ones computed as in (2.22).
2.4.2
Note that the gradient is zero when the counts of the data correspond exactly
with the expected counts predicted by the model. In practice, a prior on the
parameters is used to help avoid overtting. The standard prior is a diagonal
 i
Gaussian,   N (0,  2 I), which adds an additional factor of i,u
2 to the gradient.
To compute the probability P (ui | ) needed to evaluate the gradient, we need
to perform inference in the Markov network. Unlike in Bayesian networks, where
parameters of intractable (large treewidth) graphs can be estimated by simple
counting because of local normalization, the undirected case requires inference
even during the learning stage. This is one of the prices of the exibility of global
normalization in Markov networks. See further discussion in chapter 4. Because
of this added complexity, maximum-likelihood learning of the Markov network
2.4
Learning
47
structure is much more expensive and much less investigated; we will focus below
on Bayesian networks.
2.4.3
Next we consider the problem of learning the structure of a Bayesian network. There
are three broad classes of algorithms for BN structure learning:
Constraint-based approaches These approaches view a Bayesian network as
a representation of independencies. They try to test for conditional dependence
and independence in the data, and then nd a network that best explains these
dependencies and independencies. Constraint-based methods are quite intuitive;
they closely follow the denition of Bayesian network. A potential disadvantage
of these methods is they can be sensitive to failures in individual independence
tests.
Score-based approaches These approaches view a Bayesian network as specifying a statistical model, and then address learning as a model selection problem.
These all operate on the same principle: We dene a hypothesis space of potential models  the set of possible network structures we are willing to consider
 and a scoring function that measures how well the model ts the observed
data. Our computational task is then to nd the highest-scoring network structure. The space of Bayesian networks is a combinatorial space, consisting of a
2
superexponential number of structures  2O(n ) . Therefore, even with a scoring
function, it is not clear how one can nd the highest-scoring network. There are
very special cases where we can nd the optimal network. In general, however,
the problem is NP-hard, and we resort to heuristic search techniques. Score-based
methods consider the whole structure at once, and are therefore less sensitive to
individual failures and are better at making compromises between the extent to
which variables are dependent in the data and the cost of adding the edge.
The disadvantage of the score-based approaches is that they are in general not
gauranteed to nd the optimal solution.
Bayesian model averaging approaches The third class of approaches do not
attempt to learn a single structure. They are based on a Bayesian framework
describing a distribution over possible structures and try to average the prediction
of all possible structures. Since the number of structures is immense, performing
this task seems impossible. For some classes of models this can be done eciently,
and for others we need to resort to approximations.
In this chapter, we focus on the second approach, score-based approaches to
structure selection. For details about the other approaches, see [8].
48
2.4.3.1
Structure Scores
As discussed above, score-based methods approach the problem of structure learning as an optimization problem. We dene a score function that can score each
candidate structure with respect to the training data, and then search for a highscoring structure. As can be expected, one of the most important decisions we
must make in this framework is the choice of scoring function. In this subsection,
we discuss two of the most obvious choices.
The Likelihood Score A natural choice for scoring function is the likelihood
function, which we used for parameter estimation. This measures the probability
of the data given a model; thus, it seems intuitive to nd a model that would make
the data as probable as possible.
Assume that we want to maximize the likelihood of the model. Our goal is to
nd both a graph G and parameters G that maximize the likelihood. It is easy to
show that to nd the maximum-likelihood (G, G ) pair, we should nd the graph
structure G that achieves the highest likelihood when we use the MLE parameters
for G. We therefore dene
 G  : D),
scoreL (G : D) = (G, 
 G  : D) is the logarithm of the likelihood function, and 
 G are the
where (G, 
maximum-likelihood parameters for G. (It is typically easier to deal with the
logarithm of the likelihood.)
The problem with the likelihood score is that it overts the training data. It
will learn a model that precisely ts the specics of the empirical distribution in
our training set. This model captures both dependencies that are true of the
underlying distribution, and dependencies that are artifacts of the specic set of
instances that were given as training data. It therefore fails to generalize well to
new data cases: these are sampled from the underlying distribution, which is not
identical to the empirical distribution in our training set.
However it is reasonable to use the maximum-likelihood score when there are additional mechanisms that disallow overcomplicated structures. For example, learning networks with a xed indegree. Such a limitation can constrain the tendency
to overt when using the maximum-likelihood score.
Bayesian Score An alternative scoring function is based on Bayesian considerations. Recall that the main principle of the Bayesian approach is that, whenever we
have uncertainty over anything, we should place a distribution over it. In this case,
we have uncertainty both over structure and over parameters. We therefore dene
a structure prior P (G) that puts a prior probability on dierent graph structures,
and a parameter prior P ( G | G) that puts a probability on a dierent choice of
2.4
Learning
49
P (D | G)P (G)
,
P (D)
where, as usual, the denominator is simply a normalizing factor that does not help
distinguish between dierent structures. Then, we dene the Bayesian score as
scoreB (G : D) = log P (D | G) + log P (G),
(2.23)
The ability to ascribe a prior over structures gives us a way of preferring some
structures over others. For example, we can penalize dense structures more than
sparse ones. It turns out, however, that this term in the score is almost irrelevant
compared to the second term. This rst term, P (D | G) takes into consideration
our uncertainty over the parameters:
P (D | G) =
P (D | G , G)P ( G | G)d G ,
(2.24)
where P (D | G , G) is the likelihood of the data given the network G,  G  and
P (G | G) is our prior distribution over dierent parameter values for the network
G. This term is the marginal likelihood of the data given the structure, since we
marginalize out the unknown parameters.
Note that the marginal likelihood is dierent from the maximum-likelihood score.
Both terms examine the likelihood of the data given the structure. The maximumlikelihood score returns the maximum of this function. In contrast, the marginal
likelihood is the average value of this function, where we average based on the prior
measure P ( G | G).
Instantiating this further, if we consider a network with Dirichlet priors, such
that P ( Xi |pai | G) has hyperparameters {Gxj |u : j = 1, . . . , |Xi |, then we have
i
i
that
,
P (D | G) =
(GXi |ui + Nui ) j
(Gxj |u )
G
i
ui Val(PaX )
i
xi Val(Xi )
where GXi |ui = j Gxj |u . In practice, we use the logarithm of this formula, which
i
i
is more manageable to compute numerically.
The Bayesian score is biased toward simpler structures, but as it gets more data,
it is willing to recognize that a more complex structure is necessary. In other words,
it trades o t to data with model complexity. To understand behavior, it is useful to
consider an approximation to the Bayesian score that better exposes its fundamental
properties.
50
Theorem 2.35
If we use a Dirichlet parameter prior for all parameters in our network, then, as
M  , we have that
 G : D) 
log P (D | G) = (
log M
Dim[G] + O(1),
2
PaGXi )
2.4
Learning
51
Search
All of the scores we have considered are decomposable. Another property that is
shared by all these scores is score equivalence; if G is independence-equivalent to
G  , then score(G : D) = score(G  : D).
52
There are several special cases where structure learning is tractable. We wont go
into full details, but two important cases are: (1) learning tree-structured networks
and (2) learning networks with known ordering over the variables.
A network is tree-structured if each variable has at most one parent. In this case,
for decomposable, score-equivalent scores, we can construct an undirected graph,
where the weight on an edge Xi  Xj is the change in network score if we add
Xi as the parent of Xj (note that, because of score-equivalence, this is the same as
the change if we add Xj as parent of Xi ). We can nd a weighted spanning tree of
this graph in polynomial time. We can transform the undirected spanning tree into
a directed spanning tree by choosing an arbitrary root, and directing edges away
from the root.
Another interesting tractable case is the problem of learning a BN structure
consistent with some known total order  over X and bounded indegree d. In other
G
words,
 we restrict attention to structures G where if Xi  PaXj then Xi  Xj
 G 
and PaXj  < d. For some domains, nding an ordering such as this is relatively
straightforward; for example, a temporal ow over the order in which variables take
on their values. In this case, for each Xi we can evaluate each possible parent-set
of size d from {X1 , . . . , Xi1 }. This is polynomial in n (but exponential in d).
Unfortunately, the general case, nding an optimally scoring G  , for bounded
degree d  2, is N P-hard. Instead of aiming for an algorithm that will always nd
the highest-scoring network, we resort to heuristic algorithms that attempt to nd
the best network, but are not guaranteed to do so.
To dene the heuristic search algorithm, we must dene the search space and
search procedure. We can think of a search space as a graph, where each vertex or
node is a candidate network structure to be considered, and edges denote possible
moves that the search procedure can perform. The search procedure denes an
algorithm that explores the search space, without necessarily seeing all of it . The
simplest search procedure is the greedy one that whenever it is a node chooses to
move the neighbor that has the highest score, until it reaches a node that has a
better score than all of its neighbors.
To elaborate further, in our case a node in the search space is a complete network
structure G over X . There is a tradeo in how densely each node is connected with
how eective the search will be. If each node has few neighbors, then the search
procedure has to consider only few options at each point of the search. Thus, it can
aord to evaluate each of these options. However, paths from the initial node to a
good one might be long and complex. On the other hand, if each node has many
neighbors, there are short paths from each point to another, but we might not be
able to pick it, because we dont have time to evaluate all of the options at each
step.
A good tradeo for this problem chooses reasonably few neighbors for each node,
but ensures that the diameter of the search space remains small. A natural choice
of neighbors of a network structure is a set of structures that are identical to it
except for small local modications. The most commonly used operators which
dene the local modications are
2.4
Learning
1
2
3
4
5
6
7
8
9
10
11
12
13
53
Procedure Greedy-Structure-Search (
G , // initial network structure
D // Fully observed dataset
score, // Score
O, // A set of search operators
)
Gbest  G
do
G  Gbest
Progress  false
for each operator o  O
Go  o(G) // Result of applying o on G
if Go is legal structure then
if score(Go : D) > score(Gbest : D) then
Gbest  Go
Progress  true
while Progress
return Gbest
Figure 2.7 Greedy structure search algorithm, with an arbitrary scoring function
score(G : D).
add an edge;
delete an edge;
reverse an edge.
In other words, if we consider the node G, then the neighboring nodes in the
search space are those where we change one edge, either by adding one, deleting
one, or reversing the orientation of one. We only consider operations that result
in legal networks (i.e., acyclic networks satisfying any constraints such as bounded
indegree).
This denition of search space is quite natural and has several desirable properties. First, notice that the diameter of the search space is at most n2 . That is,
there is a relatively short path between any two networks we choose. To see this,
note that if we consider traversing a path from G1 to G2 , we can start by deleting
all edges in G1 that do not appear in G2 , and then we can add the edges that are
in G2 and not in G1 . Clearly, the number of steps we take is bounded by the total
number of edges we can have, n2 .
Second, recall that the score of a network G is a sum of local scores. The operations
we consider result in changing only one local score term (in the case of addition
or deletion of an edge) or two (in the case of edge reversal). Thus, they result in a
local change in the score  the main mass of the score remains the same. This
implies that there is some sense of continuity in the score of neighboring nodes.
The search methods most commonly used are local search procedures. Such
search procedures are characterized by the following design: they keep a current
candidate node. At each iteration they explore some of the neighboring nodes, and
54
then decide to make a step to one of the neighbors and make it the current
candidate. These iterations are repeated until some termination condition. In other
words, local search procedures can be thought of as keeping one pointer into the
search space and moving it around.
One of the simplest, and often used, search procedures is the greedy hill-climbing
procedure. The intuition is simple. As the name suggests, at each step we take the
step that leads to the largest improvement in the score. The actual details of the
procedure are shown in gure 2.7. We pick an initial network structure G as a
starting point; this network can be the empty one, a random choice, the best tree,
or a network obtained from some prior knowledge. We compute its score. We then
consider all of the neighbors of G in the space  all of the legal networks obtained
by applying a single operator to G  and compute the score for each of them. We
then apply the change that leads to the best improvement in the score. We continue
this process until no modication improves the score.
We can improve on the performance of greedy hill-climbing by using more clever
search algorithms. Some common extensions are:
TABU search: Keep a list of K most recently visited structures and avoid them,
i.e., apply the best move that leads to a structure not on the list. This approach
deals with local maxima whose hill has fewer than K structures.
Random restarts: Once stuck, apply some xed number of random edge changes
and then restart the greedy search. At the end of the search, select the best
structure encountered anywhere on the trajectory. This approach can escape from
the basin of one local maximum to another.
Simulated annealing: Evaluate operators in random order. If the randomly
selected operator induces an uphill step, move to the resulting structure. (Note:
it does not have to be the best of the current neighbors.) If the operator induces a
downhill step, apply it with probability inversely proportional to the reduction in
score. A temperature parameter determines the probability of taking downhill
steps. As the search progress, the temperature decreases, and the algorithm
becomes less likely to take a downhill step.
2.5
Conclusion
This chapter presented a condensed description of graphical models, including their
representation, inference algorithms, and learning algorithms. Many topics have not
been covered; we refer the reader to [8] for a more complete description.
References
[1] A. Becker and D. Geiger. A suciently fast algorithm for nding close to
optimal clique trees. Articial Intelligence, 125(1-2):317, 2001.
References
55
Sa
so D
zeroski
3.1
Introduction
From a knowledge discovery in database (KDD) perspective, we can say that
inductive logic programming (ILP) is concerned with the development of techniques
and tools for relational data mining (RDM). While typical data mining approaches
nd patterns in a given single table, relational data mining approaches nd patterns
in a given relational database. In a typical relational database, data resides in
multiple tables. ILP tools can be applied directly to such multi-relational data to
nd patterns that involve multiple relations. This is a distinguishing feature of ILP
approaches: most other data mining approaches can only deal with data that resides
in a single table and require preprocessing to integrate data from multiple tables
(e.g., through joins or aggregation) into a single table before they can be applied.
Integrating data from multiple tables through joins or aggregation can cause loss
of meaning or information. Suppose we are given the relation customer(CustID, N ame,
Age, SpendsALot) and the relation purchase(CustID, P roductID, Date, V alue,
P aymentM ode), where each customer can make multiple purchases, and we are in-
58
terested in characterizing customers that spend a lot. Integrating the two relations
via a natural join will give rise to a relation purchase1 where each row corresponds
to a purchase and not to a customer. One possible aggregation would give rise to
the relation customer1(CustID, Age, N of P urchases, T otalV alue, SpendsALot).
In this case, however, some information has been clearly lost during the aggregation process.
The following pattern can be discovered by an ILP system if the relations
customer and purchase are considered together.
customer(CID, N ame, Age, yes) 
Age > 30 
purchase(CID, P ID, D, V alue, P M ) 
P M = credit card  V alue > 100.
This pattern says: a customer spends a lot if she is older than 30, has purchased
a product of value more than 100, and paid for it by credit card. It would not
be possible to induce such a pattern from either of the relations purchase1 and
customer1 considered on their own.
Besides the ability to deal with data stored in multiple tables directly, ILP systems are usually able to take into account generally valid background (domain)
knowledge in the form of a logic program. The ability to take into account background knowledge and the expressive power of the language of discovered patterns
are also distinctive for ILP.
Note that data mining approaches that nd patterns in a given single table are
referred to as attribute-value or propositional learning approaches, as the patterns
they nd can be expressed in propositional logic. ILP approaches are also referred to
as rst-order learning approaches, or relational learning approaches, as the patterns
they nd are expressed in the relational formalism of rst-order logic. A more
detailed discussion of the single table assumption, the problems resulting from it,
and how a relational representation alleviates these problems can be found in [49]
and in (chapter 4 of [15]).
The remainder of this chapter rst introduces the basics of logic programming and
relates logic programming terminology to database terminology. It then discusses
the major settings for, tasks of, and approaches to ILP and RDM. The tasks of
learning relational classication rules, decision trees, and association rules and
approaches to solving them are discussed in the following three sections. Relational
distance-based approaches are covered next. The chapter concludes with a brief
discussion of recent trends in ILP and RDM research.
3.2
Logic Programming
We rst briey describe the basic logic programming terminology and relate it
to database terminology, then proceed with a more complete introduction to logic
3.2
Logic Programming
59
programming. The latter discusses both the syntax and semantics of logic programs.
While syntax denes the language of logic programs, semantics is concerned with
assigning meaning (truth-values) to such statements. Proof theory focuses on
(deductive) reasoning with such statements.
For a thorough treatment of logic programming we refer to the standard textbook
of Lloyd [31]. The overview below is mostly based on the comprehensive and easily
readable text by Hogger [22].
3.2.1
Logic programs consist of clauses. We can think of clauses as rst-order rules, where
the conclusion part is termed the head and the condition part the body of the clause.
The head and body of a clause consist of atoms, an atom being a predicate applied
to some arguments, which are called terms. In Datalog, terms are variables and
constants, while in general they may consist of function symbols applied to other
terms. Ground clauses have no variables.
Consider the clause f ather(X, Y )  mother(X, Y )  parent(X, Y ). It reads: if
X is a parent of Y, then X is the father of Y or X is the mother of Y ( stands for
logical or). parent(X, Y ) is the body of the clause, and f ather(X, Y )mother(X, Y )
is the head. parent, f ather, and mother are predicates, X and Y are variables,
and parent(X, Y ), f ather(X, Y ), and mother(X, Y ) are atoms. We adopt the
Prolog [4] syntax and start variable names with capital letters. Variables in clauses
are implicitly universally quantied. The above clause thus stands for the logical
formula XY : f ather(X, Y )  mother(X, Y )  parent(X, Y ). Clauses are also
viewed as sets of literals, where a literal is an atom or its negation. The above clause
is then the set {f ather(X, Y ), mother(X, Y ), parent(X, Y )}.
As opposed to full clauses, denite clauses contain exactly one atom in the
head. As compared to denite clauses, program clauses can also contain negated
atoms in the body. The clause in the paragraph above is a full clause; the clause
ancestor(X, Y )  parent(Z, Y )  ancestor(X, Z) is a denite clause ( stands
for logical and). It is also a recursive clause, since it denes the relation ancestor in
terms of itself and the relation parent. The clause mother(X, Y )  parent(X, Y ) 
not male(X) is a program clause.
A set of clauses is called a clausal theory. Logic programs are sets of program
clauses. A set of program clauses with the same predicate in the head is called a
predicate denition. Most ILP approaches learn predicate denitions.
A predicate in logic programming corresponds to a relation in a relational
database. An n-ary relation p is formally dened as a set of tuples [47], i.e., a
subset of the Cartesian product of n domains D1  D2  . . .  Dn , where a domain
(or a type) is a set of values. It is assumed that a relation is nite unless stated
otherwise. A relational database (RDB) is a set of relations.
Thus, a predicate corresponds to a relation, and the arguments of a predicate
correspond to the attributes of a relation. The major dierence is that the attributes
of a relation are typed (i.e., a domain is associated with each attribute). For
60
Table 3.1
LP terminology
relation name p
attribute of relation p
tuple a1 , . . . , an 
relation p a set of tuples
predicate symbol p
argument of predicate p
ground fact p(a1 , . . . , an )
predicate p dened extensionally
by a set of ground facts
predicate q
dened intensionally
by a set of rules (clauses)
relation q
dened as a view
example, in the relation lives in(X, Y ), we may want to specify that X is of type
person and Y is of type city. Database clauses are typed program clauses.
A deductive database (DDB) is a set of database clauses. In DDBs, relations
can be dened extensionally as sets of tuples (as in RDBs) or intensionally as
sets of database clauses. Database clauses use variables and function symbols in
predicate arguments and the language of DDBs is substantially more expressive
than the language of RDBs [31, 47]. A deductive Datalog database consists of
denite database clauses with no function symbols.
Table 3.1 relates basic database and logic programming terms. For a full treatment of logic programming, RDBs, and DDBs, we refer the reader to [31] and [47].
3.2.2
The basic concepts of logic programming include the language (syntax) of logic
programs, as well as notions from model and proof theory (semantics). The syntax denes what are legal sentences/statements in the language of logic programs.
Model theory (semantics) is concerned with assigning meaning (truth-values) to
such statements. Proof theory focuses on (deductive) reasoning with such statements.
3.2.2.1
3.2
Logic Programming
61
62
Model theory is concerned with attributing meaning (truth-value) to sentences (wellformed formulae) in a rst-order language. Informally, the sentence is mapped to
some statement about a chosen domain through a process known as interpretation.
An interpretation is determined by the set of ground facts (ground atomic formulae)
to which it assigns the value true. Sentences involving variables and quantiers are
interpreted by using the truth-values of the ground atomic formulae and a xed set
of rules for interpreting logical operations and quantiers, such as F is true if and
only if F is false.
An interpretation which gives the value true to a sentence is said to satisfy the
sentence; such an interpretation is called a model for the sentence. An interpretation
which does not satisfy a sentence is called a counter-model for that sentence. By
extension, we also have the notion of a model (counter-model) for a set of sentences
(e.g., for a clausal theory): an interpretation is a model for the set if and only if it
is a model for each of the sets members. A sentence (set of sentences) is satisable
if it has at least one model; otherwise it is unsatisable.
3.2
Logic Programming
63
Proof theory focuses on (deductive) reasoning with logic programs. Whereas model
theory considers the assignment of meaning to sentences, proof theory considers
the generation of sentences (conclusions) from other sentences (premises). More
specically, proof theory considers the derivability of sentences in the context of
some set of inference rules, i.e., rules for sentence derivation. Formally, an inference
system consists of an initial set S of sentences (axioms) and a set R of inference
rules.
Using the inference rules, we can derive new sentences from S and/or other
derived sentences. The fact that sentence s can be derived from S is denoted S  s.
A proof is a sequence s1 , s2 , ....., sn , such that each si is either in S or derivable using
R from S and s1 , ..., si1 . Such a proof is also called a derivation or deduction. Note
that the above notions are of entirely syntactic nature. They are directly relevant
to the computational aspects of automated deductive inference.
The set of inference rules R denes the derivability relation . A set of inference
rules is sound if the corresponding derivability relation is a subset of the logical
implication relation, i.e., for all S and s, if S  s, then S |= s. It is complete if
the other direction of the implication holds, i.e., for all S and s, if S |= s, then
S  s. The properties of soundness and completeness establish a relation between
64
the notions of syntactic () and semantic (|=) entailment in logic programming and
rst-order logic. When the set of inference rules is both sound and complete, the
two notions coincide.
Resolution comprises a single inference rule applicable to clausal-form logic. From
any two clauses having an appropriate form, resolution derives a new clause as their
consequence. For example, the clauses daughter(X, Y )  f emale(X), parent(Y, X)
and f emale(sonja)  resolve into daughter(sonja, Y )  parent(Y, sonja). Resolution is sound: every resolvent is implied by its parents. It is also refutation complete: the empty clause is derivable by resolution from any set S of Horn clauses if
S is unsatisable.
3.3
One of the most basic and most often considered tasks in machine learning is the
task of inductive concept learning (table 3.3.1). Given U, a universal set of objects
(observations), a concept C is a subset of objects in U, C  U. For example, if U is
the set of all patients in a given hospital, C could be the set of all patients diagnosed
with hepatitis A. The task of inductive concept learning is dened as follows: Given
instances and non-instances of concept C, nd a hypothesis (classier) H able to
tell whether x  C, for each x  U.
To dene the task of inductive concept learning more precisely, we need to specify
U the space of instances (examples), as well as the space of hypotheses considered.
This is done through specifying the languages of examples (LE ) and concept
descriptions (LH ). In addition, a coverage relation covers(H, e) has to be specied,
which tells us when an example e is considered to belong to the concept represented
by hypothesis H. Examples that belong to the target concept are termed positive;
those that do not are termed negative. Given positive and negative examples, we
want hypotheses that are complete (cover all positive examples) and consistent (do
not cover negative examples).
Looking at concept learning in a logical framework, De Raedt [11] considers
three settings for concept learning. The key aspect that varies in these settings is
3.3
Table 3.2
65
a language of examples LE
a language of concept descriptions LH
a covers relation between LH and LE , dening when
an example e is covered by a hypothesisH: covers(H, e)
sets of positive P and negative N examples described in LE
Find hypothesis H from LH , such that
completeness: H covers all positive examples p  P
consistency: H does not cover any negative example n  N
the notion of coverage, but the languages LE and LH vary as well. We characterize
these for each of the three settings below.
In learning from entailment, the coverage relation is dened as covers(H, e) i
H |= e. The hypothesis logically entails the example. Here H is a clausal theory
and e is a clause.
In learning from interpretations, we have covers(H, e) i e is model of H. The
example has to be a model of the hypothesis. H is a clausal theory and e is a
Herbrand interpretation.
In learning from satisability, covers(H, e) i H  e 
|=. The example and the
hypothesis taken together have to be satisable. Here both H and e are clausal
theories.
The setting of learning from entailment, introduced by Muggleton [34], is the
one that has received the most attention in the eld of ILP. The alternative ILP
setting of learning from interpretations was proposed by De Raedt and Dzeroski
[14]: this setting is a natural generalization of propositional learning. Many learning
algorithms for propositional learning have been upgraded to the learning from
interpretations ILP setting. Finally, the setting of learning from satisability was
introduced by Wrobel and Dzeroski [50], but has rarely been used in practice due
to computational complexity problems.
De Raedt [11] also discusses the relationships among the three settings for concept
learning. Learning from nite interpretations reduces to learning from entailment.
Learning from entailment reduces to learning from satisability. Learning from
interpretations is thus the easiest and learning from satisability the hardest of the
three settings.
As introduced above, the logical settings for concept learning do not take into
account background knowledge, one of the essential ingredients of ILP. However,
the denitions of the settings are easily extended to take it into account. Given
background knowledge B, which in its most general form can be a clausal theory,
66
The most commonly addressed task in ILP is the task of learning logical denitions
of relations [40], where tuples that belong or do not belong to the target relation
are given as examples. From training examples ILP then induces a logic program
(predicate denition) corresponding to a view that denes the target relation in
terms of other relations that are given as background knowledge. This classical
ILP task is addressed, for instance, by the seminal MIS system [44] (rightfully
considered as one of the most inuential ancestors of ILP) and one of the best
known ILP systems FOIL [40].
Given is a set of examples, i.e., tuples that belong to the target relation p (positive
examples) and tuples that do not belong to p (negative examples). Given are also
background relations (or background predicates) qi that constitute the background
knowledge and can be used in the learned denition of p. Finally, a hypothesis
language, specifying syntactic restrictions on the denition of p, is also given (either
explicitly or implicitly). The task is to nd a denition of the target relation p that
is consistent and complete, i.e., explains all the positive and none of the negative
tuples.
Formally, given is a set of examples E = P  N , where P contains positive and N
negative examples, and background knowledge B. The task is to nd a hypothesis
H such that e  P : B  H |= e (H is complete) and e  N : B  H 
|= e (H
is consistent), where |= stands for logical implication or entailment. This setting,
introduced by Muggleton [34] (and discussed in the previous section), is thus also
called learning from entailment.
In the most general formulation, each e, as well as B and H, can be a clausal
theory. In practice, each e is most often a ground example (tuple), B is a relational
database (which may or may not contain views), and H is a denite logic program.
The semantic entailment (|=) is in practice replaced with syntactic entailment () or
provability, where the resolution inference rule (as implemented in Prolog) is most
often used to prove examples from a hypothesis and the background knowledge. In
learning from entailment, a positive fact is explained if it can be found among the
answer substitutions for h produced by a query ?  b on database B, where h  b
is a clause in H. In learning from interpretations, a clause h  b from H is true in
the minimal Herbrand model of B if the query b  h fails on B.
As
an
illustration,
consider
the
task
of
dening
relation
daughter(X, Y ), which states that person X is a daughter of person Y , in terms of
the background knowledge relations f emale and parent. These relations are given
in table 3.3. There are two positive and two negative examples of the target relation
daughter. In the hypothesis language of denite program clauses it is possible to
3.3
67
Background knowledge
parent(ann, mary).
parent(ann, tom).
parent(tom, eve).
parent(tom, ian).
f emale(ann).
f emale(mary).
f emale(eve).
68
3.3.3
Initial eorts in ILP focused on relational rule induction, more precisely on concept
learning in rst-order logic and synthesis of logic programs; cf. [34]. An overview
of early work is given in the textbook on ILP by Lavrac and Dzeroski [30].
Representative early ILP systems addressing this task are Cigol [36], Foil [40],
Golem [37], and Linus [29]. More recent representative ILP systems are Progol
[35] and Aleph [46].
State-of-the-art ILP approaches now span most of the spectrum of data mining
tasks and use a variety of techniques to address these. The distinguishing features
of using multiple relations directly and discovering patterns expressed in rst-order
logic are present throughout: the ILP approaches can thus be viewed as upgrades
of traditional approaches. Van Laer and De Raedt [48] (chapter 10 of [15]) present
a case study of upgrading a propositional approach to classication rule induction
to rst-order logic. Note, however, that upgrading to rst-order logic is non-trivial:
the expressive power of rst-order logic implies computational costs and much work
is needed in balancing the expressive power of the pattern languages used and the
computational complexity of the data mining algorithm looking for such patterns.
This search for a balance between the two has occupied much of the ILP research
in the last ten years.
Present ILP approaches to multi-class classication involve the induction of
relational classication rules (ICL [48]), as well as rst order logical decision trees in
Tilde [3] and S-Cart [26]. ICL upgrades the propositional rule inducer CN2 [6].
Tilde and S-Cart upgrade decision tree induction as implemented in C4.5 [41] and
Cart [5]. A nearest-neighbor approach to relational classication is implemented
in Ribl [21] and its successor Ribl2.
Relational regression approaches upgrade propositional regression tree and rules
approaches. Tilde and S-Cart, as well as Ribl2, can handle continuous classes.
Fors [23] learns decision lists (ordered sets of rules) for relational regression.
The main nonpredictive or descriptive data mining tasks are clustering and
discovery of association rules. These have been also addressed in a rst-order
logic setting. The Ribl distance measure has been used to perform hierarchical
agglomerative clustering in Rdbc , as well as k-means clustering (see section 3.7).
Section 3.6 describes a relational approach to the discovery of frequent queries and
query extensions, a rst-order version of association rules.
With such a wide arsenal of RDM techniques, there is also a variety of practical
applications. ILP has been successfully applied to discover knowledge from relational data and background knowledge in the areas of molecular biology (including
drug design, protein structure prediction, and functional genomics), environmental sciences, trac control, and natural language processing. An overview of such
applications is given by Dzeroski [19] and (chapter 14 in [15]).
3.3
3.3.4
69
One of the early approaches to ILP, implemented in the ILP system LINUS [29],
is based on the idea that the use of background knowledge can introduce new
attributes for learning. The learning problem is transformed from relational to attribute-value form and solved by an attribute-value learner. An advantage of this
approach is that data mining algorithms that work on a single table (and this
is the majority of existing data mining algorithms) become applicable after the
transformation.
This approach, however, is feasible only for a restricted class of ILP problems.
Thus, the hypothesis language of LINUS is restricted to function-free program
clauses which are typed (each variable is associated with a predetermined set of
values), constrained (all variables in the body of a clause also appear in the head),
and nonrecursive (the predicate symbol in the head does not appear in any of the
literals in the body).
The LINUS algorithm, which solves ILP problems by transforming them into
propositional form, consists of the following three steps:
The learning problem is transformed from relational to attribute-value form.
The transformed learning problem is solved by an attribute-value learner.
The induced hypothesis is transformed back into relational form.
The above algorithm allows for a variety of approaches developed for propositional problems, including noise-handling techniques in attribute-value algorithms,
such as CN2 [7], to be used for learning relations. It is illustrated on the simple
ILP problem of learning family relations. The task is to dene the target relation
daughter(X, Y ), which states that person X is a daughter of person Y , in terms of
the background knowledge relations f emale, male, and parent.
Table 3.4
Training examples
daughter(mary, ann).
daughter(eve, tom).
daughter(tom, ann).
daughter(eve, ann).
parent(X, Y ) 
mother(X, Y ).
parent(X, Y ) 
f ather(X, Y ).
Background knowledge
mother(ann, mary). f emale(ann).
mother(ann, tom).
f emale(mary).
f ather(tom, eve).
f emale(eve).
f ather(tom, ian).
All the variables are of the type person, dened as person = {ann, eve,
ian, mary, tom}. There are two positive and two negative examples of the target
relation. The training examples and the relations from the background knowledge
are given in table 3.3. However, since the LINUS approach can use nonground
background knowledge, let us assume that the background knowledge from table 3.4
is given.
70
Variables
C
mary
eve
tom
eve
ann
tom
ann
ann
Propositional features
f (X) f (Y ) m(X) m(Y ) p(X, X) p(X, Y ) p(Y, X) p(Y, Y )
true true f alse f alse
true f alse f alse true
f alse true true f alse
true true f alse f alse
f alse
f alse
f alse
f alse
f alse
f alse
f alse
f alse
true
true
true
f alse
f alse
f alse
f alse
f alse
The rst step of the algorithm, i.e., the transformation of the ILP problem into
attribute-value form, is performed as follows. The possible applications of the background predicates on the arguments of the target relation are determined, taking
into account argument types. Each such application introduces a new attribute. In
our example, all variables are of the same type person. The corresponding attributevalue learning problem is given in table 3.5, where f stands for f emale, m for male,
and p for parent. The attribute-value tuples are generalizations (relative to the given
background knowledge) of the individual facts about the target relation.
In table 3.5, variables stand for the arguments of the target relation, and propositional features denote the newly constructed attributes of the propositional learning
task. When learning function-free clauses, only the new attributes (propositional
features) are considered for learning.
In the second step, an attribute-value learning program induces the following
if-then rule from the tuples in table 3.5:
Class =  if [f emale(X) = true]  [parent(Y, X) = true]
In the last step, the induced if-then rules are transformed into clauses. In our
example, we get the following clause:
daughter(X, Y )  f emale(X), parent(Y, X).
The LINUS approach has been extended to handle determinate clauses [16, 30],
which allow the introduction of determinate new variables (which have a unique
value for each training example). There also exist a number of other approaches to
propositionalization, some of them very recent: an overview is given by Kramer et
al. [28] (chapter 11 of [15]).
Let us emphasize again, however, that it is in general not possible to transform an
ILP problem into a propositional (attribute-value) form eciently. De Raedt [12]
treats the relation between attribute-value learning and ILP in detail, showing that
propositionalization of some more complex ILP problems is possible, but results
in attribute-value problems that are exponentially large. This has also been the
main reason for the development of a variety of new RDM and ILP techniques by
upgrading propositional approaches.
3.3.5
3.4
71
search for patterns valid in the given data. The key dierences lie in the representation of data and patterns, renement operators/generality relationships, and
testing coverage (i.e., whether a rule explains an example).
Van Laer and De Raedt [48] explicitly formulate a recipe for upgrading propositional algorithms to deal with relational data and patterns. The key idea is to keep
as much of the propositional algorithm as possible and upgrade only the key notions. For rule induction, the key notions are the renement operator and coverage
relationship. For distance-based approaches, the notion of distance is the key one.
By carefully upgrading the key notions of a propositional algorithm, an RDM/ILP
algorithm can be developed that has the original propositional algorithm as a special case.
The recipe has been followed (more or less exactly) to develop ILP systems for
rule induction, well before it was formulated explicitly. The well-known FOIL [40]
system can be seen as an upgrade of the propositional rule induction program CN2
[7]. Another well-known ILP system, PROGOL [35], can be viewed as upgrading
the AQ approach [33] to rule induction.
More recently, the upgrading approach has been used to develop a number of
RDM approaches that address data mining tasks other than binary classication.
These include the discovery of frequent Datalog patterns and relational association
rules [9] (chapter 8 of [15]), [8], the induction of relational decision trees (structural
classication and regression trees [27] and rst-order logical decision trees [3]), and
relational distance-based approaches to classication and clustering ([25], chapter
9 of [15], [21]). The algorithms developed have as special cases well-known propositional algorithms, such as the APRIORI algorithm for nding frequent patterns;
the CART and C4.5 algorithms for learning decision trees; k-nearest neighbor classication, hierarchical and k-medoids clustering. In the following two sections, we
briey review how the propositional approaches for association rule discovery and
decision tree inducion have been lifted to a relational framework, highlighting the
key dierences between the relational algorithms and their propositional counterparts.
3.4
From a data mining perspective, the task described above is a binary classication
task, where one of two classes is assigned to the examples (tuples):  (positive) or
72
hypothesis H := 
repeat {covering}
clause c := p(X1 , ...Xn ) 
repeat {specialization}
build the set S of all renements of c
c := the best element of S (according to a heuristic)
until stopping criterion is satised (B  H  {c} is consistent)
add c to H
delete all examples from P entailed by B  H  {c}
until stopping criterion is satised
(B  H  {c} is complete)
Having described how to learn sets of clauses by using the covering algorithm for
clause/rule set induction, let us now look at some of the mechanisms underlying
single clause/rule induction. In order to search the space of relational rules (program
clauses) systematically, it is useful to impose some structure upon it, e.g., an
ordering. One such ordering is based on -subsumption, dened below.
A substitution  = {V1 /t1 , ..., Vn /tn } is an assignment of terms ti to variables
Vi . Applying a substitution  to a term, atom, or clause F yields the instantiated
term, atom, or clause F  where all occurrences of the variables Vi are simultaneously
3.4
73
replaced by the term ti . Let c and c be two program clauses. Clause c -subsumes
c if there exists a substitution , such that c  c [39].
To illustrate the above notions, consider the clause c = daughter(X, Y ) 
parent(Y, X). Applying the substitution  = {X/mary, Y /ann} to clause c yields
c = daughter(mary, ann)  parent(ann, mary).
Clauses can be viewed as sets of literals: the clausal notation daughter(X, Y ) 
parent(Y, X) thus stands for {daughter(X, Y ), parent(Y, X)} where all variables
are assumed to be universally quantied,  denotes logical negation, and the commas denote disjunction. According to the denition, clause c -subsumes c if there is
a substitution  that can be applied to c such that every literal in the resulting clause
occurs in c . Clause c -subsumes c = daughter(X, Y )  f emale(X), parent(Y, X)
under the empty substitution  = , since {daughter(X, Y ), parent(Y, X)} is a
proper subset of {daughter(X, Y ), f emale(X), parent(Y, X)}. Furthermore, under the substitution  = {X/mary, Y /ann}, clause c -subsumes the clause c =
daughter(mary, ann)  f emale(mary), parent(ann, mary), parent(ann, tom).
-subsumption introduces a syntactic notion of generality. Clause c is at least
as general as clause c (c  c ) if c -subsumes c . Clause c is more general than
c (c < c ) if c  c holds and c  c does not. In this case, we say that c is a
specialization of c and c is a generalization of c . If the clause c is a specialization
of c, then c is also called a renement of c.
Under a semantic notion of generality, c is more general than c if c logically
entails c (c |= c ). If c -subsumes c , then c |= c . The reverse is not always true.
The syntactic, -subsumption-based, generality is computationally more feasible.
Namely, semantic generality is in general undecidable. Thus, syntactic generality is
frequently used in ILP systems.
The relation  dened by -subsumption introduces a lattice on the set of
reduced clauses [39]: this enables ILP systems to prune large parts of the search
space. -subsumption also provides the basis for clause construction by top-down
searching of renement graphs and bounding the search of renement graphs
from below by using a bottom clause (which can be constructed as least general
generalizations, i.e., least upper bounds of example clauses in the -subsumption
lattice).
3.4.3
Most ILP approaches search the hypothesis space of program clauses in a topdown manner, from general to specic hypotheses, using a -subsumption-based
specialization operator. A specialization operator is usually called a renement
operator [44]. Given a hypothesis language L, a renement operator  maps a
clause c to a set of clauses (c) which are specializations (renements) of c:
(c) = {c | c  L, c < c }.
74
daughter(X, Y )
XXz daughter(X, Y ) 
 @ X
parent(X, Z)
  @
)
R
@
daughter(X, Y ) 
daughter(X, Y ) 
parent(Y, X)
X=Y
daughter(X, Y ) 
f emale(X)
HH
HH
   
HH
j 
daughter(X, Y ) 
f emale(X)
f emale(Y )
Figure 3.1
daughter(X, Y ) 
f emale(X)
parent(Y, X)
A renement operator typically computes only the set of minimal (most general)
specializations of a clause under -subsumption. It employs two basic syntactic
operations:
apply a substitution to the clause, and
add a literal to the body of the clause.
The hypothesis space of program clauses is a lattice, structured by the subsumption generality ordering. In this lattice, a renement graph can be dened
as a directed, acyclic graph in which nodes are program clauses and arcs correspond
to the basic renement operations: substituting a variable with a term, and adding
a literal to the body of a clause.
Figure 3.1 depicts a part of the renement graph for the family relations problem
dened in table 3.3, where the task is to learn a denition of the daughter relation
in terms of the relations f emale and parent.
At the top of the renement graph (lattice) is the clause with an empty body
c = daughter(X, Y )  . The renement operator  generates the renements of c,
which are of the form (c) = {daughter(X, Y )  L}, where L is one of following
literals:
literals having as arguments the variables from the head of the clause: X = Y (applying a substitution X/Y ), f emale(X), f emale(Y ), parent(X, X), parent(X,
Y ), parent(Y, X), and parent(Y, Y ), and
3.5
75
3.5
Without loss of generality, we can say the task of relational prediction is dened
by a two-place target predicate target(ExampleID, ClassV ar), which has as arguments an example ID and the class variable, and a set of background knowledge
predicates/relations. Depending on whether the class variable is discrete or continuous, we talk about relational classication or regression. Relational decision trees
are one approach to solving this task.
An example of a relational decision tree is given in gure 3.3. It predicts the maintenance action A to be taken on machine M (maintenance(M, A)), based on parts
the machine contains (haspart(M, X)), their condition (worn(X)), and ease of replacement (irreplaceable(X)). The target predicate here is maintenance(M, A),
76
Figure 3.2 A relational regression tree for predicting the degradation time
LogHLT of a chemical compound C (target predicate degrades(C, LogHLT )).
false
LogHLT=7.82
false
LogHLT=7.51
atom(C, A3, o)
true
LogHLT=6.08
false
LogHLT=6.73
the class variable is A, and background knowledge predicates are haspart(M, X),
worn(X), and irreplaceable(X).
Relational decision trees have much the same structure as propositional decision trees. Internal nodes contain tests, while leaves contain predictions for the
class value. If the class variable is discrete/continuous, we talk about relational
classication/regression trees. For regression, linear equations may be allowed in
the leaves instead of constant class-value predictions: in this case we talk about
relational model trees.
The tree in gure 3.3 is a relational classication tree, while the tree in
gure 3.2 is a relational regression tree. The latter predicts the degradation
time (the logarithm of the mean half-life time in water [18]) of a chemical
compound from its chemical structure, where the latter is represented by the
atoms in the compound and the bonds between them. The target predicate is
degrades(C, LogHLT ), the class variable LogHLT , and the background knowledge predicates are atom(C, AtomID, Element) and bond(C, A1 , A2 , BondT ype).
The test at the root of the tree atom(C, A1, cl) asks if the compound C has a
chlorine atom A1 and the test along the left branch checks whether the chlorine
atom A1 is connected to a nitrogen atom A2.
As can be seen from the above examples, the major dierence between propositional and relational decision trees is in the tests that can appear in internal nodes.
In the relational case, tests are queries, i.e., conjunctions of literals with existentially
quantied variables, e.g., atom(C, A1, cl) and haspart(M, X), worn(X). Relational
trees are binary: each internal node has a left (yes) and a right (no) branch. If the
query succeeds, i.e., if there exists an answer substitution that makes it true, the
yes branch is taken.
It is important to note that variables can be shared among nodes, i.e., a variable
introduced in a node can be referred to in the left (yes) subtree of that node. For
example, the X in irreplaceable(X) refers to the machine part X introduced in the
root node test haspart(M, X), worn(X). Similarly, the A1 in bond(C, A1, A2, BT )
refers to the chlorine atom introduced in the root node atom(C, A1, cl). One cannot
3.5
77
refer to variables introduced in a node in the right (no) subtree of that node.
For example, referring to the chlorine atom A1 in the right subtree of the tree in
gure 3.2 makes no sense, as going along the right (no) branch means that the
compound contains no chlorine atoms.
The actual test that has to be executed in a node is the conjunction of the
literals in the node itself and the literals on the path from the root of the tree
to the node in question. For example, the test in the node irreplaceable(X)
in gure 3.3 is actually haspart(M, X), worn(X), irreplaceable(X). In other
words, we need to send the machine back to the manufacturer for maintenance only if it has a part which is both worn and irreplaceable. Similarly,
the test in the node bond(C, A1, A2, BT ), atom(C, A2, n) in gure 3.2 is in fact
atom(C, A1, cl), bond(C, A1, A2, BT ), atom(C, A2, n). As a consequence, one cannot transform relational decision trees to logic programs in the fashion one clause
per leaf (unlike propositional decision trees, where a transformation one rule per
leaf is possible).
Table 3.8 A decision list representation of the relational regression tree for
predicting the biodegradability of a compound, given in gure 3.2
degrades(C, LogHLT )  atom(C, A1, cl),
bond(C, A1, A2, BT ), atom(C, A2, n), LogHLT = 7.82, !
degrades(C, LogHLT )  atom(C, A1, cl),
LogHLT = 7.51, !
degrades(C, LogHLT )  atom(C, A3, o),
LogHLT = 6.08, !
degrades(C, LogHLT )  LogHLT = 6.73.
Table 3.9
ure 3.3
a(M )  haspart(M, X), worn(X), irreplaceable(X)
b(M )  haspart(M, X), worn(X)
maintenance(M, A)  not a(M ), A = no aintenance
maintenance(M, A)  b(M ), A = repair in house
maintenance(M, A)  a(M ), not b(M ), A = send back
78
Relational decision trees can be easily transformed into rst-order decision lists,
which are ordered sets of clauses (clauses in logic programs are unordered). When
applying a decision list to an example, we always take the rst clause that applies
and return the answer produced. When applying a logic program, all applicable
clauses are used and a set of answers can be produced. First-order decision lists can
be represented by Prolog programs with cuts (!) [4]: cuts ensure that only the rst
applicable clause is used.
A decision list is produced by traversing the relational regression tree in a depthrst fashion, going down left branches rst. At each leaf, a clause is output that
contains the prediction of the leaf and all the conditions along the left (yes) branches
leading to that leaf. A decision list obtained from the tree in gure 3.3 is given in
table 3.7. For the rst clause (send back), the conditions in both internal nodes
are output, as the left branches out of both nodes have been followed to reach
the corresponding leaf. For the second clause, only the condition in the root is
output: to reach the repair in house leaf, the left (yes) branch out of the root has
been followed, but the right (no) branch out of the irreplaceable(X) node has been
followed. A decision list produced from the relational regression tree in gure 3.2
is given in table 3.8.
Generating a logic program from a relational decision tree is more complicated. It
requires the introduction of new predicates. We will not describe the transformation
process in detail, but rather give an example. A logic program, corresponding to
the tree in gure 3.3, is given in table 3.9.
3.5.2
The two major algorithms for inducing relational decision trees are upgrades of
the two most famous algorithms for inducting propositional decision trees. SCART
[26, 27] is an upgrade of CART [5], while TILDE [3, 13] is an upgrade of C4.5
[41]. According to the upgrading recipe, both SCART and TILDE have their
propositional counterparts as special cases. The actual algorithms thus closely follow
Table 3.10
decision trees
procedure DivideAndConquer(TestsOnYesBranchesSofar, DeclarativeBias, Examples)
if TerminationCondition(Examples)
then
N ewLeaf = CreateNewLeaf(Examples)
return N ewLeaf
else
PossibleTestsNow = GenerateTests(TestsOnYesBranchesSofar, DeclarativeBias)
BestTest = FindBestTest(PossibleTestsNow, Examples)
(Split1 , Split2 ) = SplitExamples(Examples, TestsOnYesBranchesSofar, BestTest)
Lef tSubtree = DivideAndConquer(T estsOnY esBranchesSof ar  BestT est, Split1 )
RightSubtree = DivideAndConquer(T estsOnY esBranchesSof ar, Split2 )
return [BestT est, Lef tSubtree, RightSubtree]
3.5
79
CART and C4.5. Here we illustrate the dierences between SCART and CART by
looking at the TDIDT (top-down induction of decision trees) algorithm of SCART
(table 3.10).
Given a set of examples, the TDID algorithm rst checks if a termination
condition is satised, e.g., if all examples belong to the same class c. If yes, a
leaf is constructed with an appropriate prediction, e.g., assigning the value c to the
class variable. Otherwise a test is selected among the possible tests for the node at
hand, examples are split into subsets according to the outcome of the test, and tree
construction proceeds recursively on each of the subsets. A tree is thus constructed
with the selected test at the root and the subtrees resulting from the recursive calls
attached to the respective branches.
The major dierence in comparison to the propositional case is in the possible
tests that can be used in a node. While in CART these remain (more or less)
the same regardless of where the node is in the tree (e.g., A = v or A < v for
each attribute and attribute value), in SCART the set of possible tests crucially
depends on the position of the node in the tree. In particular, it depends on the tests
along the path from the root to the current node, more precisely on the variables
appearing in those tests and the declarative bias. To emphasize this, we can think
of a GenerateTests procedure being separately employed before evaluating the
tests. The inputs to this procedure are the tests on positive branches from the root
to the current node and the declarative bias. These are also inputs to the top level
TDIDT procedure.
The declarative bias in SCART contains statements of the form
schema(CofL,TandM), where CofL is a conjunction of literals and TandM is a
list of type and mode declarations for the variables in those literals. Two such
statements, used in the induction of the regression tree in gure 3.2 are as follows:
schema((bond(V, W, X, Y), atom(V, X, Z)), [V:chemical:+, W:atomid:+,
X:atomid:, Y:bondtype:, Z:element: =]), and schema(bond (V, W, X,
Y), [V: chemical:+, W:atomid:+, X:atomid:, Y:bondtype: =]). In the
lists, each variable in the conjunction is followed by its type and mode declaration:
+ denotes that the variable must be bound (i.e., appear in TestsOnYesBranchesSofar),  that it must not be bound, and = that it must be replaced by a constant
value.
Assuming we have taken the left branch out of the root in gure 3.2,
T estsOnY esBranchesSof ar = atom(C, A1, cl). Taking the declarative bias with
the two schema statements above, the only choice for replacing the variables V
and W in the schemata are the variables C and A1, respectively. The possible
tests at this stage are thus of the form bond(C, A1, A2, BT ), atom(C, A2, E),
where E is replaced with an element (such as cl - chlorine, s - sulphur, or n nitrogen), or of the form bond(C, A1, A2, BT ), where BT is replaced with a bond
type (such as single, double, or aromatic). Among the possible tests, the test
bond(C, A1, A2, BT ), atom(C, A2, n) is chosen.
The approaches to relational decision tree induction are among the fastest
multi-relational data mining approaches. They have been successfully applied to a
80
3.6
Dehaspe and colleagues [8], [9] (chapter 8 of [15]) consider patterns in the form
of Datalog queries, which reduce to SQL queries. A Datalog query has the form
?  A1 , A2 , . . . An , where the Ai s are logical atoms.
An example Datalog query is
?  person(X), parent(X, Y ), hasP et(Y, Z).
This query on a Prolog database containing predicates person, parent, and hasP et
is equivalent to the SQL query
select Person.Id, Parent.Kid, HasPet.Aid
from Person, Parent, HasPet
where Person.Id = Parent.Pid
and Parent.Kid = HasPet.Pid
on a database containing relations Person with argument Id, Parent with
arguments Pid and Kid, and HasPet with arguments Pid and Aid. This query
nds triples (x, y, z), where child y of person x has pet z.
Datalog queries can be viewed as a relational version of itemsets (which are sets
of items occurring together). Consider the itemset {person, parent, child, pet}. The
market-basket interpretation of this pattern is that a person, a parent, a child, and
a pet occur together. This is also partly the meaning of the above query. However,
the variables X, Y , and Z add extra information: the person and the parent are
the same, the parent and the child belong to the same family, and the pet belongs
to the child. This illustrates the fact that queries are a more expressive variant of
itemsets.
To discover frequent patterns, we need to have a notion of frequency. Given
that we consider queries as patterns and that queries can have variables, it is not
immediately obvious what the frequency of a given query is. This is resolved by
3.6
81
specifying an additional parameter of the pattern discovery task, called the key. The
key is an atom which has to be present in all queries considered during the discovery
process. It determines what is actually counted. In the above query, if person(X) is
the key, we count persons; if parent(X, Y ) is the key, we count (parent,child) pairs;
and if hasP et(Y, Z) is the key, we count (owner,pet) pairs. This is described more
precisely below.
Submitting a query Q =?  A1 , A2 , . . . An with variables {X1 , . . . Xm } to a Datalog database r corresponds to asking whether a grounding substitution exists (which
replaces each of the variables in Q with a constant), such that the conjunction
A1 , A2 , . . . An holds in r. The answer to the query produces answering substitutions  = {X1 /a1 , . . . Xm /am } such that Q succeeds. The set of all answering
substitutions obtained by submitting a query Q to a Datalog database r is denoted
answerset(Q, r).
The absolute frequency of a query Q is the number of answer substitutions 
for the variables in the key atom for which the query Q succeeds in the given
database, i.e., a(Q, r, key) = |{  answerset(key, r)|Q succeeds w.r.t. r}|. The
relative frequency (support) can be calculated as f (Q, r, key) = a(Q, r, key)/|{ 
answerset(key, r)}|. Assuming the key is person(X), the absolute frequency for
our query involving parents, children, and pets can be calculated by the following
SQL statement:
select count(distinct *)
from select Person.Id
from Person, Parent, HasPet
where Person.Id = Parent.Pid
and Parent.Kid = HasPet.Pid
Association rules have the form A  C and the intuitive market-basket interpretation customers that buy A typically also buy C. If itemsets A and C have
supports fA and fC , respectively, the condence of the association rule is dened
to be cAC = fC /fA . The task of association rule discovery is to nd all association rules A  C, where fC and cAC exceed prespecied thresholds (minsup and
minconf).
Association rules are typically obtained from frequent itemsets. Suppose we have
two frequent itemsets A and C, such that A  C, where C = A  B. If the support
of A is fA and the support of C is fC , we can derive an association rule A  B,
which has condence fC /fA . Treating the arrow as implication, note that we can
derive A  C from A  B (A  A and A  B implies A  A  B, i.e., A  C).
Relational association rules can be derived in a similar manner from frequent
Datalog queries. From two frequent queries Q1 =?  l1 , . . . lm and Q2 =?  l1 , . . . lm ,
lm+1 , . . . ln , where Q2 -subsumes Q1 , we can derive a relational association rule
Q1  Q2 . Since Q2 extends Q1 , such a relational association rule is named a query
extension.
82
The task of discovering frequent queries is addressed by the RDM system WARMR
[8]. WARMR takes as input a database r, a frequency threshold minf req, and
declarative language bias L. L species a key atom and input-output modes for
predicates/relations, discussed below.
WARMR upgrades the well-known APRIORI algorithm for discovering frequent
patterns, which performs levelwise search [1] through the lattice of itemsets. APRIORI starts with the empty set of items and at each level l considers sets of items
of cardinality l. The key to the eciency of APRIORI lies in the fact that a large
frequent itemset can only be generated by adding an item to a frequent itemset.
Candidates at level l + 1 are thus generated by adding items to frequent itemsets
obtained at level l. Further eciency is achieved using the fact that all subsets of a
frequent itemset have to be frequent: only candidates that pass this test get their
frequency to be determined by scanning the database.
In analogy to APRIORI, WARMR searches the lattice of Datalog queries for
queries that are frequent in the given database r. In analogy to itemsets, a more
complex (specic) frequent query Q2 can only be generated from a simpler (more
general) frequent query Q1 (where Q1 is more general than Q2 if Q1 -subsumes
Q2 ; see section 3.4.2 for a denition of -subsumption). WARMR thus starts with
the query ?  key at level 1 and generates candidates for frequent queries at level
l + 1 by rening (adding literals to) frequent queries obtained at level l.
3.6
Table 3.11
83
WARMR
warmode key(person(-)).
warmode(parent(+, -)).
warmode(hasPet(+, cat)).
warmode(hasPet(+, dog)).
warmode(hasPet(+, lizard)).
Suppose we are given a Prolog database containing the predicates person, parent,
and hasP et, and the declarative bias in table 3.11. The latter contains the key atom
parent(X) and input-output modes for the relations parent and hasP et. Inputoutput modes specify whether a variable argument of an atom in a query has to
appear earlier in the query (+), must not () or may, but need not (). Inputoutput modes thus place constraints on how queries can be rened, i.e., what atoms
may be added to a given query.
Given the above, WARMR starts the search of the renement graph of queries
at level 1 with the query ?  person(X). At level 2, the literals parent(X, Y ),
hasP et(X, cat), hasP et(X, dog), and hasP et(X, lizard) can be added to this query,
yielding the queries ?  person(X), parent(X, Y ), ?  person(X), hasP et(X, cat),
?  person(X), hasP et(X, dog), and ?  person(X), hasP et(X, lizard). Taking the
rst of the level 2 queries, the following literals are added to obtain level 3 queries:
parent(Y, Z) (note that parent(Y, X) cannot be added, because X already appears
in the query being rened), hasP et(Y, cat), hasP et(Y, dog), and hasP et(Y, lizard).
While all subsets of a frequent itemset must be frequent in APRIORI, not all
subqueries of a frequent query need be frequent queries in WARMR. Consider the
query ?  person(X), parent(X, Y ), hasP et(Y, cat) and assume it is frequent. The
subquery ?person(X), hasP et(Y, cat) is not allowed, as it violates the declarative
bias constraint that the rst argument of hasP et has to appear earlier in the query.
This causes some complications in pruning the generated candidates for frequent
queries: WARMR keeps a list of infrequent queries and checks whether the generated
candidates are subsumed by a query in this list. The WARMR algorithm is given
in table 3.12.
WARMR upgrades APRIORI to a multi-relational setting following the upgrading recipe (see section 3.3.5). The major dierences are in nding the frequency of
queries (where we have to count answer substitutions for the key atom) and the
candidate query generation (by using a renement operator and declarative bias).
WARMR has APRIORI as a special case: if we only have predicates of zero arity
(with no arguments), which correspond to items, WARMR can be used to discover
frequent itemsets.
More importantly, WARMR has as special cases a number of approaches that
extend the discovery of frequent itemsets with, e.g., hierarchies on items [45], as
well as approaches to discovering sequential patterns [2], including general epi-
84
Initialize level d := 1
Initialize the set of candidate queries Q1 := {?- key}
Initialize the set of (in)frequent queries F := ; I := 
While Qd not empty
Find frequency of all queries Q  Qd
Move those with frequency below minfreq to I
Update F := F  Qd
Compute new candidates:
Qd+1 = WARMRgen(L; I; F; Qd ) )
9.
Increment d
10. Return F
Function WARMRgen(L; I; F; Qd );
1. Initialize Qd+1 := 
2. For each Qj  Qd , and for each renement Qj  L of Qj :
Add Qj to Qd+1 , unless:
(i) Qj is more specic than some query  I, or
(ii) Qj is equivalent to some query  Qd+1  F
3. Return Qd+1
sodes [32]. The individual approaches mentioned make use of the specic properties
of the patterns considered (very limited use of variables) and are more ecient
than WARMR for the particular tasks they address. The high expressive power
of the language of patterns considered has its computational costs, but it also has
the important advantage that a variety of dierent pattern types can be explored
without any changes in the implementation.
WARMR can be (and has been) used to perform propositionalization, i.e., to
transform MRDM problems to propositional (single table) form. WARMR is rst
used to discover frequent queries. In the propositional form, examples correspond
to answer substitutions for the key atom and the binary attributes are the frequent
queries discovered. An attribute is true for an example if the corresponding query
succeeds for the corresponding answer substitution. This approach has been applied
with considerable success to the tasks of predictive toxicology [10] and genome-wide
prediction of protein functional class [24].
3.7
3.7
85
no
irreplaceable(X)
yes
A=send back
A=no maintenance
no
A=repair in house
Figure 3.3 A relational decision tree, predicting the class variable A in the target
predicate maintenance(M, A).
Propositional distance measures are dened between examples that have the form
of vectors of attribute values. They essentially sum up the dierences between the
examples values along each of the dimensions of the vectors. Given two examples
x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ), their distance might be calculated as
distance(x, y) =
n
dierence(xi , yi )/n,
i=1
|xi yi | if continuous,
dierence(xi , yi ) =
0
if discrete and xi = yi ,
1
otherwise
In a relational representation, an example (also called instance or case) can
be described by a set of facts about multiple relations. A fact of the target predicate of the form target(ExampleID, A1 , ..., An ) species an instance
through its ID and properties, and additional information can be specied through
background knowledge predicates. In table 3.13, the target predicate member(PersonID,A,G,I,MT) species information on members of a particular club,
which includes age, gender, income, and membership type. The background predicates car(OwnerID, CT, T S, M ) and house(OwnerID, DistrictID, Y, S) provide information on property owned by club members: for cars this includes car
86
type, top speed, and manufacturer; for houses the district, construction year,
and size. Additional information is available on districts through the predicate
district(DistrictID, P, S, C), i.e., the popularity, size, and country of the district.
Table 3.13
The basic idea behind the RIBL [21] distance measure is as follows. To calculate
the distance between two objects/examples, their properties are taken into account
rst (at depth 0). Next (at depth 1), objects immediately related to the two
original objects are taken into account, or more precisely, the distances between
the corresponding related objects. At depth 2, objects related to those at depth 1
are taken into account, and so on, until a user-specied depth limit is reached.
In our example, when calculating the distance between e1 = member(person1, 45,
male, 20, gold) and e2 = member(person2, 30, f emale, 10, platinum), the properties of the persons (age, gender, income, membership type) are rst compared and
dierences between them calculated and summed (as in the propositional case). At
depth 1, cars and houses owned by the two persons are compared, i.e., distances
between them are calculated. At depth 2, the districts where the houses reside are
taken into account when calculating the distances between houses. Before beginning
to calculate distances, RIBL collects all facts related to a person into a so-called
case. The case for person1 generated with a depth limit of 2 is given in gure 3.4.
Let us calculate the distance between the two club members according to the distance measure. d(e1 , e2 ) = 1/5(d(person1, person2)+d(45, 30)+d(male, f emale)+
d(20, 10) + d(gold, platinum)). With a depth limit of 0, the identiers person1
and person2 are treated as discrete values, d(person1, person2) = 1 and we have
d(e1 , e2 ) = (1 + (45  30)/100 + 1 + (20  10)/50 + 1)/5 = 0.67; the denominators
100 and 50 denote the highest possible dierences in age and income.
To calculate d(person1, person2) at level 1, we collect the facts directly related
to the two persons and partition them according to the predicates. Thus we have
3.7
87
Figure 3.4
88
Once we have a relational distance measure, we can easily adapt classical statistical
approaches to prediction and clustering, such as the nearest-neighbor method and
hierarchical agglomerative clustering, to work on relational data. This is precisely
what has been done with the RIBL distance measure.
The original RIBL [21] addresses the problem of prediction, more precisely
classication. It uses the k-nearest neighbor method in conjunction with the RIBL
distance measure to solve the problem addressed. RIBL was successfully applied to
the practical problem of diterpene structure elucidation [17], where it outperformed
propositional approaches as well as a number of other relational approaches.
3.8
89
RIBL2 [25] upgrades the RIBL distance measure by considering lists and
terms as elementary types, much like discrete and numeric values. Edit distances are used for these, while the RIBL distance measure is followed otherwise.
RIBL2 has been used to predict mRNA signal structure and to automatically
discover previously uncharacterized mRNA signal structure classes [25].
Two clustering approaches have been developed that use the RIBL distance
measure [25]. RDBC uses hierarchical agglomerative clustering, while FORC adapts
the k-means approach. The latter relies on nding cluster centers, which is easy for
numeric vectors but far from trivial in the relational case. FORC thus uses the
k-medoids method, which denes a cluster center as the existing case/example that
has the smallest sum of squared distances to all other cases in the cluster and only
uses distance information.
3.8
References
[1] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast
discovery of association rules. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth,
90
References
91
92
4.1
Introduction
Relational data has two characteristics: rst, statistical dependencies exist between
the entities we wish to model, and second, each entity often has a rich set of features
that can aid classication. For example, when classifying web documents, the pages
text provides much information about the class label, but hyperlinks dene a
relationship between pages that can improve classication [55]. Graphical models
are a natural formalism for exploiting the dependence structure among entities.
Traditionally, graphical models have been used to represent the joint probability
distribution p(y, x), where the variables y represent the attributes of the entities
that we wish to predict, and the input variables x represent our observed knowledge
about the entities. But modeling the joint distribution can lead to diculties when
using the rich local features that can occur in relational data, because it requires
modeling the distribution p(x), which can include complex dependencies. Modeling
these dependencies among inputs can lead to intractable models, but ignoring them
can lead to reduced performance.
A solution to this problem is to directly model the conditional distribution p(y|x),
which is sucient for classication. This is the approach taken by conditional
random elds (CRFs) [24]. A CRF is simply a conditional distribution p(y|x) with
94
4.2
Graphical Models
4.2.1
Denitions
4.2
Graphical Models
95
the form
p(x, y) =
1 
A (xA , yA ),
Z
(4.1)
for some real-valued parameter vector A , and for some set of feature functions or
sucient statistics {fAk }. This form ensures that the family of distributions over V
parameterized by  is an exponential family. Much of the discussion in this chapter
actually applies to exponential families in general.
A directed graphical model, also known as a Bayesian network, is based on a
directed graph G = (V, E). A directed model is a family of distributions that
factorize as
p(v|(v)),
(4.4)
p(y, x) =
vV
96
4.2.2
Classication
First we discuss the problem of classication, that is, predicting a single class
variable y given a vector of features x = (x1 , x2 , . . . , xK ). One simple way to
accomplish this is to assume that once the class label is known, all the features
are independent. The resulting classier is called the naive Bayes classier. It is
based on a joint probability model of the form
p(y, x) = p(y)
K
p(xk |y).
(4.5)
k=1
This model can be described by the directed model shown in gure 4.1 (left). We
can also write this model as a factor graph, by dening a factor (y) = p(y), and
a factor k (y, xk ) = p(xk |y) for each feature xk . This factor graph is shown in
gure 4.1 (right).
Another well-known classier that is naturally represented as a graphical model is
logistic regression (sometimes known as the maximum entropy classier in the NLP
community). In statistics, this classier is motivated by the assumption that the log
probability, log p(y|x), of each class is a linear function of x, plus a normalization
constant. This leads to the conditional distribution:
1
exp y +
y,j xj ,
(4.6)
p(y|x) =
Z(x)
j=1
where Z(x) = y exp{y + K
j=1 y,j xj } is a normalizing constant, and y is a
bias weight that acts like log p(y) in naive Bayes. Rather than using one vector per
class, as in (4.6), we can use a dierent notation in which a single set of weights is
shared across all the classes. The trick is to dene a set of feature functions that are
4.2
Graphical Models
97
nonzero only for a single class. To do this, the feature functions can be dened as
fy ,j (y, x) = 1{y =y} xj for the feature weights and fy (y, x) = 1{y =y} for the bias
weights. Now we can use fk to index each feature function fy ,j , and k to index
its corresponding weight y ,j . Using this notational trick, the logistic regression
model becomes:
K
1
exp
p(y|x) =
k fk (y, x) .
(4.7)
Z(x)
k=1
We introduce this notation because it mirrors the usual notation for CRFs.
4.2.2.2
Sequence Models
Classiers predict only a single class variable, but the true power of graphical
models lies in their ability to model many variables that are interdependent. In this
section, we discuss perhaps the simplest form of dependency, in which the output
variables are arranged in a sequence. To motivate this kind of model, we discuss
an application from NLP, the task of named-entity recognition (NER). NER is the
problem of identifying and classifying proper names in text, including locations,
such as China; people, such as George Bush; and organizations, such as the United
Nations. The NER task is, given a sentence, rst to segment which words are part
of entities, and then to classify each entity by type (person, organization, location,
and so on). The challenge of this problem is that many named entities are too rare
to appear even in a large training set, and therefore the system must identify them
based only on context.
One approach to NER is to classify each word independently as one of either
Person, Location, Organization, or Other (meaning not an entity). The
problem with this approach is that it assumes that given the input, all of the namedentity labels are independent. In fact, the named-entity labels of neighboring words
are dependent; for example, while New York is a location, New York Times is an
organization.
This independence assumption can be relaxed by arranging the output variables
in a linear chain. This is the approach taken by HMMs [42]. An HMM models a
sequence of observations X = {xt }Tt=1 by assuming that there is an underlying
sequence of states Y = {yt }Tt=1 drawn from a nite state set S. In the named-entity
example, each observation xt is the identity of the word at position t, and each state
yt is the named-entity label, that is, one of the entity types Person, Location,
Organization, and Other.
To model the joint distribution p(y, x) tractably, an HMM makes two independence assumptions. First, it assumes that each state depends only on its immediate
predecessor, that is, each state yt is independent of all its ancestors y1 , y2 , . . . , yt2
given its previous state yt1 . Second, an HMM assumes that each observation variable xt depends only on the current state yt . With these assumptions, we can
specify an HMM using three probability distributions: rst, the distribution p(y1 )
98
over initial states; second, the transition distribution p(yt |yt1 ); and nally, the
observation distribution p(xt |yt ). That is, the joint probability of a state sequence
y and an observation sequence x factorizes as
p(y, x) =
T
(4.8)
t=1
where, to simplify notation, we write the initial state distribution p(y1 ) as p(y1 |y0 ).
In NLP, HMMs have been used for sequence labeling tasks such as part-of-speech
tagging, named-entity recognition, and information extraction.
4.2.3
An important dierence between naive Bayes and logistic regression is that naive
Bayes is generative, meaning that it is based on a model of the joint distribution
p(y, x), while logistic regression is discriminative, meaning that it is based on
a model of the conditional distribution p(y|x). In this section, we discuss the
dierences between generative and discriminative modeling, and the advantages of
discriminative modeling for many tasks. For concreteness, we focus on the examples
of naive Bayes and logistic regression, but the discussion in this section actually
applies in general to the dierences between generative models and CRF.
The main dierence is that a conditional distribution p(y|x) does not include
a model of p(x), which is not needed for classication anyway. The diculty in
modeling p(x) is that it often contains many highly dependent features, which are
dicult to model. For example, in named-entity recognition, an HMM relies on only
one feature, the words identity. But many words, especially proper names, will not
have occurred in the training set, so the word-identity feature is uninformative. To
label unseen words, we would like to exploit other features of a word, such as its
capitalization, its neighboring words, its prexes and suxes, its membership in
predetermined lists of people and locations, and so on.
To include interdependent features in a generative model, we have two choices:
enhance the model to represent dependencies among the inputs, or make simplifying
independence assumptions, such as the naive Bayes assumption. The rst approach,
enhancing the model, is often dicult to do while retaining tractability. For example, it is hard to imagine how to model the dependence between the capitalization of
a word and its suxes, nor do we particularly wish to do so, since we always observe
the test sentences anyway. The second approach, adding independence assumptions
among the inputs, is problematic because it can hurt performance. For example,
although the naive Bayes classier performs surprisingly well in document classication, it performs worse on average across a range of applications than logistic
regression [7].
Furthermore, even when naive Bayes has good classication accuracy, its
probability estimates tend to be poor. To understand why, imagine training
naive Bayes on a data set in which all the features are repeated, that is,
4.2
Graphical Models
99
Figure 4.2 Diagram of the relationship between naive Bayes, logistic regression,
HMMs, linear-chain CRFs, generative models, and general CRFs.
100
(4.10)
(4.11)
where pg (x; ) and pg (y|x; ) are computed by inference, i.e., pg (x; ) = y pg (y, x; )
and pg (y|x; ) = pg (y, x; )/pg (x; ).
Now, compare this generative model to a discriminative model over the same family of joint distributions. To do this, we dene a prior p(x) over inputs, such that p(x)
could have arisen from pg with some parameter setting. That is, p(x) = pc (x;  ) =
y pg (y, x| ). We combine this with a conditional distribution pc (y|x; ) that
could also have arisen from pg , that is, pc (y|x; ) = pg (y, x; )/pg (x; ). Then the
resulting distribution is
pc (y, x) = pc (x;  )pc (y|x; ).
(4.12)
By comparing (4.11) with (4.12), it can be seen that the conditional approach has
more freedom to t the data, because it does not require that  =  . Intuitively,
because the parameters  in (4.11) are used in both the input distribution and the
conditional, a good set of parameters must represent both well, potentially at the
cost of trading o accuracy on p(y|x), the distribution we care about, for accuracy
on p(x), which we care less about.
In this section, we have discussed the relationship between naive Bayes and
logistic regression in detail because it mirrors the relationship between HMMs and
linear-chain CRFs. Just as naive Bayes and logistic regression are a generativediscriminative pair, there is a discriminative analogue to HMMs, and this analogue
is a particular type of CRF, as we explain next. The analogy between naive Bayes,
logistic regression, generative models, and CRFs is depicted in gure 4.2.
4.3
4.3
Figure 4.3
101
chain CRFs, motivating them from HMMs. Then, we discuss parameter estimation
(section 4.3.2) and inference (section 4.3.3) in linear-chain CRFs.
4.3.1
 
1
ij 1{yt =i} 1{yt1 =j} +
oi 1{yt =i} 1{xt =o} ,
p(y, x) = exp
Z
t
i,jS
iS oO
(4.13)
where  = {ij , oi } are the parameters of the distribution, and can be any real
numbers. Every HMM can be written in this form, as can be seen simply by setting
ij = log p(y  = i|y = j) and so on. Because we do not require the parameters to
be log probabilities, we are no longer guaranteed that the distribution sums to 1,
unless we explicitly enforce this by using a normalization constant Z. Despite this
added exibility, it can be shown that (4.13) describes exactly the class of HMMs
in (4.8); we have added exibility to the parameterization, but we have not added
any distributions to the family.
We can write (4.13) more compactly by introducing the concept of feature
functions, just as we did for logistic regression in (4.7). Each feature function
has the form fk (yt , yt1 , xt ). In order to duplicate (4.13), there needs to be one
feature fij (y, y  , x) = 1{y=i} 1{y =j} for each transition (i, j) and one feature
fio (y, y  , x) = 1{y=i} 1{x=o} for each state-observation pair (i, o). Then we can write
102
an HMM as
K
1
p(y, x) = exp
k fk (yt , yt1 , xt ) .
Z
(4.14)
k=1
Again, (4.14) denes exactly the same family of distributions as (4.13), and therefore
as the original HMM equation (4.8).
The last step is to write the conditional distribution p(y|x) that results from the
HMM (4.14). This is
K
exp
k=1 k fk (yt , yt1 , xt )
p(y, x)
=
.
(4.15)
p(y|x) = 
K
y p(y , x)
k fk (y  , y  , xt )
 exp
y
k=1
t1
(4.17)
k=1
We have just seen that if the joint p(y, x) factorizes as an HMM, then the
associated conditional distribution p(y|x) is a linear-chain CRF. This HMM-like
CRF is pictured in gure 4.3. Other types of linear-chain CRFs are also useful,
however. For example, in an HMM, a transition from state i to state j receives the
same score, log p(yt = j|yt1 = i), regardless of the input. In a CRF, we can allow
the score of the transition (i, j) to depend on the current observation vector, simply
by adding a feature 1{yt =j} 1{yt1 =1} 1{xt =o} . A CRF with this kind of transition
feature, which is commonly used in text applications, is pictured in gure 4.4.
To indicate in the denition of linear-chain CRF that each feature function
can depend on observations from any time step, we have written the observation
argument to fk as a vector xt , which should be understood as containing all the
4.3
103
components of the global observations x that are needed for computing features
at time t. For example, if the CRF uses the next word xt+1 as a feature, then the
feature vector xt is assumed to include the identity of word xt+1 .
Finally, note that the normalization constant Z(x) sums over all possible state
sequences, an exponentially large number of terms. Nevertheless, it can be computed
eciently by forward-backward, as we explain in section 4.3.3.
4.3.2
Parameter Estimation
In this section we discuss how to estimate the parameters  = {k } of a linearchain CRF. We are given i.i.d. training data D = {x(i) , y(i) }N
i=1 , where each
(i)
(i)
(i)
(i) (i)
(i)
x(i) = {x1 , x2 , . . . xT } is a sequence of inputs, and each y(i) = {y1 , y2 , . . . yT }
is a sequence of the desired predictions. Thus, we have relaxed the i.i.d. assumption
within each sequence, but we still assume that distinct sequences are independent.
(In section 4.4, we will see how to relax this assumption as well.)
Parameter estimation is typically performed by penalized maximum likelihood.
Because we are modeling the conditional distribution, the following log-likelihood,
sometimes called the conditional log-likelihood, is appropriate:
() =
N
(4.18)
i=1
(4.19)
the two terms on the right-hand side are decoupled, that is, the value of  does
not aect the optimization over . If we do not need to estimate p(x), then we can
simply drop the second term, which leaves (4.18).
After substituting in the CRF model (4.16) into the likelihood (4.18), we get the
following expression:
() =
K
T 
N 
(i)
(i)
(i)
k fk (yt , yt1 , xt )
N
log Z(x(i) ),
(4.20)
i=1
K
T 
N 
i=1 t=1 k=1
(i)
(i)
(i)
k fk (yt , yt1 , xt )
N
i=1
log Z(x(i) )
K
2k
.
2 2
k=1
(4.21)
104
The notation for the regularizer is intended to suggest that regularization can also
be viewed as performing maximum a posteriori estimation of , if  is assigned
a Gaussian prior with mean 0 and covariance 2 I. The parameter  2 is a free
parameter which determines how much to penalize large weights. Determining the
best regularization parameter can require a computationally intensive parameter
sweep. Fortunately, often the accuracy of the nal model does not appear to be
sensitive to changes in  2 , even when  2 is varied up to a factor of 10. An alternative
choice of regularization is to use the 1 norm instead of the Euclidean norm, which
corresponds to an exponential prior on parameters [17]. This regularizer tends to
encourage sparsity in the learned parameters.
In general, the function () cannot be maximized in closed form, so numeric
optimization is used. The partial derivatives of (4.21) are
T
T 
N 
N 
K
k
(i) (i)
(i)
(i)
=
fk (yt , yt1 , xt ) 
fk (y, y  , xt )p(y, y  |x(i) ) 
.
k
2
i=1 t=1
i=1 t=1 y,y 
k=1
(4.22)
The rst term is the expected value of fk under the empirical distribution:
p(y, x) =
N
1 
1
(i) 1
(i) .
N i=1 {y=y } {x=x }
(4.23)
The second term, which arises from the derivative of log Z(x), is the expectation
p(x). Therefore, at the unregularized
of fk under the model distribution p(y|x; )
maximum likelihood solution, when the gradient is zero, these two expectations are
equal. This pleasing interpretation is a standard result about maximum-likelihood
estimation in exponential families.
Now we discuss how to optimize (). The function () is concave, which follows
from the convexity of functions of the form g(x) = log i exp xi . Convexity is
extremely helpful for parameter estimation, because it means that every local
optimum is also a global optimum. Adding regularization ensures that  is strictly
concave, which implies that it has exactly one global optimum.
Perhaps the simplest approach to optimize  is steepest ascent along the gradient
(4.22), but this requires too many iterations to be practical. Newtons method
converges much faster because it takes into account the curvature of the likelihood,
but it requires computing the Hessian, the matrix of all second derivatives. The size
of the Hessian is quadratic in the number of parameters. Since practical applications
often use tens of thousands or even millions of parameters, even storing the full
Hessian is not practical.
Instead, current techniques for optimizing (4.21) make approximate use of secondorder information. Particularly successful have been quasi-Newton methods such
as BFGS [3], which compute an approximation to the Hessian from only the rst
derivative of the objective function. A full K  K approximation to the Hessian still
requires quadratic size, however, so a limited-memory version of BFGS is used, due
to Byrd et al. [6]. As an alternative to limited-memory BFGS, conjugate gradient
4.3
105
Inference
There are two common inference problems for CRFs. First, during training, computing the gradient requires marginal distributions for each edge p(yt , yt1 |x), and
computing the likelihood requires Z(x). Second, to label an unseen instance, we
compute the most likely (Viterbi) labeling y = arg maxy p(y|x). In linear-chain
CRFs, both inference tasks can be performed eciently and exactly by variants
of the standard dynamic-programming algorithms for HMMs. In this section, we
briey review the HMM algorithms, and extend them to linear-chain CRFs. These
standard inference algorithms are described in more detail by Rabiner [42].
First, we introduce notation which will simplify the forward-backward recursions.
	
An HMM can be viewed as a factor graph p(y, x) = t t (yt , yt1 , xt ) where Z = 1,
and the factors are dened as
def
(4.24)
106
Now, we review the HMM forward algorithm, which is used to compute the
probability p(x) of the observations. The idea behind forward-backward is to rst
rewrite the naive summation p(x) = y p(x, y) using the distributive law:
p(x) =
T
t (yt , yt1 , xt )
(4.25)
y t=1
T (yT , yT1 , xT )
yT yT1
yT2
(4.26)
yT3
Now we observe that each of the intermediate sums is reused many times during
the computation of the outer sum, and so we can save an exponential amount of
work by caching the inner sums.
This leads to dening a set of forward variables t , each of which is a vector
of size M (where M is the number of states) which stores one of the intermediate
sums. These are dened as
def
t (j) = p(x1...t	 , yt = j)
=
t (j, yt1 , xt )
y1...t1
(4.27)
t1
(4.28)
t =1
where the summation over y1...t1	 ranges over all assignments to the sequence
of random variables y1 , y2 , . . . , yt1 . The alpha values can be computed by the
recursion
t (j) =
t (j, i, xt )t1 (i),
(4.29)
iS
with initialization 1 (j) = 1 (j, y0 , x1 ). (Recall that y0 is the xed initial state of
the HMM.) It is easy to see that p(x) = yT T (yT ) by repeatedly substituting the
recursion (4.29) to obtain (4.26). A formal proof would use induction.
The backward recursion is exactly the same, except that in (4.26), we push in
the summations in reverse order. This results in the denition
def
(4.30)
(4.31)
yt+1...T t =t+1
(4.32)
jS
4.3
107
By combining results from the forward and backward recursions, we can compute
the marginal distributions needed for the gradient (4.22). Applying the distributive
law again, we see that
p(yt1 , yt |x) = t (yt , yt1 , xt )
t1
y1...t2 t =1
T
yt+1...T
t =t+1
(4.34)
Finally, to compute the globally most probable assignment y = arg maxy p(y|x),
we observe that the trick in (4.26) still works if all the summations are replaced by
maximization. This yields the Viterbi recursion:
t (j) = max t (j, i, xt )t1 (i).
(4.35)
iS
Now that we have described the forward-backward and Viterbi algorithms for
HMMs, the generalization to linear-chain CRFs is fairly straightforward. The
forward-backward algorithm for linear-chain CRFs is identical to the HMM version,
except that the transition weights t (j, i, xt ) are dened dierently. We observe that
the CRF model (4.16) can be rewritten as
p(y|x) =
T
1 
t (yt , yt1 , xt ),
Z(x) t=1
where we dene
t (yt , yt1 , xt ) = exp
(4.36)
k fk (yt , yt1 , xt ) .
(4.37)
With that denition, the forward recursion (4.29), the backward recursion (4.32),
and the Viterbi recursion (4.35) can be used unchanged for linear-chain CRFs.
Instead of computing p(x) as in an HMM, in a CRF the forward and backward
recursions compute Z(x).
A nal inference task that is useful in some applications is to compute a marginal
probability p(yt , yt+1 , . . . yt+k |x) over a range of nodes. For example, this is useful
for measuring the models condence in its predicted labeling over a segment of
input. This marginal probability can be computed eciently using constrained
forward-backward, as by Culotta and McCallum[12].
108
4.4
CRFs in General
In this section, we dene CRFs with general graphical structure, as they were
introduced originally [24]. Although initial applications of CRFs used linear chains,
there have been many later applications of CRFs with more general graphical
structures. Such structures are especially useful for relational learning, because
they allow relaxing the i.i.d. assumption among entities. Also, although CRFs have
typically been used for across-network classication, in which the training and
testing data are assumed to be independent, we will see that CRFs can be used for
within-network classication as well, in which we model probabilistic dependencies
between the training and testing data.
The generalization from linear-chain CRFs to general CRFs is fairly straightforward. We simply move from using a linear-chain factor graph to a more general
factor graph, and from forward-backward to more general (perhaps approximate)
inference algorithms.
4.4.1
Model
K(A)
1
exp
Ak fAk (yA , xA ) .
(4.38)
p(y|x) =
Z(x)
A G
k=1
In addition, practical models rely extensively on parameter tying. For example, in the linear-chain case, often the same weights are used for the factors
t (yt , yt1 , xt ) at each time step. To denote this, we partition the factors of G
into C = {C1 , C2 , . . . CP }, where each Cp is a clique template whose parameters are
tied. This notion of clique template generalizes that in Taskar et al. [55], Sutton
et al. [54], and Richardson and Domingos [43]. Each clique template Cp is a set
of factors which has a corresponding set of sucient statistics {fpk (xp , yp )} and
parameters p  K(p) . Then the CRF can be written as
p(y|x) =
1 
Z(x)
Cp C c Cp
c (xc , yc ; p ),
(4.39)
4.4
CRFs in General
109
K(p)
k=1
pk fpk (xc , yc ) ,
c (xc , yc ; p ).
(4.40)
(4.41)
y Cp C c Cp
Applications of CRFs
CRFs have been applied to a variety of domains, including text processing, computer vision, and bioinformatics. In this section, we discuss several applications,
highlighting the dierent graphical structures that occur in the literature.
One of the rst large-scale applications of CRFs was by Sha and Pereira [49], who
matched state-of-the-art performance on segmenting noun phrases in text. Since
then, linear-chain CRFs have been applied to many problems in NLP, including
named-entity recognition [30], feature induction for NER [28], identifying protein
names in biology abstracts [48], segmenting addresses in webpages [13], nding
semantic roles in text [45], identifying the sources of opinions [8], Chinese word
segmentation [38], Japanese morphological analysis [22], and many others.
In bioinformatics, CRFs have been applied to RNA structural alignment [47]
and protein structure prediction [25]. Semi-Markov CRFs [46] add somewhat more
exibility in choosing features, which may be useful for certain tasks in information
extraction and especially bioinformatics.
General CRFs have also been applied to several tasks in NLP. One promising
application is to perform multiple labeling tasks simultaneously. For example,
Sutton et al. [54] show that a two-level dynamic CRF for part-of-speech tagging and
noun phrase chunking performs better than solving the tasks one at a time. Another
application is to multilabel classication, in which each instance can have multiple
class labels. Rather than learning an independent classier for each category,
Ghamrawi and McCallum [16] present a CRF that learns dependencies between
the categories, resulting in improved classication performance. Finally, the skip-
110
chain CRF, which we present in section 4.5, is a general CRF that represents
long-distance dependencies in information extraction.
An interesting graphical CRF structure has been applied to the problem of proper
noun coreference, that is, of determining which mentions in a document, such as
Mr. President and he, refer to the same underlying entity. McCallum and Wellner
[31] learn a distance metric between mentions using a fully connected CRF in
which inference corresponds to graph partitioning. A similar model has been used
to segment handwritten characters and diagrams [11, 40].
In some applications of CRFs, ecient dynamic programs exist even though
the graphical model is dicult to specify. For example, McCallum et al[33] learn
the parameters of a string-edit model in order to discriminate between matching
and nonmatching pairs of strings. Also, there is work on using CRFs to learn
distributions over the derivations of a grammar [44, 9, 51, 57]. A potentially useful
unifying framework for this type of model is provided by case-factor diagrams [27].
In computer vision, several authors have used grid-shaped CRFs [18, 23] for
labeling and segmenting images. Also, for recognizing objects, Quattoni et al.
[41] use a tree-shaped CRF in which latent variables are designed to recognize
characteristic parts of an object.
4.4.3
Parameter Estimation
Parameter estimation for general CRFs is essentially the same as for linear-chains,
except that computing the model expectations requires more general inference
algorithms. First, we discuss the fully observed case, in which the training and
testing data are independent, and the training data is fully observed. In this case
the conditional log-likelihood is given by
() =
  K(p)
(4.42)
Cp C c Cp k=1
It is worth noting that the equations in this section do not explicitly sum over
training instances, because if a particular application happens to have i.i.d. training
instances, they can be represented by disconnected components in the graph G.
The partial derivative of the log-likelihood with respect to a parameter pk
associated with a clique template Cp is
 
=
fpk (xc , yc ) 
fpk (xc , yc )p(yc |x).
pk
c Cp
(4.43)
c Cp yc
The function () has many of the same properties as in the linear-chain case.
First, the zero-gradient conditions can be interpreted as requiring that the suf
cient statistics Fpk (x, y) =
c fpk (xc , yc ) have the same expectations under
the empirical distribution and under the model distribution. Second, the function
() is concave, and can be eciently maximized by second-order techniques such
4.4
CRFs in General
111
The rst question is how even to compute the marginal likelihood (), because if
there are many variables w, the sum cannot be computed directly. The key is to
realize that we need to compute log w p(y, w|x) not for any possible assignment
y, but only for the particular assignment that occurs in the training data. This
motivates taking the original CRF (4.44), and clamping the variables Y to their
observed values in the training data, yielding a distribution over w:
 
1
c (xc , wc , yc ; p ),
(4.46)
p(w|y, x) =
Z(y, x)
Cp C c Cp
c (xc , wc , yc ; p ).
(4.47)
w Cp C c Cp
This new normalization constant Z(y, x) can be computed by the same inference
algorithm that we use to compute Z(x). In fact, Z(y, x) is easier to compute,
because it sums only over w, while Z(x) sums over both w and y. Graphically, this
amounts to saying that clamping the variables y in the graph G can simplify the
structure among w.
112
1  
Z(x) w
c (xc , wc , yc ; p ) =
Cp C c Cp
Z(y, x)
.
Z(x)
(4.48)
(4.49)
which can be seen by applying the chain rule to log f and rearranging. Applying
this to the marginal likelihood () = log w p(y, w|x) yields
  
1
p(y, w|x)
= 
pk
p(y,
w|x)
pk
w
w
 
(4.50)
(4.51)
This is the expectation of the fully observed gradient, where the expectation is
taken over w. This expression simplies to
 
=
p(wc |y, x)fk (yc , xc , wc ) 
pk
c Cp wc
c Cp wc ,yc
(4.52)
This gradient requires computing two dierent kinds of marginal probabilities.
The rst term contains a marginal probability p(wc |y, x), which is exactly a
marginal distribution of the clamped CRF (4.46). The second term contains a
dierent marginal p(wc , yc |xc ), which is the same marginal probability required
in a fully-observed CRF. Once we have computed the gradient,  can be maximized
by standard techniques such as conjugate gradient. In our experience, conjugate
gradient tolerates violations of convexity better than limited-memory BFGS, so it
may be a better choice for latent-variable CRFs.
Alternatively,  can be optimized using EM. At each iteration j in the EM
algorithm, the current parameter vector (j) is updated as follows. First, in the
E-step, an auxiliary function q(w) is computed as q(w) = p(w|y, x; (j) ). Second,
4.4
CRFs in General
113
(4.53)
w
The direct maximization algorithm and the EM algorithm are strikingly similar.
This can be seen by substituting the denition of q into (4.53) and taking derivatives. The gradient is almost identical to the direct gradient (4.52). The only difference is that in EM, the distribution p(w|y, x) is obtained from a previous, xed
parameter setting rather than from the argument of the maximization. We are unaware of any empirical comparison of EM to direct optimization for latent-variable
CRFs.
4.4.4
Inference
114
4.4.5
Discussion
This section contains miscellaneous remarks about CRFs. First, it is easily seen that
the logistic regression model (4.7) is a CRF with a single output variable. Thus,
CRFs can be viewed as an extension of logistic regression to arbitrary graphical
structures.
Linear-chain CRFs were originally introduced as an improvement to the maximumentropy Markov model (MEMM) [32], which is essentially a Markov model in which
the transition distributions are given by a logistic regression model. MEMMs can
exhibit the problems of label bias [24] and observation bias [20]. Both of these
problems can be readily understood graphically: the directed model of an MEMM
implies that for all time steps t, the observation xt is marginally independent of
the labels yt1 , yt2 . and so onan independence assumption which is usually
strongly violated in sequence modeling. Sometimes this assumption can be eectively avoided by including information from previous time steps as features, and
this explains why MEMMs have had success in some NLP applications.
Although we have emphasized the view of a CRF as a model of the conditional
distribution, one could view it as an objective function for parameter estimation of
joint distributions. As such, it is one objective among many, including generative
likelihood, pseudolikelihood [4], and the maximum-margin objective [56, 2]. Another
related discriminative technique for structured models is the averaged perceptron,
which has been especially popular in the natural language community [10], in large
part because of its ease of implementation. To date, there has been little careful
comparison of these, especially CRFs and max-margin approaches, across dierent
structures and domains.
Given this view, it is natural to imagine training directed models by conditional
likelihood, and in fact this is commonly done in the speech community, where it is
called maximum mutual information training. However, it is no easier to maximize
the conditional likelihood in a directed model than an undirected model, because in
a directed model the conditional likelihood requires computing log p(x), which plays
the same role as Z(x) in the CRF likelihood. In fact, training is more complex in a
directed model, because the model parameters are constrained to be probabilities
constraints which can make the optimization problem more dicult. This is in stark
contrast to the joint likelihood, which is much easier to compute for directed models
than undirected models (although recently several ecient parameter estimation
techniques have been proposed for undirected factor graphs, such as Abbeel et al.
[1] and Wainwright et al. [60]).
4.4.6
Implementation Concerns
There are a few implementation techniques that can help both training time and
accuracy of CRFs, but are not always fully discussed in the literature. Although
these apply especially to language applications, they are also useful more generally.
4.4
CRFs in General
115
First, when the predicted variables are discrete, the features fpk are ordinarily
chosen to have a particular form:
fpk (yc , xc ) = 1{yc =yc } qpk (xc ).
(4.54)
 c , but
In other words, each feature is nonzero only for a single output conguration y
as long as that constraint is met, then the feature value depends only on the input
observation. Essentially, this means that we can think of our features as depending
only on the input xc , but that we have a separate set of weights for each output
conguration. This feature representation is also computationally ecient, because
computing each qpk may involve nontrivial text or image processing, and it need be
evaluated only once for every feature that uses it. To avoid confusion, we refer to
the functions qpk (xc ) as observation functions rather than as features. Examples of
observation functions are word xt is capitalized and word xt ends in ing.
This representation can lead to a large number of features, which can have
signicant memory and time requirements. For example, matching state-of-theart results on a standard natural language task, [49] uses 3.8 million features. Not
all of these features are ever nonzero in the training data. In particular, some
observation functions qpk are nonzero only for certain output congurations. This
point can be confusing: One might think that such features can have no eect on
the likelihood, but actually they do aect Z(x), so putting a negative weight on
them can improve the likelihood by making wrong answers less likely. In order to
save memory, however, sometimes these unsupported features, that is, those which
never occur in the training data, are removed from the model. In practice, however,
including unsupported features typically results in better accuracy.
In order to get the benets of unsupported features with less memory, we have
had success with an ad hoc technique for selecting only a few unsupported features.
The main idea is to add unsupported features only for likely paths, as follows: rst
train a CRF without any unsupported features, stopping after only a few iterations;
then add unsupported features fpk (yc , xc ) for cases where xc occurs in the training
data, and p(yc |x) > . McCallum[28] presents a more principled method of feature
selection for CRFs.
Second, if the observations are categorical rather than ordinal, that is, if they
are discrete but have no intrinsic order, it is important to convert them to binary
features. For example, it makes sense to learn a linear weight on fk (y, xt ) when fk
is 1 if xt is the word dog and 0 otherwise, but not when fk is the integer index
of word xt in the texts vocabulary. Thus, in text applications, CRF features are
typically binary; in other application areas, such as vision and speech, they are
more commonly real-valued.
Third, in language applications, it is sometimes helpful to include redundant
factors in the model. For example, in a linear-chain CRF, one may choose to include
both edge factors t (yt , yt1 , xt ) and variable factors t (yt , xt ). Although one could
dene the same family of distributions using only edge factors, the redundant node
factors provide a kind of backo, which is useful when there is too little data.
116
In language applications, there is always too little data, even when hundreds of
thousands of words are available.
Finally, often the probabilities involved in forward-backward and belief propagation become too small to be represented within numeric precision. There are two
standard approaches to this common problem. One approach is to normalize each
of the vectors t and t to sum to 1, thereby magnifying small values. A second
approach is to perform computations in the logarithmic domain, e.g., the forward
recursion becomes
!"
#
log t (j, i, xt ) + log t1 (i) ,
(4.55)
log t (j) =
iS
where  is the operator a  b = log(ea + eb ). At rst, this does not seem much of
an improvement, since numeric precision is lost when computing ea and eb . But 
can be computed as
a  b = a + log(1 + eba ) = b + log(1 + eab ),
(4.56)
which can be much more numerically stable, particularly if we pick the version of
the identity with the smaller exponent. CRF implementations often use the logspace approach because it makes computing Z(x) more convenient, but in some
applications, the computational expense of taking logarithms is an issue, making
normalization preferable.
4.5
Skip-Chain CRFs
In this section, we present a case study of applying a general CRF to a practical
natural language problem. In particular, we consider a problem in information
extraction, the task of building a database automatically from unstructured text.
Recent work in extraction has often used sequence models, such as HMMs and
linear-chain CRFs, which model dependencies only between neighboring labels, on
the assumption that those dependencies are the strongest.
But sometimes it is important to model certain kinds of long-range dependencies
between entities. One important kind of dependency within information extraction
occurs on repeated mentions of the same eld. When the same entity is mentioned
more than once in a document, such as Robert Booth, in many cases all mentions
have the same label, such as Seminar-Speaker. We can take advantage of this
fact by favoring labelings that treat repeated words identically, and by combining
features from all occurrences so that the extraction decision can be made based on
global information. Furthermore, identifying all mentions of an entity can be useful
in itself, because each mention might contain dierent useful information. However,
most extraction systems, whether probabilistic or not, do not take advantage of
this dependency, instead treating the separate mentions independently.
4.5
Skip-Chain CRFs
117
Figure 4.5
Model
118
=w
matches [A-Z][a-z]+
matches [A-Z][A-Z]+
matches [A-Z]
matches [A-Z]+
matches [A-Z]+[a-z]+[A-Z]+[a-z]
appears in list of rst names,
last names, honorics, etc.
wt appears to be part of a time followed by a dash
wt appears to be part of a time preceded by a dash
wt appears to be part of a date
Tt = T
qk (x, t + ) for all k and   [4, 4]
words that belong to the same stem class, or have small edit distance. In addition,
we must be careful not to include too many skip edges, because this could result
in a graph that makes approximate inference dicult. So we need to use similarity
metrics that result in a suciently sparse graph. In the experiments below, we focus
on named-entity recognition, so we connect pairs of identical capitalized words.
Formally, the skip-chain CRF is dened as a general CRF with two clique
templates: one for the linear-chain portion, and one for the skip edges. For a sentence
x, let I = {(u, v)} be the set of all pairs of sequence positions for which there are
skip edges. For example, in the experiments reported here, I is the set of indices of
all pairs of identical capitalized words. Then the probability of a label sequence y
given an input x is modeled as
T
1 
t (yt , yt1 , x)
uv (yu , yv , x),
p (y|x) =
Z(x) t=1
(4.57)
(u,v)I
where t are the factors for linear-chain edges, and uv are the factors over skip
edges. These factors are dened as
1k f1k (yt , yt1 , x, t)
(4.58)
t (yt , yt1 , x) = exp
uv (yu , yv , x) = exp
k
2k f2k (yu , yv , x, u, v) ,
(4.59)
4.5
Skip-Chain CRFs
119
1
where 1 = {1k }K
k=1 are the parameters of the linear-chain template, and 2 =
K2
{2k }k=1 are the parameters of the skip template. The full set of model parameters
is  = {1 , 2 }.
As described in section 4.4.6, both the linear-chain features and skip-chain
features are factorized into indicator functions of the outputs and observation
functions, as in (4.54). In general the observation functions qk (x, t) can depend
on arbitrary positions of the input string. For example, a useful feature for NER is
qk (x, t) = 1 if and only if xt+1 is a capitalized word.
The observation functions for the skip edges are chosen to combine the observations from each endpoint. Formally, we dene the feature functions for the skip
edges to factorize as
(4.60)
This choice allows the observation functions qk (x, u, v) to combine information from
the neighborhood of yu and yv . For example, one useful feature is qk (x, u, v) = 1
if and only if xu = xv = Booth and xv1 = Speaker:. This can be a useful
feature if the context around xu , such as Robert Booth is manager of control
engineering. . . , may not make clear whether or not Robert Booth is presenting a
talk, but the context around xv is clear, such as Speaker: Robert Booth. 1
Because the loops in a skip-chain CRF can be long and overlapping, exact
inference is intractable for the data we consider. The running time required by exact
inference is exponential in the size of the largest clique in the graphs junction tree.
In junction trees created from the seminars data, 29 of the 485 instances have a
maximum clique size of 10 or greater, and 11 have a maximum clique size of 14
or greater. (The worst instance has a clique with 61 nodes.) These cliques are far
too large to perform inference exactly. For reference, representing a single factor
that depends on 14 variables requires more memory than can be addressed in a
32-bit architecture. Instead, we perform approximate inference using loopy belief
propagation, which was mentioned in section 4.4.4. We use an asynchronous treebased schedule known as tree-based representation (TRP) [59].
4.5.2
Results
1. This example is taken from an actual error made by a linear-chain CRF on the seminars
data set. We present results from this data set in section 4.5.2.
120
stime
etime
location
speaker
overall
96.0
97.5
96.7
98.8
97.5
97.2
87.1
88.3
88.1
76.9
77.3
80.4
89.7
90.2
90.6
Table 4.3
Field
Linear-chain
Skip-chain
stime
12.6
17
etime
3.2
5.2
location
6.4
0.6
speaker
30.2
4.8
useful to nd both such mentions, because dierent information can occur in the
surrounding context of each mention: for example, the rst mention might be near
an institutional aliation, while the second mentions that Smith is a professor.
We evaluate a skip-chain CRF with skip edges between identical capitalized
words. The motivation for this is that the hardest aspect of this data set is
identifying speakers and locations, and capitalized words that occur multiple times
in a seminar announcement are likely to be either speakers or locations.
Table 4.1 shows the list of input features we used. For a skip edge (u, v), the
input features we used were the disjunction of the input features at u and v, that
is,
qk (x, u, v) = qk (x, u)  qk (x, v),
(4.61)
where  is binary or. All of our results are averaged over ve-fold cross-validation
with an 80/20 split of the data. We report results from both a linear-chain CRF
and a skip-chain CRF with the same set of input features.
We calculate precision and recall as2
2. Previous work on this data set has traditionally measured precision and recall per
document, that is, from each document the system extracts only one eld of each type.
Because the goal of the skip-chain CRF is to extract all mentions in a document, these
4.5
Skip-Chain CRFs
121
P =
Related Work
Recently, Bunescu and Mooney [5] have used a relational Markov network to
collectively classify the mentions in a document, achieving increased accuracy by
learning dependencies between similar mentions. In their work, however, candidate
phrases are extracted heuristically, which can introduce errors if a true entity is
metrics are inappropriate, so we cannot compare with this previous work. Peshkin and
Pfeer [39] do use the per-token metric (personal communication), so our comparison is
fair in that respect.
122
4.6
Conclusion
CRFs are a natural choice for many relational problems because they allow both
graphically representing dependencies between entities, and including rich observed
features of entities. In this chapter, we have presented a tutorial on CRFs, covering
both linear-chain models and general graphical structures. Also, as a case study in
CRFs for collective classication, we have presented the skip-chain CRF, a type of
general CRF that performs joint segmentation and collective labeling on a practical
language understanding task.
The main disadvantage of CRFs is the computational expense of training. Although CRF training is feasible for many real-world problems, the need to perform
inference repeatedly during training becomes a computational burden when there
are a large number of training instances, when the graphical structure is complex,
when there are latent variables, or when the output variables have many outcomes.
One focus of current research [1, 53, 60] is on more ecient parameter estimation
techniques.
References
123
Acknowledgments
We thank Tom Minka and Jerod Weinman for helpful conversations, and we thank
Francine Chen and Benson Limketkai for useful comments. This work was supported
in part by the Center for Intelligent Information Retrieval; in part by the Defense
Advanced Research Projects Agency (DARPA), the Department of the Interior,
NBC, Acquisition Services Division, under contract number NBCHD030010; and
in part by the Central Intelligence Agency, the National Security Agency, and the
National Science Foundation under NSF grants #IIS-0427594 and #IIS-0326249.
Any opinions, ndings and conclusions or recommendations expressed in this
material are the authors and do not necessarily reect those of the sponsors.
References
[1] P. Abbeel, D. Koller, and A. Y. Ng. Learning factor graphs in polynomial time
and sample complexity. In Proceedings of the Conference on Uncertainty in
Articial Intelligence, 2005.
[2] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden Markov support vector
machines. In Proceedings of the International Conference on Machine Learning,
2003.
[3] D. Bertsekas. Nonlinear Programming. Athena Scientic, Nashua, NH, 2nd
edition, 1999.
[4] J. Besag. Eciency of pseudolikelihood estimation for simple gaussian elds.
Biometrika, 64(3):616618, 1977.
[5] R. Bunescu and R. J. Mooney. Collective information extraction with relational
Markov networks. In Proceedings of the Annual Meeting of the Association for
Computational Linguistics, 2004.
[6] R. Byrd, J. Nocedal, and R. Schnabel. Representations of quasi-Newton matrices and their use in limited memory methods. Mathematical Programming,
63(2):129156, 1994. ISSN 0025-5610.
[7] R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised
learning algorithms using dierent performance metrics. Technical Report
TR2005-1973, Cornell University, Ithica, NY, 2005.
[8] Y. Choi, C. Cardie, E. Rilo, and S. Patwardhan. Identifying sources of
opinions with conditional random elds and extraction patterns. In Proceedings
of Human Language Technology Conference and North American Chapter of
the Association for Computational Linguistics, 2005.
[9] S. Clark and J. Curran. Parsing the WSJ using CCG and log-linear models.
In Proceedings of the Annual Meeting of the Association for Computational
Linguistics, 2004.
124
References
125
[24] J. Laerty, A. McCallum, and F. Pereira. Conditional random elds: Probabilistic models for segmenting and labeling sequence data. Proceedings of the
International Conference on Machine Learning, 2001.
[25] Y. Liu, J. Carbonell, P. Weigele, and V. Gopalakrishnan. Segmentation conditional random elds (SCRFs): A new approach for protein fold recognition.
In Proceedings of the ACM International Conference on Research in Computational Molecular Biology, 2005.
[26] R. Malouf. A comparison of algorithms for maximum entropy parameter
estimation. In Proceedings of the Conference on Natural Language Learning,
2002.
[27] D. McAllester, M. Collins, and F. Pereira. Case-factor diagrams for structured
probabilistic modeling. In Proceedings of the Conference on Uncertainty in
Articial Intelligence, 2004.
[28] A. McCallum. Eciently inducing features of conditional random elds. In
Proceedings of the Conference on Uncertainty in Articial Intelligence, 2003.
[29] A. McCallum and D. Jensen. A note on the unication of information
extraction and data mining using conditional-probability, relational models.
In IJCAI03 Workshop on Learning Statistical Models from Relational Data,
2003.
[30] A. McCallum and W. Li. Early results for named entity recognition with
conditional random elds, feature induction and web-enhanced lexicons. In
Proceedings of the Conference on Natural Language Learning, 2003.
[31] A. McCallum and B. Wellner. Conditional models of identity uncertainty
with application to noun coreference. In Proceedings of Neural Information
Processing Systems, 2005.
[32] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models
for information extraction and segmentation. In Proceedings of the International Conference on Machine Learning, 2000.
[33] A. McCallum, K. Bellare, and F. Pereira. A conditional random eld for
discriminatively-trained nite-state string edit distance. In Proceedings of the
Conference on Uncertainty in Articial Intelligence, 2005.
[34] T. Minka. Discriminative models, not discriminative training. Technical Report MSR-TR-2005-144, Microsoft Research, October 2005.
ftp://ftp.research.microsoft.com/ pub/tr/TR-2005-144.pdf .
[35] T. P. Minka. A comparsion of numerical optimizers for logistic regression.
Technical report, Dept. of Statistics, Carnegie Mellon University, Pittsburgh,
2003.
[36] A. Ng and M. Jordan. On discriminative vs. generative classiers: A comparison of logistic regression and naive Bayes. In Proceedings of Neural Information
Processing Systems, 2002.
126
References
127
http://www.cs.umass.edu/ casutton/publications.html.
[52] C. Sutton and A. McCallum. Collective segmentation and labeling of distant
entities in information extraction. Technical Report TR # 04-49, University of
Massachusetts, 2004. Presented at ICML Workshop on Statistical Relational
Learning and Its Connections to Other Fields.
[53] C. Sutton and A. McCallum. Piecewise training of undirected models. In
Proceedings of the Conference on Uncertainty in Articial Intelligence, 2005.
[54] C. Sutton, K. Rohanimanesh, and A. McCallum. Dynamic conditional random
elds: Factorized probabilistic models for labeling and segmenting sequence
data. In Proceedings of the International Conference on Machine Learning,
2004.
[55] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for
relational data. In Proceedings of the Conference on Uncertainty in Articial
Intelligence, 2002.
[56] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In
Proceedings of Neural Information Processing Systems, 2004.
[57] P. Viola and M. Narasimhan. Learning to extract information from semistructured text using a discriminative context free grammar. In Proceedings of
the ACM International Conference on Information Retrieval, 2005.
[58] S.V.N. Vishwanathan, N. Schraudolph, M. Schmidt, and K. Murphy. Accelerated training of copnditional random elds with stochastic meta-descent. In
Proceedings of the International Conference on Machine Learning, 2006.
[59] M. Wainwright, T. Jaakkola, and A. Willsky. Tree-based reparameterization
for approximate estimation on graphs with cycles. In Proceedings of Neural
Information Processing Systems, 2001.
[60] M. Wainwright, T. Jaakkola, and A. Willsky. Tree-reweighted belief propagation and approximate ML estimation by pseudo-moment matching. In Ninth
Workshop on Articial Intelligence and Statistics, 2003.
[61] H. Wallach. Ecient training of conditional random elds. MSc thesis,
University of Edinburgh, 2002.
[62] B. Wellner, A. McCallum, F. Peng, and M. Hay. An integrated, conditional
model of information extraction and coreference with application to citation
graph construction. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 2004.
[63] J. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free energy approximations and generalized belief propagation algorithms. Technical Report
TR2004-040, Mitsubishi Electric Research Laboratories, Cambridge, MA, 2004.
Lise Getoor, Nir Friedman, Daphne Koller, Avi Pfeer and Ben Taskar
Probabilistic relational models (PRMs) are a rich representation language for structured statistical models. They combine a frame-based logical representation with
probabilistic semantics based on directed graphical models (Bayesian networks).
This chapter gives an introduction to probabilistic relational models, describing semantics for attribute uncertainty, structural uncertainty, and class uncertainty. For
each case, learning algorithms and some sample results are presented.
5.1
Introduction
Over the last decade, Bayesian networks have been used with great success in
a wide variety of real-world and research applications. However, despite their
success, Bayesian networks are often inadequate for representing large and complex
domains. A Bayesian network for a given domain involves a prespecied set of
random variables, whose relationship to each other is xed in advance. Hence, a
Bayesian network cannot be used to deal with domains where we might encounter a
varying number of entities in a variety of congurations. This limitation of Bayesian
networks is a direct consequence of the fact that they lack the concept of an object
(or domain entity). Hence, they cannot represent general principles about multiple
similar objects which can then be applied in multiple contexts.
Probabilistic relational models (PRMs) [13, 18] extend Bayesian networks with
the concepts of objects, their properties, and relations between them. In a way,
they are to Bayesian networks as relational logic is to propositional logic. A PRM
species a template for a probability distribution over a database. The template
includes a relational component that describes the relational schema for our domain,
and a probabilistic component that describes the probabilistic dependencies that
hold in our domain. A PRM has a coherent formal semantics in terms of probability
distributions over sets of relational logic interpretations. Given a set of ground
objects, a PRM species a probability distribution over a set of interpretations
involving these objects (and perhaps other objects as well). A PRM, together with
130
5.2
PRM Representation
The two components of PRM syntax are a logical description of the domain
of discourse and a probabilistic graphical model template which describes the
probabilistic dependencies in the domain. Here we describe the logical description
of the domain as a relational schema, although it can be transformed into either a
frame-based representation or a logic-based syntax is a relatively straightforward
manner. Our probabilistic graphical component is depicted pictorially, although
it can also be represented in a logical formalism; for example in the probabilistic
relational language of [10]. We begin by describing the syntax and semantics for
PRMs which have the simplest form of uncertainty, attribute uncertainty, and then
move on to describing various forms of structural uncertainty.
5.2.1
Relational Language
The relational language allows us to describe the kinds of objects in our domain.
For example, gure 5.1(a) shows the schema for a simple domain that we will be
using as our running example. The domain is that of a university, and contains
professors, students, courses, and course registrations. The classes in the schema
are Professor, Student, Course, and Registration.
More formally, a schema for a relational model describes a set of classes, X =
{X1 , . . . , Xn }. Each class is associated with a set of descriptive attributes. For
example, professors may have descriptive attributes such as popularity and teaching
ability; courses may have descriptive attributes such as rating and diculty.
The set of descriptive attributes of a class X is denoted A(X). Attribute A of
class X is denoted X.A, and its space of values is denoted V(X.A). We assume
here that value spaces are nite. For example, the Student class has the descriptive
5.2
PRM Representation
131
Professor
Student
Popularity
Intelligence
Teaching-Ability
Ranking
Course
Registration
Instructor
Course
Rating
Student
Difficulty
Grade
Satisfaction
(a)
Student
John Doe
Student
Intelligence
Jane Doe
high
Intelligence
Performance
high
average
Ranking
average
Professor
Prof. Gump
Popularity
high
Teaching Ability
medium
Course
Phil142
Course
Difficulty
Phil101
low
Difficulty
Rating
low
high
Rating
Registration
#5639
Registration
Grade#5639
Registration
A
Grade
#5639
Satisfaction
A
Grade
3
Satisfaction
A
3
Satisfaction
3
high
(b)
Figure 5.1 (a) A relational schema for a simple university domain. The underlined
attributes are reference slots of the class and the dashed lines indicate the types
of objects referenced. (b) An example instance of this schema. Here we do not
show the values of the reference slots; we simply use dashed lines to indicate the
relationships that hold between objects.
attributes Intelligence and Ranking. The value space for Student.Intelligence in this
example is {high, low}.
In addition, we need a method for allowing an object to refer to another object.
For example we may want a course to have a reference to the instructor of the
course. And a registration record should refer both to the associated course and to
the student taking the course.
The simplest way of achieving this eect is using reference slots. Specically, each
class is associated with a set of reference slots. The set of reference slots of a class X
is denoted R(X). We use X. to denote the reference slot  of X. Each reference slot
 is typed, i.e., the schema species the range type of object that may be referenced.
More formally, for each  in X, the domain type Dom[] is X and the range type
Range[] is Y for some class Y in X . For example, the class Course has reference
slot Instructor with range type Professor, and class Registration has reference slots
Course and Student. In gure 5.1(a) the reference slots are underlined.
There is a direct mapping between our representation and that of relational
databases. Each class corresponds to a single table and each attribute corresponds
to a column. Our descriptive attributes correspond to standard attributes in the
table, and our reference slots correspond to attributes that are foreign keys (key
attributes of another table).
For each reference slot , we can dene an inverse slot 1 , which is interpreted
as the inverse function of . For example, we can dene an inverse slot for the
Student slot of Registration and call it Registered-In. Note that this is not a oneto-one relation, but returns a set of Registration objects. More formally, if Dom[]
is X and Range[] is Y , then Dom[1 ] is Y and Range[1 ] is X.
Finally, we dene the notion of a slot chain, which allows us to compose slots,
dening functions from objects to other objects to which they are indirectly related. More precisely, we dene a slot chain 1 , . . . , k to be a sequence of slots
132
(inverse or otherwise) such that for all i, Range[i ] = Dom[i+1 ]. For example,
Student.Registered-In.Course.Instructor can be used to denote a students instructors. Note that a slot chain describes a set of objects from a class.1
The relational framework we have just described is motivated primarily by the
concepts of relational databases, although some of the notation is derived from
frame-based and object-oriented systems. However, the framework is a fully general
one, and is equivalent to the standard vocabulary and semantics of relational logic.
5.2.2
Schema Instantiation
Probabilistic Model
1. It is also possible to dene slot chains as multi-sets of objects; here we have found it
sucient to make them sets of objects, but there may be domains where multi-sets are
desirable.
5.2
PRM Representation
133
Student
John Doe
Student
Intelligence
Jane Doe
high
Intelligence
Performance
???
average
Ranking
???
Professor
Prof. Gump
Popularity
???
Teaching Ability
???
Teaching-Ability
Popularity
Rating
Course
Phil142
Course
Difficulty
Phil101
low
Difficulty
Rating
???
high
Rating
Registration
#5639
Registration
Grade#5639
Registration
A
Grade#5639
Satisfaction
A
Grade
3
Satisfaction
???
3
Satisfaction
???
Difficulty
Intelligence
AVG
Ranking
Satisfaction
AVG
Grade
???
(a)
(b)
Figure 5.2 (a) The relational skeleton for the university domain. (b) The PRM
dependency structure for our university example.
each attribute of each object in the skeleton. A PRM then species a probability
distribution over completions I of the skeleton.
A PRM consists of two components: the qualitative dependency structure, S,
and the parameters associated with it, S . The dependency structure is dened by
associating with each attribute X.A a set of parents Pa(X.A). These correspond
to formal parents; they will be instantiated in dierent ways for dierent objects
in X. Intuitively, the parents are attributes that are direct inuences on X.A. In
gure 5.2(b), the arrows dene the dependency structure.
We distinguish between two types of formal parents. The attribute X.A can depend on another probabilistic attribute B of X. This formal dependence induces
a corresponding dependency for individual objects: for any object x in r (X), x.A
will depend probabilistically on x.B. For example, in gure 5.2(b), a professors
Popularity depends on her Teaching-Ability. The attribute X.A can also depend
on attributes of related objects X.K.B, where K is a slot chain. In gure 5.2(b),
the grade of a student depends on Registration.Student .Intelligence and Registration.Course.Diculty. Or we can have a longer slot chain, for example, the dependence of student satisfaction on Registration.Course.Instructor .Teaching-Ability.
In addition, we can have a dependence of student ranking on Student.RegisteredIn.Grade. To understand the semantics of this formal dependence for an individual
object x, recall that x.K represents the set of objects that are K-relatives of x.
Except in cases where the slot chain is guaranteed to be single-valued, we must
specify the probabilistic dependence of x.A on the multiset {y.B : y  x.K}.
For example, a students rank depends on the grades in the courses in which he
or she are registered. However each student may be enrolled in a dierent number
of courses, and we will need a method of compactly representing these complex
dependencies.
The notion of aggregation from database theory gives us an appropriate tool to
address this issue: x.A will depend probabilistically on some aggregate property of
this multiset. There are many natural and useful notions of aggregation of a set: its
mode (most frequently occurring value); its mean value (if values are numerical);
134
Teaching-Ability
Teaching-Ability
Popularity
Popularity
Rating
Intelligence
Ranking
Difficulty
Satisfaction
D, I
h,h
h ,l
l ,h
l ,l
A
B
C
0.5 0.4 0.1
0.1 0.5 0.4
0.8 0.1 0.1
0.3 0.6 0.1
Rating
Intelligence
Ranking
Difficulty
Satisfaction
AVG
Grade
Grade
(a)
avg l
A 0.1
B 0.2
C 0.6
m h
0.2 0.7
0.4 0.4
0.3 0.1
(b)
Figure 5.3 (a) The CPD for Registration.Grade (b) The CPD for an aggregate
dependency of Student.Ranking on Student.Registered-In .Grade .
its median, maximum, or minimum (if values are ordered); its cardinality; etc.
In the preceding example, we can have a students ranking depend on her grade
point average (GPA), or the average grade in her courses (or in the case where the
grades are represented as letters, we may use median; in our example we blur the
distinction and assume that average is dened appropriately).
More formally, our language allows a notion of an aggregate ;  takes a multiset
of values of some ground type, and returns a summary of it. The type of the
aggregate can be the same as that of its arguments. However, we allow other types
as well, e.g., an aggregate that reports the size of the set. We allow X.A to have
as a parent (X.K.B); the semantics is that for any x  X, x.A will depend on
the value of (x.K.B). In our example PRM, there are two aggregate dependencies
dened, one that species that the ranking of a student depends on the average of
her grades and one that species that the rating of a course depends on the average
satisfaction of students in the course.
Given a set of parents Pa(X.A) for X.A, we can dene a local probability model
for X.A. We associate X.A with a conditional probability distribution (CPD) that
species P (X.A | Pa(X.A)). We require that the CPDs are legal. Figure 5.3 shows
two CPDs. Let U be the set of parents of X.A, U = Pa(X.A). Each of these parents
Ui  whether a simple attribute in the same relation or an aggregate of a set of K
relatives  has a set of values V(Ui ) in some ground type. For each tuple of values
u  V(U), we specify a distribution P (X.A | u) over V(X.A). This entire set of
parameters comprises S .
Denition 5.2
A probabilistic relational model (PRM)  for a relational schema R is dened as
follows. For each class X  X and each descriptive attribute A  A(X), we have:
a set of parents Pa(X.A) = {U1 , . . . , Ul }, where each Ui has the form X.B or
(X.K.B), where K is a slot chain and  is an aggregate of X.K.B;
5.2
PRM Representation
135
PRM Semantics
Xi AA(Xi ) xr (Xi )
P (Ix.A | IPa(x.A) ).
(5.1)
This expression is very similar to the chain rule for Bayesian networks. There
are three primary dierences. First, our random variables are the attributes of a
set of objects. Second, the set of parents of a random variable can vary according
to the relational context of the object  the set of objects to which it is related.
Third, the parameters are shared; the parameters of the local probability models
for attributes of objects in the same class are identical.
5.2.5
As in any denition of this type, we have to take care that the resulting function
from instances to numbers does indeed dene a coherent probability distribution,
136
i.e., where the sum of the probability of all instances is 1. In Bayesian networks,
where the joint probability is also a product of CPDs, this requirement is satised
if the dependency graph is acyclic: a variable is not an ancestor of itself. A similar
condition is sucient to ensure coherence in PRMs as well.
5.2.5.1
We want to ensure that our probabilistic dependencies are acyclic, so that a random
variable does not depend, directly or indirectly, on its own value. To do so, we can
consider the graph of dependencies among attributes of objects in the skeleton,
which we will call the instance dependency graph, Gr .
Denition 5.4
The instance dependency graph Gr for a PRM  and a relational skeleton r has
a node for each descriptive attribute of each object x  r (X) in each class X  X .
Each x.A has the following edges:
1. Type I edges: For each formal parent of x.A, X.B, we introduce an edge from
x.B to x.A.
2. Type II edges: For each formal parent X.K.B, and for each y  x.K, we dene
an edge from y.B to x.A.
Type I edges correspond to intra-object dependencies and type II edges correspond
to inter-object dependencies. We say that a dependency structure S is acyclic
relative to a relational skeleton r if the instance dependency graph Gr over the
variables x.A is acyclic. In this case, we are guaranteed that the PRM denes a
coherent probabilistic model over complete instantiations I consistent with r :
Theorem 5.5
Let  be a PRM whose dependency structure S is acyclic relative to a relational
skeleton r . Then  and r dene a coherent probability distribution over instantiations I that extend r via (5.1).
5.2.5.2
5.2
PRM Representation
137
Teaching-Ability
Popularity
Rating
Intelligence
Ranking
Difficulty
Satisfaction
Grade
Figure 5.4
1. Type I edges: For any attribute X.A and any of its parents X.B, we introduce an
edge from X.B to X.A.
2. Type II edges: For any attribute X.A and any of its parents X.K.B we introduce
an edge from Y.B to X.A, where Y = Range[X.K].
Figure 5.4 shows the dependency graph for our school domain.
The most obvious approach for using the class dependency graph is simply to
require that it be acyclic. This requirement is equivalent to assuming a stratication
among the attributes of the dierent classes, and requiring that the parents of an
attribute precede it in the stratication ordering. As theorem 5.7 shows, if the
class dependency graph is acyclic, we can never have that x.A depends (directly or
indirectly) on itself.
Theorem 5.7
If the class dependency graph G is acyclic for a PRM , then for any skeleton r ,
the instance dependency graph is acyclic.
The following corollary follows immediately:
Corollary 5.8
Let  be a PRM whose class dependency structure S is acyclic. For any relational
skeleton r , , and r dene a coherent probability distribution over instantiations
I that extend r via (5.1).
For example, if we examine the PRM of gure 5.2(b), we can easily convince ourselves that we cannot create a cycle in any instance. Indeed, as we saw in gure 5.4,
the class dependency graph is acyclic. Note, however, that if we introduce additional
dependencies we can create cycles. For example, if we make Professor.TeachingAbility depend on the rating of courses she teaches (e.g., if high teaching ratings
increase her motivation), then the resulting class dependency graph is cyclic, and
there is no stratication order that is consistent with the PRM structure. An inability to stratify the class dependency graph implies that there are skeletons for
which the PRM will induce a distribution with cyclic dependencies.
138
(Father)
(Mother)
Person
Blood Type
Blood Type
P-chromosome
Person
Person.M-chromosome
Person.P-chromosome
P-chromosome
M-chromosome
M-chromosome
Person.BloodType
P-chromosome
BloodTest.Contaminated
Person
M-chromosome
BloodTest.Result
Blood Type
Contaminated
Result
Blood Test
(a)
(b)
(a) A simple PRM for the genetics domain. (b) The corresponding dependency graph. Dashed edges correspond to green dependencies, dotted edges
correspond to yellow dependencies, and solid edges correspond to red dependencies.
Figure 5.5
5.2.5.3
In some important cases, a cycle in the class dependency graph is not problematic,
it will not result in a cyclic instance dependency graph. This can be the case when
we have additional domain constraints on the form of skeletons we may encounter.
Consider, for example, a simple genetic model of the inheritance of a single gene
that determines a persons blood type, shown in gure 5.5(a). Each person has
two copies of the chromosome containing this gene, one inherited from her mother,
and one inherited from her father. There is also a possibly contaminated test that
attempts to recognize the persons blood type. Our schema contains two classes:
Person and BloodTest. Class Person has reference slots Mother and Father and
descriptive attributes Gender, P-Chromosome (the chromosome inherited from the
father), and M-Chromosome (inherited from the mother). BloodTest has a reference
slot Test-Of (not shown explicitly in the gure) that points to the owner of the test,
and descriptive attributes Contaminated and Result.
In our genetic model, the genotype of a person depends on the genotype of
her parents; thus, at the class level, we have Person.P-Chromosome depending
directly on Person.P-Chromosome. As we can see in gure 5.5(b), this dependency
results in a cycle that clearly violates the acyclicity requirements of our simple class
dependency graph. However, it is clear to us that the dependencies in this model are
not actually cyclic for any skeleton that we will actually encounter in this domain.
The reason is that, in legitimate skeletons for this schema, a person cannot be
his own ancestor, which disallows the situation of the persons genotype depending
(directly or indirectly) on itself. In other words, although the model appears to be
cyclic at the class level, we know that this cyclicity is always resolved at the level
of individual objects.
5.2
PRM Representation
139
Our ability to guarantee that the cyclicity is resolved relies on some prior
knowledge that we have about the domain. We want to allow the user to give us
information such as this, so that we can make stronger guarantees about acyclicity
and allow richer dependency structures in the PRM. In particular, the user can
specify that certain reference slots are guaranteed acyclic. In our genetics example,
Father and Mother are guaranteed acyclic; cycles involving these attributes may in
fact be legal. Moreover, they are mutually guaranteed acyclic, so that compositions
of the slots are also guaranteed acyclic. Figure 5.5(b) shows the class dependency
graph for the genetics domain, with guaranteed acyclic edges shown as dashed
edges.
We allow the user to assert that certain reference slots Rga = {1 , . . . , k } are
guaranteed acyclic; i.e., we are guaranteed that there is a partial ordering ga such
that if y is a -relative for some   Rga of x, then y ga x. We say that a slot
chain K is guaranteed acyclic if each of its component s is guaranteed acyclic.
This prior knowledge allows us to guarantee the legality of certain dependency
models. We start by building a colored class dependency graph that describes the
direct dependencies between the attributes.
Denition 5.9
The colored class dependency graph G for a PRM  has the following edges:
1. Yellow edges: If X.B is a parent of X.A, we have a yellow edge X.B  X.A.
2. Green edges: If (X.K.B) is a parent of X.A, Y = Range[X.K], and K is
guaranteed acyclic, we have a green edge Y.B  X.A.
3. Red edges: If (X.K.B) is a parent of X.A, Y = Range[X.K], and K is not
guaranteed acyclic, we have a red edge Y.B  X.A.
Note that there might be several edges, perhaps of dierent colors, between two
attributes.
The intuition is that dependency along green edges relates objects that are
ordered by an acyclic order. Thus, these edges by themselves or combined with
intra-object dependencies (yellow edges) cannot cause a cyclic dependency. We
must, however, take care with other dependencies, for which we do not have
prior knowledge, as these might form a cycle. This intuition suggests the following
denition:
Denition 5.10
A (colored) dependency graph is stratied if every cycle in the graph contains at
least one green edge and no red edges.
Theorem 5.11
If the colored class dependency graph is stratied for a PRM , then for any
skeleton r , the instance dependency graph is acyclic.
140
In other words, if the colored dependency graph of S and Rga is stratied, then
for any skeleton r for which the slots in Rga are jointly acyclic, S denes a coherent
probability distribution over assignments to r .
This notion of stratication generalizes the two special cases we considered
above. When we do not have any guaranteed acyclic relations, all the edges in
the dependency graph are colored either yellow or red. Then the graph is stratied
if and only if it is acyclic. In the genetics example, all the parent relations would
be in Rga . The only edges involved in cycles are green edges.
We can also support multiple guaranteed acyclic relations by using dierent
shades of green for each set of guaranteed acyclic relations. Then a cycle is safe
as long as it contains at most one shade of green edge.
5.3
5.4
5.4
141
5.5
142
Bibliography
1. ----- ?
2. ----- ?
3. ----- ?
Scientific Paper
Document Collection
Figure 5.6
introduce XE to denote the set of classes that represent entities, and XR to denote
those that represent relationships. We note that the distinctions are prior knowledge
about the domain, and are therefore part of the domain specication. We use the
generic term object to refer both to entities and to relationships.
5.5.1
Reference Uncertainty
Consider a simple citation domain illustrated in gure 5.6. Here we have a document
collection. Each document has a bibliography that references some of the other
documents in the collection. We may know the number of citations made by each
document (i.e., it is outside the probabilistic model). By observing the citations
that are made, we can use the links to reach conclusions about other attributes in
the model. For example, by observing the number of citations to papers of various
topics, we may be able to infer something about the topic of the citing paper.
gure 5.7(a) shows a simple schema for this domain. We have two classes, Paper
and Cites. The Paper class has information about the topic of the paper and the
words contained in the paper. For now, we simply have an attribute for each word
that is true if the word occurs in the page and false otherwise. The Cites class
represents the citation of one paper, the Cited paper, by another paper, the Citing
paper. (In the gure, for readability, we show the Paper class twice.) In this model,
we assume that the set of objects is prespecied, but relations among them, i.e.,
reference slots, are subject to probabilistic choices. Thus, rather than being given
a full relational skeleton r , we assume that we are given an object skeleton o .
The object skeleton species only the objects o (X) in each class X  X , but
not the values of the reference slots. In our example, the object skeleton species
the objects in class Paper and the objects in class Cites, but the reference slots of
the Cites relation, Cites.Cited and Cites.Citing are unspecied. In other words, the
probabilistic model does not provide a model of the total number of citation links,
but only a distribution over their endpoints. gure 5.7 shows an object skeleton
for the citation domain.
5.5
Paper
Topic
Words
Paper
Cites
Cited
Citing
Topic
Words
(a)
143
Paper
Paper
P2
P5
Paper
Topic
Paper
Topic
P4Paper
Theory
P3
AI
Topic
P1Topic
Theory
TopicAI
???
Paper
Paper
P2
P5
Paper
Topic
Paper
Topic
P4Paper
Theory
P3
Reg
Reg
AI
Topic
P1Topic
Theory
TopicAI
Reg
Reg
Cites
???
(b)
Figure 5.7 (a) A relational schema for the citation domain. (b) An object skeleton
for the citation domain.
5.5.1.1
Probabilistic Model
In the case of reference uncertainty, we specify a probabilistic model for the value
of the reference slots X.. The domain of a reference slot X. is the set of keys
(unique identiers) of the objects in the class Y to which X. refers. Thus, we need
to specify a probability distribution over the set of all objects in Y . For example,
for Cites.Cited, we must specify a distribution over the objects in class Paper.
A naive approach is to simply have the PRM specify a probability distribution
directly over the objects o (Y ) in Y . For example, for Cites.Cited, we would have
to specify a distribution over the primary keys of Paper. This approach has two
major aws. Most obviously, this distribution would require a parameter for each
object in Y , leading to a very large number of parameters. This is a problem both
from a computational perspective  the model becomes very large  and from
a statistical perspective  we often would not have enough data to make robust
estimates for the parameters. More importantly, we want our dependency model
to be general enough to apply over all possible object skeletons o ; a distribution
dened in terms of the objects within a specic object skeleton would not apply to
others.
In order to achieve a general and compact representation, we use the attributes
of Y to dene the probability distribution. In this model, we partition the class Y
into subsets labeled 1 , . . . , m according to the values of some of its attributes,
and specify a probability for choosing each partition, i.e., a distribution over the
partitions. We then select an object within that partition uniformly.
For example, consider a description of movie theater showings as in gure 5.8(a).
For the foreign key Shows.Movie, we can partition the class Movie by Genre,
indicating that a movie theater rst selects the genre of movie it wants to show,
and then selects uniformly among the movies with the selected genre. For example,
a movie theater may be much more likely to show a movie which is a thriller
than a foreign movie. Having selected, for example, to show a thriller, the theater
then selects the actual movie to show uniformly from within the set of thrillers.
In addition, just as in the case of descriptive attributes, the partition choice can
144
M1
Movie.Genre = foreign
M2
Movie.Genre = thriller
Paper
P5
Topic
AI Paper
P3
Topic
AI
Paper
P4
Paper
Topic
P2
Topic PaperTheory
Theory P1
Topic
Theory
P1
P2
Paper.Topic = Theory
Paper.Topic = AI
Paper
Theater
Type
Location
Profit
Paper
Paper
P1
Paper
P1
Topic
Paper
P1
Topic
Paper
Theory
P1
Topic
Theory
P1
Topic
Theory
Topic
Theory
Theory
Shows
Theater
Movie
Type
m1 m2
M1 M2
0.1 0.9
0.2 0.8
art theater 0.7 0.3
megaplex
(a)
Topic
Words
Cites
Citing
Cited
Topic
Theory
AI
P1 P2
0.1 0.9
0.99 0.01
(b)
Figure 5.8 (a) An example of reference uncertainty for a movie theaters showings.
(b) A simple example of reference uncertainty in the citation domain
depend on other attributes in our model. Thus, the selector attribute can have
parents. As illustrated in the gure, the choice of movie genre might depend on
the type of theater. Consider another example in our citation domain. As shown in
gure 5.8(b), we can partition the class Paper by Topic, indicating that the topic
of a citing paper determines the topics of the papers it cites; and then the cited
paper is chosen uniformly among the papers with the selected topic.
We make this intuition precise by dening, for each slot , a partition function
 . We place several restrictions on the partition function which are captured in
the following denition:
Denition 5.12
Let X. be a reference slot with domain Y . Let  : Y  Dom[ ] be a function
where Dom[ ] is a nite set of labels. We say that  is a partition function for
 if there is a subset of the attributes of Y , P[]  A(Y ), such that for any y  Y
and any y   Y , if the values of the attributes P[] of y and y  are the same, i.e., for
each A  P[], y.A = y  .A, then  (y) =  (y  ). We refer to P[] as the partition
attributes for .
Thus, the values of the partition attributes are all that is required to determine the
partition to which an object belongs.
In our rst example, Shows.Movie : Movie  {foreign, thriller} and the partition
attributes are P[Shows.Movie] = {Genre}. In the second example, Cites.Cited :
Paper  {AI, Theory} and the partition attributes are P[Cites.Cited] = {Topic}.
There are a number of natural methods for specifying the partition function.
It can be dened simply by having one partition for each possible combination
of values of the partition attributes, i.e., one partition for each value in the cross
product of the partition attribute values. Our examples above take this approach.
In both cases, there is only a single partition attribute, so specifying the partition
function in this manner is not too unwieldy, but for larger collections of partition
attributes or for partition attributes with large domains, this method for dening
the partitioning function may be problematic. A more exible and scalable approach
5.5
145
is to dene the partition function using a decision tree built over the partition
attributes. In this case, there is one partition for each of the leaves in the decision
tree.
Each possible value  determines a subset of Y from which the value of  (the
referent) will be selected. For a particular instantiation I of the database, we use
I(Y ) to represent the set of objects in I(Y ) that fall into the partition .
We now represent a probabilistic model over the values of  by specifying a
distribution over possible partitions, which encodes how likely the reference value
of  is to fall into one partition versus another. We formalize our intuition above
by introducing a selector attribute S , whose domain is Dom[ ]. The specication
of the probabilistic model for the selector attribute S is the same as that of any
other attribute: it has a set of parents and a CPD. In our earlier example, the CPD
of Show.SMovie might have as a parent Theater.Type. For each instantiation of the
parents, we have a distribution over Dom[S ]. The choice of value for S determines
the partition Y from which the reference value of  is chosen; the choice of reference
value for  is uniformly distributed within this set.
Denition 5.13
A probabilistic relational model  with reference uncertainty over a relational
schema R has the same components as in denition 5.2. In addition, for each
reference slot   R(X) with Range[] = Y , we have:
a partition function  with a set of partition attributes P[]  A(Y );
a new selector attribute S within X which takes on values in the range of  ;
a set of parents and a CPD for S .
To dene the semantics of this extension, we must dene the probability of
reference slots as well as descriptive attributes:
 
P (x.A | Pa(x.A))
P (I | o , ) =
XX xo (X) AA(X)
R(X),y=x.
(5.2)
where [y] refers to  (y)  the partition that the partition function assigns y.
Note that the last term in (5.2) depends on I in three ways: the interpretation of
x. = y, the values of the attributes P[] within the object y, and the size of Y[y] .
The above probability is not well-dened if there are no objects in a partition, so
in that case we dene it to be zero.
5.5.2
146
model. The associated ground Bayesian network will therefore be cumbersome and
not particularly intuitive. We dene our coherence constraints using an instance
dependency graph, relative to our PRM and object skeleton.
Denition 5.14
The instance dependency graph for a PRM  and an object skeleton o is a
graph Go with the nodes and edges described below. For each class X and each
x  o (X), we have the following nodes:
a node x.A for every descriptive attribute X.A;
a node x. and a node x.S , for every reference slot X..
The dependency graph contains ve types of edges:
Type I edges: Consider any attribute (descriptive or selector) X.A and formal
parent X.B. We dene an edge x.B  x.A, for every x  o (X).
Type II edges: Consider any attribute (descriptive or selector) X.A and formal
parent X.K.B where Dom[X.K] = Y . We dene an edge y.B  x.A, for every
x  o (X) and y  o (Y ).
Type III edges: Consider any attribute X.A and formal parent X.K.B, where
K = 1 , . . . , k , and Dom[i ] = Xi . We dene an edge x.1  x.A, for every
x  o (X). In addition, for each i > 1, we add an edge xi .i  x.A for every
xi  o (Xi ) and for every x  o (X).
Type IV edges: Consider any slot X. and partition attribute Y.B  P[] for
Y = Range[]. We dene an edge y.B  x.S for every x  o (X) and y  o (Y ).
Type V edges: Consider any slot X.. We dene an edge x.S  x. for every
x  o (X).
We say that a dependency structure S is acyclic relative to an object skeleton o
if the directed graph Go is acyclic.
Intuitively, type I edges correspond to intra-object dependencies and type II edges
to inter-object dependencies. These are the same edges that we had in the dependency graph for regular PRMs, except that they also apply to selector attributes.
Moreover, there is an important dierence in our treatment of type II edges. In this
case, the skeleton does not specify the value of x., and hence we cannot determine
from the skeleton on which object y the attribute x.A actually depends. Therefore,
our instance dependency graph must include an edge from every attribute y.B.
Type III edges represent the fact that the actual choice of parent for x.A depends
on the value of the slots used to dene it. When the parent is dened via a slot
chain, the actual choice depends on the values of all the slots along the chain. Since
we cannot determine the particular object from the skeleton, we must include an
edge from every slot xi .i potentially included in the chain.
Type V edges represent the dependency of a slot on the attributes dening the
associated partition. To see why this dependence is required, we observe that our
choice of reference value for x. depends on the values of the partition attributes
5.5
147
148
S
Type V
Type
Theater
Type III
Genre
Type II
Theater
Type
Location
Movie
Shows
STheater
Theater
S Movie
Movie
Profit
Type IV
S
Genre
Type I
Type V
Popularity
Location
Popularity
Movie
Type III
Type II
Profit
(a)
(b)
Figure 5.9 (a) A PRM for the movie theater example. The partition attributes
are indicated using dashed lines. (b) The dependency graph for the movie theater
example. The dierent edge types are labeled.
??
?
Document Collection
Figure 5.10
5.5.3
Document Collection
Existence Uncertainty
The second form of structural uncertainty we introduce is called existence uncertainty. In this case, we make no assumptions about the number of links that exist.
The number of links that exist and the identity of the links are all part of the
probabilistic model and can be used to make inferences about other attributes in
our model. In our citation example above, we might assume that the set of papers
is part of our background knowledge, but we want to provide an explicit model for
the presence or absence of citations. Unlike the reference uncertainty model of the
previous section, we do not assume that the total number of citations is xed, but
rather that each potential citation can be present or absent.
5.5
149
Paper
PaperPaper
P5 P2
Paper
Paper
Topic
Topic
P4Paper
P3
Theory
AI
Topic
P1Topic
Theory
Topic
AI
???
???
PaperPaper
P5 P2
Paper
Paper
Topic
Topic
P4Paper
P3
Theory
AI P1
Topic
Topic
Theory
Topic
AI
???
Topic
Words
(a)
Paper
Topic
Words
Cites
Exists
Citer.Topic
Theory
Theory
AI
AI
Cited.Topic
Theory
AI
Theory
AI
False
True
0.995
0.999
0.997
0.993
0005
0001
0003
0008
(b)
Figure 5.11 (a) An entity skeleton for the citation domain. (b) A CPD for the
Exists attribute of Cites.
5.5.3.1
The object skeleton used for reference uncertainty assumes that the number of
objects in each relation is known. Thus, if we consider a division of objects into
entities and relations, the number of objects in classes of both types is xed.
Existence uncertainty assumes even less background information than specied by
the object skeleton. Specically, we assume that the number of relationship objects
is not xed in advance. This situation is illustrated in gure 5.10.
We assume that we are given only an entity skeleton e , which species the set
of objects in our domain only for the entity classes. Figure 5.11(a) shows an entity
skeleton for the citation example. Our basic approach is to allow other objects
within the model  those in the relationship classes  to be undetermined, i.e.,
their existence can be uncertain. In other words, we introduce into the model all
of the objects that can potentially exist in it; with each of them, we associate a
special binary variable that tells us whether the object actually exists or not. We
call entity classes determined and relationship classes undetermined.
To specify the set of potential objects, we note that relationship classes typically
represent many-many relationships; they have at least two reference slots, which
refer to determined classes. For example, our Cite class has the two reference
slots, Citing and Cited . Thus the potential domain of the Cites class in a given
instantiation I is I(Paper)  I(Paper). Each potential object x in this class has
the form Cite[y1 , y2 ]. Each such object is associated with a binary attribute x.E
that species whether paper y1 did or did not cite paper y2 .
Denition 5.18
Consider a schema with determined and undetermined classes, and let e be an
entity skeleton over this schema. We dene the induced relational skeleton, r [e ],
to be the relational skeleton that contains the following objects:
If X is a determined class, then r [e ](X) = e (X).
Let X be an undetermined class with reference slots 1 , . . . , k whose range types
are Y1 , . . . , Yk respectively. Then r [e ](X) contains an object X[y1 , . . . , yk ] for
all tuples y1 , . . . , yk 
  r [e ](Y1 )      r [e ](Yk ).
150
The relations in r [e ] are dened in the obvious way: Slots of objects of determined
classes are taken from the entity skeleton. Slots of objects of undetermined classes
are induced from the object denition: X[y1 , . . . , yk ].i is yi .
To ensure that the semantics of schemata with undetermined classes is welldened, we need a few tools. Specically, we need to ensure that the set of potential
objects is well-dened and nite. It is clear that if we allow cyclic references (e.g.,
an undetermined class with a reference to itself), then the set of potential objects
is not nite. To avoid such situations, we need to put some requirements on the
schema.
Denition 5.19
A set of classes X is stratied if there exists a partial ordering over the classes 
such that for any reference slot X. with range type Y , Y  X.
Lemma 5.20
If the set of undetermined classes in a schema is stratied, then given any entity
skeleton e the number of potential objects in any undetermined class is nite.
As discussed, each undetermined X has a special existence attribute X.E whose
values are V(E) = {true, false}. For uniformity of notation, we introduce an E
attribute for all classes; for classes that are determined, the E value is dened to
be always true. We require that all of the reference slots of a determined class X
have a range type which is also a determined class.
For a PRM with stratied undetermined classes, we dene an instantiation to
be an assignment of values to the attributes, including the Exists attribute, of all
potential objects.
5.5.3.2
Probabilistic Model
We now specify the probabilistic model dened by the PRM. By treating the Exists
attributes as standard descriptive attributes, we can essentially build our denition
directly on top of the denition of standard PRMs.
Specically, the existence attribute for an undetermined class is treated in the
same way as a descriptive attribute in our dependency model, in that it can have
parents and children, and has an associated CPD. gure 5.11(b) illustrates a CPD
for the Cites.Exists attribute. In this example, the existence of a citation depends
on the topic of the citing paper and the topic of the cited paper; e.g., it is more
likely that citations will exist between papers with the same topic.
Using the induced relational skeleton and treating the existence events as descriptive attributes, we have set things up so that (5.1) applies with minor changes.
There are two important changes to the denition of the distribution:
We want to enforce that x.E = false if x..E = false for one of the slots  of X.
Suppose that X has the slots 1 , . . . , k , we dene the eective CPD for X.E as
5.6
151
follows. Let Pa (X.E) = Pa(X.E)  {X.1 .E, . . . , X.k .E}, and dene
P (X.E | Pa(X.E)) if X.i .E = true, i = 1, . . . , k,
P (X.E | Pa (X.E)) =
0
otherwise
We want to decouple the attributes of nonexistent objects from the rest
of the PRM. Thus, if X.A is a descriptive attribute, we dene Pa (X.A) =
Pa(X.A)  {X.E}, and
P (X.A | Pa(X.A)) if X.E = true,
P (X.A | Pa (X.A)) =
1
otherwise
|V(X.A)|
It is easy to verify that in both cases P  (X.A | Pa (X.A)) is a legal conditional
distribution.
In eect, these constraints specify a new PRM  , in which we treat X.E as a
standard descriptive attribute. For each attribute (including the Exists attribute),
we dene the parents of X.A in  to be Pa (X.A) and the associated CPD to be
P  (X.A | Pa (X.A)).
Given an entity skeleton e , a PRM with exists uncertainty  species a distribution over a set of instantiations I consistent with r [e ]:
P (I | e , ) = P (I | r [e ],  ) =
P  (x.A | Pa (x.A))
XX xr [e ](X) AA(x)
(5.3)
We can similarly dene the the class dependency graph for a PRM  with exists
uncertainty using the corresponding notions for the standard PRM  . As there, we
require that the class dependency graph G is acyclic. One immediate consequence
of this requirement is that the schema is stratied.
Lemma 5.21
If the class dependency graph G is acyclic, then there is a stratication of the
undetermined classes.
Based on this denition, we can prove the following result:
Theorem 5.22
Let  be a PRM with existence uncertainty and an acyclic class dependency graph.
Let e be an entity skeleton. Then (5.3) denes a coherent distribution on all
instantiations I of the induced relational skeleton r [e ].
5.6
152
jects. Subclasses allow us to specialize the probabilistic model for some instances
of a class. For example, if we have a class movie in our relational schema, we
might consider subclasses of movies, such as documentaries, action movies, British
comedies, etc. The popularity of an action movie (a subclass of movies) may depend on its budget, whereas the popularity of a documentary (another subclass
of movies) may depend on the reputation of the director. Subclassing allows us to
model probabilistic dependencies at the appropriate level of detail. For example,
we can have the parents of the popularity attribute in the action movie subclass
be dierent than the parents of the same attribute in the documentary subclass. In
addition, subclassing allows additional dependency paths to be represented in the
model that would not be allowed in a PRM that does not support subclasses. For
example, whether a person enjoys action movies may depend on whether she enjoys
documentaries. PRMs-CH provide a general mechanism that allow us to dene a
rich set of dependencies.
To motivate our extensions, consider a simple PRM for the movie domain. Let
us restrict attention to the three classes, Person, Movie, and Vote. We can have
the attributes of Vote depending on attributes of the person voting (via the slot
Vote.Voter ) and on attributes of the movie (via the slot Vote.Movie). However,
given the attributes of all the people and the movie in the model, the dierent
votes are (conditionally) i.i.d.
5.6.1
Class Hierarchies
Our aim is to rene the notion of a class, such as Movie, into ner subclasses,
such as action movies, comedy, documentaries, etc. Moreover, we want to allow
recursive renements of this structure, so that we might rene action movies into
the subclasses spy movies, car chase movies, and kung-fu movies.
A class hierarchy for a class X denes an IS-A hierarchy for objects from class
X. The root of the class hierarchy is simply class X itself. The subclasses of X are
organized into an inheritance hierarchy. The leaves of the class hierarchy describe
basic classesthese are the most specic characterization of objects that occur
in the database. The interior nodes describe abstractions of the base-level classes.
The intent is that the class hierarchy is designed to capture useful and meaningful
abstractions in a particular domain.
More formally, a hierarchy H[X] for a class X is a rooted directed acyclic graph
dened by a subclass relation  over a nite set of subclasses C[X]. For c, d  C[X],
if c  d, we say that Xc is a direct subclass of Xd , and Xd is a direct superclass of
Xc . The root of the tree is the class X. Class
 corresponds to the original class
X. We dene  to be the transitive closure of ; if c  d, we say that Xc is a
subclass of Xd . For example, gure 5.12 shows the simple class hierarchy for the
Movie class.
We denote the sublcasses of the hierarchy by C[(]H[X]). We achieve subclassing
for a class X by requiring that there be an additional subclass indicator attribute
X.Class that determines the subclass to which an object belongs. Thus, if c is a
5.6
153
Movie
Figure 5.12
Comedy
Action-Movie
Spy-Movie
Car-Chase-Movie
Documentary
Kung-Fu-Movie
subclass, then I(Xc ) contains all objects x  X for which x.Class  c, i.e., all
objects that are in some class which is a subclass of c. In our example, Movie has
a subclass indicator variable Movie.Class with possible values
{Comedy, Action-Movie, Documentary, Spy-Movie, Car-Chase-Movie, Kung-Fu-Movie}
.
Subclasses allow us to make ner distinctions when constructing a probabilistic
model. In particular, they allow us to specialize CPDs for dierent subclasses in
the hierarchy.
Denition 5.23
A probabilistic relational model with subclass hierarchy is dened as follows. For
each class X  X , we have
a class hierarchy H[X] = (C[X], );
a subclass indicator attribute X.Class such that V(X.Class) = C[(]H[X]);
a CPD for X.Class;
for each subclass c  C[X] and attribute A  A(X) we have either
a set of parents Pac (X.A) and a CPD that describes P (X.A | Pac (X.A)); or
an inherited indicator that species that the CPD for X.A in c is inherited
from its direct superclass. The root of the hierarchy cannot have the inherited
indicator.
With the introduction of subclass hierarchies, we can rene our probabilistic
dependencies. Before each attribute X.A had an associated CPD. Now, if we like,
we can specialize the CPD for an attribute within a particular subclass. We can
associate a dierent CPD with the attributes of dierent subclasses. For example,
the attribute Action-Movie.Popularity may have a dierent conditional distribution
from the attribute Documentary.Popularity. Further, the distribution for each of
the attributes may depend on a completely dierent set of parents. Continuing
our earlier example, if the popularity of an action movie depends on its budget,
154
5.6
155
this dependency, we need a mechanism for constructing slot chains that restrict
the types of objects along the path to belong to specic subclasses. Recall that a
reference slot  is a function from Dom[] to Range[], i.e. from X to Y . We can
introduce renements of a slot reference by restricting the types of the objects in
the range.
Denition 5.24
Let  be a slot (reference or inverse) of X with range Y . Let d be a subclass of Y .
A rened slot reference d	 for  to d is a relation between X and Y :
For x  X, y  Y, y  x.d	 if x  X and y  Yd , then y  x..
Returning to our earlier example, suppose that we have subclasses of Movie:
Comedy, Action-Movie, and Documentary. In addition, suppose we also have subclasses of Vote, Comedy-Vote and Action-Vote, and Documentary-Vote. To get from
a person to her votes, we use the inverse of slot reference Person.Votes. Now we
can construct renements of Person.Votes, VotesComedy-Vote	 , VotesAction-Vote	 , and
VotesDocumentary-Vote	 .
Let us name these slots Comedy-Votes and Action-Votes, and Documentary-Votes.
To specify the dependency of a persons rankings for documentaries on their rankings for action movies we can say that Documentary-Vote.Rank has a parent which is
the persons action movie rankings: (Documentary-Vote.Person.Action-Votes.Rank).
5.6.3
The introduction of subclasses brings the benet that we can now provide a smooth
transition from the PRM, a class-based probabilistic model, to models that are
more similar to Bayesian networks. To see this, suppose our subclass hierarchy
for movies is very deep and starts with the general class and ends in the most
rened levels with particular movie instances. Thus, at the most rened version
of the model we can dene the preferences of a person by either class-based
dependency (the probability of enjoying documentary movies depends on whether
the individual enjoys action movies) or instance-based dependency (the probability
of enjoying Terminator II depends on whether the individual enjoys The Hunt for
Red October ). The latter model is essentially the same as the Bayesian network
models learned by Breese et al. [2] in the context of collaborative ltering for TV
programs.
In addition, the new exibility in dening rened slot references allows us to
make interesting combinations of these types of dependencies. For example, whether
an individual enjoys a particular movie(e.g., True Lies) can be enough to predict
whether she watches a whole other category of movies (e.g., James Bond movies).
156
5.6.4
Semantics
Using this denition, the semantics for PRM-CH are given by the following equation:
 
P (x.Class)
P (x.A | Pax.c (x.A)).
(5.4)
P (I | r , ) =
X xr (X)
AA(X)
5.6
157
158
Movie.Class
MovieAction.Budget
Budget
Person
Popularity
Age
Action-Movie
Action-Vote
Rank
Budget
Type I
MovieAction.Popularity
Person.Age
Type II
Type II
MovieDoc.Budget
VoteAction.Rank
Type II
MovieDoc.Popularity
Popularity
Documentary
Rank
DocumentaryVote
(a)
VoteDoc.Rank
Vote.Class
(b)
Figure 5.13 (a) A simple PRM with class hierarchies for the movie domain. (b)
The class dependency graph for this PRM.
Denition 5.28
The class dependency graph for a PRM with class hierarchy CH has the following
set of nodes for each X  X :
for each subclass c  C[X] and attribute A  A(X), a node Xc .A;
a node for the subclass indicator X.Class;
and the following edges:
Type I edges: For any node Xc .A and formal parent Xc .B  Pac (Xc .A) we have
an edge Xc .B  Xc .A.
Type II edges: For any attribute Xc .A and formal parent Xc ..B  Pac (Xc .A),
where Range[] = Y , we have an edge Y.B  Xc .A.
Type III edges: For any attribute Xc .A, and for any direct superclass d, c  d,
we add an edge Xc .A  Xd .A.
Figure 5.13 shows a simple class dependency graph for our movie example. The
PRM-CH is given in gure 5.13(a) and the class dependency graph is shown in
gure 5.13(b).
It is now easy to show that if this class dependency graph is acyclic, then the
instance dependency graph is acyclic.
Lemma 5.29
If the class dependency graph is acyclic for a PRM with class hierarchies CH , then
for any relational skeleton r , the colored instance dependency graph is acyclic.
And again we have the following corollary:
Corollary 5.30
Let CH be a PRM with class hierarchies whose class dependency structure S is
acyclic. For any relational skeleton r , CH and r dene a coherent probability
distribution over instantiations I that extend r via (5.4).
5.7
5.7
Inference in PRMs
159
Inference in PRMs
An important aspect of any probabilistic representation is the support for making
inferences; having made some observations, how do we condition on these observations and update our probabilistic model? Inference in PRMs supports many
interesting patterns of reasoning. Oftentimes we can view the inference as inuence
owing between the interrelated objects. Consider a simple example of inference
about a particular student in our school PRM. A priori we may believe a student
is likely to be smart. We may observe his grades in several courses and see that
for the most part he received As, but in one class he received a C. This may cause
us to slightly reduce our belief that the student is smart, but it will not change
it signicantly. However, if we nd that most of the other students that took the
course received high grades, we then may believe that the course is an easy course.
Since it is unlikely that a smart student got a low grade in an easy course, our
probability for the student being smart now goes down substantially.
There are several potential approaches for performing inference eectively in
PRMs. In a few cases, particularly when the skeleton is small, or it results in a
network with low tree width, we can do exact inference in the ground Bayesian
network. In other cases, when there are certain types of regularities in the ground
Bayesian network, we can still perform exact inference by carefully exploiting and
reusing computations. And in cases where the ground Bayesian network is very
large and we cannot exploit regularities in its structure, we resort to approximate
inference.
5.7.1
Exact Inference
We can always resort to exact inference on the ground Bayesian Network, but
the ground Bayesian Network may be very large and thus this inference may
prove intractable. Under certain circumstances, inference algorithms can exploit
the model structure to make inference tractable. Previous work on inference in
structured probabilistic models [14, 19, 18] shows how eective inference can be
done for a number of dierent structured probabilistic models. The algorithms make
use of the structure imposed by the class hierarchy to decompose the distribution
and eectively reuse computation.
There are two ways in which aspects of the structure can be used to make
inference more ecient. The rst structural aspect is the natural encapsulation
of objects that occurs in a well-designed class hierarchy. Ideally, the interactions
between objects will occur via a small number of object attributes, and the majority
of interactions between attributes will be encapsulated within the class. This can
provide a natural decomposition of the model suitable for inference. The complexity
of the inference will depend on the width of the connections between objects; if
the width is small, we are guaranteed an ecient procedure.
160
The second structural aspect that is used to make inference ecient is the fact
that similar objects occur many times in the model. Pfeer et al. [19] describe
a recursive inference algorithm that caches the computations that are done for
fragments of the model; these computations then need only be performed once; we
can reuse them for another object occurring in the same context. We can think of
this object as a generic object, which occurs repeatedly in the model. Exploiting
these structural aspects of the model allow Pfeer et al. [19] to achieve impressive
speedups; in a military battlespace domain the structured inference was orders of
magnitudes faster than the standard Bayesian Network exact inference algorithm.
5.7.2
Approximate Inference
Unfortunately the methods used in the inference algorithm above often are not
applicable for the PRMs we study. In the majority of cases, there are no generic
objects that can be exploited. Unlike standard Bayesian Network inference, we
cannot decompose this task into separate inference tasks over the objects in the
model, as they are typically all correlated. Thus, inference in the PRM requires
inference over the ground network dened by instantiating a PRM for a particular
skeleton.
In general, the ground network can be fairly complex, involving many objects that
are linked in various ways. (For example, in some of our experiments, the networks
involve hundreds of thousands of nodes.) Exact inference over these networks is
clearly impractical, so we must resort to approximate inference. We use belief
propagation (BP), a local message-passing algorithm, introduced by Pearl [17]. The
algorithm is guaranteed to converge to the correct marginal probabilities for each
node only for singly connected Bayesian networks. However, empirical results [16]
show that it often converges in general networks, and when it does the marginals
are a good approximation to the correct posterior.
We provide a brief outline of one variant of BP, and refer the reader to [20, 16, 15]
for more details. Consider a Bayesian network over some set of nodes (which in our
case would be the variables x.A). We rst convert the graph into a family graph,
with a node Fi for each variable Vi in the Bayesian network, containing Vi and its
parents. Two nodes are connected if they have some variable in common. The CPD
of Vi is associated with Fi . Let i represent the factor dened by the CPD; i.e., if Fi
contains the variables V, Y1 , . . . , Yk , then i is a function from the domains of these
variables to [0, 1]. We also dene i to be a factor over Vi that encompasses our
evidence about Vi : i (Vi )  1 if Vi is not observed. If we observe Vi = v, we have
	
	
that i (v) = 1 and 0 elsewhere. Our posterior distribution is then  i i  i i ,
where  is a normalizing constant.
The BP algorithm is now very simple. At each iteration, all the family nodes
simultaneously send messages to all others, as follows:
i  i 
mki ,
mij (Fi  Fj )  
Fi Fj
kN (i){j}
5.8
Learning
161
where  is a (dierent) normalizing constant and N (i) is the set of families that
are neighbors of Fi in the family graph. This process is repeated until the beliefs
converge. At any point in the algorithm, our marginal distribution about any family
	
Fi is bi = i i  kN (i) mki . Each iteration is linear in the number of edges in the
Bayesian network. While the algorithm is not guaranteed to converge, it typically
converges after just a few iterations. After convergence, the bi give us approximate
marginal distributions over each of the families in the ground network.
5.8
Learning
Next, we turn our attention to learning a PRM. In the learning problem, our input
contains a relational schema that describes the basic vocabulary in the domain
 the set of classes, the attributes associated with the dierent classes, and the
possible types of relations between objects in the dierent classes. For simplicity,
in the description that follows, we assume the training data consists of a fully
specied instance of that schema; if there are missing values, then an expectation
maximization (EM) algorithm is needed as well. We begin by describing learning
PRMs with attribute uncertainty, next describe the extensions to support learning
PRMs with structural uncertainty, and then describe support for learning PRMs
with class hierarchies.
We assume that the training instance is given in the form of a relational database.
Although our approach would also work with other representations (e.g., a set of
ground facts completed using the closed-world assumption), the ecient querying
ability of relational databases is particularly helpful in our framework, and makes
it possible to apply our algorithms to large data sets.
There are two components of the learning task: parameter estimation and structure learning. In the parameter estimation task, we assume that the qualitative
dependency structure of the PRM is known; i.e., the input consists of the schema
and training database (as above), as well as a qualitative dependency structure
S. The learning task is only to ll in the parameters that dene the CPDs of the
attributes. In the structure learning task, the dependency structure is not provided
(although the user can, if available, provide prior knowledge about the structure,
e.g., in the form of constraints) and the goal is to extract an entire PRM, structure
as well as parameters, from the training database alone. We discuss each of these
problems in turn.
5.8.1
Parameter Estimation
We begin with learning the parameters for a PRM where the dependency structure
is known. In other words, we are given the structure S that determines the set of
parents for each attribute, and our task is to learn the parameters S that dene the
CPDs for this structure. Our learning is based on a particular training set, which we
will take to be a complete instance I. While this task is relatively straightforward, it
162
is of interest in and of itself. In addition, it is a crucial component in the structurelearning algorithm described in the next section.
The key ingredient in parameter estimation is the likelihood function: the probability of the data given the model. This function captures the response of the
probability distribution to changes in the parameters. The likelihood of a parameter set is dened to be the probability of the data given the model. For a PRM,
the likelihood of a parameter set S is: L(S | I, , S) = P (I | , S, S ). As usual,
we typically work with the log of this function:
l(S | I, , S) = log P (I | , S, S )
 
(5.5)
x(Xi )
The key insight is that this equation is very similar to the log-likelihood of data
given a Bayesian network [11]. In fact, it is the likelihood function of the Bayesian
network induced by the PRM given the skeleton. The main dierence from standard
Bayesian network parameter learning is that parameters for dierent nodes in the
network are forced to be identicalthe parameters are shared or tied.
5.8.1.1
We can still use the well-understood theory of learning from Bayesian networks.
Consider the task of performing maximum likelihood parameter estimation. Here,
our goal is to nd the parameter setting S that maximizes the likelihood L(S |
I, , S) for a given I,  and S. This estimation is simplied by the decomposition
of log-likelihood function into a summation of terms corresponding to the various
attributes of the dierent classes:
 
x(Xi )
(5.6)
where C X.A [v, u] is the number of times we observe Ix.A = v and IPa(x.A) = u Each
of the terms in the above sum can be maximized independently of the rest. Hence,
maximum likelihood estimation reduces to independent maximization problems,
one for each CPD.
For many parametric models, such as the exponential family, maximum likelihood
estimation can be done via sucient statistics that summarize the data. In the case
of multinomial CPDs, these are just the counts we described above, C X.A [v, u], the
number of times we observe each of the dierent values v, u that the attribute X.A
and its parents can jointly take.
An important property of the database setting is that we can easily compute
sucient statistics. To compute C X.A [v, v1 , . . . , vk ], we simply query over the class
5.8
Learning
163
X and its parents classes, and project onto the appropriate set of attributes. For
example, to learn the parameters for the grade CPD from our school example, we
can compute the sucient statistics with the following SQL query:
SELECT grade, intelligence, diculty, count(*)
FROM from registration, student, course
GROUP BY grade, intelligence, diculty
In some cases, it is useful to materialize a view that can be used to compute
the sucient statistics. This is benecial when the relationship between the child
attribute and the parent attribute is many-one rather than one-one or one-many.
For example, consider the dependence of attributes of Student on attributes of
Registration. In our example PRM, a students ranking depends on the students
grades. In this case we would construct a view using the following SQL query:
CREATE VIEW v1
SELECT student.*, AVERAGE(grade) AS ave grade,
AVERAGE(satisfaction) as ave satisfaction
FROM student s, registration r
WHERE s.student id = r.student
To compute the statistics we would then project on the appropriate attributes
from view v1:
SELECT ranking, ave grade, COUNT(*)
FROM v1
GROUP BY ranking, ave grade
Thus both the creation of the view and the process of counting occurrences can
be computed using simple database queries, and can be executed eciently. The
view creation for each combination of classes is done once during the full learning
algorithm (we will see exactly at which point this is done in the next section when
we describe the search). If the tables being joined are indexed on the appropriate
set of foreign keys, the construction of this view is ecient: the number of rows
in the resulting table is the size of the child attributes table; in our example this
is |Student|. Computing the sucient statistics can be done in one pass over the
resulting table. The size of the resulting table is simply the number of unique
combinations of attribute values. We are careful to cache sucient statistics so they
are only computed once. In some cases, we can compute new sucient statistics
from a previously cached set of sucient statistics; we make use of this in our
algorithm as well.
164
5.8.1.2
(For more details see [7].) If X.A can take on k values, then the prior is
P (X.A|u ) = Dir(X.A|u | 1 , . . . , k ).
For a parameter prior satisfying these two assumptions, the posterior also has
this form. That is, it is a product of independent Dirichlet distributions over the
parameters X.A|u . In other words,
P (X.A|u | I, , S) = Dir(X.A|u |X.A [v1 , u]+C X.A [v1 , u], . . . , X.A [vk , u]+C X.A [vk , u]).
Now that we have the posterior, we can compute the probability of new data. In
the case where the new instance is conditionally independent of the old instances
given the parameter values (which is always the case in Bayesian network models,
but may not be true here), then the probability of the new data case can be
conveniently rewritten using the expected parameters:
Proposition 5.31
Assuming multinomial CPDs, prior independence, and Dirichlet priors, with hyperparameters X.A [v, u], we have that
E [P (X.A = v | Pa(X.A) = u) | I] =
C X.A [v, u] + X.A [v, u]
.
k
i=1 C X.A [vi , u] + X.A [vi , u]
5.8
Learning
165
This suggests that the Bayesian estimate for S should be estimated using this
formula as well. Unfortunately, the expected parameter is not the proper Bayesian
solution for computing probability of new data in the case where the new data
instance is not independent of previous data given the parameters. Suppose that
we want to use the posterior to evaluate the probability of an instance I  of
another skeleton   . If there are two instances x and x of the class X such that
v I (Pa(x.A)) = v I (Pa(x .A)), then we will be relying on the same multinomial
parameter vector twice. Using the chain rule, we see that the second probability
depends on the posterior of the parameters after seeing the training data, and
the rst instance. In other words, the probability of a relational database given a
distribution over parameter values is not identical to the probability of the data set
when we have a point estimate of the parameters (i.e., when we act as though we
know their values). However if the posterior is sharply peaked (i.e., we have a strong
prior, or we have seen many training instances), we can approximate the solution
using the expected parameters of proposition 5.31. We use this approximation in
our computation of the estimates for the parameters.
5.8.1.3
Structure Learning
Legal Structures
We saw in section 5.2.5.2 that we could construct a class dependency graph for
a PRM, and the PRM dened a coherent probability distribution if the class
dependency graph was stratied. During learning it is straightforward to maintain
this structure, and consider only models whose dependency structure passes this
test.
166
Now that we know which structures are legal, we need to decide how to evaluate
dierent structures in order to pick one that ts the data well. We adapt Bayesian
model selection methods to our framework. We would like to nd the MAP
(maximum a posteriori) structure. Formally, we want to compute the posterior
probability of a structure S given an instantiation I. Using Bayes rule we have that
P (S | I, )  P (I | S, )P (S | ).
This score is composed of two parts: the prior probability of the structure, and the
probability of the data assuming that structure.
The rst component is P (S | ), which denes a prior over structures. We
assume that the choice of structure is independent of the skeleton, and thus
P (S | ) = P (S). In the context of Bayesian networks, we often use a simple
uniform prior over possible dependency structures. Unfortunately, this assumption
does not work in our setting. The problem is that there may be innitely many
possible structures.2 In our genetics example, a persons genotype can depend on
the genotype of his parents, or of his grandparents, or of his great-grandparents,
etc. A simple and natural solution penalizes long indirect slot chains, by having
log P (S) proportional to the sum of the lengths of the chains K appearing in S.
The second component is the marginal likelihood :
P (I | S, ) = P (I | S, S , )P (S | S) dS .
If we use a parameter-independent Dirichlet prior (as above), this integral decomposes into a product of integrals each of which has a simple closed-form solution.
This is a simple generalization of the ideas used in the Bayesian score for Bayesian
networks [12].
Proposition 5.32
If I is a complete assignment, and P (S | S) satises parameter independence and
is a Dirichlet with hyperparameters X.A [v, u], then the marginal likelihood of I
2. Although there are only a nite number that are reasonable to consider for a given
skeleton.
5.8
Learning
167
given S is
P (I | S, ) =
i AA(Xi )
(5.7)
uV(()Pa(Xi .A))
(
P
v [v])
v ([v]+C [v])
([v]+C [v])
,
([v])
and (x) =
%
0
tx1 et dt
Structure Search
Now that we have both a test for determining whether a structure is legal, and
a scoring function that allows us to evaluate dierent structures, we need only
provide a procedure for nding legal high-scoring structures. For Bayesian networks,
we know that this task is NP-hard [3]. As PRM learning is at least as hard as
Bayesian network learning (a Bayesian network is simply a PRM with one class
and no relations), we cannot hope to nd an ecient procedure that always nds
the highest-scoring structure. Thus, we must resort to heuristic search.
As is standard in Bayesian network learning [11], we use a greedy local search
procedure that maintains a current candidate structure and iteratively modies
it to increase the score. At each iteration, we consider a set of simple local
transformations to the current structure, score all of them, and pick the one with
the highest score. In the case where we are learning multinomial CPDs, the three
operators we use are: add edge, delete edge, and reverse edge. In the case where we
168
are learning tree CPDs, following [4], our operators consider only transformations to
the CPD trees. The tree structure induces the dependency structure, as the parents
of X.A are simply those attributes that appear in its CPD tree. In this case, the
two operators we use are: split  replaces a leaf in a CPD tree by an internal node
with two leafs; and trim  replaces the subtree at an internal node by a single leaf.
The simplest heuristic search algorithm is a greedy hill-climbing search, using our
score as a metric. We maintain our current candidate structure and iteratively improve it. At each iteration, we consider the appropriate set of local transformations
to that structure, score all of them, and pick the one with highest score.
We refer to this simple algorithm as the greedy algorithm. There are several
common variants to improve the robustness of hill-climbing methods. One is is to
make use of random restarts to deal with local maxima. In this algorithm, when we
reach a local maximum, we take some xed number of random steps, and then we
restart our search process. Another common approach is to make use of a tabulist,
which keeps track of the most recent states visited, and allows only steps which do
not return to a recently visited state. A more sophisticated approach is to make use
of a simulated annealing style of algorithm which uses the following procedure: in
the early phases of the search we are likely to take random steps (rather than the
best step), but as the search proceeds (i.e., the temperature cools) we are less likely
to take random steps and more likely to take the best greedy step. The algorithms we
have used are either the simple greedy algorithm or a simple randomized algorithm.
Regardless of the specic heuristic search algorithm used, an important component of the search is the scoring of candidate structures. As in Bayesian networks,
the decomposability property of the score has signicant impact on the computational eciency of the search algorithm. First, we decompose the score into a sum
of local scores corresponding to individual attributes and their parents. (This local score of an individual attribute is exactly the logarithm of the term in square
brackets in (5.7).) Now, if our search algorithm considers a modication to our
current structure where the parent set of a single attribute X.A is dierent, only
the component of the score associated with X.A will change. Thus, we need only
reevaluate this particular component, leaving the others unchanged; this results in
major computational savings.
However, there are still a very large number of possible structures to consider.
We propose a heuristic search algorithm that addresses this issue. At a high level,
the algorithm proceeds in phases. At each phase k, we have a set of potential
parents Pot k (X.A) for each attribute X.A. We then do a standard structure search
restricted to the space of structures in which the parents of each X.A are in
Pot k (X.A). The advantage of this approach is that we can precompute the view
corresponding to X.A, Pot k (X.A); most of the expensive computations  the joins
and the aggregation required in the denition of the parents  are precomputed in
these views. The sucient statistics for any subset of potential parents can easily be
derived from this view. The above construction, together with the decomposability
of the score, allows the steps of the search (say, greedy hill-climbing) to be done
very eciently.
5.8
Learning
169
The success of this approach depends on the choice of the potential parents.
Clearly, a bad initial choice can result to poor structures. Following [8], which
examines a similar approach in the context of learning Bayesian networks, we
propose an iterative approach that starts with some structure (possibly one where
each attribute does not have any parents), and select the sets Pot k (X.A) based
on this structure. We then apply the search procedure and get a new, higherscoring, structure. We choose new potential parents based on this new structure
and reiterate, stopping when no further improvement is made.
It remains only to discuss the choice of Pot k (X.A) at the dierent phases. Perhaps
the simplest approach is to begin by setting Pot 1 (X.A) to be the set of attributes
in X. In successive phases, Pot k+1 (X.A) would consist of all of Pak (X.A), as well
as all attributes that are related to X via slot chains of length < k. Of course, these
new attrributes may require aggregation; we may either specify the appropriate
aggregator or search over the space of possible aggregators.
This scheme expands the set of potential parents at each iteration. In some cases,
however, it may result in large set of potential parents. In such cases we may want to
use a more rened algorithm that only adds parents to Pot k+1 (X.A) if they seem to
add value beyond Pak (X.A). There are several reasonable ways of evaluating the
additional value provided by new parents. Some of these are discussed by Friedman
et al. [8] in the context of learning Bayesian networks. These results suggest that
we should evaluate a new potential parent by measuring the change of score for
the family of X.A if we add (X.K.B) to its current parents. We can then choose
the highest scoring of these, as well as the current parents, to be the new set of
potential parents. This approach would allow us to signicantly reduce the size of
the potential parent set, and thereby of the resulting view, while typically causing
insignicant degradation in the quality of the learned model.
5.8.4
Next, we describe how to extend the basic PRM learning algorithm to deal with
structural uncertainty. For PRMs with reference uncertainty, in addition we also
attempt to learn the rules that govern the link models. For PRMs with existence
uncertainty we learn the probability of existence of relationship objects.
5.8.4.1
The extension to scoring required to deal with reference uncertainty is not a dicult
one. Once we x the partitions dened by the attributes P[], a CPD for S
compactly denes a distribution over values of . Thus, scoring the success in
predicting the value of  can be done eciently using standard Bayesian methods
used for attribute uncertainty (e.g., using a standard Dirichlet prior over values of
).
The extension to search the model space for incorporating reference uncertainty
involves expanding our search operators to allow the addition (and deletion) of
170
attributes to partition denition for each reference slot. Initially, the partition of
the range class for a slot X. is not given in the model. Therefore, we must also
search for the appropriate set of attributes P[]. We introduce two new operators,
rene and abstract, which modify the partition by adding and deleting attributes
from P[]. Initially, P[] is empty for each . The rene operator adds an attribute
into P[]; the abstract operator deletes one. As mentioned earlier, we can dene
the partition simply by looking at the cross product of the values for each of
the partition attributes, or using a decision tree. In the case of a decision tree,
rene adds a split to one of the leaves and abstract removes a split. These newly
introduced operators are treated by the search algorithm in exactly the same way
as the standard edge-manipulation operators: the change in the score is evaluated
for each possible operator, and the algorithm selects the best one to execute.
We note that, as usual, the decomposition of the score can be exploited to
substantially speed up the search. In general, the score change resulting from an
operator  is reevaluated only after applying an operator   that modies the parent
or partition set of an attribute that  modies. This is also true when we consider
operators that modify the parent of selector attributes.
5.8.4.2
The extension of the Bayesian score to PRMs with existence uncertainty is straight
forward; the exists attribute is simply a new descriptive attribute. The only new
issue is how to compute sucient statistics that include existence attributes x.E
without explicitly enumerating all the nonexistent entities. We perform this computation by counting, for each possible instantiation of Pa(X.E), the number of
potential objects with that instantiation, and subtracting the actual number of
objects x with that parent instantiation.
Let u be a particular instantiation of Pa(X.E). To compute C X.E [true, u], we
can use a standard database query to compute how many objects x  (X) have
Pa(x.E) = u. To compute C X.E [false, u], we need to compute the number of
potential entities. We can do this without explicitly considering each (x1 , . . . , xk ) 
I(Y1 )     I(Yk ) by decomposing the computation as follows: Let  be a reference
slot of X with Range[] = Y . Let Pa (X.E) be the subset of parents of X.E along
slot  and let u be the corresponding instantiation. We count the number of y
consistent with u . If Pa (X.E) is empty, this count is simply |I(Y )|. The product
of these counts is the number of potential entities. To compute C X.E [false, u], we
simply subtract C X.E [true, u] from this number.
No extensions to the search algorithm are required to handle existence uncertainty. We simply introduce the new attributes X.E, and integrate them into the
search space. Our search algorithm now considers operators that add, delete, or
reverse edges involving the exist attributes. As usual, we enforce coherence using
the class dependency graph. In addition to having an edge from Y.E to X.E for
every slot   R(X) whose range type is Y , when we add an edge from Y.B to
X.A, we add an edge from Y.E to X.E and an edge from Y.E to X.A.
5.8
Learning
5.8.5
171
Learning PRM-CHs
We now turn to learning PRMs with class hierarchies. We examine two scenarios:
in one case the class hierarchies are given as part of the input and in the other, in
addition to learning the PRM, we also must learn the class hierarchy. The learning
algorithms use the same criteria for scoring the models; however, the search space
is signicantly dierent.
5.8.6
We begin with the simpler learning with class hierarchies scenario, where we assume
that the class hierarchy is given as part of the input. As in section 5.8, we restrict
attention to fully observable data sets. Hence, we assume that, in our training set,
the class of each object is given. Without this assumption, the subclass indicator
attribute would play the role of a hidden variable, greatly complicating the learning
algorithm.
As discussed above, we need a scoring function that allows us to evaluate dierent
candidate structures, and a search procedure that searches over the space of possible
structures. The scoring function remains largely unchanged. For each object x in
each class X, we have the basic subclass c to which it belongs. For each attribute A
of this object, the probabilistic model then species the subclass d of X from which
c inherits the CPD of X.A. Then x.A contributes only to the sucient statistics for
the CPD of Xd .A. With that recomputation of the sucient statistics, the Bayesian
score can now be computed unchanged.
Next we extend our search algorithm to make use of the subclass hierarchy. First,
we extend our phased search to allow the introduction of new subclasses. Then, we
introduce a new set of operators. The new operators allow us to rene and abstract
the CPDs of attributes in our model, using our class hierarchy to guide us.
5.8.6.1
New subclasses can be introduced at any point in the search. We may construct
all the subclasses at the start of our search, or we may consider introducing them
more gradually, perhaps at each phase of the search. Regardless of when the new
subclasses are introduced, the search space is greatly expanded, and care must be
taken to avoid the construction of an intractable search problem. Here we describe
the mechanics of the introduction of the new subclasses.
For each new subclass introduced, each attribute for the subclass is associated
with a CPD. A CPD can be marked as either inherited or specialized. Initially,
only the CPD for attributes of X
 are marked as specialized; all the other CPDs
are inherited. Our original search operators  those that add and delete parents
 can be applied to attributes at all levels of the class hierarchy. However, we
only allow parents to be added and deleted from attributes whose CPDs have been
specialized. Note that any change to the parents of an attribute is propagated to
172
any descendents of the attribute whose CPDs are marked as inherited from this
attribute.
Next, we introduce the operators Specialize and Inherit. If Xc .A currently has
an inherited CPD, we can apply Specialize(Xc .A). This has two eects. First, it
recomputes the parameters of that CPD to utilize only the sucient statistics of
the subclass c. To understand this point, assume that Xc .A was being inherited
from Xd prior to the specialization. The CPD of Xd .A was being computed using all
objects in I(Xd ). After the change, the CPD will be computed using just the objects
in I(Xc ). The second eect of the operator is that it makes the CPD modiable,
in that we can now add new parents or delete them. The Inherit operator has the
opposite eect.
In addition, when a new subclass is introduced, we construct new rened slot
references that make use of the subclass. Let D be a newly introduced subclass
of Y . For each reference slot  of some class X with range Y , we introduce a
new rened slot reference D	 . In addition, we add each reference slot of Y to D;
however, we rene the domain from Y to D. In other words, if we have the new
reference slot  , where Dom[ ] = D and Range[ ] = X.
5.8.6.2
We next examine the case where the subclass hierarchies are not given as part of
the input. In this case, we will learn them at the same time we are learning the
PRM.
As above, we wish to avoid the problem of learning from partially observable data.
Hence, we need to assume that the basic subclasses are observed in the training set.
At rst glance, this requirement seems incompatible with our task denition: if the
class hierarchy is not known, how can we observe subclasses in the training data?
We resolve this problem by dening our class hierarchy based on the standard class
attributes. For example, movies might be associated with an attribute specifying the
genre  action, drama, or documentary. If our search algorithm decides that this
attribute is a useful basis for forming subclasses, we would dene subclasses based in
a deterministic way on its values. Another attribute might be the reputation of the
director. The algorithm might choose to rene the class hierarchy by partitioning
sitcoms according to the values of this attribute. Note that, in this case, the class
hierarchy depends on an attribute of a related class, not the class itself.
We implement this approach by requiring that the subclass indicator attribute
be a deterministic function of its parents. These parents are the attributes used to
dene the subclass hierarchy. In our example, Movie.Class would have as parents
Movie.Genre and Movie.Director.Reputation. Note that, as the function dening the
subclass indicator variable is required to be deterministic, the subclass is eectively
observed in the training data (due to the assumption that all other attributes are
observed).
We restrict attention to decision-tree CPDs. The leaves in the decision tree
represent the basic subclasses, and the attributes used for splitting the decision
5.9
Conclusion
173
tree are the parents of the subclass indicator variable. We can allow binary splits
that test whether an attribute has a particular value, or, if we nd it necessary, we
can allow a split on all possible values of an attribute.
The decision tree gives a simple algorithm for determining the subclass of an
object. In order to build the decision tree during our search, we introduce a new
operator Split(X, c, X.K.B), where c is a leaf in the current decision tree for X.Class
and X.K.B is the attribute on which we will split that subclass.
Note that this step expands the space of models that can be considered, but in
isolation does not change the score of the model. Thus, if we continue to use a purely
greedy search, we would never take these steps. There are several approaches for
addressing this problem. One is to use some lookahead for evaluating the quality of
such a step. Another is to use various heuristics for guiding us toward worthwhile
splits. For example, if an attribute is the common parent of many other attributes
within Xc , it may be a good candidate on which to split.
The other operators, Specialize and Inherit, remain the same; they simply use the
subclasses dened by the decision tree.
5.9
Conclusion
In this chapter we have described a comprehensive framework for learning a statistical model from relational data. We have presented a method for the automatic
construction of a PRM from an existing database. Our method learns a structured
statistical model directly from the relational database, without requiring the data
to be attened into a xed attribute-value format. We have shown how to perform
parameter estimation, developed a scoring criterion for use in structure selection,
and dened the model search space. We have also provided algorithms for guaranteeing the coherence of the learned model.
References
[1] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Context-specic
independence in Bayesian networks. In Proceedings of the Conference on
Uncertainty in Articial Intelligence, 1996.
[2] J. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive
algorithms for collaborative ltering. In Proceedings of the Conference on
Uncertainty in Articial Intelligence, 1998.
[3] D. Chickering. Learning Bayesian networks is NP-complete. In Articial
Intelligence and Statistics, 1996.
[4] D. Chickering, D. Heckerman, and C. Meek. A Bayesian approach to learning
Bayesian networks with local structure. In Proceedings of the Conference on
Uncertainty in Articial Intelligence, 1997.
174
[5] D. Cohn and T. Hofmann. The missing linka probabilistic model of document
content and hypertext connectivity. In Proceedings of Neural Information
Processing Systems, 2001.
[6] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam,
and S. Slattery. Learning to extract symbolic knowledge from the World Wide
Web. In Proceedings of the National Conference on Articial Intelligence, 1998.
[7] M. H. DeGroot. Optimal Statistical Decisions. McGraw-Hill, New York, 1970.
[8] N. Friedman, I. Nachman, and D. Peer. Learning of Bayesian network structure
from massive datasets: The sparse candidate algorithm. In Proceedings of
the Conference on Uncertainty in Articial Intelligence, 1999.
[9] L. Getoor. Learning Statistical Models from Relational Data. PhD thesis,
Stanford University, Stanford, CA, 2001.
[10] L. Getoor and J. Grant. PRL: A probabilistic relational language. Machine
Learning Journal, 62(1-2):731, 2006.
[11] D. Heckerman. A tutorial on learning with Bayesian networks. In M. I. Jordan,
editor, Learning in Graphical Models, pages 301354. MIT Press, Cambridge,
MA, 1998.
[12] D. Heckerman, D. Geiger, and D. Chickering. Learning Bayesian networks:
The combination of knowledge and statistical data. Machine Learning, 20:
197243, 1995.
[13] D. Koller and A. Pfeer. Probabilistic frame-based systems. In Proceedings
of the Conference on Uncertainty in Articial Intelligence, 1998.
[14] D. Koller and A. Pfeer. Object-oriented Bayesian networks. In Proceedings
of the Conference on Uncertainty in Articial Intelligence, 1997.
[15] D. MacKay, R. McEliece, and J. Cheng. Turbo decoding as an instance
of Pearls belief propagation algorithm. IEEE Journal on Selected Areas in
Communication, 16(2):140152, 1997.
[16] K. Murphy and Y. Weiss. Loopy belief propagation for approximate inference:
An empirical study. In Proceedings of the Conference on Uncertainty in
Articial Intelligence, 1999.
[17] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann,
San Francisco, 1988.
[18] A. Pfeer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford
University, Stanford, CA, 2000.
[19] A. Pfeer, D. Koller, B. Milch, and K. Takusagawa. spook: A system for
probabilistic object-oriented knowledge representation. In Proceedings of the
Conference on Uncertainty in Articial Intelligence, 1999.
[20] Y. Weiss. Correctness of local probability propagation in graphical models
with loops. Neural Computation, 12(1):141, 2000.
One of the key challenges for statistical relational learning is the design of a representation language that allows exible modeling of complex relational interactions.
Many of the formalisms presented in this book are based on the directed graphical models (probabilistic relational models, probabilistic entity-relationship models, Bayesian logic programs). In this chapter, we present a probabilistic modeling
framework that builds on undirected graphical models (also known as Markov random elds or Markov networks). Undirected models address two limitations of the
previous approach. First, undirected models do not impose the acyclicity constraint
that hinders representation of many important relational dependencies in directed
models. Second, undirected models are well suited for discriminative training, where
we optimize the conditional likelihood of the labels given the features, which generally improves classication accuracy. We show how to train these models eectively, and how to use approximate probabilistic inference over the learned model
for collective classication and link prediction. We provide experimental results on
hypertext and social network domains, showing that accuracy can be signicantly
improved by modeling relational dependencies.1
6.1
Introduction
We focus on supervised learning as a motivation for our framework. The vast
majority of work in statistical classication methods has focused on at data
 data consisting of identically structured entities, typically assumed to be i.i.d.
However, many real-world data sets are innately relational: hyperlinked webpages,
cross-citations in patents and scientic papers, social networks, medical records,
and more. Such data consists of entities of dierent types, where each entity type is
176
characterized by a dierent set of attributes. Entities are related to each other via
dierent types of links, and the link structure is an important source of information.
Consider a collection of hypertext documents that we want to classify using
some set of labels. Most naively, we can use a bag-of-words model, classifying each
webpage solely using the words that appear on the page. However, hypertext has a
very rich structure that this approach loses entirely. One document has hyperlinks
to others, typically indicating that their topics are related. Each document also
has internal structure, such as a partition into sections; hyperlinks that emanate
from the same section of the document are even more likely to point to similar
documents. When classifying a collection of documents, these are important cues
that can potentially help us achieve better classication accuracy. Therefore, rather
than classifying each document separately, we want to provide a form of collective
classication, where we simultaneously decide on the class labels of all of the entities
together, and thereby can explicitly take advantage of the correlations between the
labels of related entities.
Another challenge arises from the task of predicting which entities are related to
which others and what are the types of these relationships. For example, in a data
set consisting of a set of hyperlinked university webpages, we might want to predict
not just which page belongs to a professor and which to a student, but also which
professor is which students advisor. In some cases, the existence of a relationship
will be predicted by the presence of a hyperlink between the pages, and we will have
only to decide whether the link reects an advisor-advisee relationship. In other
cases, we might have to infer the very existence of a link from indirect evidence,
such as a large number of coauthored papers.
We propose the use of a joint probabilistic model for an entire collection of
related entities. Following the work of Laerty et al. [13], we base our approach on
discriminatively trained undirected graphical models, or Markov networks [17]. We
introduce the framework of relational Markov networks (RMNs), which compactly
denes a Markov network over a relational data set. The graphical structure of
an RMN is based on the relational structure of the domain, and can easily model
complex patterns over related entities. For example, we can represent a pattern
where two linked documents are likely to have the same topic. We can also capture
patterns that involve groups of links: for example, consecutive links in a document
tend to refer to documents with the same label. As we show, the use of an undirected
graphical model avoids the diculties of dening a coherent generative model for
graph structures in directed models. It thereby allows us tremendous exibility in
representing complex patterns.
Undirected models lend themselves well to discriminative training, where we optimize the conditional likelihood of the labels given the features. Discriminative
training, given sucient data, generally provides signicant improvements in classication accuracy over generative training (see [23]). We provide an eective parameter estimation algorithm for RMNs which uses conjugate gradient combined
with approximate probabilistic inference (belief propagation [17, 14, 12]) for estimating the gradient. We also show how to use approximate probabilistic inference
6.2
177
over the learned model for collective classication and link prediction. We provide
experimental results on a webpage classication and social network task, showing
signicant gains in accuracy arising both from the modeling of relational dependencies and the use of discriminative training.
6.2
178
not actually exist. We denote this event using a binary existence attribute Exists,
which is true if the link between the associated entities exists and false otherwise.
In our example, our model may contain a potential link  for each pair of webpages,
and the value of the variable .Exists determines whether the link actually exists or
not. The link prediction task now reduces to the problem of predicting the existence
attributes of these link objects.
6.3
6.3
179
this approach works well for classifying scientic documents, using both the words
in the title and abstract and the citation-link structure.
However, the application of this idea to other domains, such as webpages, is
problematic since there are many cycles in the link graph, leading to cycles in the
induced Bayesian network, which is therefore not a coherent probabilistic model.
Getoor et al. [8] suggest an approach where we do not include direct dependencies
between the labels of linked webpages, but rather treat links themselves as random
variables. Each two pages have a potential link, which may or may not exist
in the data. The model denes the probability of the link existence as a function
of the labels of the two endpoints. In this link existence model, labels have no
incoming edges from other labels, and the cyclicity problem disappears. This model,
however, has other fundamental limitations. In particular, the resulting Bayesian
network has a random variable for each potential link  N 2 variables for collections
containing N pages. This quadratic blowup occurs even when the actual link graph
is very sparse. When N is large (e.g., the set of all webpages), a quadratic growth is
intractable. Even more problematic are the inherent limitations on the expressive
power imposed by the constraint that the directed graph must represent a coherent
generative model over graph structures. The link existence model assumes that the
presence of dierent edges is a conditionally independent event. Representing more
complex patterns involving correlations between multiple edges is very dicult. For
example, if two pages point to the same page, it is more likely that they point to
each other as well. Such interactions between many overlapping triples of links do
not t well into the generative framework.
Furthermore, directed models such as Bayesian networks and PRMs are usually
trained to optimize the joint probability of the labels and other attributes, while the
goal of classication is a discriminative model of labels given the other attributes.
The advantage of training a model only to discriminate between labels is that
it does not have to trade o between classication accuracy and modeling the
joint distribution over nonlabel attributes. In many cases, discriminatively trained
models are more robust to violations of independence assumptions and achieve
higher classication accuracy than their generative counterparts.
In our experiments, we found that the combination of a relational language with
a probabilistic graphical model provides a very exible framework for modeling
complex patterns common in relational graphs. First, as observed by Getoor et al.
[7], there are often correlations between the attributes of entities and the relations
in which they participate. For example, in a social network, people with the same
hobby are more likely to be friends.
We can also exploit correlations between the labels of entities and the relation
type. For example, only students can be teaching assistants in a course. We can
easily capture such correlations by introducing cliques that involve these attributes.
Importantly, these cliques are informative even when attributes are not observed
in the test data. For example, if we have evidence indicating an advisor-advisee
relationship, our probability that X is a faculty member increases, and thereby our
belief that X participates in a teaching assistant link with some entity Z decreases.
180
We also found it useful to consider richer subgraph templates over the link graph.
One useful type of template is a similarity template, where objects that share a
certain graph-based property are more likely to have the same label. Consider, for
example, a professor X and two other entities Y and Z. If Xs webpage mentions Y
and Z in the same context, it is likely that the X-Y relation and the Y-Z relation are
of the same type; for example, if Y is Professor Xs advisee, then probably so is Z.
Our framework accomodates these patterns easily, by introducing pairwise cliques
between the appropriate relation variables.
Another useful type of subgraph template involves transitivity patterns, where
the presence of an A-B link and of a B-C link increases (or decreases) the likelihood
of an A-C link. For example, students often assist in courses taught by their advisor.
Note that this type of interaction cannot be accounted for by just using pairwise
cliques. By introducing cliques over triples of relations, we can capture such patterns
as well. We can incorporate even more complicated patterns, but of course we are
limited by the ability of belief propagation to scale up as we introduce larger cliques
and tighter loops in the Markov network.
We note that our ability to model these more complex graph patterns relies on
our use of an undirected Markov network as our probabilistic model. In contrast,
the approach of Getoor et al. [8] uses directed graphical models (Bayesian networks
and PRMs [11]) to represent a probabilistic model of both relations and attributes.
Their approach easily captures the dependence of link existence on attributes of
entities. But the constraint that the probabilistic dependency graph be a directed
acyclic graph makes it hard to see how we would represent the subgraph patterns
described above. For example, for the transitivity pattern, we might consider simply
directing the correlation edges between link existence variables arbitrarily. However,
it is not clear how we would then parameterize a link existence variable for a link
that is involved in multiple triangles. See [20] for further discussion.
6.4
Markov Networks
6.4
181
Denition 6.1
Let G = (V, E) be an undirected graph with a set of cliques C(G). Each c  C(G)
is associated with a set of nodes Vc and a clique potential c (Vc ), which is a nonnegative function dened on the joint domain of Vc . Let  = {c (Vc )}cC(G) . The
	
Markov net (G, ) denes the distribution P (v) = Z1 cC(G) c (vc ), where Z is
 	
the partition function  a normalization constant given by Z = v c (vc ).
Each potential c is simply a table of values for each assignment vc that denes
a compatibility between values of variables in the clique. The potential is often
represented by a log-linear combination of a small set of features:
wi fi (vc )} = exp{wc  fc (vc )} .
c (vc ) = exp{
i
The simplest and most common form of a feature is the indicator function
f (Vc )  (Vc = vc ). However, features can be arbitrary logical predicates of the
variables of the clique, Vc . For example, if the variables are binary, a feature might
signify the parity or whether the variables are all the same value. More generally,
the features can be real-valued functions, not just binary predicates. See further
discussion of features at the end of section 6.4.
We will abbreviate log-linear representation as follows:
wc  fc (vc )  log Z = w  f (v)  log Z;
log P (v) =
c
182
of all of the entities in an instantiation given the relational structure and the content
attributes. (We provide the denitions directly for the conditional case, as the
unconditional case is a special case where the set of content attributes is empty.)
Roughly speaking, it species the cliques and potentials between attributes of
related entities at a template level, so a single model provides a coherent distribution
for any collection of instances from the schema.
For example, suppose that pages with the same label tend to link to each other,
as in gure 6.1. We can capture this correlation between labels by introducing,
for each link, a clique between the labels of the source and the target page. The
potential on the clique will have higher values for assignments that give a common
label to the linked pages.
To specify what cliques should be constructed in an instantiation, we will dene
a notion of a relational clique template. A relational clique template species tuples
of variables in the instantiation by using a relational query language. For our link
example, we can write the template as a kind of SQL query:
SELECT doc1.Category, doc2.Category
FROM Doc doc1, Doc doc2, Link link
WHERE link.From = doc1.Key and link.To = doc2.Key
Note the three clauses that dene a query: the FROM clause species the cross
product of entities to be ltered by the WHERE clause and the SELECT clause
picks out the attributes of interest. Our denition of clique templates contains the
corresponding three parts.
Denition 6.3
A relational clique template C = (F, W, S) consists of three components:
F = {Fi }  a set of entity variables, where an entity variable Fi is of type E(Fi ).
W(F.R)  a Boolean formula using conditions of the form Fi .Rj = Fk .Rl .
F.S  F.X  F.Y  a selected subset of content and label attributes in F.
6.4
183
For the clique template corresponding to the SQL query above, F consists
of doc1 , doc2 , and link of types Doc, Doc, and Link, respectively. W(F.R) is
link.F rom = doc1.Key  link.T o = doc2.Key and F.S is doc1.Category and
doc2.Category.
A clique template species a set of cliques in an instantiation I:
C(I)  {c = f .S : f  I(F)  W(f .r)},
where f is a tuple of entities {fi } in which each fi is of type E(Fi ); I(F) =
I(E(F1 )) . . .  I(E(Fn )) denotes the cross product of entities in the instantiation;
the clause W(f .r) ensures that the entities are related to each other in specied
ways; and nally, f .S selects the appropriate attributes of the entities. Note that
the clique template does not specify the nature of the interaction between the
attributes; that is determined by the clique potentials, which will be associated
with the template.
This denition of a clique template is very exible, as the WHERE clause of
a template can be an arbitrary predicate. It allows modeling complex relational
patterns on the instantiation graphs. To continue our webpage example, consider
another common pattern in hypertext: links in a webpage tend to point to pages of
the same category. This pattern can be expressed by the following template:
SELECT doc1.Category, doc2.Category
FROM Doc doc1, Doc doc2, Link link1, Link link2
WHERE link1.From = link2.From and link1.To = doc1.Key
and link2.To = doc2.Key and not doc1.Key = doc2.Key
Depending on the expressive power of our template denition language, we
may be able to construct very complex templates that select entire subgraph
structures of an instantiation. We can easily represent patterns involving three (or
more) interconnected documents without worrying about the acyclicity constraint
imposed by directed models. Since the clique templates do not explicitly depend on
the identities of entities, the same template can select subgraphs whose structure
is fairly dierent. The RMN allows us to associate the same clique potential
parameters with all of the subgraphs satisfying the template, thereby allowing
generalization over a wide range of dierent structures.
Denition 6.4
A relational Markov network M = (C, ) species a set of clique templates C and
corresponding potentials  = {C }CC to dene a conditional distribution:
 
1
C (I.xc , I.yc ),
P (I.y | I.x, I.r) =
Z(I.x, I.r)
CC cC(I)
184
CC
fC (I.xc , I.yc )
cC(I)
is the sum over all appearances of the template C(I) in the instantiation, and f is
the vector of all fC .
Given a particular instantiation I of the schema, the RMN M produces an
unrolled Markov network over the attributes of entities in I. The cliques in the
unrolled network are determined by the clique templates C. We have one clique for
each c  C(I), and all of these cliques are associated with the same clique potential
C . In our webpage example, an RMN with the link feature described above would
dene a Markov net in which, for every link between two pages, there is an edge
between the labels of these pages. Figure 6.1 illustrates a simple instance of this
unrolled Markov network.
Note that we leave the clique potentials to be specied using arbitrary sets of
feature functions. A common set is the complete table of indicator functions, one
for each instantiation of the discrete-valued variables in the clique. However, this
results in a large number of parameters (exponential in the number of variables).
Often, as we encounter in our experiments, only a subset of the instantiations is
of interest or many instantiations are essentially equivalent because of symmetries.
For example, in an edge potential between labels of two webpages linked from a
given page, we might want to have a single feature tracking whether the two labels
are the same. In the case of triad cliques enforcing transitivity, we might constrain
features to be symmetric functions with respect to the variables. In the presence of
continuous-valued variables, features are often a predicate on the discrete variables
multiplied by a continuous value. We do not prescribe a language for specifying
features (as does Markov logic; see chapter 11), although in our implementation,
we use a combination of logical formulae and custom-designed functions.
6.5
6.5
185
and our task is to compute the weights w for the potentials . In the learning task,
we are given some training set D where both the content attributes and the labels
are observed. Any particular setting for w fully species a probability distribution
Pw over D, so we can use the likelihood as our objective function, and attempt to
nd the weight setting that maximizes the likelihood (ML) of the labels given other
attributes. However, to help avoid overtting, we assume a prior over the weights
(a zero-mean Gaussian), and use maximum a posteriori (MAP) estimation. More
precisely, we assume& that dierent
parameters are a priori independent and dene
'
1
2
2
dD
||w||22
+C .
2 2
dD
The last term is the shrinking eect of the prior and the other two terms are the
dierence between the expected feature counts and the empirical feature counts,
where the expectation is taken relative to Pw :
IEPw [f (xd , Yd )] =
f (xd , yd )Pw (yd | xd ) .
y
Thus, ignoring the eect of the prior, the gradient is zero when empirical and
expected feature counts are equal.2 The prior term gives the smoothing we expect
from the prior: small weights are preferred in order to reduce overtting. Note that
the sum over y  is just over the possible categorizations for one data sample every
time.
2. The solution of ML estimation with log-linear models is also the solution to the dual
problem of maximum entropy estimation with constraints that empirical and expected
feature counts must be equal [4].
186
6.5.2
Learning RMNs
The analysis for the relational setting is very similar. Now, our data set D is actually
a single instantiation I, where the same parameters are used multiple times  once
for each dierent entity that uses a feature. A particular choice of parameters w
species a particular RMN, which induces a probability distribution Pw over the
unrolled Markov network. The product of the likelihood of I and the parameter
prior dene our objective function, whose gradient L(w, I) again consists of the
empirical feature counts minus the expected feature counts and a smoothing term
due to the prior:
f (I.y, I.x, I.r)  IEw [f (I.Y, I.x, I.r)] 
w
,
2
This last formula reveals a key dierence between the relational and the at
case: the sum over I.y involves the exponential number of assignments to all the
label attributes in the instantiation. In the at case, the probability decomposes
as a product of probabilities for individual data instances, so we can compute the
expected feature count for each instance separately. In the relational case, these
labels are correlated  indeed, this correlation was our main goal in dening this
model. Hence, we need to compute the expectation over the joint assignments to all
the entities together. Computing these expectations over an exponentially large set
is the expensive step in calculating the gradient. It requires that we run inference
on the unrolled Markov network.
6.5.3
The inference task in our conditional Markov networks is to compute the posterior distribution over the label variables in the instantiation given the content
variables. Exact algorithms for inference in graphical models can execute this process eciently for specic graph topologies such as sequences, trees, and other low
treewidth graphs. However, the networks resulting from domains such as our hypertext classication task are very large (in our experiments, they contain tens
of thousands of nodes) and densely connected. Exact inference is completely intractable in these cases.
We therefore resort to approximate inference. There is a wide variety of approximation schemes for Markov networks, including sampling and variational methods.
We chose to use belief propagation(BP) for its simplicity and relative eciency and
accuracy. BP is a local message passing algorithm introduced by Pearl [17] and
later related to turbo-coding by McEliece et al. [14]. It is guaranteed to converge to
the correct marginal probabilities for each node only for singly connected Markov
6.6
Experimental Results
187
networks. Empirical results [15] show that it often converges in general networks,
and when it does, the marginals are a good approximation to the correct posteriors.
As our results in section 6.6 show, this approach works well in our domain. We refer
the reader to chapter 2 in this book for a detailed description of the BP algorithm.
6.6
Experimental Results
We present experiments with collective classication and link prediction, in both
hypertext and social network data.
6.6.1
Experiments on WebKB
We experimented with our framework on the WebKB data set [3], which is an
instance of our hypertext example. The data set contains webpages from four different computer science departments: Cornell, Texas, Washington, and Wisconsin.
Each page has a label attribute, representing the type of webpage which is one of
course, faculty, student, project, or other . The data set is problematic in that the
category other is a grab bag of pages of many dierent types. The number of pages
classied as other is quite large, so that a baseline algorithm that simply always
selected other as the label would get an average accuracy of 75%. We could restrict
attention to just the pages with the four other labels, but in a relational classication setting, the deleted webpages might be useful in terms of their interactions
with other webpages. Hence, we compromised by eliminating all other pages with
fewer than three outlinks, making the number of other pages commensurate with
the other categories.3 For each page, we have access to the entire HTML of the
page and the links to other pages. Our goal is to collectively classify webpages into
one of these ve categories. In all of our experiments, we learn a model from three
schools and test the performance of the learned model on the remaining school,
thus evaluating the generalization performance of the dierent models.
Unfortunately, we cannot directly compare our accuracy results with previous
work because dierent papers use dierent subsets of the data and dierent training/test splits. However, we compare to standard text classiers such as naive Bayes,
logistic regression, and support vector machines, which have been demonstrated to
be successful on this data set [9].
3. The resulting category distribution is: course (237), faculty (148), other (332), researchproject (82), and student (542). The number of remaining pages for each school are: Cornell
(280), Texas (292), Washington (315), and Wisconsin (454). The number of links for each
school are: Cornell (574), Texas (574), Washington (728) and Wisconsin (1614).
Words
0.35
Words+Meta
0.35
0.3
0.3
0.25
0.25
Test Error
Test Error
188
0.2
0.15
Link
Section
Link+Section
0.2
0.15
0.1
0.1
0.05
0.05
Logistic
Nave Bayes
Svm
(a)
Logistic
Cor
Tex
Wash
Wisc
AVG
(b)
6.6.1.1
Flat Models
The simplest approach we tried predicts the categories based on just the text content
on the webpage. The text of the webpage is represented using a set of binary
attributes that indicate the presence of dierent words on the page. We found that
stemming and feature selection did not provide much benet and simply pruned
words that appeared in fewer than three documents in each of the three schools
in the training data. We also experimented with incorporating metadata: words
appearing in the title of the page, in anchors of links to the page, and in the
last header before a link to the page [24]. Note that metadata, although mostly
originating from pages linking into the considered page, are easily incorporated as
features, i.e., the resulting classication task is still at feature-based classication.
Our rst experimental setup compares three well-known text classiers  Naive
Bayes, linear support vector machines 4 (Svm), and logistic regression (Logistic)
 using words and metawords. The results, shown in gure 6.2(a), show that the
two discriminative approaches outperform Naive Bayes. Logistic and Svm give very
similar results. The average error over the four schools was reduced by around 4%
by introducing the metadata attributes.
4. We trained one-against-others SVM for each category and during testing, picked the
category with the largest margin.
6.6
Experimental Results
6.6.1.2
189
Relational Models
190
Figure 6.3
category, none, assigned when the two most frequent categories of the links are less
than a factor of 2 apart. In the entire data set, the breakdown of labels for the
sections we found is: course (40), faculty (24), other (187), research.project (11),
student (71), and none (17). Note that these labels are hidden in the test data, so
the learning algorithm now also has to learn to predict section labels. Although not
our nal aim, correct prediction of section labels is very helpful. Words appearing
in the last header before the section are used to better predict the section label by
introducing a clique over these words and section labels.
We compared the performance of Link, Section, and Section+Link (a combined
model which uses both types of cliques) on the task of predicting webpage labels,
relative to the baseline of at logistic regression with metadata. Our experiments
used MAP estimation with a Gaussian prior on the feature weights with standard
deviation of 0.3. Figure 6.2(b) compares the average error achieved by the dierent
models on the four schools, training on three and testing on the fourth. We see
that incorporating any type of relational information consistently gives signicant
improvement over the baseline model. The Link model incorporates more relational
interactions, but each is a weaker indicator. The Section model ignores links outside
of coherent sections, but each of the links it includes is a very strong indicator. In
general, we see that the Section model performs slightly better. The joint model
is able to combine benets from both and generally outperforms all of the other
models. The only exception is for the task of classifying the Wisconsin data. In
this case, the joint Section+Link model contains many links, as well as some large
tightly connected loops, so belief propagation did not converge for a subset of nodes.
Hence, the results of the inference, which was stopped at a xed arbitrary number
of iterations, were highly variable and resulted in lower accuracy.
6.6.1.3
Experimental Results
191
0.35
Exists+Nave Bayes
Exists+Logistic
Link
0.3
0.25
Test Error
6.6
0.2
0.15
0.1
0.05
0
Cor
Tex
Wash
Wisc
AVG
Figure 6.4 Comparison of generative and discriminative relational models. Exists+Naive Bayes is completely generative. Exists+Logistic is generative in the links,
but locally discriminative in the page labels given the local features (words, metawords). The Link model is completely discriminative.
on both pages labels. We can also consider an alternative Exists+Logistic model that
uses a discriminative model for the connection between page label and words 
i.e., uses logistic regression for the conditional probability distribution of page label
given words. This model has equivalent expressive power to the naive Bayes model
but is discriminatively rather than generatively trained. Finally, the Link model is
a fully discriminative (undirected) variant we have presented earlier, which uses a
discriminative model for the label given both words and link existence. The results,
shown in gure 6.4, show that discriminative training provides a signicant improvement in accuracy: the Link model outperforms Exists+Logistic which in turn
outperforms Exists+Naive Bayes.
As illustrated in table 6.1, the gain in accuracy comes at some cost in training
time: for the generative models, parameter estimation is closed form while the
discriminative models are trained using conjugate gradient, where each iteration
requires inference over the unrolled RMN. On the other hand, both types of
models require inference when the model is used on new data; the generative
model constructs a much larger, fully connected network, resulting in signicantly
longer testing times. We also note that the situation changes if some of the data
is unobserved in the training set. In this case, generative training also requires an
iterative procedure (such as the expectation macimation algorihtm (EM)) where
each iteration uses the signicantly more expressive inference.
6.6.2
We collected and manually labeled a new relational data set inspired by WebKB [3].
Our data set consists of computer science department webpages from three schools:
Stanford, Berkeley, and MIT. A total of 2954 pages are labeled into one of eight
categories: faculty, student, research scientist, sta, research group, research project,
192
Training
Testing
Links
Links+Section
Exists+NB
1530
7
6060
10
1
100
course, and organization (organization refers to any large entity that is not a
research group). Owned pages, which are owned by an entity but are not the main
page for that entity, were manually assigned to that entity. The average distribution
of classes across schools is: organization (9%), student (40%), research group (8%),
faculty (11%), course (16%), research project (7%), research scientist (5%), and
sta (3%).
We established a set of candidate links between entities based on evidence of a
relation between them. One type of evidence for a relation is a hyperlink from an
entity page or one of its owned pages to the page of another entity. A second type
of evidence is a virtual link : We assigned a number of aliases to each page using
the page title, the anchor text of incoming links, and email addresses of the entity
involved. Mentioning an alias of a page on another page constitutes a virtual link.
The resulting set of 7161 candidate links were labeled as corresponding to one of
ve relation types  advisor (faculty, student), member (research group/project,
student/faculty/research scientist), teach (faculty/research scientist/sta, course),
TA (student, course), part-of (research group, research project)  or none,
denoting that the link does not correspond to any of these relations.
The observed attributes for each page are the words on the page itself and the
metawords on the page  the words in the title, section headings, anchors to the
page from other pages. For links, the observed attributes are the anchor text, text
just before the link (hyperlink or virtual link), and the heading of the section in
which the link appears.
Our task is to predict the relation type, if any, for all the candidate links. We
tried two settings for our experiments: with page categories observed (in the test
data) and page categories unobserved. For all our experiments, we trained on two
schools and tested on the remaining school.
Observed entity labels We rst present results for the setting with observed
page categories. Given the page labels, we can rule out many impossible relations;
the resulting label breakdown among the candidate links is: none (38%), member
(34%), part-of (4%), advisor (11%), teach (9%), TA (5%).
There is a huge range of possible models that one can apply to this task. We
selected a set of models that we felt represented some range of patterns that
manifested in the data.
Link-Flat is our baseline model, predicting links one at a time using multinomial
logistic regression. This is a strong classier, and its performance is competitive
Experimental Results
193
0.95
0.85
0.85
0.8
Flat
Neigh
0.8
Accuracy
Flat
Triad
Section
Section & Triad
0.9
Accuracy
6.6
0.75
0.7
0.65
0.75
0.6
0.7
ber
mit
(a)
sta
ave
ber
m it
sta
ave
(b)
Figure 6.5
with other classiers (e.g., support vector machines). The features used by this
model are the labels of the two linked pages and the words on the links going from
one page and its owned pages to the other page. The number of features is around
1000.
The relational models try to improve upon the baseline model by modeling the
interactions between relations and predicting relations jointly. The Section model
introduces cliques over relations whose links appear consecutively in a section on a
page. This model tries to capture the pattern that similarly related entities (e.g.,
advisees, members of projects) are often listed together on a webpage. This pattern
is a type of similarity template, as described in section 6.3. The Triad model is a
type of transitivity template, as discussed in section 6.3. Specically, we introduce
cliques over sets of three candidate links that form a triangle in the link graph. The
Section & Triad model includes the cliques of the two models above.
As shown in gure 6.2(a), both the Section and Triad models outperform the at
model, and the combined model has an average accuracy gain of 2.26%, or 10.5%
relative reduction in error. As we only have three runs (one for each school), we
cannot meaningfully analyze the statistical signicance of this improvement.
As an example of the interesting inferences made by the models, we found a
student-professor pair that was misclassied by the Flat model as none (there is only
a single hyperlink from the students page to the advisors) but correctly identied
by both the Section and Triad models. The Section model utilizes a paragraph on the
students webpage describing his or her research, with a section of links to research
groups and the link to his or her advisor. Examining the parameters of the Section
model clique, we found that the model learned that it is likely for people to mention
their research groups and advisors in the same section. By capturing this trend, the
Section model is able to increase the condence of the student-advisor relation. The
Triad model corrects the same misclassication in a dierent way. Using the same
example, the Triad model makes use of the information that both the student and
0.75
194
Phased (Flat/Flat)
Phased (Neigh/Flat)
Phased (Neigh/Sec)
Joint+Neigh
Joint+Neigh+Sec
0.7
0.65
0.6
0.55
0.5
0.45
ber
mit
sta
ave
Figure 6.6
the teacher belong to the same research group, and the student TAed a class taught
by his advisor. It is important to note that none of the other relations are observed
in the test data, but rather the model bootstraps its inferences.
Unobserved entity labels When the labels of pages are not known during
relations prediction, we cannot rule out possible relations for candidate links based
on the labels of participating entities. Thus, we have many more candidate links that
do not correspond to any of our relation types (e.g., links between an organization
and a student). This makes the existence of relations a very low-probability event,
with the following breakdown among the potential relations: none (71%), member
(16%), part-of (2%), advisor (5%), teach (4%), TA (2%). In addition, when we
construct a Markov network in which page labels are not observed, the network
is much larger and denser, making the (approximate) inference task much harder.
Thus, in addition to models that try to predict page entity and relation labels
simultaneously, we also tried a two-phase approach, where we rst predict page
categories, and then use the predicted labels as features for the model that predicts
relations.
For predicting page categories, we compared two models. The Entity-Flat model
is a multinomial logistic regression that uses words and metawords from the page
and its owned pages in separate bags of words. The number of features is roughly
10, 000. The Neighbors model is a relational model that exploits another type of
similarity template: pages with similar URLs often belong to the same category or
tightly linked categories (research group/project, professor/course). For each page,
two pages with URLs closest in edit distance are selected as neighbors, and we
introduced pairwise cliques between neighboring pages. Figure 6.5(b) shows that
the Neighbors model clearly outperforms the Flat model across all schools, by an
average of 4.9% accuracy gain.
Experimental Results
195
0.75
0.75
0.7
flat
compatibility
6.6
0.65
0.6
0.55
0.5
0.45
flat
compatibility
0.7
0.65
0.6
0.55
0.5
0.45
0.4
0.4
10% observed
25% observed
50% observed
(a)
DD
JL
TX
67
FG
LM
BC
SS
(b)
Figure 6.7 (a) Average precision-recall breakeven point for 10%, 25%, 50% observed links. (b)
Average precision-recall breakeven point for each fold of school residences at 25% observed links.
Given the page categories, we can now apply the dierent models for link
classication. Thus, the Phased (Flat/Flat) model uses the Entity-Flat model to
classify the page labels, and then the Link-Flat model to classify the candidate
links using the resulting entity labels. The Phased (Neighbors/Flat) model uses the
Neighbors model to classify the entity labels, and then the Link-Flat model to classify
the links. The Phased (Neighbors/Section) model uses the Neighbors to classify the
entity labels and then the Section model to classify the links.
We also tried two models that predict page and relation labels simultaneously.
The Joint + Neighbors model is simply the union of the Neighbors model for page
categories and the Flat model for relation labels given the page categories. The Joint
+ Neighbors + Section model additionally introduces the cliques that appeared in
the Section model between links that appear consecutively in a section on a page.
We train the joint models to predict both page and relation labels simultaneously.
As the proportion of the none relation is so large, we use the probability
of none to dene a precision-recall curve. If this probability is less than some
threshold, we predict the most likely label (other than none); otherwise we predict
the most likely label (including none). As usual, we report results at the precisionrecall breakeven point on the test data. Figure 6.6 shows the breakeven points
achieved by the dierent models on the three schools. Relational models, both
phased and joint, did better than at models on the average. However, performance
varies from school to school and for both joint and phased models, performance on
one of the schools is worse than that of the at model.
6.6.3
The data set we used has been collected by a portal website at a large university that
hosts an online community for students [1]. Among other services, it allows students
to enter information about themselves, create lists of their friends, and browse the
social network. Personal information includes residence, gender, major, and year, as
well as favorite sports, music, books, social activities, etc. We focused on the task of
predicting the friendship links between students from their personal information
196
and a subset of their links. We selected students living in sixteen dierent residences
or dorms and restricted the data to the friendship links only within each residence,
eliminating interresidence links from the data to generate independent training/test
splits. Each residence has about fteen to twenty-ve students and an average
student lists about 25% of his or her housemates as friends.
We used an eight-fold train-test split, where we trained on fourteen residences and
tested on two. Predicting links between two students from just personal information
alone is a very dicult task, so we tried a more realistic setting, where some
proportion of the links is observed in the test data, and can be used as evidence for
predicting the remaining links. We used the following proportions of observed links
in the test data: 10%, 25%, and 50%. The observed links were selected at random,
and the results we report are averaged over ve folds of these random selection
trials.
Using just the observed portion of links, we constructed the following at features:
for each student, the proportion of students in the residence that list him/her and
the proportion of students he/she lists; for each pair of students, the proportion of
other students they have as common friends. The values of the proportions were
discretized into four bins. These features capture some of the relational structure
and dependencies between links: Students who list (or are listed by) many friends
in the observed portion of the links tend to have links in the unobserved portion as
well. More importantly, having friends in common increases the likelihood of a link
between a pair of students.
The Flat model uses logistic regression with the above features as well as personal
information about each user. In addition to the individual characteristics of the two
people, we also introduced a feature for each match of a characteristic; for example,
both people are computer science majors or both are freshmen.
The Compatibility model uses a type of similarity template, introducing cliques
between each pair of links emanating from each person. Similarly to the Flat model,
these cliques include a feature for each match of the characteristics of the two
potential friends. This model captures the tendency of a person to have friends
who share many characteristics (even though the person might not possess them).
For example, a student may be friends with several computer science majors, even
though he is not a CS major himself. We also tried models that used transitivity
templates, but the approximate inference with 3-cliques often failed to converge or
produced erratic results.
Figure 6.7(a) compares the average precision-recall breakpoint achieved by the
dierent models at the three dierent settings of observed links. Figure 6.7(b) shows
the performance on each of the eight folds containing two residences each. Using
a paired t -test, the Compatibility model outperforms Flat with p-values 0.0036,
0.00064, and 0.054 respectively.
6.7
6.7
197
References
[1] L. Adamic, O. Buyukkokten, and E. Adar. A social network caught in the web.
http://www.hpl.hp.com/shl/papers/social/, 2002.
[2] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization
using hyperlinks. In Proceedings of ACM International Conference on Management of Data, 1998.
[3] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam,
and S. Slattery. Learning to extract symbolic knowledge from the World Wide
Web. In Proceedings of the National Conference on Articial Intelligence, 1998.
[4] S. Della Pietra, V. Della Pietra, and J. Laerty. Inducing features of random
elds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19
(4):380393, 1997.
[5] L. Egghe and R. Rousseau. Introduction to Informetrics. Elsevier, Amsterdam,
1990.
[6] N. Friedman, L. Getoor, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In Proceedings of the International Joint Conference on
198
References
199
In this chapter, we introduce a graphical language for relational data called the
probabilistic entity-relationship (PER) model. The model is an extension of the
entity-relationship model, a common model for the abstract representation of
database structure. We concentrate on the directed version of this modelthe
directed acyclic probabilistic entity-relationship (DAPER) model. The DAPER
model is closely related to the plate model and the probabilistic relational model
(PRM), existing models for relational data. The DAPER model is more expressive
than either existing model, and also helps to demonstrate their similarity. In
addition to describing the new language, we discuss important facets of modeling
relational data, including the use of restricted relationships, self relationships, and
probabilistic relationships. Many examples are provided.
7.1
Introduction
For over a century, statistical modeling has focused primarily on at datadata
that can be encoded naturally in a single two-dimensional table having rows and
columns. The disciplines of pattern recognition, machine learning, and data mining
have had a similar focus. Notable exceptions include hierarchical models (e.g., [11])
and spatial statistics (e.g., [1]). Over the last decade, however, perhaps due to the
ever-increasing volumes of data being stored in databases, the modeling of nonat
or relational data has increased signicantly. During this time, several graphical
languages for relational data have emerged including plate models (e.g.,[3, 9]) and
probabilistic relational models (PRMs) (e.g., [5]). These models are to relational
data what ordinary graphical models (e.g., directed acyclic graphs and undirected
graphs) are to at data.
In this chapter, we introduce a new graphical model for relational datathe
probabilistic entity-relationship (PER) model. This model class is more expressive
202
7.2
7.2
203
to be written as
p(x) =
n
p(xi |pai ),
(7.1)
i=1
where pai are the attributes corresponding to the parents of node Xi . The local
distributions of the DAG model is the set of conditional probability distributions
p(xi |pai ), i = 1, . . . , n. Thus, a DAG model for X species the joint distribution
for X.
An example DAG model structure for attributes (X, Y, Z, W ) is shown in gure 7.1(a). The structure (i.e., the missing arcs) encode the independencies: (1) X
and Z are independent given Y , and (2) (Y, Z) and W are independent given X.
We note that DAG models can be interpreted as a generative model for the data. In
our example, we can generate a sample for (X, Y, Z, W ) by rst sampling X, then
Y and W given X, and nally Z given Y .
As we shall see, when working with relational data, it is often necessary to express
constraints or restrictions among attributes. Such restrictions can be encoded in a
DAG model, which we review here.
As a simple example, suppose we have a generative story for binary (0/1)
attributes X, Y, Z, and W that can be described by the DAG model structure
shown in gure 7.1(a). In addition, suppose we know that at most two of these
attributes take on the value 1. We can add this restriction to the model as shown
in gure 7.1(b). Here, we have added a binary node named R. Associated with this
node (not shown in the gure) is a local distribution wherein R = 1 with probability
1 when at most two of its parents take on value 1, and with probability zero
otherwise. To encode the restriction, we set R = 1. Note that R is a deterministic
attribute. That is, given the parents of R, R is known with certainty. As is commonly
done in the graphical modeling literature, we indicate deterministic nodes with
double ovals.1
Assuming that the restriction always holdsthat is, R is always equal to 1it
is not meaningful to work with the joint distribution p(x, y, z, w, r). Instead, the
appropriate distribution to make inferences with is
p(x|r = 1) = p(x) p(y|x) p(z|y) p(w|x) p(r = 1|x, y, z, w).
(7.2)
Readers familiar with directed factor-graph models [4] will recognize that this
distribution for (X, Y, Z, W ) can be encoded by a directed factor-graph model in
which node R is replaced by the factor f (x, y, z, w) = p(r = 1|x, y, z, w). More
generally, the factor-graph model is perhaps a more natural model for situations
1. DAG models can also be used to encode soft restrictions. For example, if we know that
zero, one, two, three, and four of the attributes X take on the value 1 with probabilities
p0 , p1 , p2 , p3 , and p4 , respectively, we can encode this soft restriction using the DAG model
structure in gure 7.1(b) where R is no longer deterministic and has the appropriate local
probability distribution.
204
X
W
W
Y
R
(a)
(b)
(a) A DAG model. (b) A similar DAG model with an added restriction
among the attributes.
Figure 7.1
7.3
7.3
205
In this example, we can think of individual students (e.g., john, mary) and individual courses (e.g., cs107, stat10) as entities.3 Naturally, there will be many
students and courses in the database. We refer to the set of students (e.g.,
{john,mary,. . .}) as an entity set. The set of courses (e.g., {cs107,stat10,. . . }) is
another entity set. Most important, because an ER model can be built before any
data is collected, we need the concept of an entity classa reference to a set of
entities without a specication of the entities in the set. In our example, the entity
classes are Student and Course.
A relationship is a list of entities. In our example, a possible relationship is the
pair (john, cs107), meaning that john took the course cs107. Using nomenclature
similar to that for entities, we talk about relationship sets and relationship classes.
A relationship set is a collection of like relationshipsthat is, a collection of
relationships each relating entities from a xed list of entity classes. In our example,
we have the relationship set of student-course pairs. A relationship class refers to
an unspecied set of like relationships. In our example, we have the relationship
class Takes.
The IQ of john and the diculty of cs107 are examples of attributes. We use the
term attribute class to refer to an unspecied collection of like attributes. In our
example, Student has the single attribute class Student.IQ and Course has the single
attribute class Course.Di. Relationships also can have attributes; and relationship
classes can have attribute classes. In our example, Takes has the attribute class
Takes.Grade.
An ER model for the structure of a database graphically depicts entity classes,
relationships classes, attribute classes, and their interconnections. An ER model for
Example 7.1 is shown in gure 7.2(a). The entity classes (Student and Course) are
shown as rectangular nodes; the relationship class (Takes) is shown as a diamondshaped node; and the attribute classes (Student.IQ, Course.Di, and Takes.Grade)
are shown as oval nodes. Attribute classes are connected to their corresponding
entity or relationship class, and the relationship class is connected to its associated
entity classes. (Solid edges are customary in ER models. Here, we use dashed edges
so that we can later use solid edges to denote probabilistic dependencies.)
An ER model describes the potential attributes and relationships in a database. It
says little about actual data. A skeleton for a set of entity and relationship classes is
specication of the entities and relationships associated with a particular database.
That is, a skeleton for a set of entity and relationship classes is a collection of
corresponding entity and relationship sets. An example skeleton for our university
database example is shown in gure 7.2(b).
An ER model applied to a skeleton denes a specic set of attributes. In particular, for every entity class and every attribute class of that entity class, an attribute
is dened for every entity in the class; and for every relationship class and every at-
3. In a real database, longer names would be needed to dene unique students and courses.
We keep the names short in our example to make reading easier.
206
Student
john
mary
Diff
Course
Course
cs107
Takes
Grade
stat10
Takes
Student
IQ
(a)
(b)
cs107.Diff
T(john,cs107).G
(c)
john.IQ
T(mary,cs107).G
Student
Course
john
cs107
mary
cs107
mary
stat10
stat10.Diff
T(mary.stat10).G
mary.IQ
Figure 7.2
tribute class of that relationship class, an attribute is dened for every relationship
in the class. The attributes dened by the ER model in gure 7.2(a) applied to the
skeleton in gure 7.2(b) are shown in gure 7.2(c). In what follows, we use ER model
to mean both the ER diagramthe graph in gure 7.2(a)and the mechanism by
which attributes are generated from skeletons.
A skeleton still says nothing about the values of attributes. An instance for an
ER model consists of (1) a skeleton for the entity and relationship classes in that
model, and (2) an assignment of a value to every attribute generated by the ER
model and the skeleton. That is, an instance of an ER model is an actual database.
Let us now turn to the probabilistic modeling of relational data. To do so, we
introduce a specic type of probabilistic ER model: the DAPER model. Roughly
7.3
207
speaking, a DAPER model is an ER model with directed (solid) arcs among the
attribute classes that represent probabilistic dependencies among corresponding attributes, and local distribution classes that dene local distributions for attributes.
Recall that an ER model applied to a skeleton denes a set of attributes. Similarly, a DAPER model applied to a skeleton denes a set of attributes as well as
a DAG model for these attributes. Thus, a DAPER model can be thought of as a
language for expressing conditional independence among unrealized attributes that
eventually become realized given a skeleton.
As with the ER diagram and model, we sometimes distinguish between a DAPER
diagram, which consists of the graph only, and the DAPER model, which consists of
the diagram, the local distribution classes, and the mechanism by which a DAPER
model denes a DAG model given a skeleton.
Example 7.2
In the university database (Example 7.1), a students grade in a course depends
both on the students IQ and on the diculty of the course.
The DAPER model (or diagram) for this example is shown in gure 7.3(a). The
model extends the ER model in gure 7.2 with the addition of arc classes and
local distribution classes. In particular, there is an arc class from Student.IQ to
Takes.Grade and an arc class from Course.Di to Takes.Grade. These arc classes
are denoted as a solid directed arc. A local distribution class for Takes.Grade (not
shown) represents the probabilistic dependence of grade on IQ and diculty.
Just as we expand attribute classes in a DAPER model to attributes in a
DAG model given a skeleton, we expand arc classes to arcs. In doing so, we
sometimes want to limit the arcs that are added to a DAG model. In the current
problem, for example, we want to draw an arc from attribute c.Di for course c to
attribute Takes(s, c ).Grade for course c and any student s, only when c = c . This
limitation is achieved by adding a constraint to the arc classnamely, the constraint
course[Di] = course[Grade] (see gure 7.3(a)). Here, the terms course[Di] and
course[Grade] refer to the entities c and c , respectivelythe entities associated
with the attributes at the ends of the arc.
The arc class from Student.IQ to Takes.Grade has a similar constraint: student[IQ] = student[Grade]. This constraint says that we draw an arc from attribute
s.IQ for student s =student[IQ] to Takes(s , c).Grade for student s =student[Grade]
and any course c only when s = s . As we shall see, constraints in DAPER models
can be quite expressivefor example, they may include rst-order expressions on
entities and relationships.
Figure 7.3(c) shows the DAG (structure) generated by the application of
the DAPER model in gure 7.3(a) to the skeleton in gure 7.3(b). (The attribute names in the DAG model are abbreviated.) The arc from stat10.Di to
Takes(mary,cs107).Grade, e.g., is disallowed by the constraint on the arc class from
Course.Di to Takes.Grade.
Regardless of what skeleton we use, the DAG model generated by the DAPER
model in gure 7.3(a) will be acyclic. In general, as we show in section 7.7, if the
208
attribute classes and arc classes in the DAPER diagram form an acyclic graph,
then the DAG model generated from any skeleton for the DAPER model will be
acyclic. Weaker conditions are also sucient to guarantee acyclicity. We describe
one in section 7.7.
In general, a local distribution class for an attribute class is a specication from
which local distributions for attributes corresponding to the attribute class can be
constructed, when a DAPER model is expanded to a DAG model. In our example,
the local distribution class for Takes.Gradewritten p(Takes.Grade|Student.IQ,
Course.Di)is a specication from which the local distributions for Takes(s, c).Grade,
for all students s and courses c, can be constructed. In our example, each attribute
Takes(s, c).Grade will have two parents: s.IQ and c.Di. Consequently, the local
distribution class need only be a single local probability distribution. We discuss
more complex situations in section 7.4.
Whereas most of this chapter concentrates on issues of representation, the
problems of probabilistic inference, learning local distributions, and learning model
structure are also of interest. For all of these problems, it is natural to extend
the concept of an instance to that of a partial instance; an instance in which
some of the attributes do not have values. A simple approach for performing
probabilistic inference about attributes in a DAPER model given a partial instance
is to (1) explicitly construct a ground graph, (2) instantiate known attributes from
the partial instance, and (3) apply standard probabilistic inference techniques to
the ground graph to compute the quantities of interest. One can improve upon
this simple approach by utilizing the additional structure provided by a relational
modelfor example, by caching inferences in subnetworks. Koller and Pfeer[15],
for example, have done preliminary work in this direction. With regard to learning,
note that from a Bayesian perspective, learning about both the local distributions
and model structure can be viewed as probabilistic inference about (missing)
attributes (e.g., parameters) from a partial instance. In addition, there has been
substantial research on learning PRMs (e.g., [8]) and much of this work is applicable
to DAPER models.
We shall explore PER models in much more detail in subsequent sections. Here,
let us examine two alternate languages for relational data: plate models and PRMs.
Plate models were developed independently by Buntine[3] and the BUGS team
(e.g., [9]) as a language for compactly representing graphical models in which there
are repeated measurements. We know of no formal denition of a plate model, and
so we provide one here. This denition deviates slightly from published examples of
plate models, but it enhances the expressivity of such models while retaining their
essence (see section 7.5).
According to our denition, plate and DAPER models are equivalent. The
invertible mapping from a DAPER to a plate model is as follows. Each entity
class in a DAPER model is drawn as a large rectanglecalled a plate. The plate
is labeled with the entity-class name. Plates are allowed to intersect or overlap. A
relationship class for a set of entity classes is drawn at the named intersection of
the plates corresponding to those entities. If there is more than one relationship
7.3
209
Student
john
mary
Diff
Course
course[Diff] =
course[Grade]
Course
cs107
Takes
Grade
stat10
student[IQ] =
student[Grade]
Student
Takes
IQ
(a)
(b)
cs107.Diff
T(john,cs107).G
(c)
john.IQ
T(mary,cs107).G
Student
Course
john
cs107
mary
cs107
mary
stat10
stat10.Diff
T(mary.stat10).G
mary.IQ
Figure 7.3
class among the same set of entity classes, the plates are drawn such that there
is a distinct intersection for each of the relationship classes. Attribute classes of
an entity class are drawn as ovals inside the rectangle corresponding to the entity
but outside any intersection. Attribute classes associated with a relationship class
are drawn in the intersection corresponding to the relationship class. Arc classes
and constraints are drawn just as they are in DAPER models. In addition, local
distribution classes are specied just as they are in DAPER models.
The plate model corresponding to the DAPER model in gure 7.3(a) is shown in
gure 7.4(a). The two rectangles are the plates corresponding to the Student and
210
Course entity classes. The single relationship class between Student and Course
Takesis represented as the named intersection of the two plates. The attribute
class Student.IQ is drawn inside the Student plate and outside the Course plate;
the attribute class Course.Di is drawn inside the Course plate and outside the
Student plate; and the attribute class Takes.Grade is drawn in the intersection of
the Student and Course plate. The arc classes and their constraints are identical to
those in the DAPER model.
PRMs were developed in [5] explicitly for the purpose of representing relational
data. The PRM extends the relational modelanother commonly used representation for the structure of a databasein much the same way as the PER model
extends the ER model. In this chapter, we shall dene directed PRMs such that
they are equivalent to DAPER models and, hence, plate models. This denition deviates from the one given by, e.g., [5], but enhances the expressivity of the language
as previously dened (see section 7.6).
The invertible mapping from a DAPER model to a directed PRM (by our
denition) takes place in two stages. First, the ER model component of the DAPER
model is mapped to a relational model in a standard way (e.g., see [19]). In
particular, both entity and relationship classes are represented as tables. Foreign
keysor what Getoor et al.[8] call reference slotsare used in the relationshipclass tables to enocde the ER connections in the ER model. Attribute classes
for entity and relationship classes are represented as attributes or columns in the
corresponding tables of the relational model. Second, the probabilistic components
of the DAPER model are mapped to those of the directed PRM. In particular, arc
classes and constraints are drawn just as they are in the DAPER model.
The directed PRM corresponding to the DAPER model in gure 7.3(a) is shown
in gure 7.4(b). (The local distribution for Takes.Grade is not shown.) The Student
entity class and its attribute class Student.IQ appear in a table, as does the Course
entity class and its attribute class Course.Di. The Takes relationship and its
attribute class Takes.Grade is shown as a table containing the foreign keys Student
and Course. The arc classes and their constraints are drawn just as they are in the
DAPER model.
7.4
7.4
211
Course
Course
Diff
Diff
course[Diff] =
course[Grade]
Takes
Takes
Course
Student
Grade
Grade
student[IQ] =
student[Grade]
Student
IQ
(a)
course[Diff] =
course[Grade]
(b)
student[IQ] =
student[Grade]
IQ
Student
Figure 7.4
Fundamentals
212
Disease
Present
d1.Present
d 2 .Present
d 3 .Present
Symptom
Present
s1.Present
s2 .Present
s3 .Present
(a)
(b)
DAPER model applied to a skeleton in which there are three diseases and three
symptoms is shown in gure 7.5(b).
We give this example rst to emphasize that arc classes need not have constraints.
Now, let us see what happens when we include such constraints.
Example 7.4
Extending example 7.3, suppose a physician has identied the possible causes of
each symptom.
The DAPER model for example 7.4 is shown in gure 7.6(a). With respect to the
model in gure 7.5(a), there is now the relationship class Causes, where Causes(d, s)
is true if the physician has identied disease d as a possible cause of symptom s.
Also new is the constraint Causes(d, s) on the arc class. This constraint says that,
when we expand the DAPER model to a DAG model given a skeleton, we draw
an arc from d.Present to s.Present only when Causes(d, s) holds. Note that, in the
diagram we use d and s to refer to the entities associated with Disease.Present
and Symptom.Present, respectively. In what follows, we will continue to make strong
abbreviations as in this example, although such abbreviations are not required and
may be undesirable for computer implementations of the PER language.
In the next two examples, we consider more complex constraints.
Example 7.5
Extending example 7.3 in a dierent way, suppose the physician has identied both
primary (major) and secondary (minor) causes of disease.
The DAPER model for example 7.5 is shown in gure 7.7(a). There are now two
relationship classesPrimary (1o ) Causes and Secondary (2o ) Causesbetween
the two entity classes, and the constraint is a disjunctive one: 1o Causes(d, s) 
2o Causes(d, s). This constraint says that, when the DAPER model is expanded to
a DAG model given a skeleton, an arc is drawn from d.Present to s.Present only
when d is a primary and/or secondary cause of s.
7.4
Causes
Present
Disease
Causes (d , s )
Causes
Present
Symptom
(a)
213
Disease
Symptom
d1
s1
d1
s2
d1
s3
d2
s2
d3
s3
(b)
d1.Present
d 2 .Present
d 3 .Present
s1.Present
s2 .Present
s3 .Present
(c)
Figure 7.6 (a) A DAPER model for incomplete bipartite graph of diseases and
symptoms. (b) A possible skeleton identifying diseases, symptoms, and potential
causes of symptoms. (c) A DAG model resulting from the expansion of the DAPER
model to the skeleton.
Example 7.6
Extending example 7.3 in a dierent way, suppose that both diseases and symptoms
have category labelslabels drawn from the same set of categories. The possible
causes of a symptom are diseases that have at least one category in common with
that symptom.
The DAPER model for this example is shown in gure 7.7(b). Here, we have
introduced a third entity classCategorywhose entities have relationships with
Disease and Symptom. In particular, R1(d, c) holds when disease d is in category
c; and R2(s, c) holds when symptom s is in category c. In this model, the arc class
has the constraint cR1(d, c)  R2(c, s), where c is an arbitrary entity in Category.
Thus, when the DAPER model is expanded to a DAG given a skeleton, an arc will
be drawn from d.Present to s.Present only when d and s share at least one category.
To understand how constraints are written and used in general, consider a
DAPER model with an arc class from X.A to Y.B. When this model is expanded
to a ground graph given a skeleton, depending on the constraint, we might draw
an arc from x.A to y.B for any x and y in the skeleton. To determine whether we
do so, we look at the tail and head entities associated with this putative arc. The
tail entities of the putative arc from x.A to y.B are the set of entities associated
with x. If X is an entity class, then the tail entity is just the entity x. If X is
a relationship class, then the tail entities are those entities in the relationship
tuple x. Similarly, the head entities of this arc are the set of entities associated
with y. For example, given the DAPER model and skeleton in gure 7.3 for the
university database, the tail and head entities of the putative arc from john.IQ to
Takes(john,cs107).Grade are (john) and (john,cs107), respectively. A constraint on
the arc class from X.A to Y.B in a DAPER model is any rst-order expression
involving entities and relationship classes in the DAPER model such that the
expression is bound when the tail and head entities are taken to be constants.
To determine whether we draw an arc from x.A to y.B, we evaluate the rst-order
expression using the tail and head entities of the putative arc. It must evaluate
214
Disease
Disease
Present
Present
R1
1 Causes (d , s ) 
2o Causes(d , s )
o
1o Causes
2o Causes
Symptom
(a)
Figure 7.7
Category
Present
c R1 (d , c) 
R2 ( s, c )
R2
(b)
Symptom
Present
quantier.
to true or false. We draw the arc from x.A to y.B only if the expression is true.
Continuing with the same university database example, let us determine whether
to draw an arc from john.IQ to Takes(john,cs107).Grade. The relevant constraint
student[IQ] = student[Grade]references the tail entity student[IQ] = john and
the head entity student[Grade] = john. Thus, the expression evaluates to true and
we draw the arc.
Next, let us consider the local distribution class. A local distribution class for
attribute class X.A is any specication from which the local distributions for
attribute x.A, for any entity or relationship x in class X, may be constructed. In
gure 7.3(c), each attribute for a students grade in a course has two parentsone
attribute corresponding to the diculty of the course and another corresponding to
the IQ of the student. Consequently, the local distribution class for Takes.Grade in
the DAPER model can be a single (ordinary) local distribution. In general, however,
a more complicated specication is needed. For example, in the ground graph
of gure 7.6(c), the attribute s1 .Present has one parent, whereas the attributes
s2 .Present and s3 .Present have two parents. Consequently, the local distribution
class for Symptom.Present must be something more than a single local distribution.
In general, a local distribution class for X.A may take the form of an enumeration
of local distributions. In our example, we could specify a local distribution for every
possible parent set of s.Present for every symptom s in every possible skeleton. Of
course, such enumerations are cumbersome. Instead, a local distribution class is
typically expressed as a canonical distribution such as noisy OR, logistic, or linear
regression. Friedman et al.[5] refer to such specications as aggregators.
So far, we have considered only DAPER models in which all attributes derive
from attributes classes. In practice, however, it is often convenient to include
(ordinary) attributes in a DAPER model. For example, in a Bayesian approach to
learning the conditional probability distribution of Takes.Grade given Student.IQ
7.4
215
and Course.Di in example 7.2, we may add to the DAPER model an ordinary
attribute  corresponding to this uncertain distribution, as shown in gure 7.8(a).
(If Grade is binary, e.g.,  would correspond to the parameter of a Bernoulli
distribution.) The ground graph obtained from this DAPER model applied to the
skeleton in gure 7.8(b) is shown in gure 7.8(c). Note that the attribute  appears
only once in the ground graph and that, because there is no annotation on the arc
class from  to Takes.Grade, there is an arc from  to each grade attribute.
Although this view makes DAPER models easy to understand, formally, we do
not allow such models to contain (ordinary) attributes. Instead, we specify that,
for any DAPER model, (1) there is an entity classGlobalthat is not drawn; (2)
for any skeleton, this entity class has precisely one entity; and (3) every attribute
class not connected explicitly to some visible entity class is connected to Global.
This view is equivalent to the informal one just presented, but leads to simpler
denitions and notation in our formal treatment of DAPER models in section 7.7.
7.4.2
Restricted Relationships
216
Student
john
mary
Diff
Course
Course
c[D] = c[G]
cs107
Takes
stat10
Grade
Takes
s[IQ] = s[G]
Student
IQ
(a)
Student
Course
john
cs107
mary
cs107
mary
stat10
(b)
cs107.Diff
T(john,cs107).G
(c)
john.IQ
T(mary,cs107).G
stat10.Diff
T(mary.stat10).G
mary.IQ
Note that, due to the many-to-one restriction in this problem, we could equivalently attach the attribute class O to In rather than to Patient. A DAPER model
equivalent to the one in gure 7.9(a) is shown in gure 7.9(c).
Example 7.8
The occurrence of words in a document is used to infer its topic. The occurrence
of words is mutually independent given document topic. Document topics are i.i.d.
given multinomial parameters t . The occurrence of word w in a document with
topic t is i.i.d. given t and Bernoulli parameters w|t .
This example is commonly referred to a binary naive Bayes classication [18]. A
DAPER model for this problem is shown in gure 7.10. The entity classes Document
7.4
217
Hospital
h1.
In(h, p)
In
p11.O
hm.
p1n1 .O
pm1.O
p mnm .O
Patient
(a)
(b)
Hospital
h[ ] = h[O]
O
In
Patient
(c)
Figure 7.9 (a) A DAPER model for patient outcomes across multiple hospitals
(example 7.7). (b) The ground graph (a hierarchical model structure) for a skeleton
containing m hospitals and ni patients in hospital i applied to the DAPER model
in (a). (c) A DAPER model equivalent to the one in (a).
and Word are related by the single relationship class F. The attribute classes are
Document.Topic representing the topic of a document, Word.w|t representing the
set of Bernoulli parameters w|t for a word, and F(d, w).In representing whether
word w is in document d. The relationship class F is restricted to be a Full
relationship class. That is, in any allowed skeleton, all pairs (document,word) must
be represented.4 We indicate this restriction on the DAPER diagram by placing
the annotation Full next to the relationship class. As we shall see in what follows,
the Full restriction is useful in many situations.
4. In a practical database implementation, this relationship would be encoded sparsely,
despite the Full restriction. That is, relationship (d, w) would be stored in the database
only when word w appears in document d.
218
Document
Topic
d [T] = d [In]
Full
In
w[ w|t ] = w[In]
Word
Figure 7.10
7.4.3
w|t
Self Relationships
Self relationships are relationships that relate like entities (and perhaps other
entities as well). A self-relationship class is one that contains self relationships.
Examples of self-relationship classes are common in databases: people are managers
of other people, cities are near other cities, timestamps follow timestamps, and so
on. ER models can represent self relationships in a natural manner. The extension
to PER models is also straightforward, as we illustrate with the following three
examples.
Example 7.9
In the university database example (example 7.2), a students grade in a course
depends on whether an advisor of the student is a friend of a teacher of the course.
The ER model for the data in this example is shown in gure 7.11(a). With
respect to the ER model in gure 7.2(a), Professor is a new entity class and Advises,
Teaches, and F are new relationship classes. Advises(p, s) means that professor p
is an advisor of student s. Teaches(p, c) means that professor p teaches course c.
(Students may have more than one advisor and courses may have more than one
teacher.)
The relationship class F is introduced to model whether one professor is a friend of
another. F is our rst example of a self-relationship classit contains relationships
between professor pairs. The two dashed lines connecting F and the Professor entity
class in the diagram indicate that F is a self-relationship class. F has one attribute
class F.Friend, where the attribute F(p, pf ).Friend is true if professor pf is a friend
of professor p. Note that F has the Full constraint so that we can model whether
any one professor is a friend of another. Also note that F(p1 , p2 ).Friend may be true
while F(p2 , p1 ).Friend may be false.
The DAPER model for this example, including the new probabilistic relationship
between F.Friend and Takes.Grade, is shown in gure 7.11(b). The constraint on
the arc class from F.Friend to Takes.Grade is Teaches(p, c)  Advises(pf , s). Thus,
in any ground graph generated from this model, there is an arc from attribute
F(p, pf ).Friend to attribute Takes(s, c).Grade whenever a teacher of the course is p
7.4
219
220
Full
F
Full
F(p,pf)
Friend
Professor
Professor
Teaches
Course
Friend
Teaches
Diff
Teaches ( p, c ) 
Advises ( p f , s )
Diff
Course
c[D] = c[G ]
Takes
Grade
Takes
Advises
(a)
Grade
Advises
IQ
Student
(b)
s[IQ] = s[G ]
IQ
Student
Full
F(p,pf)
Professor
(Advisor)
Friend
Professor
(Teacher)
Teaches
Teaches ( p, c ) 
Advises ( p f , s )
Course
Diff
c[D] = c[G ]
Takes
Advises
(c)
Grade
s[IQ] = s[G ]
Student
IQ
Figure 7.11
7.4
221
Probabilistic Relationships
222
Next ( s , s+1 )
Order
Next(s,s+1)
s[ H ] = s[ X ]
Slice
x|h
(a)
Order
Next(s,s+1)
H
Slice (+1)
Slice
s[ H ] = s[ X ]
x|h
(b)
Figure 7.12 (a) The DAPER model representation of a hidden Markov model.
(b) The same model in which Slice is copied.
7.4
223
 pm Fam( p c , pm , p f ) 
 p f Fam( p c , pm , p f )
2DAG
Family(pc,pm,pf)
Gene
Person
(a)
Person
(Mother)
2DAG
Person
(Father)
Gene
Gene
p f Fam( pc , pm , p f )
Family(pc,pm,pf)
 pm Fam( pc , pm , p f )
Person
(Child)
Gene
(b)
Figure 7.13 (a) The DAPER model for gene transmission through inheritance.
(b) The same model in which Person is copied.
224
Paper
(Citing)
Paper
(Citing)
Topic
p[T ] = pcg [ E ]
Full
Cites(pcg,pcd)
Cites
Exists
p[T ] = pcd [E ]
Paper
(Cited)
Paper
(Cited)
(a)
Topic
(b)
Figure 7.14 (a) An ER model for a citation database. (b) A DAPER model for
the situation where citations are uncertain.
Paper
(Citing)
Topic
p[T] = pcg [E]
Full
Cites(pcg,pcd)
Exists
<=10
p[T] = pcd [ E]
Paper
(Cited)
Topic
A DAPER model for the situation where citations are uncertain and
limited to ten per paper.
Figure 7.15
The DAPER model in gure 7.15 shows the DAPER model for this example,
where the Cites relationship class is both uncertain and restricted. As discussed in
section 7.2, we encode the restrictions using instantiated deterministic nodes. With
respect to gure 7.14(b), we have added a binary, attribute class P aper. <= 10.
The double oval associated with this attribute class indicates that this attribute
expands to deterministic attributes in a ground graph. In particular, a ground
graph attribute p. <= 10 will have parents Cites(pcg , pcd ).Exists, for all pcd , and
will be true exactly when ten or fewer of these parents are true. To encode the
restriction, we set p. <= 10 to true for every p when performing inference in the
ground graph.
7.4
Paper
(Citing)
225
Topic
R1 ( p , c )
R1
Cites
c[ E ] = c[ M]
Full
R2
Exists
MutEx
p[T ] = p[ E ]
Paper
(Cited)
Figure 7.16
Topic
A DAPER model for the situation where only the cited papers are
uncertain.
Example 7.14 Partial Relationship Existence
Modifying example 7.12 once again, the citation database now has a complete set
of citations, but some of citations are so garbled that the identities of some of the
cited papers are uncertain.
One way to think about this uncertainty is that the relationships Cites(pcg , pcd )
are uncertain only in their second argument. Getoor et al. [8] refer to this uncertainty as reference uncertainty and present a special mechanism for representing it
in PRMs. We take an alternative approach that uses only concepts that we have
already discussed.
A DAPER model for this example is shown in gure 7.16. With respect to the
DAPER model in gure 7.14(b), we have added the entity class Cites, and the
relationship classes R1 and R2 between Paper and Cites. An entity pair in Cites
corresponds to a citationa citing and a cited paper. R1 (pcg , c) holds when paper
pcg is the citing paper in c, and R2 (pcd , c) holds when pcd is the cited paper in
c. The relationship class R1 is a restricted (many-to-one) relationship class. In
contrast, the relationship class R2 is a probabilistic relationship class, restricted to
be Full. The uncertainty in this relationship class is encoded with the attribute
class R2 .Exists, where R2 (pcd , c).Exists is true precisely when citation c cites paper
pcd . To model the restriction that the possible cited papers of c are mutually
exclusive, we rst introduce the deterministic, attribute class Cites.MutEx. In any
ground graph obtained from this DAPER model, c.M utEx will be true exactly
when one of its parents R2 (pcd , c).Exists is true. For any inference we perform with
the ground graph, we set c.M utEx to true for every citation c.
226
Slice (+1)
Slice
h
Next ( s, s+1 )
s[ H ] = s[ X ]
x|h
Figure 7.17
Next(s,s+1)
Order
gure7.12(b).
7.5
Plate Models
In this section, we revisit our denition of the plate model, give examples, and
describe how our denition diers from previously published examples.
As discussed in section 7.3, we dene the plate model by giving an invertible
mapping from DAPER to plate model. Thus, the two model types are equivalent in
the sense that they can represent the same conditional independence relationships
for any given skeleton.
Summarizing the mapping from DAPER to plate model given in section 7.3,
entity classes are drawn as large named rectangles called plates; a relationship class
for a set of entity classes is drawn at the named intersection of the corresponding
plates; attribute classes are drawn inside the rectangle corresponding to its entity
or relationship class; and arc classes and constraints are drawn just as they are
in DAPER models. For example, as we have discussed, the DAPER model in
gure 7.3(a) has the corresponding plate model in gure 7.4(a). As another example,
the DAPER model for the HMM shown in gure 7.12(b) has the corresponding plate
model in gure 7.17. Note that, because plate models represent relationship classes
as the intersection of plates, plates (corresponding to entity classes) must be copied
when the model contains self-relationship classes.
The plate model corresponding to the DAPER model for the patient-hospital
example in gure 7.3(a) is shown in gure 7.18(a). In this plate model, there are
no attributes in the Patient plate outside the intersection. Thus, one can move the
Patient plate fully inside the Hospital plate, yielding the diagram in gure 7.18(b).
We allow this nesting in our framework. Furthermore, plates may be nested to an
arbitrary depth. This convention corresponds to one found in published examples
of plate models.
There are three dierences between plate models as we have dened them
and traditional plate modelsplates models as they have been described in the
literature. In all three cases, our denition provides a more expressive language.
7.5
Plate Models
227
Hospital
Hospital
Hospital
h[ ] = h[O ]
h[ ] = h[O ]
In
Patient/In
Patient/In
Patient
(a)
(b)
(c)
Figure 7.18
One, in traditional plate models, an arc class emanating from an attribute class in
a plate cannot leave that plate. Given this constraint, any arc class from attribute
class E.X must point either to attribute class E.Y or to attribute class R.Y , where
R is nested inside E.
Two, when a traditional plate model is expanded to a ground graph, arcs are
drawn only between attributes corresponding to the same entity. To be more precise,
consider a plate model containing the arc class from E.X to E.Y . In a traditional
plate model, the arc class implicitly has the constraint e[X] = e[Y ]. Similarly,
consider a plate model containing the arc class from E.X to R.Y where R is
nested inside E, possibly many levels deep. Because R in nested inside E, for
any relationship r  R, the entities associated with r must uniquely determine
an e  E. Let r(e) be the set of the relationships r that uniquely determine e.
Now, when this traditional plate model is expanded to a ground graph, arcs are
drawn from e.X to r.Y only when r  r(e). As an example, consider gure 7.18(c),
which shows the traditional plate model for the patienthospital example. Here,
E=Hospital, R=In, and r(h) = p {(h, p)} for all hospitals h. Thus, the arc class
from Hospital. to In(h, p).O has the constraint h[] = h[O]. This constraint is
implicit (see gure 7.18(c)).
Three, traditional plate models contain no arc-class constraints other than the
implicit ones just described.
The DAPER and plate model (as we have dened them) are equivalent. Nonetheless, in some situations, a DAPER model may be easier to understand than an
equivalent plate model, and vice versa. When there are many entity and relationship
classes (plates and intersections), DAPER models are often easier to understand.
228
In particular, drawing intersections when there are many plates can be dicult
(although not impossible; see [10]). In contrast, when there are few entities and the
nesting convention can be used, plates are often easier to understand.
7.6
7.7
Technical Details
229
Course
Diff
Takes
Course
Student
Grade
Takes.Course.Diff
Takes.Student.IQ
Takes.Grade
Takes.Grade
Student
IQ
Figure 7.19
those who prefer to design databases with relational models may prefer the DAPER
model for probabilistic modeling, as DAPER models make explicit the distinction
between entities and relationships.
7.7
Technical Details
In this section, we formalize many of the concepts we have described. In addition,
we state and prove a few relevant facts.
We use E and R to denote the set of entity and relationship classes, respectively.
We use E and R (sometimes with subscripts) to denote an entity and relationship
class, respectively, and X to denote an arbitrary class in E  R. We use (E) and
(R) to denote an entity and relationship set, respectively, and (X) to denote
an arbitrary (E) or (R). We use e and r to denote a particular entity and
relationship, respectively, and x to denote an arbitrary entity or relationship. We
use X.A to denote the attribute class A associated with class X, and A(X) to
denote the set of attribute classes associated with class X. We use x.A to denote
an attribute associated with entity or relationship x, and A(x) to denote the set of
attributes associated with x. Each attribute class and attribute is associated with
a domaina set of possible values. The domain of x.A is the same as the domain
of X.A for every x  X.
First, we dene the ER model in the following series of denitions.
Denition 7.1
An entity-relationship diagram for entity classes E, relationship classes R, and attribute classes A is a graph in which rectangular nodes correspond to entity classes,
diamond nodes correspond to relationship classes, and oval nodes correspond to
230
7.7
Technical Details
231
Denition 7.8
A ground graph for a DPER diagram and skeleton ER for E, R, and A is a directed
graph constructed as follows. For every attribute in A(ER ), there is a corresponding
node in the graph. For any attribute x.A  A(ER ), its parent set pa(x.A) are
those attributes y.B  A(y) such that there is an arc class from Y.B to X.A and
the expression CAB (e(x), e(y)) is true.
Denition 7.9
Given ER , a set of skeletons for E, R, and A, a DPER diagram for E, R, and A
is acyclic with respect to ER if, for every ER  ER , the ground graph for the
DPER diagram and ER is acyclic.
Theorem 7.10
If the probabilistic arcs of a DPER diagram for E, R, and A form an acyclic graph,
then the DPER diagram is ayclic with respect to ER for any ER .
Proof Suppose the theorem is false. Consider a cyclic ground graph for some
skeleton. Denote the attributes in the cycle by (x1 .A1  x2 .A2  . . .  xn .An )
where x1 .A1 = xn .An . For each attribute xi .Ai there is an associated attribute
class Xi .Ai . From denition 7.8, we know that there must be an edge from
Xi .Ai  Xi+1 .Ai+1 . Because X1 .A1 = Xn .An , there must be a cycle in the DPER
diagram, which is a contradiction. Q.E.D.
Friedman et al.[5] prove something equivalent.
Denition 7.11
A directed acyclic probabilistic entity-relationship (DAPER) model for entity classes
E, relationship classes R, attribute classes A, and skeletons ER consists of (1) an
DPER diagram for E, R, and A that is acyclic with respect to every ER  ER ,
and (2) a local distribution classdenoted P (X.A|PA(X.A))for each attribute
class X.A. Each local distribution class is a collection of information sucient to
determine a local distribution p(x.A|pa(x.A)) for any x.A  A(ER ). For every
ER  ER , the DAPER model species a DAG model for A(ER ). The structure
of this DAG model is the ground graph of the DPER diagram for ER . The local
distributions of this DAG model are the local distributions p(x.A|pa(x.A)).
An immediate consequence of denition 7.11 is that, given D, a DAPER model for
E, R, A, and ER and a skeleton ER  ER , we can write the joint distribution
for A(ER ) as follows:
p(IERA |ER , D) =
p(x.A|pa(x.A)).
(7.3)
XER x(X) AA(X)
In the remainder of this section, we describe a condition weaker than the one in
theorem 7.10 that guarantees the creation of acyclic ground graphs from a DPER
model. In this discussion, we use R(e1 , . . . , en ) to denote a particular relationship
in a relationship set (R).
232
Denition 7.12
A relationship class R is a self-relationship class with respect to entity class E if a
relationship in R contains two or more references to entities in the entity class E.
Denition 7.13
A projected pairwise self-relationship class is obtained from a self-relationship class
by projecting two of the entities in the relationships that are from the same entity
class.
For example, the Family relationship class is a self-relationship class that can be
projected into the Father-Child relationship class and the Mother-Child relationship
class; and both are projected pairwise self-relationship classes.
Denition 7.14
Given skeleton ER for E and R, a relationship set (R) for a self-relationship class
R is cyclic if there exists a projected pairwise self-relationship class R for some
entity set E containing entities e1 , . . . , en such that R (e1 , e2 ), . . . , R (en1 , en ) and
R (en , e1 ). If a relationship set is not cyclic, it is acyclic.
Denition 7.15
An arc class in a DPER model is called a self arc if both the head and tail of the
arc are the same attribute class. A self-arc class is simple if there is exactly one
entity class associated with the attribute class associated with the self arc.
Theorem 7.16
If (1) the arc classes excluding the self arcs of the DPER diagram for E, R, and A
form an acyclic graph, (2) every self arc class is simple and has a constraint with
no disjunctions, no negations, and contains a self-relationship class for the entity
class associated with the self arc, and (3) for every self-relationship class R, (R)
is acyclic for every ER  ER , then the DPER diagram is acyclic with respect to
ER .
Proof Suppose the theorem is false. Consider a ground graph G for some skeleton
ER containing a shortest cycle (x1 .A1 , . . . , xn .An ) where x1 .A1 = xn .An . Suppose
that the cycle contains at least two distinct attribute classes, that is, Xi .Ai =
Xi+1 .Ai+1 . This implies that there must be a cycle in the DAPER diagram with
the self-arc classes removed; however, from condition 1 and theorem 7.10 this cannot
be the case. Therefore, all of the attribute classes in the cycle must be the same
and must be included due to a single self-arc class. Due to condition (2), the cycle
in the self-arc class must imply a cyclic self-relationship class but this contradicts
condition (3). Q.E.D.
7.8
233
Y =1
Z
7.8
234
class is a Boolean expression over attribute-states that may take head and tail
entities as arguments.
Example 7.15 Identity Uncertainty
We have video images of multiple cars of dierent colors. We know how many
cars there are and have zero or more observations of each cars color, but we are
uncertain about what observations go with what cars.
Pasula and Russell[18] describe this example as having identity uncertainty. We
can represent this example using the contingent DAPER model in gure 7.21(a).
The two entity classes, Car and Observation, are related by the relationship class
Of, where Of(o, c) holds when observation o corresponds to car c. The probabilistic
relationship Of has the many-to-one restriction: an observation is associated with
exactly one car. As in previous examples, the many-to-one restriction is represented
by the Full relationship class Of, together with the attribute class Of.Exists and
the deterministic node MutEx (which is set to true). The arc class from Car.Color
to Observation.Color is annotated with the ordered pair (Of(o, c), Of(o, c).Exists =
true). The rst component says that we draw an arc from c.Color to o.Observation
only when Of(o, c) is true. (In this case, this constraint is vacuous because the
relationship class F is Full.) The second component says that, when we draw such
an arc, we add to it the state constraint Of(o, c).Exists = true. Figure 7.21(b) shows
the expansion of this contigent DAPER model to a contingent DAG model for a
skeleton containing one car and two observations. Note that, because there is only
one car, the MutEx nodes are redudant and can be omitted.
In this example, we know how many cars there are. If we do not, we can place
a probability distribution on the number of cars and stipulate that the DAPER
model in gure 7.21(a) should be applied to each possible number of cars.
Let us now discuss possibilities for relational modeling with undirected models. A
commonly used (nonrelational) undirected model is the undirected graphical (UG)
model. This model class has more than one denitiondenitions that coincide
only for positive distributions [17]. Here, we dene a UG for attributes X with
joint distribution p(x) as a model having two components: (1) an undirected graph
(the model structure) whose nodes are in one-to-one correspondence with X, and
(2) a collection of non-negative clique functions m (xm ), m = 1, . . . , M , where m
indexes the maximal cliques of the graph and Xm are the attributes in X in the
mth maximal clique, such that
p(x) = c
M
m (xm ).
(7.4)
m=1
The term c is a normalization constant. As is the case for the DAG model, the UG
model for X denes the joint distribution for X. The clique functions are sometimes
called potentials.
A UG model for (X, Y, Z) is shown in gure 7.22(a). The graph has a single
maximal clique consisting of all three attributes, and hence represents an arbitrary
distribution for these attributes.
7.8
235
Color
Car
o (M ) = o (E )
MutEx
Full
Exists
(a)
Color
Observation
o1.MutEx
c.Color
Of(c,o1).Exists
Of
Of(c,o2).Exists
o1.Color
(b)
o2.Color
Figure 7.21 (a) A contingent DAPER model for example 7.15, an example of
identity uncertainty. (b) A contigent DAG model resulting from the expansion of
the model in (a) given a skeleton containing one car and two observations.
A related but more general undirected model is the hierarchical log-linear graphical (HLLG) model. An HLLG model is a model having two components: (1) an
undirected hypergraph (the model structure) whose nodes are in one-to-one correspondence with X, and (2) a collection of potentials h (xh ), h = 1, . . . , H, where
h indexes the hyperarcs of the graph and xh are the attributes in X of the hth
hyperarc, such that
p(x) = c
H
h (xh ).
(7.5)
h=1
Again, an HLLG model for X denes the joint distribution for X. In this chapter,
we represent a hyperarc as a triangle connecting multiple nodes with undirected
edges. For example, gure 7.22(b) shows an HLLG model with a single hyperedge.
By virtue of (7.4) and (7.5), both UG and HLLG model structures dene
factorization constraints on distributions. In this sense, HLLG models are more
general than UG models. That is, given any UG model structure, there exists an
HLLG model structure that can encode the same factorization constraints, but not
vice versa. For example, the UG structure in gure 7.22(a) has the equivalent HLLG
model structure shown in gure 7.22(b). In contrast, the HLLG model structure
shown in gure 7.22(c) encodes the factorization constraint
p(x, y, z) = c 1 (x, y) 2 (y, z) 3 (x, z),
236
(a)
(b)
(c)
Figure 7.22
References
237
Variable
a
b
Neigh(v1 , v2 )
UpperT
a.X
Neigh(v1,v2)
Neigh
Variable
(a)
v1
v2
(b)
b.X
c.X
(c)
Finally, there are numerous classes of graphical models that we have not yet
explored, including mixed directed and undirected models (e.g., see [17]); directed
factor-graph models [4]; inuence diagrams [14]; and dependency networks [13]. The
development of PER models that expand to models in these classes also provides
opportunities for research.
Acknowledgments
We thank David Blei, Tom Dietterich, Brian Milch, and Ben Taskar for useful
comments.
References
[1] J. Besag. Spatial interaction and the statistical analysis of lattice systems.
Journal of the Royal Statistical Society, 36:192236, 1974.
[2] C. Boutlier, N. Friedman, M. Goldszmidt, and D. Koller. Context-specic
independence in Bayesian networks. In Proceedings of the Conference on
Uncertainty in Articial Intelligence, 1996.
[3] W. Buntine. Operations for learning with graphical models.
Articial Intelligence Research, 2(159-225), 1994.
Journal of
238
Intelligence, 2003.
[5] N. Friedman, L. Getoor, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In Proceedings of the International Joint Conference on
Articial Intelligence, 1999.
[6] R. Fung and R. Shachter. Contingent belief networks. 1990.
[7] A. Gelman, J. Carlin, H. Stern, and D. Rubin.
Chapman and Hall, London, 1995.
Recent work on graphical models for relational data has demonstrated signicant
improvements in classication and inference when models represent the dependencies among instances. Despite its use in conventional statistical models, the assumption of instance independence is contradicted by most relational data sets.
For example, in citation data there are dependencies among the topics of a papers references, and in genomic data there are dependencies among the functions
of interacting proteins. In this chapter we present relational dependency networks
(RDNs), a graphical model that is capable of expressing and reasoning with such
dependencies in a relational setting. We discuss RDNs in the context of relational
Bayes networks and relational Markov networks and outline the relative strengths
of RDNsnamely, the ability to represent cyclic dependencies, simple methods for
parameter estimation, and ecient structure learning techniques. The strengths of
RDNs are due to the use of pseudo-likelihood learning techniques, which estimate an
ecient approximation of the full joint distribution. We present learned RDNs for
a number of real-world data sets and evaluate the models in a prediction context,
showing that RDNs identify and exploit cyclic relational dependencies to achieve
signicant performance gains over conventional conditional models.
8.1
Introduction
Many data sets routinely captured by businesses and organizations are relational in
nature, yet until recently most machine learning research has focused on attened
propositional data. Instances in propositional data record the characteristics of
homogeneous and statistically independent objects; instances in relational data
record the characteristics of heterogeneous objects and the relations among those
objects. Examples of relational data include citation graphs, the World Wide
Web, genomic structures, fraud detection data, epidemiology data, and data on
interrelated people, places, and events extracted from text documents.
240
1. Several previous papers [e.g., 8, 10] use the term probabilistic relational model to refer
to a specic model that is now often called a relational Bayesian network [Koller, personal
communication]. In this paper, we use PRM in its more recent and general sense.
2. We use the term relational Bayesian network to refer to Bayesian networks that have
been upgraded to model relational databases. The term has also been used by Jaeger [13]
8.1
Introduction
241
acyclicity constraint of the model. While domain knowledge can sometimes be used
to structure the autocorrelation in an acyclic manner, often an acyclic ordering
is unknown or does not exist. For example, in genetic pedigree analysis there
is autocorrelation among the genes of relatives [20]. In this domain, the causal
relationship is from ancestor to descendent so we can use the temporal parentchild relationship to structure the dependencies in an acyclic manner (i.e., parents
genes will never be inuenced by the genes of their children). However, given a set
of hyperlinked webpages, there is little information to use to determine the causal
direction of the dependency between their topics. In this case, we can only represent
an (undirected) correlation between the topics of two pages, not a (directed) causal
relationship. The acyclicity constraint of directed PRMs precludes the learning of
arbitrary autocorrelation dependencies and thus severely limits the applicability of
these models in relational domains.
Undirected PRMs, such as relational Markov networks (RMNs) [39], can represent and reason with arbitrary forms of autocorrelation. However, research on these
models has focused primarily on parameter estimation and inference procedures.
The current RMN learning algorithm does not select featuresmodel structure
must be prespecied by the user. While in principle it is possible for RMN techniques to learn cyclic autocorrelation dependencies, inecient parameter estimation makes this dicult in practice. Because parameter estimation requires multiple
rounds of inference over the entire data set, it is impractical to incorporate it as
a subcomponent of feature selection. Recent work on conditional random elds for
sequence analysis includes a feature selection algorithm [24] that could be extended
for RMNs. However, the algorithm abandons estimation of the full joint distribution and uses pseudo-likelihood estimation, which makes the approach tractable
but removes some of the advantages of reasoning with the full joint distribution.
In this chapter, we outline relational dependency networks (RDNs), an extension
of dependency networks (DNs) [11] for relational data. RDNs can represent and
reason with the cyclic dependencies required to express and exploit autocorrelation
during collective inference. In this regard, they share certain advantages of RMNs
and other undirected models of relational data [4, 6]. Also, to our knowledge, RDNs
are the rst PRM capable of learning cyclic autocorrelation dependencies. RDNs
oer a relatively simple method for structure learning and parameter estimation,
which results in models that are easier to understand and interpret. In this regard
they share certain advantages of RBNs and other directed models [37, 12]. The
primary distinction between RDNs and other existing PRMs is that RDNs are an
approximate model. RDN models approximate the full joint distribution and thus
are not guaranteed to specify a coherent probability distribution. However, the
quality of the approximation will be determined by the data available for learning
to refer to Bayesian networks where the nodes correspond to relations and their values
represent possible interpretations of those relations in a specic domain.
242
if the models are learned from large data sets, and combined with Monte Carlo
inference techniques, the approximation should not be a disadvantage.
We start by reviewing the details of DNs for propositional data. Then we describe
the general characteristics of PRM models and outline the specics of RDN learning
and inference procedures. We evaluate RDN learning and inference algorithms on
both synthetic and real-world data sets, presenting learned RDNs for subjective
evaluation and evaluating the models in a prediction context. Of particular note,
all the real-world data sets exhibit multiple autocorrelation dependencies that were
automatically discovered by the RDN learning algorithm. Finally, we review related
work and conclude with a discussion of future directions.
8.2
Dependency Networks
Graphical models represent a joint distribution over a set of variables. The primary
distinction between representations such as Bayesian networks and Markov networks and DNs is that DNs are an approximate representation. DNs approximate
the joint distribution with a set of conditional probability distributions (CPDs) that
are learned independently. This approach to learning results in signicant eciency
gains over exact models. However, because the CPDs are learned independently, DN
models are not guaranteed to specify a consistent joint distribution. This precludes
DNs from being used to infer causal relationships and limits the applicability of
exact inference techniques. Nevertheless, DNs can encode predictive relationships
(i.e., dependence and independence), and Gibbs sampling inference techniques [e.g.,
27] can be used to recover a full joint distribution, regardless of the consistency of
the local CPDs.
8.2.1
DN Representation
8.3
243
distribution can be recovered through Gibbs sampling (see below for details). From
the joint distribution, we can extract any probabilities of interest.
8.2.2
DN Learning
Both the structure and parameters of DN models are determined through learning
the local CPDs. The DN learning algorithm learns a separate distribution for each
variable Xi , conditioned on the other variables in the data (i.e., X  {Xi }). Any
conditional learner can be used for this task (e.g., logistic regression, decision trees).
The CPD is included in the model as P (vi ) and the variables selected by the
conditional learner form the parents of Xi (e.g., if p(xi |{x  xi }) = xj + xk , then
pai = {xj , xk }). The parents are then reected in the edges of G appropriately. If
the conditional learner is not selective (i.e., the algorithm does not select a subset
of the features), the DN model will be fully connected (i.e., pai = x  {xi }). In
order to build understandable DNs, it is desirable to use a selective learner that
will learn CPDs that use a subset of the variables.
8.2.3
DN Inference
8.3
244
the utility of the relational information. Third, the ability to represent cycles
in a network facilitates reasoning with autocorrelation, a common characteristic
of relational data. In addition, whereas the need for approximate inference is a
disadvantage of DNs for propositional data, due to the complexity of relational
model graphs in practice, all PRMs use approximate inference.
RDNs extend DNs to work with relational data in much the same way that RBNs
extend Bayesian networks and RMNs extend Markov networks. These extensions
take a graphical model formalism and upgrade [17] it to a rst-order logic representation with an entity-relationship model. We start by describing the general
characteristics of PRMs and then discuss the details of RDNs in this context.
8.3.1
8.3
Figure 8.1
245
generalization from a single instance (i.e., one data graph) by decomposing the data
graph into multiple examples of each item type (e.g., all paper objects), and building
a joint model of dependencies between and among attributes of each type.
As in conventional graphical models, each node is associated with a probability
distribution conditioned on the other variables. Parents of Xkt are either (1) other
attributes associated with type tk (e.g., paper topic depends on paper type), or (2)
attributes associated with items of type tj where items tj are related to items tk in
GD (e.g., paper topic depends on author rank ). For the latter type of dependency, if
the relation between tk and tj is one-to-many, the parent consists of a set of attribute
values (e.g., author ranks). In this situation, current PRM models use aggregation
functions to generalize across heterogeneous items (e.g., one paper may have two
authors while another may have ve). Aggregation functions are used to either map
sets of values into single values, or to combine a set of probability distributions into
a single distribution.
Consider the RDN model graph GM in gure 8.1(b). It models the data in
gure 8.1(a), which has two object types: paper and author. In GM , each item
type is represented by a plate, and each attribute of each item type is represented
as a node. Edges characterize the dependencies among the attributes at the type
level. The representation uses a modied plate notationdependencies among
attributes of the same object are contained inside the rectangle and arcs that cross
the boundary of the rectangle represent dependencies among attributes of related
objects. For example, month i depends on type i , while avgrank j depends on the
type k and topic k for all papers k related to author j in GD .
There is a nearly limitless range of dependencies that could be considered by
algorithms learning PRM models. In propositional data, learners model a xed
set of attributes intrinsic to each object. In contrast, in relational data, learners
must decide how much to model (i.e., how much of the relational neighborhood
around an item can inuence the probability distribution of an items attributes).
For example, a papers topic may depend on the topics of other papers written by its
authorsbut what about the topics of the references in those papers or the topics
of other papers written by coauthors of those papers? Two common approaches to
246
limiting search in the space of relational dependencies are (1) exhaustive search of
all dependencies within a xed-distance neighborhood (e.g., attributes of items up
to k links away), or (2) greedy iterative-deepening search, expanding the search in
the neighborhood in directions where the dependencies improve the likelihood.
Finally, during inference, a PRM uses a model graph GM and a data graph
GD to instantiate an inference graph GI = (VI , VE ) in a process sometimes called
rollout. The rollout procedure used by PRMs to produce GI is nearly identical
to the process used to instantiate sequence models such as hidden Markov models.
GI represents the probabilistic dependencies among all the variables in a single test
set (here GD is usually dierent from GD used for training). The structure of GI is
determined by both GD and GM each item-attribute pair in GD gets a separate,
local copy of the appropriate CPD from GM . The relations in GD constrain the way
that GM is rolled out to form GI . PRMs can produce inference graphs with wide
variation in overall and local structure because the structure of GI is determined
by the specic data graph, which typically has nonuniform structure. For example,
gure 8.2 shows the RDN from gure 8.1b rolled out over a data set of three authors
and three papers, where P1 is authored by A1 and A2 , P2 is authored by A2 and
A3 , and P3 is authored by A3 . Notice that there are a variable number of authors
per paper. This illustrates why current PRMs use aggregation in their CPDsfor
example, the CPD for paper-type must be able to deal with a variable number of
author ranks.
Figure 8.2
8.3.2
RDN Representation
8.3
247
the qualitative component (GD ) of the RDNit does not depict the quantitative
component (P ) of the model, which consists of CPDs that use aggregation functions.
Although conditional independence is inferred using an undirected view of the
graph, bidirected edges are useful for representing the set of variables in each CPD.
For example, in gure 8.1b the CPD for Year contains Topic but the CPD for Topic
does not contain Type. This depicts any inconsistencies that result from the RDN
learning technique.
8.3.3
RDN Learning
Learning a PRM model consists of two tasks: learning the dependency structure
among the attributes of each object type, and estimating the parameters of the local
probability models for an attribute given its parents. Relatively ecient techniques
exist for learning both the structure and parameters of RBN models. However, these
techniques exploit the requirement that the CPDs factor the full distributiona
requirement that imposes acyclicity constraints on the model and precludes the
learning of arbitrary autocorrelation dependencies. On the other hand, although in
principle it is possible for RMN techniques to learn cyclic autocorrelation dependencies, ineciencies due to calculating the normalizing constant Z in undirected
models make this dicult in practice. Calculation of Z requires a summation over
all possible states X. When modeling the joint distribution of propositional data,
the number of states is exponential in the number of attributes (i.e., O(2m )). When
modeling the joint distribution of relational data, the number of states is exponential in the number of attributes and the number of instances. If there are N
objects, each with m attributes, then the total number of states is O(2N m ). For
any reasonable-size data set, a single calculation of Z is an enormous computational
burden. Feature selection generally requires repeated parameter estimation while
measuring the change in likelihood aected by each attribute, which would require
recalculation of Z on each iteration.
The RDN learning algorithm uses a more ecient alternativeestimating the
set of conditional distributions independently rather than jointly. This approach is
based on pseudo-likehood techniques [2], which were developed for modeling spatial
data sets with similar autocorrelation dependencies. Pseudo-Likelihood estimation
avoids the complexities of estimating Z and the requirement of acyclicity. In addition, this approach can utilize existing techniques for learning CPDs of relational
data such as rst-order Bayesian classiers [7], structural logistic regression [35], or
ACORA [34].
Instead of optimizing the log-likelihood of the full joint distribution, we optimize
the pseudo-loglikelihood for each variable independently, conditioned on all other
attribute values in the data:
  
p(xtvi |paxtvi ),
(8.1)
P L(GD ; ) =
tT Xit X t vT (v)
248
Table 8.1
Use P to form GM .
8.3
249
(a) Example QGraph query: Textual annotations specify match conditions on attribute values; numerical annotations (e.g., [0..]) specify constraints on
the cardinality of matched objects (e.g., zero or more authors), and (b) matching
subgraph.
Figure 8.3
8.3.3.1
Queries
The queries specify the relational neighborhoods that will be considered by the
conditional learner R, and their structure denes a typing over instances in the
database. Subgraphs are extracted from a larger graph database using the visual
query language QGraph [3]. Queries allow for variation in the number and types
of objects and links that form the subgraphs and return collections of all matching
subgraphs from the database.
For example, consider the query in gure 8.3a.5 The query species match
criteria for a target item (paper) and its local relational neighborhood (authors and
references). The example query matches all research papers that were published in
1995 and returns for each paper a subgraph that includes all authors and references
associated with the paper. Figure 8.3b shows a hypothetical match to this query:
a paper with two authors and seven references.
The query denes a typing over the objects of the database (e.g., people that have
authored a paper are categorized as authors) and species the relevant relational
context for the target item type in the model. For example, given this query the
model R would model the distribution of a papers attributes given the attributes
of the paper itself and the attributes of its related authors and references. The
queries are a means of restricting model search. Instead of setting a depth limit on
the extent of the search, the analyst has a more exible means with which to limit
the search (e.g., we can consider other papers written by the papers authors but
not other authors of the papers references).
250
8.3.3.2
The conditional relational learner R is used for both parameter estimation and
structure learning in RDNs. The variables selected by R are reected in the edges
of G appropriately. If R selects all of the available attributes, the RDN model will
be fully connected.
In principle, any conditional relational learner can be used as a subcomponent
to learn the individual CPDs. In this chapter, we discuss the use of two dierent conditional modelsrelational Bayesian classiers (RBCs) [32] and relational
probability trees (RPTs) [31].
Relational Bayesian classiers RBCs extend Bayesian classiers to a relational
setting. RBC models treat heterogeneous relational subgraphs as a homogeneous
set of attribute multisets. For example, when considering the references of a single
paper the publication dates of those references form multisets of varying size (e.g.,
{1995, 1995, 1996}, {1975, 1986, 1998, 1998}). The RBC assumes each value of
a multiset is independently drawn from the same multinomial distribution.6 This
approach is designed to mirror the independence assumption of the naive Bayesian
classier. In addition to the conventional assumption of attribute independence, the
RBC also assumes attribute value independence within each multiset.
For a given item type T , the query scope species the set of item types TR
that form the relevant relational neighborhood for T . For example, in gure 8.3(a)
T = paper and TR = {paper, author, ref erence, authorof, cites}. To estimate the
CPD for attribute X on items T (e.g., paper topic), the model considers all the
attributes associated with the types in TR . RBCs are non-selective models so all
the attributes are included as parents:
 
p(x|pax ) 
p(xtvi |x) p(x),
tTR Xit X t vTR (x)
Relational probability trees RPTs are selective models that extend classication trees to a relational setting. RPT models also treat heterogeneous relational
subgraphs as a set of attribute multisets, but instead of modeling the multisets as
independent values drawn from a multinomial, the RPT algorithm uses aggregation functions to map a set of values into a single feature value. For example, when
considering the publication dates of references of a research paper the RPT could
construct a feature that tests whether the average publication date was after 1995.
Figure 8.4 provides an example RPT learned on citation data.
The RPT algorithm automatically constructs and searches over aggregated relational features to model the distribution of the target variable X. The algorithm
constructs features from the attributes associated with the types specied in the
6. Alternative constructions are possible but prior work [32] has shown this approach
achieves superior performance over a wide range of conditions.
8.3
Figure 8.4
251
query. The algorithm considers four classes of aggregation functions to group multiset values: Mode, Count, Proportion, Degree. For discrete attributes, the algorithm
constructs features for all unique values of an attribute. For continuous attributes,
the algorithm constructs features for a number of dierent discretizations, binning the values by frequency (e.g., year > 1992). Count, proportion, and degree
features consider a number of dierent thresholds (e.g., proportion(A) > 10%).
Feature scores are calculated using chi-square to measure correlation between the
feature and the class. The algorithm uses prepruning in the form of a p-value cuto and a depth cuto to limit tree size. All experiments reported herein used
 = 0.05/|attributes|, depth cuto=7, and considered ten thresholds and discretizations per feature.
The RPT learning algorithm adjusts for biases toward particular features due to
degree disparity and autocorrelation in relational data [14, 15]. We have shown that
RPTs build signicantly smaller trees than other conditional models and achieve
equivalent, or better, performance [31]. These characteristics of RPTs are crucial
for learning understandable RDN models and have a direct impact on inference
eciency because smaller trees limit the size of the nal inference graph.
8.3.4
RDN Inference
The RDN inference graph GI is potentially much larger than the original data
graph. To model the full joint distribution there must be a separate node (and CPD)
for each attribute value in GD . To construct GI , the set of template CPDs in P
is rolled out over the test-set data graph. Each item-attribute pair gets a separate,
local copy of the appropriate CPD. Consequently, the total number of nodes in
252
T(v)
the inference graph will be
| + eED |XT(e) |. Rollout facilitates
vVD |X
generalization across data graphs of varying sizewe can learn the CPD templates
from one data graph and apply the model to a second data graph with a dierent
number of objects by rolling out more CPD copies. This approach is analogous to
other graphical models that tie distributions across the network and roll out copies
of model templates (e.g., hidden Markov models).
We use Gibbs sampling for inference in RDN models. Gibbs sampling can be
used to extract a unique joint distribution, regardless of the consistency of the
model [11].
Table 8.3.4 outlines the inference algorithm. To estimate a joint distribution, we
start by rolling out the model GM onto the target data set GD , forming the inference
graph GI . The values of all unobserved variables are initialized to values drawn from
their prior distributions. Gibbs sampling then iteratively relabels each unobserved
variable by drawing from its local conditional distribution, given the current state
of the rest of the graph. After a sucient number of iterations (burnin), the values
will be drawn from a stationary distribution and we can use the samples to estimate
probabilities of interest.
For prediction tasks we are often interested in the marginal probabilities associated with a single variable X (e.g., paper topic). Although Gibbs sampling may
be a relatively inecient approach to estimating the probability associated with a
joint assignment of values of X (e.g., when |X| is large), it is often reasonably fast
to estimate the marginal probabilities for each X.
There are many implementation issues that can improve the estimates obtained
from a Gibbs sampling chain, such as length of burn-in and number of samples.
For the experiments reported in this chapter we used xed-length chains of 2000
samples (each iteration relabels every value sequentially) with burn-in set at 100.
Empirical inspection indicated that the majority of chains had converged by 500
samples.
8.4
Experiments
The experiments in this section demonstrate the utility of RDNs as a joint model of
relational data. First, we use synthetic data to assess the impact of training-set size
and autocorrelation on RDN learning and inference, showing that accurate models
can be learned at reasonable data set sizes and that the model is robust to varying
levels of autocorrelation. Next, we learn RDN models of three real-world data sets
to illustrate the types of domain knowledge that the models discover automatically.
In addition, we evaluate RDN models in a prediction context, where only a single
attribute is unobserved in the test set, and report signicant performance gains
compared to two conditional models.
8.4
Experiments
Table 8.2
253
vi k
EI  EI  {eij }
For each v  VI :
Randomly initialize xv to an arbitrary value
S
For i  iter:
For each v  VI , in random order:
Resample xv from p(xv |x  {xv })
xv  xv
If i > burnin:
S  S  {x}
Use samples S to estimate probabilities of interest
8.4.1
RDN Learning
The rst set of synthetic experiments examines the eectiveness of the RDN learning
algorithm. Theoretical analysis indicates that, in the limit, the true parameters will
254
Figure 8.5
8.4
Experiments
255
learned RDNRP T models. The bottom row reports experiments with data generated
from an RDNRBC , where we learned RDNRBC models.
These experiments show that the learned RDNRP T models are a good approximation to the true model by the time training-set size reaches 500, and that RDN
learning is robust with respect to varying levels of autocorrelation. As expected,
however, when training-set size is small, the RDNs are a better approximation for
data sets with low levels of autocorrelation (see gure 8.5a).
There appears to be little dierence between the RDNRP T and RDNRBC when
autocorrelation is low, but otherwise the RDNRBC needs signicantly more data
to estimate the parameters accurately. This may be in part due to the models lack
of selectivity, which necessitates the estimation of a greater number of parameters.
However, there is little improvement even when we increase the size of the training
sets to 10,000 objects. Furthermore, the discrepancy between the estimated model
and the true model is greatest when autocorrelation is moderate. This indicates
that the inaccuracies may be due to the naive Bayes independence assumption and
its tendency to produce biased probability estimates [40].
8.4.1.2
RDN Inference
The second set of synthetic experiments evaluates the RDN inference procedure in
a prediction context, where only a single attribute is unobserved in the test set. We
generated data in the manner described above and learned RDNs for X1 . At each
autocorrelation level, we generated ten training sets (size 500) and learned RDNs.
For each training set, we generated ten test sets (size 250) and used the learned
models to infer marginal probabilities for the class labels of the test-set instances.
To evaluate the predictions, we report area under the ROC curve (AUC).7 These
experiments used the same levels of autocorrelation outlined above.
We compare the performance of three types of models. First, we measure the
performance of RPT and RBC models. These are conditional models that reason
about each instance independently and do not use the class labels of related instances. Next, we measure the performance of the two RDN models described above:
RDNRBC and RDNRP T . These are collective models that reason about instances
jointly, using the inferences about related instances to improve overall performance.
Lastly, we measure performance of the two RDN models while allowing the true
labels of related instances to be used during inference. This demonstrates the level
of performance possible if the RDNs could infer the true labels of related instances
ceil
ceil
and RDNRP
with perfect accuracy. We refer to these as ceiling models: RDNRBC
T.
Note that conditional models can reason about autocorrelation dependencies in
a limited manner by using the attributes of related instances. For example, if there
is a correlation between the words on a webpage and its topic, and the topics of
hyperlinked webpages are autocorrelated, then we can improve the inference about
7. Squared-loss results are qualitatively similar to the AUC results reported in gure 8.6.
256
Figure 8.6
a single page by modeling the contents of its neighboring pages. Recent work has
shown that collective models are a low-variance means of reducing bias that work
by modeling the autocorrelation dependencies directly [16]. Conditional models are
also able to exploit autocorrelation dependencies through modeling the attributes
of related instances, but variance increases dramatically as the number of attributes
increases.
During inference we varied the number of known class labels in the test set, measuring performance on the remaining unlabeled instances. This serves to illustrate
model performance as the amount of information seeding the inference process increases. We expect performance to be similar when other information seeds the
inference processfor example, when some labels can be inferred from intrinsic attributes, or when weak predictions about many related instances serve to constrain
the system. Figure 8.6 graphs AUC results for each of the models as the level of
known class labels is varied.
In all congurations, RDNRP T performance is equivalent, or better than, RP T
performance. This indicates that even modest levels of autocorrelation can be exploited to improve predictions using RDNRP T models. RDNRP T performance is
ceil
indistinguishable from that of RDNRP
T except when autocorrelation is high and
there are no labels to seed inference. In this situation, there is little information to
constrain the system during inference so the model cannot fully exploit the autocorrelation dependencies. When there is no information to anchor the predictions,
there will be an identiability problemsymmetric labelings that are highly au-
8.4
Experiments
Figure 8.7
257
tocorrelated, but with opposite values, will be equally likely. In situations where
there is little seed information, identiability problems can bias RDN performance
toward random.
In contrast, RDNRBC performance is superior to RBC performance only when
there is moderate to high autocorrelation and sucient seed information. When
ceil
autocorrelation is low, the RBC model is comparable to both the RDNRBC
and RDNRBC models. Even when autocorrelation is moderate or high, RBC
performance is still relatively high. Since the RBC model is low-variance and there
are only four attributes in our data sets, it is not surprising that the RBC model
is able to exploit autocorrelation to improve performance. What is more surprising
is that RDNRBC requires substantially more seed information than RDNRP T in
order to reach ceiling performance. This indicates that our choice of model should
take test-set characteristics (e.g., number of known labels) into consideration.
8.4.2
We learned RDN models for three real-world relational data sets to illustrate the
types of domain knowledge that can be garnered, and evaluated the models in a
prediction context, where the values of a single attribute are unobserved. Figure 8.7
depicts the objects and relations in each data set.
The rst data set is drawn from the Internet Movie Database (IMDb: www.imdb.com).
We collected a sample of 1382 movies released in the United States between 1996
and 2001, with their associated actors, directors, and studios. In total, this sample
contains approximately 42,000 objects and 61,000 links.
The second data set is drawn from Cora, a database of computer science research papers extracted automatically from the web using machine learning techniques [25]. We selected the set of 4330 machine learning papers along with associated authors, cited papers, and journals. The resulting collection contains approximately 13,000 objects and 26,000 links. For classication, we sampled the 1669
papers published between 1993 and 1998.
258
The third data set is from the National Association of Securities Dealers (NASD)
c sys[33]. It is drawn from NASDs Central Registration Depository (CRD)
tem, which contains data on approximately 3.4 million securities brokers, 360,000
branches, 25,000 rms, and 550,000 disclosure events. Disclosures record disciplinary information on brokers, including information on civil judicial actions, customer complaints, and termination actions. Our analysis was restricted to small and
moderate-size rms with fewer than fteen brokers, each of whom has an approved
NASD registration. We selected a set of 10,000 brokers who were active in the years
1997-2001, along with 12,000 associated branches, rms, and disclosures.
8.4.2.1
RDN Models
The RDN models in gures 8.8, 8.9, and 8.10 continue with the RDN representation
introduced in gure 8.1b. Each item type is represented by a separate plate. Arcs
inside a plate represent dependencies among the attributes of a single object, and
arcs crossing the boundaries of plates represent dependencies among attributes of
related objects. An arc from x to y indicates the presence of one or more features
of x in the conditional model learned for y. When the dependency is on attributes
of objects more than a single link away, the arc is labeled with a small rectangle to
indicate the intervening related-object type. For example, in gure 8.8 movie genre
is inuenced by the genres of other movies made by the movies director, so the arc
is labeled with a small D rectangle.
In addition to dependencies among attribute values, relational learners may also
learn dependencies between the structure of relations (edges in GD ) and attribute
values. Degree relationships are represented by a small black circle in the corner
of each platearcs from this circle indicate a dependency between the number of
related objects and an attribute value of an object. For example, in gure 8.8 movie
receipts are inuenced by the number of actors in the movie.
For each data set, we learned RDNs using queries that include all neighbors up to
two links away in the data graph. For example in the IMDb, when learning a model
of movie attributes we considered the attributes of associated actors, directors,
producers, and studios, as well as movies related to those objects.
On the IMDb data, we learned an RDN model for ten discrete attributes including
actor gender and movie opening weekend receipts (> $2 million). Figure 8.8 shows
the resulting RDN model. Four of the attributesmovie receipts, movie genre,
actor birth year, and director rst movie yearexhibit autocorrelation dependencies. Exploiting this type of dependency has been shown to signicantly improve
classication accuracy of RMNs compared to RBNs, which cannot model cyclic
dependencies [39]. However, to exploit autocorrelation, RMNs must be instantiated with the appropriate clique templatesto date there is no RMN algorithm for
learning autocorrelation dependencies. RDNs are the rst PRM capable of learning
cyclic autocorrelation dependencies.
On the Cora data, we learned an RDN model for seven attributes including
paper topic (e.g., neural networks) and journal name prex (e.g., IEEE). Figure 8.9
8.4
Experiments
Figure 8.8
259
shows the resulting RDN model. Again we see that four of the attributes exhibit
autocorrelation. Note that when a dependency is on attributes of objects a single
link away, the arc is unlabeled. For example, the unlabeled self-loops from paper
variables indicates dependencies on the same variables in cited papers. In particular,
the topic of a paper depends not only on the topics of other papers that it cites
but also on the topics of other papers written by the authors. This model is a good
reection of our domain knowledge about machine learning papers.
Figure 8.9
On the NASD data, we learned an RDN model for eleven attributes including
broker is-problem and disclosure type (e.g., customer complaint). Figure 8.10
shows the resulting RDN model. Again we see that four of the attributes exhibit
autocorrelation. Subjective inspection by NASD analysts indicates that the RDN
has automatically uncovered statistical relationships that conrm the intuition of
domain experts. These include temporal autocorrelation of risk (past problems are
indicators of future problems) and relational autocorrelation of risk among brokers
260
at the same branchindeed, fraud and malfeasance are usually social phenomena,
communicated and encouraged by the presence of other individuals who also wish to
commit fraud [5]. Importantly, this evaluation was facilitated by the intrpretability
of the RDN modelexperts are more likely to trust, and make regular use of,
models they can understand.
Figure 8.10
8.4.2.2
Prediction
We evaluated the learned models on prediction tasks in order to assess (1) whether
autocorrelation dependencies among instances can be used to improve model accuracy, and (2) whether the RDN models, using Gibbs sampling, can eectively infer
labels for a network of instances. To do this, we compared the same three classes
of models used in section 8.4.1: RPTs and RBCs, RDNs, and ceiling RDNs.
Figure 8.11 shows AUC results for each of the models on the three prediction
tasks. Figure 8.11a graphs the results of the RDNRP T models, compared to the
RP T conditional model. Figure 8.11b graphs the results of the RDNRBC models,
compared to the RBC conditional model. We used the following prediction tasks:
movie receipts for IMDb, paper topic for Cora, and broker is-problem for NASD.
The graphs show the AUC for the most prevalent class, averaged over a number
of training/test splits. We used temporal samples where we learned models on one
year of data and applied the model to the subsequent year. We used two-tailed,
paired t -tests to assess the signicance of the AUC results obtained from the trials.
The t -tests compare the RDN results to each of the other two models with a null
hypothesis of no dierence in the AUC.
When using the RPT as the conditional learner (gure 8.11(a), RDN performance
is superior to RPT performance on all tasks. The dierence is statistically signicant
for two of the three tasks. This indicates that autocorrelation is both present in the
data and identied by the RDN models. The RPT can sometimes use attributes of
related items to eectively represent and reason with autocorrelation dependencies.
8.4
Experiments
261
AUC results for (a) RDNRP T and RPT models, and (b) RDNRBC
and RBC models. Asterisks denote model performance that is signicantly dierent
(p < 0.10) from RDNRP T and RDNRBC .
Figure 8.11
However, in some cases the attributes other than the class label contain little
information about the class labels of related instances. This is the case for Cora
RPT performance is close to random because no other attributes inuence paper
topic (see gure 8.9). On all tasks, the RDN models achieve comparable performance
to the ceiling models. This indicates that the RDN model achieved the same level
of performance as if it had access to the true labels of related objects. On the
NASD data, the RDN performance is slightly higher than that of the ceiling model.
We note, however, that the ceiling model only represents a probabilistic ceiling
the RDN may perform better if an incorrect prediction for one object improves
inferences about related objects.
Similarly, when using the RBC as the conditional learner (Figure 8.11(b)), the
performance of RDN models is superior to the RBC models on all tasks and statistically signicant for two of the tasks. However, the RDN models achieve comparable
performance to the ceiling models on only one of the tasks. This may be another indication that RDN models combined with a non-selective conditional learner (e.g.,
RBCs) will experience increased variance during the Gibbs sampling process, and
thus they may need more seed information during inference to achieve the nearceiling performance. We should note that although the RDNRBC models do not
Ceil
is
signicantly outperform the RDNRP T models on any of the tasks, the RDNRBC
Ceil
signicantly higher than RDNRP T for Cora and IMDb. This indicates that, when
there is enough seed information, RDNRBC models may achieve signicant performance gains over RDNRP T models.
262
8.5
Related Work
8.5.1
Probabilistic relational models are one class of models for density estimation in
relational data sets. Examples of PRMs include RBNs and RMNs.
As outlined in section 8.3.1, learning and inference in PRMs involve a data graph
GD , a model graph GM , and an inference graph GI . All PRMs model data that can
be represented as a graph (i.e., GD ). PRMs use dierent approximation techniques
for inference in GI (e.g., Gibbs sampling, loopy belief propagation [26]), but they
all use a similar process for rolling out an inference graph GI . Consequently, PRMs
dier primarily with respect to the representation of the model graph GM and how
that model is learned.
The RBN learning algorithm [10] for the most part uses standard Bayesian
network techniques for parameter estimation and structure learning. One notable
exception is that the learning algorithm must check for legal structures that are
guaranteed to be acyclic when rolled out for inference on arbitrary data graphs. In
addition, instead of exhaustive search of the space of relational dependencies, the
structure learning algorithm uses greedy iterative-deepening, expanding the search
in directions where the dependencies improve the likelihood.
The strengths of RBNs include understandable knowledge representations and
ecient learning techniques. For relational tasks, with a huge space of possible
dependencies, selective models are easier to interpret and understand than nonselective models. Closed-form parameter estimation techniques allow for ecient
structure learning (i.e., feature selection). Also because reasoning with relational
models requires more space and computational resources, ecient learning techniques make relational modeling both practical and feasible.
The directed acyclic graph structure is the underlying reason for the eciency
of RBN learning. As discussed in section 8.1, the acyclicity requirement precludes
the learning of arbitrary autocorrelation dependencies and limits the applicability
of these models in relational domains. RDN models enjoy the strengths of RBNs
(namely, understandable knowledge representation and ecient learning) without
being constrained by an acyclicity requirement.
The RMN learning algorithm [39] uses maximum a posteriori parameter estimation with Gaussian priors, modiying Markov network learning techniques. The algorithm assumes that the clique templates are prespecied and thus does not search
for the best structure. Because the user supplies a set of relational dependencies to
consider (i.e., clique templates), it simply optimizes the potential functions for the
specied templates.
RMNs are not hampered by an acyclicity constraint, so they can represent and
reason with arbitrary forms of autocorrelation. This is particularly important for
reasoning in relational data sets where autocorrelation dependencies are nearly
ubiquitous and often cannot be structured in an acyclic manner. However, the
8.5
Related Work
263
A second class of models for density estimation consists of extensions to conventional logic programming that support probabilistic reasoning in rst-order logic
environments. We will refer to this class of models as probabilistic logic models
(PLMs). Examples of PLMs include Bayesian logic programs [18] and Markov logic
networks (MLNs) [36].
PLMs represent a joint probability distribution over the groundings of a rstorder knowledge base. The rst-order knowledge base contains a set of rst-order
formulae, and the PLM model associates a set of weights/probabilities with each of
the formulae. Combined with a set of constants representing objects in the domain,
PLM models specify a probability distribution over possible truth assignments
to groundings of the rst-order formulae. Learning a PLM consists of two tasks:
generating the relevant rst-order clauses, and estimating the weights/probabilities
associated with each clause.
Within this class of models, MLNs are most similar in nature to RDNs. In
MLNs, each node is a grounding of a predicate in a rst-order knowledge base,
and features correspond to rst-order formulae and their truth-values. Learning
an MLN consists of estimating the feature weights and selecting which features
to include in the nal structure. The input knowledge base denes the relevant
relational neighborhood, and the algorithm restricts the search by limiting the
number of distinct variables in a clause, using a weighted pseudo-likelihood scoring
function for feature selection [19].
MLNs ground out to undirected Markov networks. In this sense, they are quite
similar to RMNs, sharing the same strengths and weaknessesthey are capable of
representing cyclic autocorrelation relationships but suer from the complexity of
full joint inference during learning, which decreases eciency. Kok and Domingos
[19] have recently demonstrated the promise of ecient pseudo-likelihood structure
learning techniques. Our future work will investigate the performance tradeos
between RDN and MLN approaches to pseudo-likelihood estimation for learning.
264
8.5.3
Collective Inference
8.6
References
265
interact with the level of autocorrelation and local model characteristics to impact
performance, future work will attempt to quantify these eects more formally.
We also presented learned RDNs for a number of real-world relational domains, demonstrating another strength of RDNstheir understandable and intuitive knowledge representation. Comprehensible models are a cornerstone of the
knowledge discovery process, which seeks to identify novel and interesting patterns
in large data sets. Domain experts are more willing to trust, and make regular
use of, understandable modelsparticularly when the induced models are used
to support additional reasoning. Understandable models also aid analysts assessment of the utility of the additional relational information, potentially reducing the
cost of information gathering and storage and the need for data transfer among
organizationsincreasing the practicality and feasibility of relational modeling.
Future work will compare RDN models to RMNs and MLNs in order to quantify
the performance tradeos for using pseudo-likelihood functions rather than full likelihood functions for both parameter estimation and structure learning, particularly
over data sets with varying levels of autocorrelation. Based on theoretical analysis
of pseudo-likelihood estimation ( [e.g., 9]), we expect there to be little dierence
when autocorrelation is low and increased variance when autocorrelation is high. If
this is the case, there will need to be enough training data to withstand the increase
in variance. Alternatively, bagging techniques may be a means of reducing variance
with only a moderate increase in computational cost. In either case, the simplicity and relative eciency of RDN methods are a clear win for learning models in
relational domains.
Acknowledgments
We acknowledge the invaluable assistance of A. Shapira, and helpful comments
from C. Loiselle. This eort is supported by DARPA and NSF under contract
numbers IIS0326249 and HR0011-04-1-0013. The U.S. Government is authorized to
reproduce and distribute reprints for governmental purposes notwithstanding any
copyright notation hereon. The views and conclusions contained herein are those
of the authors and should not be interpreted as necessarily representing the ocial
policies or endorsements either expressed or implied of DARPA, NSF, or the U.S.
Government.
References
[1] A. Bernstein, S. Clearwater, and F. Provost. The relational vector-space model
and industry classication. In Proceedings of the IJCAI-2003 Workshop on
Learning Statistical Models from Relational Data, 2003.
266
References
267
268
James Cussens
9.1
Introduction
This chapter provides a high-level and selective overview of formalisms which
incorporate both logic and probability. Naturally, the focus is on those formalisms
which fall within the ambit of statistical relational learning (SRL) or which have
inuenced formalisms used for SRL. Learning (in the AI sense) is the central topic
of this book, but in order to understand existing and potential learning algorithms
for the formalisms discussed, it is necessary to understand what is represented by
each formalism: we need to know what is to be learned before examining how to
do the learning. Consequently, in this chapter there is a strong focus on issues of
representation.
It is worth stating some important questions concerning logic and probability
which will not be addressed here. First, logic in the general nontechnical sense of
a method of rational reasoning includes probabilistic reasoning quite naturally since
humans are required to reason in uncertain situations. Thus two of the historically
270
Although logic-based SRL formalisms reject Carnaps attempt to use logic to determine probabilities, his use of possible worlds to provide semantics for probabilistic
statements is widely followed. Recall that to interpret terms and formulae of a
(standard, nonprobabilistic) rst-order language L it is necessary to consider Lstructures, also known as L-interpretations, or more poetically, possible worlds. An
L-structure is a set (the domain) together with functions and relations. Each function (resp. predicate) symbol in the language has a corresponding function (resp.
relation) on the domain. Standard (Tarskian) rst-order semantics denes when a
particular L-formula is true in a particular L-structure. For example, the formula
f lies(tweety) is true in a given L-structure i the individual which the constant
tweety denotes is an element of the set which the predicate symbol f lies denotes.
To explain possible world semantics for probabilistic statements we will follow the account of Halpern [17]. Using Halperns notation, the probability that
f lies(tweety) is true is denoted by the term w(f lies(tweety)).1 The proposition
that this probability is 0.8 is represented by the formula w(f lies(tweety)) = 0.8.
As Halpern notes, it is not useful to ask whether a probabilistic statement such
as w(f lies(tweety)) = 0.8 is true in some particular L-structure. In any given Lstructure, tweety either ies or does not. Instead we have to ask whether a probability
distribution over L-structures satises w(f lies(tweety)) = 0.8 or not. A rigorous
1. Note on terminology: Throughout the rest of this chapter the term the probability of
F  (where F is a rst-order formula) will be used as an abbreviation for the probability
that F is true.
9.2
Representation
271
account of how to answer this question is given by Halpern [17] but the basic idea
is simple: a probability distribution  over L-structurespossible worldssatises
w(f lies(tweety)) = 0.8 i the set of worlds in which f lies(tweety) is true has
probability 0.8 according to . Note that this means that w(f lies(tweety)) is a
marginal probability since it can be computed by summing over possible-world
probabilities.
9.2
Representation
Probability-logic formalisms take one of two routes to dening probabilities. In the
directed approach there is a nonempty set of formulae all of whose probabilities
are explicitly stated: call these probabilistic facts, similarly to Sato [39]. Other
probabilities are dened recursively with the probabilistic facts acting as base cases.
A probability-logic model using the directed approach will be closely related to a
recursive graphical model (Bayesian net). Most probability-logic formalisms fall into
this category: for example, probabilistic logic programming (PLP) [30]; probabilistic
Horn abduction (PHA) [36] and its later expansion the independent choice logic
(ICL) [37]; probabilistic knowledge bases (PKBs) [31]; Bayesian logic programs
(BLPs) (see chapter 10); relational Bayesian networks (RBNs) [20]; stochastic logic
programs (SLPs) (see chapter 11) and the PRISM system [40].
The second, less common, approach is undirected, where no formula has its
probability explicitly stated. Relational Markov networks (RMNs) (see chapter 6
and Markov logic networks (MLNs) (see chapter 12) are examples of this approach.
In the undirected approach, the probability of each possible world is dened in terms
of its features, where each feature has an associated real-valued parameter. For
example, in the case of MLN each feature is associated with a rst-order formula:
the value of the feature for a given world is simply the number of true ground
instances of the formula in that world. Such approaches have much in common with
undirected probabilistic models such as Markov networks. For example, to compute
(perhaps conditional) probabilities of individual formulae, inference techniques from
Markov networks can be used. See chapters 5 and 12, for further details.
9.2.1
Here we will focus on formalisms using the more common directed approach.
The most basic requirement of such formalisms is to explicitly state that a given
ground atomic formula has some probability of being true: a statement such as
w(f lies(tweety)) = 0.8 should be expressible. This is indeed the case for PLP,
PHA/ICL, PKB, and PRISM. In all these cases, possible worlds semantics are
explicitly invoked.
From a statistical point of view, asserting that w(f lies(tweety)) = 0.8 amounts to
viewing f lies(tweety) as a binary variable taking the values TRUE and FALSE. In
many applications a restriction to binary variables would be be very inconvenient
272
values(flies(tweety),[yes,no]).
:- set_sw(flies(tweety),0.8+0.2).
Figure 9.1
9.2
Representation
9.2.2
273
Even at the rudimentary level of dening probabilities for atomic formulae some
of the power of rst-order methods is apparent. The basic point is that by using
variables we can dene probabilities for whole families of related atomic formulae.
To take an example from Ngo and Haddaway [31], the PKB formula
P (nbrhd(X, bad)) = 0.3  in CALI(X)
makes up part of the denition of the distribution of random variables nbrhd(X)
for those X where in CALI(X) is true. Informally, the formula says: For those in
California, there is probability 0.3 of living in a bad neighborhood. The formula
in CALI(X) is known as a context literal. Asserting that a context literal is true
amounts to stating that it is true in all possible worlds, or equivalently, restricting
the set of possible worlds under consideration to those where it is true. Thus the
preceding formula can be translated into Halperns syntax as
x : w(nbrhd(x, bad)) = 0.3  w(in CALI(x)) = 1.
If, for example, we had that w(in CALI(bob)) = 1, meaning it is certain that
Bob lives in California, then it immediately follows that w(nbrnd(bob, bad)) = 0.3,
meaning Bob lives in a bad neighborhood with probability 0.3.
Such a mixture of probabilistic literals (i.e., P r(nbrnd(X, bad))) and nonprobabilistic literals (i.e., in CALI(X)), where the latter states what is true in all worlds,
is common. In PLP [30], stating what is true in all worlds is made explicit. The
above formula would be written
nbrhd(X, bad) : [0.3, 0.3]  in CALI(X) : [1, 1]
and would have the same informal intended meaning. As this example indicates,
in PLP probability intervals are represented. To show the sort of formulae that
can be expressed in PLP and what they mean, tables 9.1, 9.2 and 9.3 show
three example PLP formulae, their informal meaning, and their representation in
Halperns notation, respectively.3
Table 9.1
1.
2.
3.
PLP formulae
dog(X) : [1, 1]
dog(X) : [V1 , V2 ]
274
Table 9.3
1.
2.
3.
9.2.3
w(dog(x)) = 1
w(dog(x))  [v1 , v2 ]
So far we have been looking mainly at formulae which directly dene probabilities
for atomic formulae. Directly stating all probabilities of interest is too restrictive
and so formalisms provide mechanisms for dening probabilities which must be
inferred rather than just looked up.
For directed approaches, there are two basic ways in which this is done: using
conditional probabilities and using logical rules. PKB [31], for example, focuses on
the former approach allowing formulae such as
P (bglry(X, yes)|nbd(X, bad)) = 0.6  in CALI(X),
(9.1)
(9.3)
9.2
Representation
275
(9.4)
(9.5)
].
(9.6)
This results in a strictly stronger formula: (9.2) only impacts on those known to
be Californians, whereas (9.6) states a conditional probability that applies to all
individuals. Formally, (9.6) |= (9.2) but (9.2) |= (9.6). To see this abbreviate (9.6)
276
x : [ p(x)
w(in CALI(x)) = 1)
x : [ q(x)
x : [ q(x)
w(in CALI(x)) = 1
(9.2)
(9.7)
Secondly, although the connection to Bayesian networks is more direct when using conditional probabilities, it is also straightforward to encode Bayesian networks
using the rule-based approach [36], since both cases share an underlying directedness.
9.2.4
So far we have considered probabilities on the truth values of atomic formulae only.
It is necessary to go further and have a mechanism for dening probabilities on (at
least) conjunctions of atomic formulae. This denes the joint distribution over the
truth values of atomic formulae. If each possible world has a conjunction that it
alone satises, this will give us a complete distribution over possible worlds.
One approach is to assume independence in all cases where this is possible, an
approach going back to Boole:
The events whose probabilities are given are to be regarded as independent of
any connexion but such as is either expressed, or necessarily implied, in the
data . . . ([4], pp. 256-7.)
Where alternatives (in the Poole sense) are used, it is clear that the atomic formulae
in any given alternative are highly dependent. Equally, those formulae on either
side of an implication must be dependent. However, we are at liberty to assume
that formulae from dierent alternatives are independent and this is what is done
in PHA/ICL and PRISM; indeed this is why the independent choice logic is so
called. Note that this is only possible because these formalisms disallow inferred
probabilities, like happy(tweety), from appearing in alternatives.
When a formalism (implicitly) denes a Bayesian network whose nodes are
ground atomic formulae, then the probability of any conjunction is just the probability of the relevant joint instantiation of the Bayesian net in the normal way.
A quite dierent way of combining Bayesian networks with logic is provided by
relational Bayesian networks [20]. Each node in an RBN corresponds to a relation 4
instead of to an atomic formula. The possible values for a relation r are the possi-
9.2
Representation
277
Not all probability-logic formalisms are framed in terms of possible worlds. The
hallmark of Bayesian logic programs [22] is a one-to-one mapping between ground
atomic formulae and random variables where there is no restriction on what these
random variables might be. In particular a random variable need not represent the
probability with which the ground atomic formula is true; indeed it need not be
binary. One advantage of this design decision is that continuous random variables
can be represented. There is, however, a logical aspect to BLPs which has associated
semantics. In BLPs, rst-order clauses are used, together with combining rules, to
dene the structure of a BLP in much the same way that parent-child edges dene
the structure of a Bayesian network. Essentially, a ground instance of an atomic
formula in the head of a clause corresponds to a child node, whereas those in
the body are its parents. Using logical formulae to dene the structure of a large
(possibly innite) Bayesian network in this way means that logical methods can be
used to reason about the structure of the network. More on BLPs can be found in
chapter 9 in this book.
The example of BLPs shows that it can be fruitful to use rst-order logic
as a convenient way of representing and manipulating data (and models) with
complex structure, without too much concern about what the resulting probability
distributions mean. Much, but not all, of the work on SLPs [28, 7] takes this
view, The easiest way to understand SLPs is by relating them to stochastic contextfree grammars (SCFGs) as Muggleton [28] did in the original paper. In an SCFG
each grammar rule has an associated probability. This provides a mechanism for
probabilistically generating strings from the grammar: when there is a choice
of grammar rules for expanding a nonterminal, one is chosen according to the
probabilities. Any derivation in the grammar thus has a probability which is simply
the product of the probabilities of all rules used in the derivation. The probability
of any string is given by the sum of the probabilities of all derivations which
278
generate that string. SLPs lift this basic idea to logic programs: probabilities are
attached to rst-order clauses, thus dening probabilities for proofs. SLPs are more
complex than SCFGs since they are not generally context-free: not all sequences of
clauses constitute a proof; some end in failure. There are dierent ways of dealing
with thisone option is to use backtrackingwhich dene dierent probability
distributions [8]. More on SLPs can be found in chapter 10 of this book.
A semantics-independent approach appears pragmatic and exible: is there not
the problem that a formalism with possible-worlds semantics cannot model probability distributions over spaces other than possible worlds? In fact, the distinction
between the two approaches is not so fundamental since, with a little imagination,
any probability distribution can be viewed as one over some set of possible worlds.
Conversely, having possible-worlds semantics certainly does not stop a formalism
being applicable to real-world problems.
Moreover, imposing a possible-world semantics on a formalism can provide a
useful bridge to related formalisms. For example, Cussens [9] provides a possibleworlds semantics to SLPs by translating SLPs into PRISM programs, the latter
already having possible-world semantics. This amounts to mapping each proof to
a possible world. A characterization of the sort of possible-world distributions thus
dened is given by Sato and Kameya [40]. The connection between PHA/ICL and
PRISM can then be used to connect SLPs with PHA/ICL.
9.3
Inference
Having dened a probability distribution in a logic-based formalism there remains
the problem of computing probabilities to answer specic queries, such as Whats
the probability that Tweety ies? This problem is generally known as inference
and the term is particularly apposite for a logic-based formalism, since for such
formalisms it is possible to exploit nonprobabilistic logical inference to perform
complex probabilistic computations. Here we will only consider inference for those
formalisms (such as PHA/ICL and PRISM) which use logical implication to dene
probability distributions, since in such cases normal rst-order inference can be
used particularly directly to compute probabilities.
Consider, rst, standard logical inferenceusing the rst-order logical theory H
in (9.8) by way of example. H denes possible output sequences (via the hmm/2
predicate) for a hidden Markov model (HMM)whose parameters are yet to be
dened. The HMM has two states (s0 and s1 ) and two symbols in its output
alphabet (a and b). Both states can emit both symbols and all four possible state
transitions are possible. The only formula of any interest is the second which uses
the cons function symbol to encode a nonempty sequence.
9.3
Inference
279
s : hmm(s, null)
(9.8)
280
values(tr(S),[s0,s1,stop]).
:- set_sw(tr(s0),0.3+0.4+0.3).
:- set_sw(tr(s1),0.4+0.1+0.5).
values(out(S),[a,b]).
:- set_sw(out(s0),0.3+0.7).
:- set_sw(out(s1),0.6+0.4).
hmm(S,[X|Y]) :msw(out(S),X),
msw(tr(S),T),
(
T == stop, Y = []
;
T \= stop, hmm(T,Y)
).
Figure 9.2
| ?- probf(hmm(s0,[a,b,a])).
hmm(s0,[a,b,a])
<=> hmm(s0,[b,a]) & msw(out(s0),1,a) & msw(tr(s0),1,s0)
v hmm(s1,[b,a]) & msw(out(s0),1,a) & msw(tr(s0),1,s1)
hmm(s0,[b,a])
<=> hmm(s0,[a]) & msw(out(s0),2,b) & msw(tr(s0),2,s0)
v hmm(s1,[a]) & msw(out(s0),2,b) & msw(tr(s0),2,s1)
hmm(s1,[b,a])
<=> hmm(s0,[a]) & msw(out(s1),3,b) & msw(tr(s1),3,s0)
v hmm(s1,[a]) & msw(out(s1),3,b) & msw(tr(s1),3,s1)
hmm(s0,[a])
<=> msw(out(s0),a) & msw(tr(s0),4,stop)
hmm(s1,[a])
<=> msw(out(s1),a) & msw(tr(s1),4,stop)
yes
| ?- prob(hmm(s0,[a,b,a]),P).
P = 0.012429?
yes
Figure 9.3 Using abduction to compute a probability. The PRISM output has
been altered so that the second argument on msw/3 is explicit.
9.4
Learning
281
is true. Note that these formulae are guaranteed to be mutually exclusive since
msw(tr(s0), 1, s0) and msw(tr(s0), 1, s1) are dened to be alternatives. The next
three lines state when hmm(s0, [b, a]) is true, and so on. It is not dicult to
see that there are exactly eight mutually exclusive conjunctions of msw/3 facts
which (together with the rules) entail hmm(s0, [a, b, a]). For example, one of
these conjunctions is msw(out(s0), 1, a), msw(tr(s0), 1, s0), msw(out(s0), 2, b),
msw(tr(s0), 2, s0), msw(out(s0), 3, a), msw(tr(s0), 4, stop). Since the msw/3 probabilistic facts are dened as independent, the probability of each conjunction
is simply a product of the probabilities of the conjuncts. The probability of
hmm(s0, [a, b, a]) is just the sum of these eight products, which, as gure 9.3
shows, happens to be 0.012429. Naturally, the sum is computed by dynamic programming similar to the variable elimination algorithm used in Bayesian networks.
Sophisticated logic programming tabling technology can be exploited to do this
elegantly and eciently.
It should be stressed that this example of probabilistic inference was able to
exploit the restrictions on PRISM programs that all the probabilistic ground atoms
in the body of each clause are probabilistically independent and the clauses dening
a probabilistic predicate are probabilistically exclusive [42], as well as the CWA. In
other cases inference is much harder. For example, inference in PLP requires linear
programming to deal with the inequalities involved and the linear program LP (P )
needed for a PLP program P contains exponentially many linear programming
variables w.r.t. the size of the Herbrand base of P  [29]. (The Herbrand base is the
set of all ground atomic formulae expressible in the language used to dene P .)
Naturally, one option for hard inference problems is to resort to approximate
methods. For example, Angelopoulos and Cussens [1], used an SLP to represent
a prior probability distribution over classication trees similarly to the way that
the PRISM program above dened a distribution over HMM outputs. Since they
adopt a Bayesian approach, learning reduces to probabilistic inference and so the
key problem is to compute posterior probabilities: probabilities conditional on the
observed data. Using exact inference to compute such probabilities (for example,
the posterior class distribution for a test example) seems a hopeless task, so instead,
the Metropolis-Hastings algorithm is used to sample from the posterior and thus
to produce approximations to the desired probabilities.
9.4
Learning
Having considered how probability-logic formalisms represent probability distributions and how inference can be used to compute probabilities of interest, we can
now turn to the issue of learning a model from data. As always, we consider the
observed data as a sample generated by some unknown true model whose identity we wish to learn (or rather estimate). In some cases only the parameters of
the model are unknown. In the general case both the structure and parameters of
282
the true model are unknown. These two cases are considered in section 9.4.1 and
section 9.4.2, respectively.
The paper by De Raedt and Kersting [10] provides an excellent overview of
learning in probability-logic formalisms. The current section is complementary to
De Raedt and Kerstings broad survey since, (1) for the sake of concreteness it
examines parameter estimation in some detail for a particular formalism (PRISM)
and (2) discusses the use of probabilities in pre-SRL inductive logic programming
(ILP). An examination of pre-SRL ILP is useful since it seems likely that some of
the techniques found there may be useful for more recent formalisms.
Before focusing on these two areas it is worth mentioning two key points about
learning in probability-logic formalisms which are provided by De Raedt and
Kersting [10].
Much of the machinery for learning Bayesian networks can be used to learn
directed probability-logic models such as BLPs. When rst-order clauses are used
for the structure of a directed model, then specialization and generalization of
clauses corresponds to using a macro-operator for adding and deleting arcs in a
Bayesian network. Parameter estimation for logical directed models corresponds
to parameter estimation with tied parameters in a normal Bayesian network.
This is because one rst-order clause typically represents a collection of network
fragments in the underlying Bayesian network via its ground instances.
The probabilistic models associated with a PHA/ICL, PRISM, or SLP model
depend on parameters associated with many predicates in the underlying logic
program. This means that structure learning for such models is, in general, at least
as hard as multiple-predicate learning / theory revision in ILP: which is known
to be a hard problem. However, there exists work on learning SLPs in a restricted
setting [27], and also work on applying grammatical inference techniques to learn
SLPs [3].
As for other types of statistical inference, the key to learning in probability-logic
formalisms is the likelihood function: the probability of the observed data as a
function of the model. If the structure is xed, then the likelihood is just a function
of the model parameters. If a probability-logic formalism denes a distribution
over possible worlds, then ideally we would like the data to be a collection of
independent observations of possible worlds, each viewed as a sample drawn from
the unknown true model. The probability of each world can then be computed
using inference (section 9.3) and the likelihood of the data is just a product of these
probabilities. In the case of alternative-based formalisms like PHA/ICL, PRISM,
and SLPs, each data point would then be associated with a unique conjunction of
atomic choices. Maximum likelihood estimation of the multinomial distribution over
each alternative is then possible by simple counting in the normal way. A Bayesian
approach using Dirichlet priors is equally simple.
However, in many cases each observation is a ground atomic formula, which is
true in many worlds. This means the data-generating process is best viewed in
terms of missing data: the true model generates a world, but we do not get to
9.4
Learning
283
hmm_out(X) :- hmm(s0,1,X).
hmm(S,N,[X|Y]) :msw(out(S),X),
msw(tr(S),T),
(
T == stop, Y = []
;
T \= stop, NN is N+1, hmm(T,NN,Y)
).
PRISM program such that only one ground instance of hmm out/1 is
true in each possible world
Figure 9.4
see this world, only some ground atomic formula that is true in it. The rest of the
information required to determine the sampled world is missing. Unsurprisingly,
the expectation maximization (EM) algorithm is generally used in such situations.
An alternative approach is presented by Kok and Domingos [24]. Here the data
is contained in a (multitable) relational database. Each row in each table denes
a ground atomic formula in the usual way. The entire database is equivalent to a
conjunction of all these ground atomic formulae (so it is equivalent to a Datalog
Prolog program). Using the CWA, this denes a unique world: all formulae which
are not consequences of the conjunction are deemed false. (This unique world is the
minimal Herbrand model of the associated Prolog program.) So, on the one hand
we have only a single observation, but on the other it is an entire world that is
observed. See Kok and Domingos [24] and chapter 11 this book for further details.
9.4.1
Parameter Estimation
284
| ?- learn([hmm_out([a,b,a,a]),hmm_out([b,b]),hmm_out([a,a,b])]).
..
Finished learning
Number of iterations: 13.
Final likelihood:-9.440724
Total learning time: 0.01 seconds.
All solution search time: 0.01 seconds.
Total table space used: 6304 out of 240000000 bytes
Type show_sw to show the probability distributions.
yes
| ?- show_sw
Switch tr(s1): unfixed: s0 (0.235899) s1 (0.000014) stop (0.764086)
Switch out(s1): unfixed: a (0.254708) b (0.745291)
Switch tr(s0): unfixed: s0 (0.226177) s1 (0.773818) stop (0.000003)
Switch out(s0): unfixed: a (0.788360) b (0.211639)
Figure 9.5 EM learning with PRISM. (I have edited the output to reduce the
precision of parameters.)
ure 9.5 shows a run of PRISM using the EM algorithm to estimate the parameters
of the HMM using a data set of three examples. The algorithm was initialized with
the values shown in gure 9.2. Abduction is used once to produce the data structure
shown in gure 9.3. This can then be used in each iteration of the EM algorithm
to compute the expected values required in the E-part of the EM algorithm.
So far a very simple and hopefully familiar exampleHMM parameter estimation
has been used to explain parameter estimation in PRISM. Of course, there is no
pressing reason to use SRL for this problem. The whole point of SRL is to address
problems outside the remit of more standard approaches. So now consider a simple
elaboration of the HMM learning problem which highlights some of the exibility
of a logic-based approach. Suppose that the HMM is constrained so that not all
outputs from the HMM are permitted. This amounts to altering the denition of
hmm out/1 to
hmm out(X) :- hmm(s0,1,X), constraint(X),
(9.9)
where the predicate constraint/1 is any predicate which can be dened using
(clausal) rst-order logic. Now there will be worlds in which no ground instance
of the target predicate is true: these worlds will not be associated with a possible
data point. To take a very simple example, if constraint/1 were dened thus:
constraint(X) :- X = [Y,Y|Z],
then the possible world illustrated by gure 9.3 would no longer entail hmm out([a, b, a]),
since the rst two elements dier. The distribution over ground instances of
hmm out/1 is now a conditional one: conditional on the logically dened constraint being satised. This turns out to be an exponential-family distribution
9.4
Learning
285
where the partition function Z is the probability that the constraint is true in a
world sampled from the original, unconditional distribution.
It is still possible to use the EM algorithm to search for maximum likelihood
estimates of the parameters of such a distribution; it is just that the generative
characterization of this conditional distribution is more complicated. We assume,
as always, that worlds are sampled from the true underlying distribution. If no
ground instance of the target predicate is true in a sampled world, then that world
is rejected and no data point is generated; otherwise the unique ground instance of
the target predicate which is true in that world is added to the data. Viewing the
observed data as being generated in this fashion we have an extra sort of missing
data: the worlds which were entirely rejected. So the data is now truncated data.
Fortunately, Dempster et al. [13] show that the EM algorithm is applicable even
when the data has been truncated like this. The method was applied to SLPs by
Cussens [7] under the name failure-adjusted maximization (FAM) and is used in
the most recent version of the PRISM system [41].
9.4.2
Structure Learning
286
tasks akin to program synthesis. For example, gure 9.6 shows a clause induced
using the ALEPH ILP system (from example input that comes with that system
to demonstrate ILP learning of classication trees). The clause probabilistically
classies days according to whether they are suitable for playing or not by the
simple expedient of putting probability in the background. No negative examples
are used to induce such a rule: the key is to declare that the class B is an output to
be computed from the day A which is an input. In the ALEPH and Progol systems
this is done with the declaration in gure 9.7.
:- modeh(1,class(+day,-class)).
Figure 9.7 ALEPH declaration that the class/2 variable takes an input (indicated
by the +) of type day and generates an output (indicated by the -) of type class.
It is possible to use an ILP algorithm to search for rules and then build some
probabilistic model from these rules afterward. One option is to use a combining
rule to compute probabilities for test examples entailed by more than one induced
rule. (See chapter 9 for further details on combining rules.) Pompe and Kononenko
[34] use a naive Bayes model to combine rst-order classication rules with a later
approach splitting induced rst-order rules to better approximate the naive Bayes
assumption [35]. This work is an example of the often used technique of viewing
rst-order rules (or parts of rules) as features for a nonlogical probabilistic model.
If induced rules are going to be used eventually as the structural component of
a probabilistic model, then naturally it is better that the algorithm searching for
rules is designed to nd rules suitable for this purpose.
A more thoroughly probabilistic approach is to use ILP techniques as subroutines
in an algorithm that directly learns a probabilistic model from data. This is the
approach taken by Dehaspe [11] with his MACCENT algorithm. The goal of
MACCENT is to learn a conditional distribution giving a distribution over classes
(C) for any given example (I). MACCENT uses the ILP framework of learning
from interpretations where each example I is a Prolog program. There are thus
connections to the approach of Kok and Domingos [24] mentioned earlier. The
9.5
Conclusion
287
9.5
Conclusion
In this chapter we have looked at the big three issues of representation, inference,
and learning for probability-logic models, with a focus on representation. What
is exciting about the current interest in SRL is that techniques for all three of
these (often originating from dierent communities) are coming together to produce
powerful techniques for learning from structured and relational data. (It is worth
noting that there are initiatives with similar goals originating from the statistical
community [16], although there logical approaches are not currently used.) The
number of applications which involve such data are many: almost any real-world
problem for which standard ILP is a reasonable choiceand many more besidesis
also a target for SRL. To take just three recent examples, Frasconi et al. [15] applied
a declarative kernel approach to (1) predicting mutagenicity, (2) information
extraction, and (3) prediction of mRNA signal structure; Lodhi and Muggleton [26]
applied failure-adjusted maximization to learn SLPs to model metabolic pathways
and Riedel and Klein [38] learnt MLN based on discourse representation structures
of a sentence to extract gene-protein interactions from annotated Medline abstracts.
References
[1] N. Angelopoulos and J. Cussens. Exploiting informative priors for Bayesian
classication and regression trees. In Proceedings of the International Joint
Conference on Articial Intelligence, 2005.
[2] A. Arnauld and P. Nicole. Port-Royal Logic. Translated by Bobbs-Merrill,
Indianapolis, IN, 1964.
288
Machine
References
289
[20] Manfred Jaeger. Relational Bayesian networks. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 1997.
[21] A. Karalic and I. Bratko. First order regression. Machine Learning, 26(2-3):
147176, 1997. ISSN 0885-6125.
[22] K. Kersting and L. De Raedt. Bayesian logic programs. Technical Report
151, University of Freiburg, Freiburg, Germany, April 2001.
[23] J. Keynes. A Treatise on Probability. Macmillan, London, 1921.
[24] S. Kok and P. Domingos. Learning the structure of Markov logic networks.
In Proceedings of the International Conference on Machine Learning, 2005.
[25] J. Lloyd. Foundations of Logic Programming. Springer, Berlin, second edition,
1987.
[26] H. Lodhi and S. Muggleton. Modelling metabolic pathways using stochastic
logic programs-based ensemble methods. In Proceedings of the International
Conference on Computational Methods in System Biology, 2004.
[27] S. Muggleton. Learning the structure and parameters of stochastic logic
programs. In Inductive Logic Programming, 2002.
[28] S. Muggleton. Stochastic logic programs. In L. De Raedt, editor, Advances in
Inductive Logic Programming, volume 32 of Frontiers in Articial Intelligence
and Applications, pages 254264. IOS Press, Amsterdam, 1996.
[29] R. Ng and V.S. Subrahmanian. A semantical framework for supporting
subjective and conditional probabilities in deductive databases. Journal of
Automated Reasoning, 10(2):191235, 1993.
[30] R. Ng and V.S. Subrahmanian. Probabilistic logic programming. Information
and Computation, 101(2):150201, 1992.
[31] L. Ngo and P. Haddaway. Answering queries from context-sensitive probabilistic knowledge bases. Theoretical Computer Science, 171:147171, 1997.
[32] N. Nilsson. Probabilistic logic. Articial Intelligence, 28:7187, 1986.
[33] Gordon D. Plotkin. A note on inductive generalization. Machine Intelligence,
5:153163, 1970.
[34] U. Pompe and I. Kononenko. Naive Bayesian classier within ILP-R. In
Inductive Logic Programming, 1995.
[35] U. Pompe and I. Kononenko. Probabilistic rst-order classication.
Inductive Logic Programming, 1997.
In
290
[39] T. Sato. A statistical learning method for logic programs with distribution
semantics. In Inductive Logic Programming, 1995.
[40] T. Sato and Y. Kameya. Parameter learning of logic programs for symbolicstatistical modeling. Journal of Articial Intelligence Research, 15:391454,
2001.
[41] T. Sato, Y. Kameya, and N. Zhou. Generative modeling with failure in
PRISM. In Proceedings of the International Joint Conference on Articial
Intelligence, 2005.
[42] N. Zhou, T. Sato, and Y. Kameya. A Reference Guide to PRISM Version
1.7, March 2004.
10.1
Introduction
In recent years, there has been a signicant interest in integrating probability theory with rst-order logic and relational representations (see De Raedt and Kersting
[5] for an overview). Muggleton [30] and Cussens [4] have upgraded stochastic grammars toward stochastic logic programs, Sato and Kameya [42] have introduced probabilistic distributional semantics for logic programs, and Domingos and Richardson
[9] have upgraded Markov networks toward Markov logic networks. Another research stream including Pooles independent choice logic [38], Ngo and Haddawys
Probabilistic-Logic Programs [34], Jaegers relational Bayesian networks [17], and
Pfeers probabilistic relational models [37] concentrates on rst-order logical and
relational extensions of Bayesian networks.
292
Bayesian networks [36] are one of the most important, ecient, and elegant
frameworks for representing and reasoning with probabilistic models. They have
been applied to many real-world problems in diagnosis, forecasting, automated
vision, sensor fusion, and manufacturing control [16]. A Bayesian network species
a joint probability distribution over a nite set of random variables and consists of
two components:
1. a qualitative or logical one that encodes the local inuences among the random
variables using a directed acyclic graph, and
2. a quantitative one that encodes the probability densities over these local inuences.
Despite these interesting properties, Bayesian networks also have a major limitation,
i.e., they are essentially propositional representations. Indeed, imagine modeling
the localization of genes/proteins as was the task at the KDD Cup 2001 [3]. When
using a Bayesian network, every gene is a single random variable. There is no way of
formulating general probabilistic regularities among the localizations of the genes
such as
the localization L of gene G is inuenced by the localization L of another gene
G that interacts with G.
The propositional nature and limitations of Bayesian networks are similar to those
of traditional attribute-value learning techniques, which have motivated a lot of
work on upgrading these techniques within inductive logic programming. This in
turn also explains the interest in upgrading Bayesian networks toward using rstorder logical representations.
Bayesian logic programs unify Bayesian networks with logic programming which
allows the propositional character of Bayesian networks and the purely logical
nature of logic programs to be overcome. From a knowledge representation point of
view, Bayesian logic programs can be distinguished from alternative frameworks by
having logic programs (i.e., denite clause programs, which are sometimes called
pure Prolog programs), as well as Bayesian networks, as an immediate special
case. This is realized through the use of a small but powerful set of primitives.
Indeed, the underlying idea of Bayesian logic programs is to establish a one-to-one
mapping between ground atoms and random variables, and between the immediate
consequence operator and the direct inuence relation. Therefore, Bayesian logic
programs can also handle domains involving structured terms as well as continuous
random variables.
In addition to reviewing Bayesian logic programs, this chapter
contributes a graphical representation for Bayesian logic programs;
its implementation in the Bayesian logic programs tool Balios; and
shows how purely logical predicates as well as aggregate function are employed
within Bayesian logic programs.
10.2
293
Figure 10.1 The graphical structure of a Bayesian network modeling the inheritance of blood types within a particular family.
10.2
Bayesian Networks
294
uence among the random variables. It represents the joint probability distribution
P(x1 , . . . , xn ) over a xed, nite set {x1 , . . . , xn } of random variables. Each random
variable xi possesses a nite set S(xi ) of mutually exclusive states. Figure 10.1
shows the graph of a Bayesian network modeling our blood type example for a particular family. The familial relationship, which is taken from Jensens stud farm example [19], forms the basis for the graph. The network encodes, e.g., that Dorothys
blood type is inuenced by the genetic information of her parents Ann and Brian.
The set of possible states of bt(dorothy) is S(bt(dorothy)) = {a, b, ab, 0}; the
set of possible states of pc(dorothy) and mc(dorothy) are S(pc(dorothy)) =
S(mc(dorothy)) = {a, b, 0}. The same holds for ann and brian. The direct predecessors of a node x, the parents of x, are denoted by Pa(x). For instance,
Pa(bt(ann)) = {pc(ann), mc(ann)}.
A Bayesian network stipulates the following conditional independence assumption.
Proposition 10.1 Independence Assumption of Bayesian Networks
Each node xi in the graph is conditionally independent of any subset A of nodes
that are not descendants of xi given a joint state of Pa(xi ), i.e.,
P(xi | A, Pa(xi )) = P(xi | Pa(xi )) .
For example, bt(dorothy) is conditionally independent of bt(ann) given a joint
state of its parents {pc(dorothy), mc(dorothy)}. Any pair (xi , Pa(xi )) is called the
family of xi denoted as Fa(xi ); e.g., bt(dorothy)s family is
(bt(dorothy), {pc(dorothy), mc(dorothy)}) .
Because of the conditional independence assumption, we can write down the joint
probability density as
P(x1 , . . . , xn ) =
n
P(xi | Pa(xi ))
i=1
by applying the independence assumption 10.1 to the chain rule expression of the
joint probability distribution. Thereby, we associate with each node xi of the graph
the conditional probability distribution P(xi | Pa(xi )), denoted as cpd(xi ). The
conditional probability distributions in our blood type domain are:
mc(dorothy)
pc(dorothy)
P(bt(dorothy))
10.2
295
mc(ann)
pc(ann)
P(mc(dorothy))
P(mc(ann))
P(mc(ann))
P(mc(ann))
10.2.2
Logic Programs
To introduce logic programs, consider gure 10.2, containing two programs, grandparent and nat. Formally speaking, we have that grandparent/2, parent/2 and
nat/1 are predicates (with their arity i.e., number of arguments listed explicitly). Furthermore, jef, paul, and ann are constants and X, Y, and Z are variables. All constants and variables are also terms. In addition, there exist structured terms, such as s(X), which contains the functor s/1 of arity 1 and the
term X. Constants are often considered as functors of arity 0. Atoms are predicate symbols followed by the necessary number of terms, e.g., parent(jef, paul),
nat(s(X)), parent(X, Z), etc. We are now able to dene the key concept of a (definite) clause. Clauses are formulae of the form A :B1 , . . . , Bm where A and the
Bi are logical atoms where all variables are understood to be universally quantied. For example, the clause grandparent(X, Y) :parent(X, Z), parent(Z, Y) can
be read as X is the grandparent of Y if X is a parent of Z and Z is a parent of
Y. Let us call this clause c. We call grandparent(X, Y) the head(c) of this clause,
and parent(X, Z), parent(Z, Y) the body(c). Clauses with an empty body, such as
parent(jef, paul), are called facts. A (denite) clause program (or logic program for
short) consists of a set of clauses. In gure 10.2, there are thus two logic programs,
one dening grandparent/2 and one dening nat/1.
parent(jef,paul).
parent(paul,ann).
grandparent(X,Y) :- parent(X,Z), parent(Z,Y).
Figure 10.2
nat(0).
nat(s(X)) :- nat(X).
296
all occurrences of the variables Vi are simultaneously replaced by the term ti , e.g.,
c is grandparent(ann, Y) :parent(ann, Z), parent(Z, Y).
The Herbrand base of a logic program T , denoted as HB(T ), is the set of all
ground atoms constructed with the predicate, constant, and function symbols in
the alphabet of T . For example, HB(nat) = {nat(0), nat(s(0)), nat(s(s(0))), ...}
and
HB(grandparent) =
{parent(ann, ann), parent(jef, jef),
parent(paul, paul), parent(ann, jef), parent(jef, ann), ...,
grandparent(ann, ann), grandparent(jef, jef), ...}.
A Herbrand interpretation for a logic program T is a subset of HB(T ). The least
Herbrand model LH(T ) (which constitutes the semantics of the logic program)
consists of all facts f  HB(T ) such that T logically entails f , i.e., T |= f .
Various methods exist to compute the least Herbrand model. We merely sketch its
computation through the use of the well-known immediate consequence operator TB .
The operator TB is the function on the set of all Herbrand interpretations of B such
that for any such interpretation I we have
TB (I) = {A |there is a substitution  and a clause A:A1 , . . . , An in B such
that A:A1 , . . . , An  is ground and for i = 1, . . . , n : Ai   I}.
Now, for range-restricted clauses, the least Herbrand model can be obtained using
the following procedure:
1: Initialize LH := 
2: repeat
3:
LH := TB (LH)
4: until LH does not change anymore
At this point the reader may want to verify that LH(nat) = HB(nat) and
LH(grandparent) =
{parent(jef, paul), parent(paul, ann), grandparent(jef, ann)}.
10.3
1. Haddawy [14] and Langley [27] have a similar view on Bayesian networks. For instance,
Langley does not represent Bayesian networks graphically but rather uses the notation of
propositional denite clause programs.
10.3
297
pc(ann).
pc(brian).
mc(ann).
mc(brian).
mc(dorothy) :- mc(ann), pc(ann).
pc(dorothy) :- mc(brian), pc(brian).
bt(ann) :- mc(ann), pc(ann).
bt(brian) :- mc(brian), pc(brian).
bt(dorothy) :- mc(dorothy), pc(dorothy).
Figure 10.3 A propositional clause program encoding the structure of the blood
type Bayesian network in gure 10.1.
the structure of the blood type Bayesian network in gure 10.1. Observe that the
random variables in the Bayesian network correspond to logical atoms. Furthermore, the direct inuence relation corresponds to the immediate consequence operator. Now, imagine another totally separated family, which could be described by a
similar Bayesian network. The graphical structure and associated conditional probability distribution for the two families are controlled by the same intensional regularities. But these overall regularities cannot be captured by a traditional Bayesian
network. So we need a way to represent these overall regularities.
Because this problem is akin to that with propositional logic and the structure
of Bayesian networks can be represented using propositional clauses, the approach
taken in Bayesian logic programs is to upgrade these propositional clauses encoding
the structure of the Bayesian network to proper rst-order clauses.
10.3.1
Representation Language
Applying the above-mentioned idea leads to the central notion of a Bayesian clause.
Denition 10.2 Bayesian Clause
A Bayesian (denite) clause c is an expression of the form A | A1 , . . . , An where
n  0, the A, A1 , . . . , An are Bayesian atoms (see below) and all Bayesian atoms
are (implicitly) universally quantied. When n = 0, c is called a Bayesian fact and
expressed as A.
So the dierences between a Bayesian clause and a logical clause are:
1. the atoms p(t1 , . . . , tl ) and predicates p/l arising are Bayesian, which means that
they have an associated (nite2) set S(p/l) of possible states, and
2. we use | instead of : to highlight the conditional probability distribution.
2. For the sake of simplicity we consider nite random variables, i.e., random variables
having a nite set S of states. However, because the semantics rely on Bayesian networks,
the ideas easily generalize to discrete and continuous random variables (modulo the
restrictions well-known for Bayesian networks).
298
For instance, consider the Bayesian clause c bt(X)|mc(X), pc(X) where S(bt/1) =
{a, b, ab, 0} and S(mc/1) = S(pc/1) = {a, b, 0}. Intuitively, a Bayesian predicate
p/l generically represents a set of random variables. More precisely, each Bayesian
ground atom g over p/l represents a random variable over the states S(g) :=
S(p/l). For example, bt(ann) represents the blood type of a person named Ann
as a random variable over the states {a, b, ab, 0}. Apart from that, most logical
notions carry over to Bayesian logic programs. So we will speak of Bayesian
predicates, terms, constants, substitutions, propositions, ground Bayesian clauses,
Bayesian Herbrand interpretations, etc. For the sake of simplicity we will sometimes
omit the term Bayesian as long as no ambiguities arise. We will assume that all
Bayesian clauses c are range-restricted, i.e., Var(head(c))  Var(body(c)). Range
restriction is often imposed in the database literature; it allows one to avoid
the derivation of nonground true facts (cf. section 10.2.2). As already indicated
while discussing gure 10.3, a set of Bayesian clauses encodes the qualitative or
structural component of the Bayesian logic programs. More precisely, ground atoms
correspond to random variables, and the set of random variables encoded by a
particular Bayesian logic program corresponds to its least Herbrand domain. In
addition, the direct inuence relation corresponds to the immediate consequence.
In order to represent a probabilistic model we also associate with each Bayesian
clause c a conditional probability distribution cpd(c) encoding P(head(c) |
body(c)); cf. gure 10.4. To keep the exposition simple, we will assume that cpd(c)
is represented as a table. More elaborate representations such as decision trees
or rules would be possible too. The distribution cpd(c) generically represents the
conditional probability distributions associated with each ground instance c of the
clause c.
In general, one may have many clauses. Consider clauses c1 and c2
bt(X) | mc(X).
bt(X) | pc(X). \ ,
and assume corresponding substitutions i that ground the clauses ci such that
head(c1 1 ) = head(c2 2 ). In contrast to bt(X)|mc(X), pc(X), they specify cpd(c1 1 )
and cpd(c2 2 ), but not the desired distribution P(head(c1 1 ) | body(c1 )body(c2 )).
The standard solution to obtain the distribution required is so-called combining
rules.
Denition 10.3 Combining Rule
A combining rule is a function that maps nite sets of conditional probability
distributions {P(A | Ai1 , . . . , Aini ) | i = 1, . . . , m} onto one (combined) conditional
$m
probability distribution P(A | B1 , . . . , Bk ) with {B1 , . . . , Bk }  i=1 {Ai1 , . . . , Aini }.
We assume that for each Bayesian predicate p/l there is a corresponding combining
rule cr(p/l), such as noisy or (see, e.g., [18]) or average. The latter assumes
n1 = . . . = nm and S(Aij ) = S(Akj ), and computes the average of the distributions
*
over S(A) for each joint state over j S(Aij ); see also section 10.3.2.
By now, we are able to formally dene Bayesian logic programs.
10.3
m(ann, dorothy).
f(brian, dorothy).
pc(ann).
pc(brian).
mc(ann).
mc(brian).
mc(X)|m(Y, X), mc(Y), pc(Y).
pc(X)|f(Y, X), mc(Y), pc(Y).
bt(X)|mc(X), pc(X).
299
mc(X)
a
b
0
m(Y, X)
true
true
false
pc(X)
a
a
0
mc(Y)
a
b
P(bt(X))
(0.97, 0.01, 0.01, 0.01)
(0.01, 0.01, 0.97, 0.01)
P(mc(X))
(0.98, 0.01, 0.01)
(0.01, 0.98, 0.01)
Figure 10.4 The Bayesian logic program blood type encoding our genetic domain.
For each Bayesian predicate, the identity is the combining rule. The conditional
probability distributions associated with the Bayesian clauses bt(X)|mc(X), pc(X)
and mc(X)|m(Y, X), mc(X), pc(Y) are represented as tables. The other distributions are
correspondingly dened. The Bayesian predicates m/2 and f/2 have as possible states
{true, f alse}.
Declarative Semantics
300
the logical sense, i.e., if the Bayesian logic program B is interpreted as a logical
program. They are the so-called relevant random variables, the random variables
over which a probability distribution is well-dened by B, as we will see. The atoms
not belonging to the least Herbrand model are irrelevant. Now, to each node x in
DG(B) we associate the combined conditional probability distribution which is
the result of applying the combining rule cr(p/n) of the corresponding Bayesian
predicate p/n to the set of cpd(c)s where head(c) = x and {x}  body(c) 
LH(B). Consider
cold.
flu.
malaria.
fever | cold.
fever | flu.
fever | malaria.
where all Bayesian predicates have true, false as states, and noisy or as combining
rule. The dependency graph is
10.3
301
For instance, the dependency graph of the blood type program as shown in Figures 10.5 and 10.6 encodes that the random variable bt(dorothy) is independent
of pc(ann) given a joint state of pc(dorothy), mc(dorothy). Using this assumption
the following proposition (taken from [21]) holds:
Proposition 10.6 Semantics
Let B be a Bayesian logic program. If
1. LH(B) = ,
2. DG(B) is acyclic, and
3. each node in DG(B) is inuenced by a nite set of random variables,
then B species a unique probability distribution PB over LH(B).
To see this, note that the least Herbrand LH(B) always exists, is unique, and
countable. Thus, DG(B) exists and is unique, and due to condition (3) the combined
probability distribution for each node of DG(B) is computable. Furthermore,
because of condition (1) a total order  on DG(B) exists, so that one can see
B together with  as a stochastic process over LH(B). An induction argument
over  together with condition (2) allows one to conclude that the family of
nite-dimensional distributions of the process is projective (cf. [2]), i.e., the joint
probability distribution over each nite subset S  LH(B) is uniquely dened and
y P(S, x = y) = P(S). Thus, the preconditions of Kolmogorovs theorem [[2], p.
307] hold, and it follows that B given  species a probability distribution P over
LH(B). This proves the proposition because the total order  used for the induction
is arbitrary.
A program B satisfying the conditions (1), (2), and (3) of proposition 10.6 is called
well-dened. A well-dened Bayesian logic program B species a joint distribution
over the random variables in the least Herbrand model LH(B). As with Bayesian
networks, the joint distribution over these random variables can be factored to
P(x|Pa(x)),
P(LH(B)) =
xLH(B)
where the parent relation Pa is according to the dependency graph.
The blood type Bayesian logic program in gure 10.4 is an example of a welldened Bayesian logic program. Its grounded version is shown in gure 10.5.
It essentially encodes the original blood type Bayesian network of Figures 10.1
and 10.3. The only dierences are the two predicates m/2 and f/2 which can be
in one of the logical set of states true and false. Using these predicates and an
appropriate set of Bayesian facts (the extension) one can encode the Bayesian
network for any family. This situation is akin to that in deductive databases, where
the intension (the clauses) encodes the overall regularities and the extension
(the facts) the specic context of interest. By interchanging the extension, one can
swap contexts (in our case, families).
302
Figure 10.6 The structure of the Bayesian network represented by the grounded
blood type Bayesian logic program in gure 10.5. The structure of the Bayesian
network coincides with the dependency graph. Omitting the dashed nodes yields
the original Bayesian network of gure 10.1.
10.3.3
Procedural Semantics
10.3
303
That the support network of a nite set X  LH(B) is sucent to compute P(X)
follows from the following theorem (taken from [21]):
Theorem 10.9 Support Network
Let N be a possibly innite Bayesian network, let Q be nodes of N , and E = e,
E  N , be some evidence. The computation of P(Q | E = e) does not depend on
any node x of N which is not a member of the support network N (Q  E).
To compute the support network N ({q}) of a single variable q eciently, let us
look at logic programs from a proof-theoretic perspective. From this perspective,
a logic program can be used to prove that certain atoms or goals (see below) are
logically entailed by the program. Provable ground atoms are members of the least
Herbrand model.
Proofs are typically constructed using the SLD-resolution procedure which we will
now briey introduce. Given a goal :-G1 , G2 . . . , Gn and a clause G:-L1 , . . . , Lm such that
G1  = G, applying SLD resolution yields the new goal :-L1 , . . . , Lm , G2  . . . , Gn  .
A successful refutation, i.e., a proof of a goal, is then a sequence of resolution steps
yielding the empty goal, i.e. :- . Failed proofs do not end in the empty goal. For
instance, in our running example, bt(dorothy) is true, because of the following
refutation:
:-bt(dorothy)
:-mc(dorothy), pc(dorothy)
:-m(ann, dorothy), mc(ann), pc(ann), pc(dorothy)
:-mc(annn), pc(ann), pc(dorothy)
:-pc(ann), pc(dorothy)
:-pc(dorothy)
:-f(brian, dorothy), mc(brian), pc(brian)
:-mc(brian), pc(brian)
:-pc(brian)
:Resolution is employed by many theorem provers (such as Prolog). Indeed, when
given the goal bt(dorothy), Prolog would compute the above successful resolution
refutation and answer that the goal is true.
The set of all proofs of :-bt(dorothy) captures all information needed to compute
N ({bt(dorothy)}). More exactly, the set of all ground clauses employed to prove
bt(dorothy) constitutes the families of the support network N ({bt(dorothy)}).
For :-bt(dorothy), they are the ground clauses shown in gure 10.5. To build the
304
Figure 10.7 The rule graph for the blood type Bayesian network. On the righthand side the local probability model associated with node R9 is shown, i.e.,
the Bayesian clause bt dorothy|mc dorothy, pc dorothy with associated conditional
probability table.
support network, we only have to gather all ground clauses used to prove the query
variable and have to combine multiple copies of ground clauses with the same head
using corresponding combining rules. To summarize, the support network N ({q})
can be computed as follows:
1
2
3
Applying this to :-bt(dorothy) yields the support network as shown in gure 10.6.
Furthermore, the method can easily be extended to compute the support network
for P(Q | E = e) . We simply compute all proofs of :-q, q  Q, and :-e, e  E . The
resulting support network can be fed into any (exact or approximative) Bayesian
network engine to compute the resulting (conditional) probability distribution of the
query. To minimize the size of the support network, one might also apply Schachters
Bayes-Ball algorithm [43].
10.4
10.4
305
10.4.1
Graphical Representation
306
Figure 10.9
Bayesian logic program, the graph can be viewed as a rule graph as known from
database theory. Ovals represent Bayesian predicates, and boxes denote Bayesian
clauses. More precisely, given a (propositional) Bayesian logic program B with
Bayesian clauses Ri  hi |bi1 , . . . , bim , there are edges from from Ri to hi and from
bij to Ri . Furthermore, to each Bayesian clause node, we associate the corresponding Bayesian clause as a Bayesian network fragment. Indeed, the graphical model
in gure 10.7 represents the propositional Bayesian logic program of gure 10.5.
In order to represent rst-order Bayesian logic programs graphically, we have
to encode Bayesian atoms and their variable bindings in the associated local
probability models. Indeed, logical terms can naturally be represented graphically.
They form trees. For instance, the term t(s(1, 2), X) corresponds to the tree
Logical variables such as X are encoded as white ovals. Constants and functors such
as 1, 2, s, and t are represented as white boxes. Bayesian atoms are represented
as gradient gray ovals containing the predicate name such as pc. Arguments of
atoms are treated as placeholders for terms. They are represented as white circles
on the boundary of the ovals (ordered from left to right). The term appearing in the
argument is represented by an undirected edge between the white oval representing
the argument and the root of the tree encoding the term (we start in the argument
and follow the tree until reaching variables).
As an example, consider the Bayesian logic program in gure 10.8. It models
the blood type domain. The graphical representation indeed conveys the meaning of
the Bayesian clause R7: the paternal genetic information pc(Person) of a person
is inuenced by the maternal mc(M) and the paternal pc(M) genetic information of
the persons Father.
10.4
307
Figure 10.10 The blood type Bayesian logic program distinguishing between
Bayesian (gradient gray ovals) and logical atoms (solid gray ovals).
As another example, consider gure 10.9 which shows the use of functors to
represent dynamic probabilistic models. More precisely, it shows an HMM [39].
HMMs are extremely popular for analyzing sequential data. Application areas
include computational biology, user modeling, speech recognition, empirical natural
language processing, and robotics.
At each Time, the system is in a state hidden(Time). The time-independent
probability of being in some state at the next time next(Time) given that the
system was in a state at TimePoint is captured in the Bayesian clause R2. Here,
the next time point is represented as functor next/1 . In HMMs, however, we do not
have direct access to the states hidden(Time). Instead, we measure some properties
obs(Time) of the states. The measurement is quantied in Bayesian clause R3. The
dependency graph of the Bayesian logic program directly encodes the well-known
Bayesian network structure of HMMs:
10.4.2
Logical Atoms
Reconsider the blood type Bayesian logic program in gure 10.8. The mother/2
and father/2 relations are not really random variables but logical ones because
they are always in the same state, namely true, with probability 1, and can depend only on other logical atoms. These predicates form a kind of logical background theory. Therefore, when predicates are declared to be logical, one need
not represent them in the conditional probability distributions. Consider the blood
type Bayesian logic program in gure 10.10. Here, mother/2 and father/2 are
declared to be logical. Consequently, the conditional probability distribution asso-
308
ciated with the denition of, e.g., pc/1 takes only pc(Father) and mc(Father) into
account but not f(Father, Person). It applies only to those substitutions for which
f(Father, Person) is true, i.e., in the least Herbrand model. This can eciently be
checked using any Prolog engine. Furthermore, one may omit these logical atoms
from the induced support network. More importantly, logical predicates provide
the user with the full power of Prolog. In the blood type Bayesian logic program of
gure 10.10, the logical background knowledge denes the founder/1 relation as
founder(Person):-\+(mother( , Person); father( , Person)).
Here, \+ denotes negation, the symbol represents an anonymous variable which
is treated as a new, distinct variable each time it is encountered, and the semicolon
denotes a disjunction. The rest of the Bayesian logic program is essentially as
in gure 10.4. Instead of explicitly listing pc(ann), mc(ann), pc(brian), mc(brian)
in the extensional part we have pc(P)|founder(P) and mc(P)|founder(P) in the
intensional part.
The full power of Prolog is also useful to elegantly encode dynamic probabilistic
models. Figure 10.11 (a) shows the generic structure of an HMM where the discrete
time is now encoded as next/2 in the logical background theory using standard
Prolog predicates:
next(X, Y):-integer(Y), Y > 0, X is Y  1.
Prologs predened predicates (such as integer/1) avoid a cumbersome representation of the dynamics via the successor functor 0, next(0), next(next(0)), . . . Imagine
querying ?- obs(100) using the successor functor,
?- obs(next(next(. . . (next(0)) . . .))) .
Whereas HMMs dene probability distributions over regular languages, probabilistic context-free grammars (,s) [29] dene probability distributions over contextfree languages. Application areas of PCFGs include, e.g., natural language processing and computational biology. For instance, mRNA sequences constitute contextfree languages. Consider, e.g., the following PCFG
terminal([A|B], A, B).
0.3 : sentence(A, B):-terminal(A, a, C), terminal(C, b, B).
0.7 : sentence(A, B):-terminal(A, a, C), sentence(C, D), terminal(D, b, B).
dening a distribution over {an bn } . The grammar is represented as probabilistic
denite clause grammar where the terminal symbols are encoded in the logical
background theory via the rst rule terminal([A|B], A, B) .
A PCFG denes a stochastic process with leftmost rewriting, i.e., refutation steps
as transitions. Words, say aabb, are parsed by querying ?- sentence([a, a, b, b], []).
The third rule yields ?- terminal([a, a, b, b], a, C), sentence(C, D), terminal(D, b, []).
Applying the rst rule yields ?- sentence([a, b, b], D), terminal(D, b, []) and the sec-
10.4
309
Figure 10.11 Two dynamic Bayesian logic programs. (a) The generic structure of a hidden Markov model more elegantly represented as in gure 10.9
using next(X, Y) : integer(Y), Y > 0, X is Y  1.. (b) A probabilistic context-free
grammar over {an bn }. The logical background theory denes terminal/3 as
terminal([A|B], A, B).
ond rule ?- terminal([a, b, b], a, C), terminal(C, b, D), terminal(D, b, []). Applying
the rst rule three times yields a successful refutation. The probability of a refutation is the product of the probability values associated with clauses used in the
refutation; in our case 0.7  0.3. The probability of aabb then is the sum of the
probabilities of all successful refutations. This is also the basic idea underlying
Muggletons stochastic logic programs [30] which extend the PCFGs to denite
clause logic.
Figure 10.11 (b) shows the {an bn } PCFG represented as a Bayesian logic program. The Bayesian clauses are the clauses of the corresponding denite clause
grammar. In contrast to PCFGs, however, we associate a complete conditional probability distribution, namely (0.3, 0.7) and (0.7, 0.3; 0.0, 1.0) to the Bayesian clauses.
For the query ?- sentence([a, a, b, b], []), the following Markov chain is induced
(omitting logical atoms):
310
10.4.3
Aggregate Functions
10.5
311
deterministically computed; cf. Bayesian clause R5. In turn, the students rank/1
probabilistically depends on her averaged rank; cf. R6.
The use of aggregate functions is inspired by probabilistic relational models [37].
As we will show in the related work section, using aggregates in Bayesian logic
programs, it is easy to model probabilistic relational models.
10.5
For Bayesian logic programs, a data case Di  D has two parts, a logical and a
probabilistic part. The logical part of a data case is a Herbrand interpretation. For
instance, the following set of atoms constitutes a Herbrand interpretation for the
blood type Bayesian logic program.
{m(ann, dorothy), f(brian, dorothy), pc(ann), mc(ann), bt(ann),
pc(brian), mc(brian), bt(brian), pc(dorothy), mc(dorothy), bt(dorothy)}
This (logical) interpretation can be seen as the least Herbrand model of an unknown
Bayesian logic program. In general, data cases specify dierent sets of relevant
random variables, depending on the given extensional context. If we accept that
the genetic laws are the same for dierent families, then a learning algorithm should
transform such extensionally dened predicates into intensionally dened ones, thus
compressing the interpretations. This is precisely what inductive logic programming
techniques [31] do. The key assumption underlying any inductive technique is
that the rules that are valid in one interpretation are likely to hold for other
interpretations. It thus seems clear that techniques for learning from interpretations
can be adapted for learning the logical structure of Bayesian logic programs.
So far, we have specied the logical part of the learning problem: we are looking
for a set H of Bayesian clauses given a set D of data cases such that all data cases
are a model of H. The hypotheses H in the space H of hypotheses are sets of
Bayesian clauses. However, we have to be more careful. A candidate set H  H
has to be acyclic on the data, which implies that for each data case the induced
Bayesian network has to be acyclic.
312
Consider the task of performing maximum likelihood learning, i.e., scoreD (H) =
P(D|H). As in many cases, it is more convenient to work with the logarithm of
this function, i.e., scoreD (H) = LL(D, H) := log P(D|H). It can be shown (see [22]
for more details) that the likelihood of a Bayesian logic program coincides with the
likelihood of the support network induced over D. Thus, learning Bayesian logic
programs basically reduces to learning Bayesian networks. The main dierences are
the ways to estimate the parameters and to traverse the hypotheses space.
10.5
313
10.5.2.1
Parameter Estimation
l=1
where  denotes substitutions such that Dl is a model of c.
10.5.2.2
3. Most combining rules commonly employed in Bayesian networks such as noisy or are
decomposable.
314
Figure 10.14 Baliosthe engine for Bayesian logic programs. (a) Graphical representation of the university Bayesian logic program. (b) Textual representation of
Bayesian clauses with associated conditional probability distributions. (c) Computed support network and probabilities for a probabilistic query.
10.6
10.6
315
10.7
Related Work
In the last ten years, there has been a lot of work done at the intersection of probability theory, logic programming, and machine learning [38, 14, 41, 30, 34, 17, 26,
1, 24, 9]; see [5] for an overview. Instead of giving a probabilistic characterization
of logic programming such as [32], this research highlights the machine learning
aspect and is known under the names of statistical relational learning (SRL) [11, 8],
probabilistic logic learning (PLL) [5], or probabilistic inductive logic programming
(PILP) [6]. Bayesian logic programs belong to the SRL line of research which
extends Bayesian networks. They are motivated and inspired by the formalisms
discussed in [38, 14, 34, 17, 10, 25]. We will now investigate these relationships in
more detail.
Probabilistic logic programs [33, 34] also adapt a logic program syntax, the
concept of the least Herbrand model to specify the relevant random variables, and
SLD resolution to develop a query-answering procedure. Whereas Bayesian logic
programs view atoms as random variables, probabilistic-logic programs view them
as states of random variables. For instance,
P (burglary(Person, yes) | neighbourhood(Person, average)) = 0.4
states that the a posteriori probability of a burglary in Persons house given that
Person has an average neighborhood is 0.4. Thus, instead of conditional probability
distributions, conditional probability values are associated with clauses.
Treating atoms as states of random variables has several consequences: (1)
Exclusivity constraints such as
false  neighbourhood(X, average), neighbourhood(X, bad)
316
10.7
Related Work
317
4. It is possible, but complicated to model domains having more than two values.
5. To simplify the discussion, we will further ignore these equality constraints here.
318
10.8
Conclusions
We have described Bayesian logic programs, their representation language, their semantics, and a query-answering process, and briey touched upon learning Bayesian
logic programs from data.
Bayesian logic programs combine Bayesian networks with denite clause logic.
The main idea of Bayesian logic programs is to establish a one-to-one mapping
between ground atoms in the least Herbrand model and random variables. The
least Herbrand model of a Bayesian logic program together with its direct inuence
relation is viewed as a (possibly innite) Bayesian network. Bayesian logic programs
inherit the advantages of both Bayesian networks and denite clause logic, including
the strict separation of qualitative and quantitative aspects. Moreover, the strict
separation facilitated the introduction of a graphical representation, which stays
close to the graphical representation of Bayesian networks.
Indeed, Bayesian logic programs can naturally model any type of Bayesian
network (including those involving continuous variables) as well as any type of
pure Prolog program (including those involving functors). We also demonstrated
that Bayesian logic programs can model HMMs and stochastic grammars, and
investigated their relationship to other rst-order extensions of Bayesian networks.
We have also presented the Balios tool, which employs the graphical as well as
the logical notations for Bayesian logic programs. It is available at
http://www.informatik.uni-freiburg.de/~kersting/profile/.
and the authors invite the reader to employ it.
Acknowledgments
The authors thank Uwe Dick for implementing the Balios system. This research
was partly supported by the European Union IST programme under contract number IST-2001-33053 and FP6-508861, APRIL I & II (Application of Probabilistic
Inductive Logic Programming).
References
319
References
[1] C. R. Anderson, P. Domingos, and D. S. Weld. Relational Markov models and
their application to adaptive web navigation. In International Conference on
Knowledge Discovery and Data Mining, 2002.
[2] Heinz Bauer. Wahrscheinlichkeitstheorie, 4th edition. Walter de Gruyter,
Berlin, 1991.
[3] J. Cheng, C. Hatzis, M.A. Krogel, S. Morishita, D. Page, and J. Sese. KDD
Cup 2001 report. SIGKDD Explorations, 3(2):47  64, 2002.
[4] J. Cussens. Loglinear models for rst-order probabilistic reasoning. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 1999.
[5] L. De Raedt and K. Kersting. Probabilistic logic learning. ACM-SIGKDD
Explorations: Special Issue on Multi-Relational Data Mining, 5(1):3148, 2003.
[6] L. De Raedt and K. Kersting. Probabilistic inductive logic programming. In
Proceedings of the International Conference on Algorithmic Learning Theory,
pages 1936, 2004.
[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical Society,
B 39:139, 1977.
[8] T. Dietterich, L. Getoor, and K. Murphy, editors. Working Notes of the ICML2004 Workshop on Statistical Relational Learning and its Connections to Other
Fields (SRL-04), 2004.
[9] P. Domingos and M. Richardson. Markov Logic: A Unifying Framework for
Statistical Relational Learning. In Proceedings of the ICML-2004 Workshop
on Statistical Relational Learning and its Connections to Other Fields, pages
4954, 2004.
[10] N. Friedman, L. Getoor, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In Proceedings of the International Joint Conference on
Articial Intelligence, 1999.
[11] L. Getoor and D. Jensen, editors. Working Notes of the IJCAI-2003 Workshop
on Learning Statistical Models from Relational Data (SRL-03), 2003.
[12] L. Getoor, N. Friedman, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In S. Dzeroski and N. Lavrac, editors, Relational Data
Mining, pages 307335. Kluwer, 2001.
[13] W. R. Gilks, A. Thomas, and D. J. Spiegelhalter. A language and program
for complex bayesian modelling. The Statistician, 43, 1994.
[14] P. Haddawy. Generating Bayesian networks from probabilistic logic knowledge bases. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 1994.
320
References
321
Stochastic logic programs (SLPs)provide a simple scheme for representing probability distributions over structured objects. Other papers have concentrated on
technical issues related to the semantics and machine learning of SLPs. By contrast, this chapter provides a tutorial for the use of SLPs as a means of representing probability distributions over structured objects such as sequences, graphs, and
plans.
11.1
Introduction
11.1.1
An algorithm generally describes an entirely deterministic series of actions. However, apart from their use in describing algorithms logic programs can also describe
324
Probabilistic Non-Determinism
Consider the following non-deterministic logic program representation of the outcome of tossing a two-sided coin.
coin(head).
coin(tail).
This logic program can be interpreted as saying that when the coin is tossed it will
either come up as heads or tails. However, the logic program does not state the
frequency with which we can expect these two outcomes to occur. By associating
probability labels with the clauses we get the following stochastic logic program
(SLP) [9] representation of a fair coin (a coin with equal probability outcomes of
heads and tails).
0.5: coin(head).
0.5: coin(tail).
Given the goal :- coin(X) we would now expect the outcomes X = head and
X = tail to occur randomly with probability 0.5 in each case. Here we can view X
as a random variable in the statistical sense.
11.2
11.2
11.2.1
325
The game involves a player and a banker. The player starts with a quantity of N
counters and the banker with M counters. Until the player chooses to stop he does
the following repeatedly.
1. The player pays an entrance fee (F counters) to the banker.
2. The player rolls a six-sided dice and gets the value D.
3. The banker rewards the player with D counters.
11.2.2
Below we show how this game can be represented as an SLP. The form of SLP used
below is known as an impure SLP. An impure SLP is one in which not every denite
clause has a probability label. Those without a probability label are treated as
normal logic program clauses. Let us start with the unlabeled part of the program.
play(State) :act(stop(State,State)).
play(State) :act(pay_entrance(State,State1)),
act(dice_reward(State1,State2)),
play(State2).
act(X) :- X.
Here we see the general playing strategy for the game. Every action such as
pay entrance and dice reward is conducted by the predicate act. Each such action
transforms one state into another. Play proceeds by recursing via the second clause
until the stop action is taken using the rst clause.
11.2.3
326
The stop action simply prints out the playing state, which is represented as a twoelement list consisting of the number of counters held by the Player and Banker
respectively.
The pay entrance action reduces the players counters by 4 and increases the
bankers counters by 4.
The dice reward action increases the players counters by the value D of the rolled
dice and decreases the bankers counters by D.
11.2.4
The dice represent the only probabilistic element of the game. A fair dice is
represented as follows.
1/6:
1/6:
1/6:
1/6:
1/6:
1/6:
roll_dice(1).
roll_dice(2).
roll_dice(3).
roll_dice(4).
roll_dice(5).
roll_dice(6).
11.2.5
Our blackjack game model is very close to the real blackjack game, as described in
Wikipedia [14]. We consider this version as a simplication of the real game in the
sense that it does not include bets and money. It involves only one player and the
player does not have any strategy.
Let us now describe the specications of the game. Blackjack hands are scored
by their point total. The hand with the highest total wins as long as it does not go
over 21, which is called a bust. Cards 2 through 10 are worth their face value,
and face cards (Jack, Queen, King) are also worth 10. An ace counts as 11 unless
it would bust a hand, in which case it counts as 1.
In our version there is only one player. His goal is to beat the dealer, by having
the higher, unbusted hand. Note that if the player busts, he loses, even if the dealer
also busts. If the players and the dealers hands have the same point value, this is
known as a push, and neither player nor dealer wins the hand.
11.2
327
The dealer deals the cards, in our version from one deck of cards. The dealer
gives two cards to the player and to himself.
A two-card hand of 21 (an ace plus a ten-value card) is called a blackjack or a
natural, and is an automatic winner.
If the dealer has a blackjack and the player does not, the dealer wins automatically. If the player has a blackjack and the dealer does not, the player wins
automatically. If the player and dealer both have blackjack, it is a tie (push). If
neither side has a blackjack, in our version the strategy of the player is always to
stand, then the dealer plays his hand. He must hit until he has at least 17, regardless
of what the player has. The dealer may hit until he has a maximum of ve cards
in his hands.
The parameters of the game that could be modied, dening another version of
the game are:
the number of decks of cards;
the maximum number of cards in a hand;
the strategy of the player;
the number of players.
11.2.7
We show here how we can represent the game in Prolog. Such an implementation
could lead to the SLP representation for several reasons. First, since SLPs lift the
concept of logic programs, representing the game in Prolog allows us to translate
it in order to obtain an SLP representation of the game. Secondly, it is interesting
to see the dierence of expressivity between logic programs and SLPs. Finally, the
Prolog implementation permits us to experimentally verify the correctness of our
representation.
Let us present the entry clause of the program.
game(Result,PScore,PHand,DScore,DHand) :State0 = [[],[],[]],
act(first_2_cards(State0,State1)),
act(rest_of_game(State1,State2)),
end_of_game(State2,Result,PScore,PHand,DScore,DHand).
act(X) :- X.
The general playing strategy has the same structure as for the simple game of chance
described above. Indeed, every action such as first 2 cards and rest of game is
conducted by the predicate act. Each such action transforms one state into another.
The predicate end of game does not represent an action but calculates, given the
nal state of the game, the result, returning also the scores and the hands of the
player and the dealer as its last arguments.
328
A playing state is represented as a list of three lists. The rst list represents the
player hand, the second the dealer hand, and the third all the cards already dealt.
The rest of the program denes each of the predicates which are mentioned in
the body of the clause. For instance, let us present the denition of the predicate
rest of game.
rest_of_game(State,State2) :act(p_turn(State,State1)),
act(d_turn(State1,State2)).
We need to introduce two other predicates; p turn and d turn. For instance, d turn
represents the dealers turn after he has received his rst two cards. In this phase he
asks for extra cards until he stands. This corresponds to the following two clauses.
d_turn(State,State) :d_stands(State).
d_turn(State,State2) :\+ d_stands(State),
act(d_deal_card(State,State1)),
act(d_turn(State1,State2)).
The predicate d deal card represents the action of dealing a card to the dealer.
Therefore, it requires taking a card from the deck of cards. Taking a card is
represented by the following clause.
pick_card(Cards,Card) :random_card(Card),
non_member(Card,Cards).,
where
random_card((C,V)) :repeat,
random(1,5,C),
random(1,14,V).
random is a build-in Prolog predicate which simulates the choice of a number
between two bounds.
11.2.8
SLP is the statistical relational learning (SRL) framework that is arguably the
closest to logic programs in terms of declarativeness. Therefore SLP is the most
expressive framework to translate the blackjack game into. Indeed, there are two
type of clauses in the Prolog program that require two dierent types of treatment.
11.2
329
The Prolog clauses without any random aspect are not modied in the SLP
representation. Obviously we do not restrict ourselves to the notion of pure SLPs
but instead we allow for impure SLP representations.
The random aspects of the program are transformed in labeled clauses. However,
taking a card from a deck of cards is the only probabilistic element of the game.
Therefore, the Prolog implementation and the SLP representation of the blackjack
game are virtually identical. The sole use of the predicate random is replaced by
several labeled clauses expressing that taking a card from a deck is a random
action. Let us show how the action of taking a card from the deck is translated:
Compared to the Prolog implementation of this action described above, the
random predicates have to be replaced in the SLP representation by labeled clauses.
The clause choose color determines the color of the card and the clause choose value
determines the value of the card.
pick_card(Cards,Card) :random_card(Card),
non_member(Card,Cards).
random_card((C,V)) :repeat,
choose_color(C),
choose_value(V).
0.25:
0.25:
0.25:
0.25:
choose_color(1).
choose_color(2).
choose_color(3).
choose_color(4).
0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:
0.07692308:
choose
choose
choose
choose
choose
choose
choose
choose
choose
choose
choose
choose
choose
value(1).
value(2).
value(3).
value(4).
value(5).
value(6).
value(7).
value(8).
value(9).
value(10).
value(11).
value(12).
value(13).
330
Thus the SLP representation is almost as expressive as the Prolog program. Since
SLPs are logically oriented, it is relatively easy to understand the rules of the game
given the SLP representation.
11.2.9
Let us now present how a modication in the parameters of the game description
would aect this model.
If we added other decks of cards, we would have to add an argument to the description of a card. A card would be dened as C=(Value,Color,Number of Deck).
We would have to modify this in the relevant clauses but the number of these
clauses is limited. We would also have to add a predicate choose deck dened
like choose color and choose value.
If we allowed for more cards in a hand, we would only have to replace 5 by the
new number in the denition of d stands.
If we wanted to assign a cleverer game strategy for the player, we would only have
to modify the p stands predicate.
We would have to make more important changes if we wanted to change the
number of players. Indeed, we would have to add equivalent predicates for all the
predicates that model the actions of the players.
Thanks to its great expressivity and compactness, the SLP representation would
not be modied much when changing the parameters of the game compared to
other frameworks.
11.3
Stochastic Grammars
The initial inspiration for SLPs in [9] was the idea of lifting stochastic grammars
to the expressive level of logic programs. In this section we show the relationship
between stochastic grammars and SLPs.
11.3.1
Stochastic Automata
Stochastic automata, otherwise called hidden Markov models [11], have found many
applications in speech recognition. An example is shown in gure 11.1. Stochastic
automata are dened by a 5-tuple A = Q, , q0 , F, 
. Q is a set of states.  is an
alphabet of symbols. q0 is the initial state and F  Q (F = {q2 } in gure 11.1) is
the set of nal states.  : (Q \ F )    Q  [0, 1] is a stochastic transition function
which associates probabilities with labeled transitions between states. The sum of
probabilities associated with transitions from any state q  (Q \ F ) is 1.
In the following  represents the empty string. The transition function   :
(Q\F )  Q[0, 1] is dened as follows.   (q, ) = q, 1
.   (q, au) = qau , pa pu 
11.3
Stochastic Grammars
331
0.4: a
q0
Figure 11.1
0.7: b
0.6: b
q1
0.3: c
q2
Stochastic automaton.
332
0.4 : q0  aq0
0.6 : q0  bq1
0.7 : q1  bq1
0.3 : q1  cq2
1.0 : q2  
Figure 11.2
Labeled Productions
11.4
333
0.5 : S  
0.5 : S  aSb
Figure 11.3
11.3.3
Stochastic context-free grammars [4] can be treated in the same way as the labeled
productions of the last section. However, the following dierences exist between the
regular and context-free cases.
To allow for the expression of context-free grammars the left-hand sides of the
production rules are allowed to consist of arbitrary strings of terminals and
nonterminals.
Since context-free grammars can have more than one derivation of a particular
string u, the probability of u is the sum of the probabilities of the individual
derivations of u.
The analogue of Theorem 11.3 holds only in relation to the length of the derivation, not the length of the generated string.
Example 11.2
The language an bn Figure 11.3 shows a stochastic context-free grammar G
expressed over the language an bn . The probabilities of generated strings are as
follows. P r(|G) = 0.5, P r(ab|G) = 0.25, P r(aabb|G) = 0.125.
11.4
334
0.5 : nate(0) 
0.5 : nate(s(N ))  nate(N )
Figure 11.4
11.4.1
For SLPs the stochastic refutation of a goal is analogous to the stochastic generation
of a string from a set of labeled production rules. Suppose that P is an SLP.
Then n(P ) will be used to express the logic program formed by dropping all the
probability labels from clauses in P . A stochastic SLD procedure will be used to
dene a probability distribution over the Herbrand base of n(P ). The stochastic
SLD derivation of atom a is as follows. Suppose  g is a unit goal with the same
predicate symbol as a, no function symbols, and distinct variables. Next suppose
that there exists an SLD refutation of  g with answer substitution  such that
g = a. Since all clauses in n(P ) are range-restricted,  is necessarily a ground
substitution. The probability of each clause selection in the refutation is as follows.
Suppose the rst atom in the subgoal  g  can unify with the heads of stochastic
clauses p1 : C1 , . . . , pn : Cn , and stochastic clause pi : Ci is chosen in the refutation.
pi
. The probability of the derivation of
Then the probability of this choice is p1 +...+p
n
a is the product of the probability of the choices in the refutation. As with stochastic
context-free grammars, the probability of a is then the sum of the probabilities of
the derivations of a.
This stochastic SLD strategy corresponds to a distributional semantics [13] for
P . That is, each atom a in the success set of n(P ) is assigned a nonzero probability
(due to the completeness of SLD derivation). For each predicate symbol q the
probabilities of atoms in the success set of n(P ) corresponding to q sum to 1 (the
proof of this is analogous to theorem 11.1).
11.4.2
Polynomial Distributions
It is reasonable to ask whether theorem 11.3 extends in some form to SLPs. The
distributions described in [10] include both those that decay exponentially over the
length of formulae and those that decay polynomially. SLPs can easily be used to
describe an exponential decay distribution over the natural numbers as follows.
Example 11.3
Exponential distribution Figure 11.4 shows a recursive SLP P which describes an exponential distribution over the natural numbers expressed in Peano
arithmetic form. The probabilities of atoms are as follows. P r(nate(0)|P ) =
0.5, P r(nate(s(0))|P ) = 0.25, and P r(nate(s(s(0)))|P ) = 0.125. In general,
P r(nate(N )|P ) = 2N 1 .
11.5
Learning Techniques
335
However, SLPs can also be used to dene a polynomially decaying distribution over
the natural numbers as follows.
Example 11.4
Polynomial distribution Figure 11.5 shows a recursive SLP P which describes a
polynomial distribution over the natural numbers expressed in reverse binary form.
Numbers are constructed by rst choosing the length of the binary representation
and then lling out the binary expression by repeated tossing of a fair coin. Since
the probability of choosing a number N of length log2 (N ) is roughly 2log2 (N ) and
there are 2log2 (N ) such numbers, each with equal probability, P r(natp(N )|P ) 
22log2 (N ) = N 2 .
11.5
Learning Techniques
We will now briey introduce the dierent existing learning techniques for SLP.
We will begin with the description of data used for learning. We will then focus
on studying the parameter estimation techniques and nally the structure learning,
after having dened these notions.
11.5.1
Data Used
For SLP, as for stochastic context-free grammars, the evidence used for learning is
facts or even clauses .
11.5.2
Parameter Estimation
The aim of parameter estimation is, given a set of examples, to infer the values 
of the parameters  (which represent the quantitative part of the model) that best
justify the set of examples. We will focus on the maximum likelihood estimation
(MLE) which tries to nd  = argmax P (E|L, ). Yet we cannot calculate exactly
the MLE when data is missing, so the expectation maximization (EM) algorithm
is the most commonly used technique.
As described in [12], EM assumes that the parameters have been initialized (e.g.,
at random) and then iteratively perform the following two steps until convergence:
336
E-Step: on the basis of the observed data and the present parameters of the model,
compute a distribution over all possible completions of each partially observed
data case.
M-Step: Using each completion as a fully-observed data case weighted by its probability, compute the updated parameter using (weighted) frequency counting.
For SLP, one uses the failure-adjusted maximization (FAM) algorithm introduced
by Cussens [3]. One has to learn the parameters thanks to the evidence, whereas the
logical part of the SLP is given. The examples consist of atoms for a predicate p and
are logically entailed by the SLP, since they are generated from the target SLP. In
order to estimate the parameters, SLD trees are computed for each example. Each
path from root to leaf is considered as one of the possible completions. Then, one
weights the above completions with the product of probabilities associated with
clauses that are used in the completions. Eventually, one obtains the improved
estimates for each clause by dividing the clauses expected counts by the sum of
the expected counts of clauses for the same predicate.
11.5.3
Structure Learning
Given a set of examples E and a language bias B, which determines the set of
possible hypotheses, one searches for a hypothesis H   B such that
1. H  logically covers the examples E, i.e., cover(H , E), and
2. the hypothesis H  is optimal w.r.t. some scoring function scores, i.e., H  =
argmaxHB = score(H, E).
The hypotheses are of the form (L, ) where L is the logical part and  the vector
of parameters values dened in section 11.5.2.
The existing approaches use a heuristic search through the space of hypothesis.
Hill-climbing or beam-search are typical methods that are applied until the candidate hypothesis satises the two conditions dened above. One applies renement
operators during the steps in the search space.
For SLPs, as described in [12], structure learning involves applying a renement
operator at the theory level (i.e. considering multiple predicates) under entailment.
It is theory revision in inductive logic programming. This problem being known as
very hard, the only approaches have been restricted to learning missing clauses for a
single predicate. Muggleton [7], introduced a two-phase approach that separates the
structure learning aspects from the parameter estimation phase. In a more recent
approach, Muggleton [8] presents an initial attempt to integrate both phases for
single predicate learning.
11.6
11.6
Conclusion
337
Conclusion
Stochastic logic programs provide a simple scheme for representing probability
distributions over structured objects. This chapter provides a tutorial for the use of
SLPs as a means of representing probability distributions over structured objects
such as sequences, graphs, and plans.
SLPs were initially applied to the problem of learning from positive examples
only [6]. This required the implementation of the following function which denes
the generality of an hypothesis.
DX (x).
g(H) =
xH
The generality is thus the sum of the probability of all instances of hypothesis H.
Clearly such a sum can be innite. However, if a large enough sample is generated
from DX (implemented as an SLP), then the proportion of the sample entailed by
H gives a good approximation of g(H).
Acknowledgments
Our thanks for useful discussions on the topics in this chapter with James Cussens,
Kristian Kersting, Jianzhong Chen, and Hiroaki Watanabe. This work was supported by the Esprit IST project Application of Probabilistic Inductive Logic
Programming II (APRIL II) and the DTI Beacon project, Metalog - Integrated
Machine Learning of Metabolic Networks Applied to Predictive Toxicology.
References
[1] I. Bratko. Prolog for Articial Intelligence. Addison-Wesley, London, 1986.
[2] W.F. Clocksin and C.S. Mellish. Programming in Prolog. Springer-Verlag,
Berlin, 1981.
[3] J. Cussens. Parameter estimation in stochastic logic programs.
Learning, 44(3):245271, 2001.
Machine
338
on
the
Blackjack
game,
Interest in statistical relational learning (SRL) has grown rapidly in recent years.
Several key SRL tasks have been identied, and a large number of approaches have
been proposed. Increasingly, a unifying framework is needed to facilitate transfer of
knowledge across tasks and approaches, to compare approaches, and to help bring
structure to the eld. We propose Markov logic as such a framework. Syntactically,
Markov logic is indistinguishable from rst-order logic, except that each formula
has a weight attached. Semantically, a set of Markov logic formulae represents a
probability distribution over possible worlds, in the form of a log-linear model with
one feature per grounding of a formula in the set, with the corresponding weight.
We show how approaches like probabilistic relational models, knowledge-based
model construction, and stochastic logic programs can be mapped into Markov
logic. We also show how tasks like collective classication, link prediction, linkbased clustering, social network modeling, and object identication can be concisely
formulated in Markov logic. Finally, we develop learning and inference algorithms
for Markov logic, and report experimental results on a link prediction task.
12.1
340
SRL approaches have been proposed, including knowledge-based model construction [55, 39, 29], stochastic logic programs [37, 9], PRISM [51], MACCENT [12],
probabilistic relational models [17], relational Markov models [1], relational Markov
networks [53], relational dependency networks [38], structural logistic regression
[44], relational generation functions [7], constraint logic programming for probablistic knowledge (CLP(BN )) [50], and others.
While the variety of problems and approaches in the eld is valuable, it makes
it dicult for researchers, students, and practitioners to identify, learn, and apply
the essentials. In particular, for the most part, the relationships between dierent
approaches and their relative strengths and weaknesses remain poorly understood,
and innovations in one task or application do not easily transfer to others, slowing
down progress. There is thus an increasingly pressing need for a unifying framework,
a common language for describing and relating the dierent tasks and approaches.
To be most useful, such a framework should satisfy the following desiderata:
1. The framework must incorporate both rst-order logic and probabilistic graphical
models. Otherwise some current or future SRL approaches will fall outside its
scope.
2. SRL problems should be representable clearly and simply in the framework.
3. The framework must facilitate the use of domain knowledge in SRL. Because
the search space for SRL algorithms is very large even by AI standards, domain
knowledge is critical to success. Conversely, the ability to incorporate rich domain
knowledge is one of the most attractive features of SRL.
4. The framework should facilitate the extension to SRL of techniques from statistical learning, inductive logic programming, probabilistic inference, and logical
inference. This will speed progress in SRL by taking advantage of the large extant
literature in these areas.
In this chapter we propose Markov logic as a framework that we believe meets
all of these desiderata. We begin by briey reviewing the necessary background
in Markov networks (section 12.2) and rst-order logic (section 12.3). We then
introduce Markov logic (section 12.4) and describe how several SRL approaches
and tasks can be formulated in this framework (sections 12.5 and 12.6). Next,
we show how techniques from logic, probabilistic inference, statistics and inductive
logic programming can be used to obtain practical inference and learning algorithms
for Markov logic (sections 12.7 and 12.8). Finally, we illustrate the application of
these algorithms in a real-world link prediction task (section 12.9) and conclude
(section 12.10).
12.2
12.2
Markov Networks
341
Markov Networks
A Markov network (also known as a Markov random eld) is a model for the joint
distribution of a set of variables X = (X1 , X2 , . . . , Xn )  X [41]. It is composed of
an undirected graph G and a set of potential functions k . The graph has a node
for each variable, and the model has a potential function for each clique in the
graph. A potential function is a non-negative real-valued function of the state of
the corresponding clique. The joint distribution represented by a Markov network
is given by
P (X = x) =
1 
k (x{k} ),
Z
(12.1)
where x{k} is the state of the kth clique (i.e., the state of the variables that appear in
	
that clique). Z, known as the partition function, is given by Z = xX k k (x{k} ).
Markov networks are often conveniently represented as log-linear models, with each
clique potential replaced by an exponentiated weighted sum of features of the state,
leading to
1
wj fj (x) .
P (X = x) = exp 
Z
j
(12.2)
A feature may be any real-valued function of the state. This chapter will focus on
binary features, fj (x)  {0, 1}. In the most direct translation from the potentialfunction form (12.1), there is one feature corresponding to each possible state x{k}
of each clique, with its weight being log k (x{k} ). This representation is exponential
in the size of the cliques. However, we are free to specify a much smaller number
of features (e.g., logical functions of the state of the clique), allowing for a more
compact representation than the potential-function form, particularly when large
cliques are present. Markov Login Networks (MLNs) will take advantage of this.
Inference in Markov networks is #P-complete [49]. The most widely used method
for approximate inference in Markov networks is Markov chain Monte Carlo
(MCMC) [20], and in particular Gibbs sampling, which proceeds by sampling each
variable in turn given its Markov blanket. (The Markov blanket of a node is the
minimal set of nodes that renders it independent of the remaining network; in a
Markov network, this is simply the nodes neighbors in the graph.) Marginal probabilities are computed by counting over these samples; conditional probabilities are
computed by running the Gibbs sampler with the conditioning variables clamped
to their given values. Another popular method for inference in Markov networks is
belief propagation [57].
Maximum likelihood or maximup a posteriori (MAP) estimates of Markov network weights cannot be computed in closed form, but, because the log-likelihood
is a concave function of the weights, they can be found eciently using standard
342
12.3
First-Order Logic
A rst-order knowledge base (KB) is a set of sentences or formulae in rst-order logic
[18]. Formulae are constructed using four types of symbols: constants, variables,
functions, and predicates. Constant symbols represent objects in the domain of
interest (e.g., people: Anna, Bob, Chris, etc.). Variable symbols range over the
objects in the domain. Function symbols (e.g., MotherOf) represent mappings from
tuples of objects to objects. Predicate symbols represent relations among objects in
the domain (e.g., Friends) or attributes of objects (e.g., Smokes). An interpretation
species which objects, functions, and relations in the domain are represented by
which symbols. Variables and constants may be typed, in which case variables range
only over objects of the corresponding type, and constants can only represent
objects of the corresponding type. For example, the variable x might range over
people (e.g., Anna, Bob, etc.), and the constant C might represent a city (e.g,
Seattle, Tokyo, etc.).
A term is any expression representing an object in the domain. It can be a
constant, a variable, or a function applied to a tuple of terms. For example, Anna,
x, and GreatestCommonDivisor(x, y) are terms. An atomic formula or atom is a
predicate symbol applied to a tuple of terms (e.g., Friends(x, MotherOf(Anna))).
Formulae are recursively constructed from atomic formulae using logical connectives
and quantiers. If F1 and F2 are formulae, the following are also formulae: F1
(negation), which is true i F1 is false; F1  F2 (conjunction), which is true i both
F1 and F2 are true; F1  F2 (disjunction), which is true i F1 or F2 is true; F1  F2
(implication), which is true i F1 is false or F2 is true; F1  F2 (equivalence), which
is true i F1 and F2 have the same truth-value; x F1 (universal quantication),
which is true i F1 is true for every object x in the domain; and x F1 (existential
quantication), which is true i F1 is true for at least one object x in the domain.
Parentheses may be used to enforce precedence. A positive literal is an atomic
formula; a negative literal is a negated atomic formula. The formulae in a KB are
implicitly conjoined, and thus a KB can be viewed as a single large formula. A
ground term is a term containing no variables. A ground atom or ground predicate
is an atomic formula all of whose arguments are ground terms. A possible world or
Herbrand interpretation assigns a truth value to each possible ground predicate.
A formula is satisable i there exists at least one world in which it is true. The
basic inference problem in rst-order logic is to determine whether a knowledge
base KB entails a formula F , i.e., if F is true in all worlds where KB is true
(denoted by KB |= F ). This is often done by refutation: KB entails F i KB  F
is unsatisable. (Thus, if a KB contains a contradiction, all formulae trivially follow
from it, which makes painstaking knowledge engineering a necessity.) For automated
12.3
First-Order Logic
343
Table 12.1 Example of a rst-order knowledge base and MLN. Fr() is short for
Friends(), Sm() for Smokes(), and Ca() for Cancer()
English
Friends of friends
are friends
Friendless people
smoke.
Smoking causes
cancer.
If two people are
friends, either both
smoke or neither
does.
First-order logic
Clausal form
Wt
xyz Fr(x, y)
Fr(y, z)  Fr(x, z)
x ((y Fr(x, y)) 
Sm(x))
0.7
2.3
x Sm(x) Ca(x)
Sm(x) Ca(x)
1.5
xy Fr(x, y) 
(Sm(x)  Sm(y))
1.1
1.1
344
12.4
Markov Logic
A rst-order KB can be seen as a set of hard constraints on the set of possible
worlds: if a world violates even one formula, it has zero probability. The basic idea
in Markov logic is to soften these constraints: when a world violates one formula in
the KB it is less probable, but not impossible. The fewer formulae a world violates,
the more probable it is. Each formula has an associated weight that reects how
strong a constraint it is: the higher the weight, the greater the dierence in log
probability between a world that satises the formula and one that does not, other
things being equal. We call a set of formulae in Markov logic a Markov logic network.
MLNs dene probability distributions over possible worlds [21] as follows.
Denition 12.1
An MLN L is a set of pairs (Fi , wi ), where Fi is a formula in rst-order logic and
wi is a real number. Together with a nite set of constants C = {c1 , c2 , . . . , c|C| },
it denes a Markov network ML,C ((12.1) and (12.2)) as follows:
1. ML,C contains one binary node for each possible grounding of each predicate
appearing in L. The value of the node is 1 if the ground atom is true, and 0
otherwise.
2. ML,C contains one feature for each possible grounding of each formula Fi in L.
The value of this feature is 1 if the ground formula is true, and 0 otherwise. The
weight of the feature is the wi associated with Fi in L.
The syntax of the formulae in an MLN is the standard syntax of rst-order
logic [18]. Free (unquantied) variables are treated as universally quantied at the
outermost level of the formula.
An MLN can be viewed as a template for constructing Markov networks. Given
dierent sets of constants, it will produce dierent networks, and these may be of
widely varying size, but all will have certain regularities in structure and parameters,
given by the MLN (e.g., all groundings of the same formula will have the same
weight). We call each of these networks a ground Markov network to distinguish it
from the rst-order MLN. From denition 12.1 and (12.1) and (12.2), the probability
distribution over possible worlds x specied by the ground Markov network ML,C
is given by
(
)
1 
1
wi ni (x) =
i (x{i} )ni (x) .
P (X = x) = exp
Z
Z
i
i
(12.3)
where ni (x) is the number of true groundings of Fi in x, x{i} is the state (truth
values) of the atoms appearing in Fi , and i (x{i} ) = ewi . Notice that, although we
dened MLNs as log-linear models, they could equally well be dened as products
of potential functions, as the second equality above shows. This will be the most
convenient approach in domains with a mixture of hard and soft constraints (i.e.,
12.4
Markov Logic
345
Friends(A,B)
Friends(A,A)
Smokes(A)
Smokes(B)
Cancer(A)
Friends(B,B)
Cancer(B)
Friends(B,A)
Figure 12.1
where some formulae hold with certainty, leading to zero probabilities for some
worlds).
The graphical structure of ML,C follows from denition 12.1: there is an edge
between two nodes of ML,C i the corresponding ground atoms appear together
in at least one grounding of one formula in L. Thus, the atoms in each ground
formula form a (not necessarily maximal) clique in ML,C . Figure 12.1 shows the
graph of the ground Markov network dened by the last two formulae in table 12.1
and the constants Anna and Bob. Each node in this graph is a ground atom (e.g.,
Friends(Anna, Bob)). The graph contains an arc between each pair of atoms that
appear together in some grounding of one of the formulae. ML,C can now be used
to infer the probability that Anna and Bob are friends given their smoking habits,
the probability that Bob has cancer given his friendship with Anna and whether
she has cancer, etc.
Each state of ML,C represents a possible world. A possible world is a set of
objects, a set of functions (mappings from tuples of objects to objects), and a
set of relations that hold between those objects; together with an interpretation,
they determine the truth-value of each ground atom. The following assumptions
ensure that the set of possible worlds for (L, C) is nite, and that ML,C represents
a unique, well-dened probability distribution over those worlds, irrespective of
the interpretation and domain. These assumptions are quite reasonable in most
practical applications, and greatly simplify the use of MLNs. For the remaining
cases, we discuss below the extent to which each one can be relaxed.
Assumption 1
Unique names Dierent constants refer to dierent objects [18].
Assumption 2
Domain closure The only objects in the domain are those representable using the
constant and function symbols in (L, C) [18].
Assumption 3
Known functions For each function appearing in L, the value of that function
applied to every possible tuple of arguments is known, and is an element of C.
346
tions 13
function Ground(F , C)
inputs: F , a formula in rst-order logic
C, a set of constants
output: GF , a set of ground formulae
calls: CN F (F, C), which converts F to conjunctive normal form, replacing
existentially quantied formulae by disjunctions of their groundings over C
F  CN F (F, C)
GF = 
for each clause Fj  F
Gj = {Fj }
for each variable x in Fj
for each clause Fk (x)  Gj
Gj  (Gj \ Fk (x))  {Fk (c1 ), Fk (c2 ), . . . , Fk (c|C| )},
where Fk (ci ) is Fk (x) with x replaced by ci  C
GF  GF  Gj
for each ground clause Fj  GF
repeat
for each function f (a1 , a2 , . . .) all of whose arguments are constants
Fj  Fj with f (a1 , a2 , . . .) replaced by c, where c = f (a1 , a2 , . . .)
until Fj contains no functions
return GF
This last assumption allows us to replace functions by their values when grounding formulae. Thus the only ground atoms that need to be considered are those
having constants as arguments. The innite number of terms constructible from all
functions and constants in (L, C) (the Herbrand universe of (L, C)) can be ignored,
because each of those terms corresponds to a known constant in C, and atoms
involving them are already represented as the atoms involving the corresponding
constants. The possible groundings of a predicate in denition 12.1 are thus obtained simply by replacing each variable in the predicate with each constant in C,
and replacing each function term in the predicate by the corresponding constant.
Table 12.2 shows how the groundings of a formula are obtained given assumptions 13. If a formula contains more than one clause, its weight is divided equally
among the clauses, and a clauses weight is assigned to each of its groundings.
Assumption 1 (unique names) can be removed by introducing the equality
predicate (Equals(x, y), or x = y for short) and adding the necessary axioms to the
MLN: equality is reexive, symmetric, and transitive; for each unary predicate P,
xyx = y  (P(x)  P(y)); and similarly for higher-order predicates and functions
[18]. The resulting MLN will have a node for each pair of constants, whose value
is 1 if the constants represent the same object and 0 otherwise; these nodes will
be connected to each other and to the rest of the network by arcs representing the
axioms above. Notice that this allows us to make probabilistic inferences about the
12.4
Markov Logic
347
equality of two constants. We have successfully used this as the basis of an approach
to object identication (see section 12.6.5).
If the number u of unknown objects is known, assumption 2 (domain closure) can
be removed simply by introducing u arbitrary new constants. If u is unknown but
nite, assumption 2 can be removed by introducing a distribution over u, grounding
the MLN with each number of unknown objects, and computing the probability of a
umax
u
u
P (u)P (F |ML,C
), where ML,C
is the ground MLN with
formula F as P (F ) = u=0
u unknown objects. An innite u requires extending MLNs to the case |C| = .
Let HL,C be the set of all ground terms constructible from the function symbols in
L and the constants in L and C (the Herbrand universe of (L, C)). Assumption 3
(known functions) can be removed by treating each element of HL,C as an additional
constant and applying the same procedure used to remove the unique names
assumption. For example, with a function G(x) and constants A and B, the MLN
will now contain nodes for G(A) = A, G(A) = B, etc. This leads to an innite number
of new constants, requiring the corresponding extension of MLNs. However, if we
restrict the level of nesting to some maximum, the resulting MLN is still nite.
To summarize, assumptions 13 can be removed as long as the domain is nite.
We believe it is possible to extend MLNs to innite domains (see Jaeger [27]), but
this is an issue of chiey theoretical interest, and we leave it for future work. In the
remainder of this chapter we proceed under assumptions 13, except where noted.
A rst-order KB can be transformed into an MLN simply by assigning a weight
to each formula. For example, the clauses and weights in the last two columns of
Table 12.1 constitute an MLN. According to this MLN, other things being equal, a
world where n friendless people are nonsmokers is e(2.3)n times less probable than
a world where all friendless people smoke. Notice that all the formulae in table 12.1
are false in the real world as universally quantied logical statements, but capture
useful information on friendships and smoking habits, when viewed as features of
a Markov network. For example, it is well-known that teenage friends tend to have
similar smoking habits [35]. In fact, an MLN like the one in table 12.1 succinctly
represents a type of model that is a staple of social network analysis [54].
It is easy to see that MLNs subsume essentially all propositional probabilistic
models, as detailed below.
Proposition 12.2
Every probability distribution over discrete or nite-precision numeric variables can
be represented as a Markov logic network.
Proof Consider rst the case of Boolean variables (X1 , X2 , . . . , Xn ). Dene a
predicate of zero arity Rh for each variable Xh , and include in the MLN L a
formula for each possible state of (X1 , X2 , . . . , Xn ). This formula is a conjunction
of n literals, with the hth literal being Rh () if Xh is true in the state, and Rh ()
otherwise. The formulas weight is log P (X1 , X2 , . . . , Xn ). (If some states have zero
probability, use instead the product form (see [12.3]), with i () equal to the
probability of the ith state.) Since all predicates in L have zero arity, L denes the
same Markov network ML,C irrespective of C, with one node for each variable Xh .
348
For any state, the corresponding formula is true and all others are false, and thus
(12.3) represents the original distribution (notice that Z = 1). The generalization
to arbitrary discrete variables is straightforward, by dening a zero-arity predicate
for each value of each variable. Similarly for nite-precision numeric variables, by
noting that they can be represented as Boolean vectors.
Of course, compact factored models like Markov networks and Bayesian networks can still be represented compactly by MLNs, by dening formulae for the
corresponding factors (arbitrary features in Markov networks, and states of a node
and its parents in Bayesian networks).2
First-order logic (with assumptions 13 above) is the special case of Markov logic
obtained when all weights are equal and tend to innity, as described below.
Proposition 12.3
Let KB be a satisable knowledge base, L be the MLN obtained by assigning
weight w to every formula in KB, C be the set of constants appearing in KB,
Pw (x) be the probability assigned to a (set of) possible world(s) x by ML,C , XKB
be the set of worlds that satisfy KB, and F be an arbitrary formula in rst-order
logic. Then:
1. x  XKB limw Pw (x) = |XKB |1
x  XKB limw Pw (x) = 0
2. For all F , KB |= F i limw Pw (F ) = 1
Proof Let k be the number of ground formulae in ML,C . By (12.3), if x  XKB ,
then Pw (x) = ekw /Z, and if x  XKB then Pw (x)  e(k1)w /Z. Thus all
x  XKB are equiprobable and limw P (X \ XKB )/P (XKB )  limw (|X \
XKB |/|XKB |)ew = 0, proving part 1. By denition of entailment, KB |= F i
every world that satises KB also satises F . Therefore, letting XF be the set of
worlds that satises F , if KB |= F , then XKB  XF and Pw (F ) = xXF Pw (x) 
Pw (XKB ). Since, from part 1, limw Pw (XKB ) = 1, this implies that if KB |= F ,
then limw Pw (F ) = 1. The inverse direction of part 2 is proved by noting that
if limw Pw (F ) = 1, then every world with nonzero probability must satisfy F ,
and this includes every world in XKB .
In other words, in the limit of all equal innite weights, the MLN represents a
uniform distribution over the worlds that satisfy the KB, and all entailment queries
can be answered by computing the probability of the query formula and checking
whether it is 1. Even when weights are nite, rst-order logic is embedded in
Markov logic in the following sense. Assume without loss of generality that all
weights are non-negative. (A formula with a negative weight w can be replaced
by its negation with weight w.) If the KB composed of the formulae in an
2. While some conditional independence structures can be compactly represented with
directed graphs but not with undirected ones, they still lead to compact models in the
form of Equation 12.3 (i.e., as products of potential functions).
12.4
Markov Logic
349
MLN L (negated, if their weight is negative) is satisable, then, for any C, the
satisfying assignments are the modes of the distribution represented by ML,C . This
is because the modes are the worlds x with maximum i wi ni (x) (see [12.3]), and
this expression is maximized when all groundings of all formulae are true (i.e., the
KB is satised). Unlike an ordinary rst-order KB, however, an MLN can produce
useful results even when it contains contradictions. An MLN can also be obtained
by merging several KBs, even if they are partly incompatible. This is potentially
useful in areas like the Semantic Web [2] and mass collaboration [46].
It is interesting to see a simple example of how Markov logic generalizes rst-order
logic. Consider an MLN containing the single formula x R(x)  S(x) with weight
w, and C = {A}. This leads to four possible worlds: {R(A), S(A)}, {R(A), S(A)},
{R(A), S(A)}, and {R(A), S(A)}. From (12.3) we obtain that P ({R(A), S(A)}) =
1/(3ew + 1) and the probability of each of the other three worlds is ew /(3ew + 1).
(The denominator is the partition function Z; see section 12.2.) Thus, if w > 0, the
eect of the MLN is to make the world that is inconsistent with x R(x)  S(x)
less likely than the other three. From the probabilities above we obtain that
P (S(A)|R(A)) = 1/(1 + ew ). When w  , P (S(A)|R(A))  1, recovering the
logical entailment.
In practice, we have found it useful to add each predicate to the MLN as a unit
clause. In other words, for each predicate R(x1 , x2 , . . .) appearing in the MLN,
we add the formula x1 , x2 , . . . R(x1 , x2 , . . .) with some weight wR . The weight
of a unit clause can (roughly speaking) capture the marginal distribution of the
corresponding predicate, leaving the weights of the non-unit clauses free to model
only dependencies between predicates.
When manually constructing an MLN or interpreting a learned one, it is useful
to have an intuitive understanding of the weights. The weight of a formula F is
simply the log odds between a world where F is true and a world where F is false,
other things being equal. However, if F shares variables with other formulae, as
will typically be the case, it may not be possible to keep the truth-values of those
formulae unchanged while reversing F s. In this case there is no longer a one-to-one
correspondence between weights and probabilities of formulae.3 Nevertheless, the
probabilities of all formulae collectively determine all weights, if we view them
as constraints on a maximum entropy distribution, or treat them as empirical
probabilities and learn the maximum likelihood weights (the two are equivalent)
[13]. Thus a good way to set the weights of an MLN is to write down the probability
with which each formula should hold, treat these as empirical frequencies, and learn
the weights from them using the algorithm in section 12.8. Conversely, the weights
3. This is an unavoidable side eect of the power and exibility of Markov networks. In
Bayesian networks, parameters are probabilities, but at the cost of greatly restricting the
ways in which the distribution may be factored. In particular, potential functions must be
conditional probabilities, and the directed graph must have no cycles. The latter condition
is particularly troublesome to enforce in relational extensions [53].
350
12.5
SRL Approaches
Because of the simplicity and generality of Markov logic, many representations used
in SRL can be easily mapped into it. In this section, we informally do this for a
representative sample of these approaches. The goal is not to capture all of their
many details, but rather to help bring structure to the eld. Further, converting
these representations to Markov logic brings a number of new capabilities and
advantages, and we also discuss these.
12.5.1
Knowledge-based model construction (KBMC) is a combination of logic programming and Bayesian networks [55, 39, 29]. As in Markov logic, nodes in KBMC represent ground predicates. Given a Horn KB, KBMC answers a query by nding all
possible backward-chaining proofs of the query and evidence predicates from each
other, constructing a Bayesian network over the ground predicates in the proofs,
and performing inference over this network. The parents of a predicate node in
the network are deterministic AND nodes representing the bodies of the clauses
that have that node as head. The conditional probability of the node given these
is specied by a combination function (e.g., noisy OR, logistic regression, arbitrary
conditional probability table (CPT)). Markov logic generalizes KBMC by allowing arbitrary formulas (not just Horn clauses) and inference in any direction. It
also sidesteps the thorny problem of avoiding cycles in the Bayesian networks constructed by KBMC, and obviates the need for ad hoc combination functions for
clauses with the same consequent.
A KBMC model can be translated into Markov logic by writing down a set of
formulae for each rst-order predicate Pk(...) in the domain. Each formula is a
conjunction containing Pk(...) and one literal per parent of Pk(...) (i.e., per rstorder predicate appearing in a Horn clause having Pk(...) as the consequent).
A subset of these literals are negated; there is one formula for each possible
combination of positive and negative literals. The weight of the formula is w =
log[p/(1  p)], where p is the conditional probability of the child predicate when the
corresponding conjunction of parent literals is true, according to the combination
function used. If the combination function is logistic regression, it can be represented
using only a linear number of formulae, taking advantage of the fact that a logistic
12.5
SRL Approaches
351
4. Conversely, joint distributions can be built up from classiers (e.g., [23]), but this would
be a signicant extension of MACCENT.
352
12.5.3
12.5
SRL Approaches
12.5.5
353
In structural logistic regression (SLR) [44], the predictors are the output of SQL
queries over the input data. In the same way that a logistic regression model can be
viewed as a discriminatively trained Markov network, an SLR model can be viewed
as a a discriminatively trained MLN.5
12.5.6
Large graphical models with repeated structure are often compactly represented
using plates [4]. Markov logic allows plates to be specied using universal quantication. In addition, it allows individuals and their relations to be explicitly represented (see Cussens [8]), and context-specic independences to be compactly written
down, instead of left implicit in the node models. More recently, Heckerman et al.
[24] have proposed probabilistic entity relationship (ER) models, a language based
on ER models that combines the features of plates and PRMs; this language can
be mapped into Markov logic in the same way that ER models can be mapped into
rst-order logic. Probabilistic ER models allow logical expressions as constraints
on how ground networks are constructed, but the truth-values of these expressions
have to be known in advance; Markov logic allows uncertainty over all logical expressions.
12.5.8
BLOG
Milch et al. [36] have proposed a language, called BLOG (Bayesian Logic), designed
to avoid making the unique names and domain closure assumptions. A BLOG
program species procedurally how to generate a possible world, and does not allow
arbitrary rst-order knowledge to be easily incorporated. Also, it only species the
structure of the model, leaving the parameters to be specied by external calls.
BLOG models are directed graphs and need to avoid cycles, which substantially
complicates their design. We saw in the previous section how to remove the
unique names and domain closure assumptions in Markov logic. (When there are
unknown objects of multiple types, a random variable for the number of each
5. Use of SQL aggregates requires that their denitions be imported into Markov logic.
354
12.6
SRL Tasks
Many SRL tasks can be concisely formulated in Markov logic, making it possible to
see how they relate to each other, and to develop algorithms that are simultaneously
applicable to all. In this section we exemplify this with ve key tasks: collective
classication, link prediction, link-based clustering, social network modeling, and
object identication.
12.6.1
Collective Classication
The goal of ordinary classication is to predict the class of an object given its
attributes. Collective classication also takes into account the classes of related
objects (e.g., [6, 53, 38]). Attributes can be represented in Markov logic as predicates
of the form A(x, v), where A is an attribute, x is an object, and v is the value
of A in x. The class is a designated attribute C, representable by C(x, v), where
v is xs class. Classication is now simply the problem of inferring the truthvalue of C(x, v) for all x and v of interest given all known A(x, v). Ordinary
classication is the special case where C(xi , v) and C(xj , v) are independent for all
xi and xj given the known A(x, v). In collective classication, the Markov blanket
of C(xi , v) includes other C(xj , v), even after conditioning on the known A(x, v).
Relations between objects are represented by predicates of the form R(xi , xj ). A
number of interesting generalizations are readily apparent; for example, C(xi , v) and
C(xj , v) may be indirectly dependent via unknown predicates, possibly including the
R(xi , xj ) predicates themselves.
12.6.2
Link Prediction
The goal of link prediction is to determine whether a relation exists between two
objects of interest (e.g., whether Anna is Bobs Ph.D. advisor) from the properties of
those objects and possibly other known relations (e.g., see Popescul and Ungar [44]).
The formulation of this problem in Markov logic is identical to that of collective
classication, with the only dierence that the goal is now to infer the value of
R(xi , xj ) for all object pairs of interest, instead of C(x, v). The task used in our
experiments is an example of link prediction (see section 12.9).
12.6
SRL Tasks
12.6.3
355
Link-Based Clustering
The goal of clustering is to group together objects with similar attributes. In model
based clustering, we assume a generative model P (X) = C P (C) P (X|C), where
X is an object, C ranges over clusters, and P (C|X) is Xs degree of membership
in cluster C. In link-based clustering, objects are clustered according to their links
(e.g., objects that are more closely related are more likely to belong to the same
cluster), and possibly according to their attributes as well (e.g., see Flake et al.
[16]). This problem can be formulated in Markov logic by postulating an unobserved
predicate C(x, v) with the meaning x belongs to cluster v, and having formulas
in the MLN involving this predicate and the observed ones (e.g., R(xi , xj ) for links
and A(x, v) for attributes). Link-based clustering can now be performed by learning
the parameters of the MLN, and cluster memberships are given by the probabilities
of the C(x, v) predicates conditioned on the observed ones.
12.6.4
Social networks are graphs where nodes represent social actors (e.g., people) and
arcs represent relations between them (e.g., friendship). Social network analysis
[54] is concerned with building models relating actors properties and their links.
For example, the probability of two actors forming a link may depend on the
similarity of their attributes, and conversely two linked actors may be more likely
to have certain properties. These models are typically Markov networks, and can
be concisely represented by formulas like xyv R(x, y)  (A(x, v)  A(y, v)),
where x and y are actors, R(x, y) is a relation between them, A(x, v) represents
an attribute of x, and the weight of the formula captures the strength of the
correlation between the relation and the attribute similarity. For example, a model
stating that friends tend to have similar smoking habits can be represented by the
formula xy Friends(x, y)  (Smokes(x)  Smokes(y)) (table 12.1). As well as
encompassing existing social network models, Markov logic allows richer ones to
be easily stated (e.g., by writing formulas involving multiple types of relations and
multiple attributes, as well as more complex dependencies between them).
12.6.5
Object Identication
356
pendencies between record matches and eld matches can then be represented by
formulas like xy x = y  fi (x) = fi (y), where x and y are records and fi (x)
is a function returning the value of the ith eld of record x. We have successfully
applied this approach to deduplicating the Cora database of computer science papers [52]. Because it allows information to propagate from one match decision (i.e.,
one grounding of x = y) to another via elds that appear in both pairs of records,
it eectively performs collective object identication, and in our experiments outperformed the traditional method of making each match decision independently of
all others. For example, matching two references may allow us to determine that
ICML and MLC represent the same conference, which in turn may help us to
match another pair of references where one contains ICML and the other MLC.
Markov logic also allows additional information to be incorporated into a deduplication system easily, modularly, and uniformly. For example, transitive closure is
incorporated by adding the formula xyz x = y  y = z  x = z, with a weight
that can be learned from data.
12.7
Inference
We now show how inference in Markov logic can be carried out. Markov logic can
answer arbitrary queries of the form What is the probability that formula F1 holds
given that formula F2 does? If F1 and F2 are two formulae in rst-order logic, C
is a nite set of constants including any constants that appear in F1 or F2 , and L
is an MLN, then
P (F1 |F2 , L, C) = P (F1 |F2 , ML,C )
P (F1  F2 |ML,C )
=
P (F2 |ML,C )
xXF XF2 P (X = x|ML,C )
,
=  1
xXF P (X = x|ML,C )
(12.4)
where XFi is the set of worlds where Fi holds, and P (x|ML,C ) is given by (12.3).
Ordinary conditional queries in graphical models are the special case of (12.4) where
all predicates in F1 , F2 , and L are zero-arity and the formulae are conjunctions.
The question of whether a knowledge base KB entails a formula F in rst-order
logic is the question of whether P (F |LKB , CKB,F ) = 1, where LKB is the MLN
obtained by assigning innite weight to all the formulae in KB, and CKB,F is the
set of all constants appearing in KB or F . The question is answered by computing
P (F |LKB , CKB,F ) by (12.4), with F2 = True.
Computing (12.4) directly will be intractable in all but the smallest domains.
Since Markov logic inference subsumes probabilistic inference, which is #Pcomplete, and logical inference in nite domains, which is NP-complete, no better
results can be expected. However, many of the large number of techniques for
12.7
Inference
Table 12.3
357
function ConstructNetwork(F1 , F2 , L, C)
inputs: F1 , a set of ground atoms with unknown truth-values (the query)
F2 , a set of ground atoms with known truth-values (the evidence)
L, a Markov logic network
C, a set of constants
output: M , a ground Markov network
calls: M B(q), the Markov blanket of q in ML,C
G  F1
while F1 = 
for all q  F1
if q  F2
F1  F1  (M B(q) \ G)
G  G  M B(q)
F1  F1 \ {q}
return M , the ground Markov network composed of all nodes in G, all arcs between
them in ML,C , and the features and weights on the corresponding cliques
ecient inference in either case are applicable to Markov logic. Because Markov
logic allows ne-grained encoding of knowledge, including context-specic independences, inference in it may in some cases be more ecient than inference in an
ordinary graphical model for the same domain. On the logic side, the probabilistic
semantics of Markov logic allows for approximate inference, with the corresponding
potential gains in eciency.
In principle, P (F1 |F2 , L, C) can be approximated using an MCMC algorithm
that rejects all moves to states where F2 does not hold, and counts the number of
samples in which F1 holds. However, even this is likely to be too slow for arbitrary
formulae. Instead, we provide an inference algorithm for the case where F1 and F2
are conjunctions of ground literals. While less general than (12.4), this is the most
frequent type of query in practice, and the algorithm we provide answers it far more
eciently than a direct application of (12.4). Investigating lifted inference (where
queries containing variables are answered without grounding them) is an important
direction for future work (see Jaeger [26] and Poole [42] for initial results). The
algorithm proceeds in two phases, analogous to knowledge-based model construction
[55]. The rst phase returns the minimal subset M of the ground Markov network
required to compute P (F1 |F2 , L, C). The algorithm for this is shown in table 12.3.
The size of the network returned may be further reduced, and the algorithm sped
up, by noticing that any ground formula which is made true by the evidence can be
ignored, and the corresponding arcs removed from the network. In the worst case,
the network contains O(|C|a ) nodes, where a is the largest predicate arity in the
domain, but in practice it may be much smaller.
The second phase performs inference on this network, with the nodes in F2 set
to their values in F2 . Our implementation uses Gibbs sampling, but any inference
method may be employed. The basic Gibbs step consists of sampling one ground
358
atom given its Markov blanket. The Markov blanket of a ground atom is the set
of ground predicates that appear in some grounding of a formula with it. The
probability of a ground atom Xl when its Markov blanket Bl is in state bl is
P (Xl = xl |Bl = bl ) =
exp( fi Fl wi fi (Xl = xl , Bl = bl ))
,
exp( fi Fl wi fi (Xl = 0, Bl = bl )) + exp( fi Fl wi fi (Xl = 1, Bl = bl ))
(12.5)
where Fl is the set of ground formulae that Xl appears in, and fi (Xl = xl , Bl = bl )
is the value (0 or 1) of the feature corresponding to the ith ground formula when
Xl = xl and Bl = bl . For sets of atoms of which exactly one is true in any given
world (e.g., the possible values of an attribute), blocking can be used (i.e., one atom
is set to true and the others to false in one step, by sampling conditioned on their
collective Markov blanket). The estimated probability of a conjunction of ground
literals is simply the fraction of samples in which the ground literals are true, after
the Markov chain has converged. Because the distribution is likely to have many
modes, we run the Markov chain multiple times. When the MLN is in clausal
form, we minimize burn-in time by starting each run from a mode found using
MaxWalkSat, a local search algorithm for the weighted satisability problem (i.e.,
nding a truth assignment that maximizes the sum of weights of satised clauses)
[28]. When there are hard constraints (clauses with innite weight), MaxWalkSat
nds regions that satisfy them, and the Gibbs sampler then samples from these
regions to obtain probability estimates.
12.8
Learning
We learn MLN weights from one or more relational databases. (For brevity, the
treatment below is for one database, but the generalization to many is trivial.) We
make a closed-world assumption [18]: if a ground atom is not in the database, it is
assumed to be false. If there are n possible ground atoms, a database is eectively
a vector x = (x1 , . . . , xl , . . . , xn ) where xl is the truth value of the lth ground
atom (xl = 1 if the atom appears in the database, and xl = 0 otherwise). Given
a database, MLN weights can in principle be learned using standard methods,
as follows. If the ith formula has ni (x) true groundings in the data x, then by
Equation 12.3 the derivative of the log-likelihood with respect to its weight is
log Pw (X = x) = ni (x) 
Pw (X = x ) ni (x ),
wi
(12.6)
12.8
Learning
359
true groundings of the ith formula in the data and its expectation according to the
current model. Unfortunately, counting the number of true groundings of a formula
in a database is intractable, even when the formula is a single clause, as stated in
the following proposition (due to Dan Suciu).
Proposition 12.4
Counting the number of true groundings of a rst-order clause in a database is
#P-complete in the length of the clause.
Proof Counting satisfying assignments of propositional monotone 2-CNF is #Pcomplete [49]. This problem can be reduced to counting the number of true
groundings of a rst-order clause in a database as follows. Consider a database
composed of the ground atoms R(0, 1), R(1, 0), and R(1, 1). Given a monotone
2-CNF formula, construct a formula  that is a conjunction of predicates of the
form R(xi , xj ), one for each disjunct xi  xj appearing in the CNF formula. (For
example, (x1  x2 )  (x3  x4 ) would yield R(x1 , x2 )  R(x3 , x4 ).) There is a oneto-one correspondence between the satisfying assignments of the 2-CNF and the
true groundings of . The latter are the false groundings of the clause formed by
disjoining the negations of all the R(xi , xj ), and thus can be counted by counting
the number of true groundings of this clause and subtracting it from the total
number of groundings.
Pw (X = x) =
n
(12.7)
l=1
where M Bx (Xl ) is the state of the Markov blanket of Xl in the data. The gradient
of the pseudo-log-likelihood is
360
log Pw (X = x) =
[ni (x)  Pw (Xl = 0|M Bx(Xl )) ni (x[Xl=0] )
wi
n
l=1
(12.8)
where ni (x[Xl=0] ) is the number of true groundings of the ith formula when we
force Xl = 0 and leave the remaining data unchanged, and similarly for ni (x[Xl=1] ).
Computing this expression (or (12.7)) does not require inference over the model.
We optimize the pseudo-log-likelihood using the limited-memory BFGS algorithm
[33]. The computation can be made more ecient in several ways:
The sum in (12.8) can be greatly sped up by ignoring predicates that do not
appear in the ith formula.
The counts ni (x), ni (x[Xl=0] ), and ni (x[Xl=1] ) do not change with the weights, and
need only be computed once (as opposed to in every iteration of BFGS).
Ground formulas whose truth-value is unaected by changing the truth-value of
any single literal may be ignored, since then ni (x) = ni (x[Xl=0] ) = ni (x[Xl=1] ). In
particular, this holds for any clause which contains at least two true literals. This
can often be the great majority of ground clauses.
To combat overtting, we penalize the pseudo-likelihood with a Gaussian prior
on each weight.
When we know a priori which predicates will be evidence, MLN weights can also
be learned discriminatively [52].
ILP techniques can be used to learn additional clauses, rene the ones already in
the MLN, or learn an MLN from scratch. Here we use the CLAUDIEN system for
this purpose [10]. Unlike most other ILP systems, which learn only Horn clauses,
CLAUDIEN is able to learn arbitrary rst-order clauses, making it well suited to
Markov logic. Also, by constructing a particular language bias, we are able to direct
CLAUDIEN to search for renements of the MLN structure. Alternatively, MLN
structure can be learned by directly optimizing pseudo-likelihood [30].
12.9
Experiments
We have empirically tested the algorithms described in the previous sections using
a database describing the Department of Computer Science and Engineering at the
University of Washington (UW-CSE). The domain consists of 12 predicates and
2707 constants divided into 10 types. Types include: publication (342 constants),
person (442), course (176), project (153), academic quarter (20), etc. Predicates
include: Professor(person), Student(person), Area(x, area) (with x ranging over
publications, persons, courses, and projects), AuthorOf(publication, person),
AdvisedBy(person, person), YearsInProgram(person, years), CourseLevel(course, level), TaughtBy(course, person, quarter), TeachingAssistant(course, per-
12.9
Experiments
361
Systems
In order to evaluate Markov logic, which uses logic and probability for inference,
we wished to compare it with methods that use only logic or only probability. We
362
were also interested in automatic induction of clauses using ILP techniques. This
section gives details of the comparison systems used.
12.9.1.1
Logic
One important question we aimed to answer with the experiments is whether adding
probability to a logical KB improves its ability to model the domain. Doing this
requires observing the results of answering queries using only logical inference, but
this is complicated by the fact that computing log-likelihood and the area under
the precision-recall curve requires real-valued probabilities, or at least some measure
of condence in the truth of each ground atom being tested. We thus used the
following approach. For a given knowledge base KB and set of evidence atoms E,
let XKBE be the set of worlds that satisfy KB  E. The probability of a query
|X
|
, the fraction of XKBE in which q is
atom q is then dened as P (q) = |XKBEq
KBE |
true.
A more serious problem arises if the KB is inconsistent (which was indeed the case
with the KB we collected from volunteers). In this case the denominator of P (q) is
zero. (Also, recall that an inconsistent KB trivally entails any arbitrary formula).
To address this, we redene XKBE to be the set of worlds which satises the
maximum possible number of ground clauses. We use Gibbs sampling to sample
from this set, with each chain initialized to a mode using WalkSat. At each Gibbs
step, the step is taken with probability: 1 if the new state satises more clauses than
the current one (since that means the current state should have 0 probability), 0.5
if the new state satises the same number of clauses (since the new and old state
then have equal probability), and 0 if the new state satises fewer clauses. We
then use only the states with the maximum number of satised clauses to compute
probabilities. Notice that this is equivalent to using an MLN built from the KB and
with all innite equal weights.
12.9.1.2
Probability
The other question we wanted to answer with these experiments is whether existing (propositional) probabilistic models are already powerful enough to be used
in relational domains without the need for the additional representational power
provided by MLNs. In order to use such models, the domain must rst be propositionalized by dening features that capture useful information about it. Creating
good attributes for propositional learners in this highly relational domain is a dicult problem. Nevertheless, as a tradeo between incorporating as much potentially
relevant information as possible and avoiding extremely long feature vectors, we dened two sets of propositional attributes: order-1 and order-2. The former involves
characteristics of individual constants in the query predicate, and the latter involves
characteristics of relations between the constants in the query predicate.
For the order-1 attributes, we dened one variable for each (a, b) pair, where a is
an argument of the query predicate and b is an argument of some predicate with the
12.9
Experiments
363
same value as a. The variable is the fraction of true groundings of this predicate
in the data. Some examples of rst-order attributes for AdvisedBy(Matt, Pedro)
are: whether Pedro is a student, the fraction of publications that are published by
Pedro, the fraction of courses for which Matt was a teaching assistant, etc.
The order-2 attributes were dened as follows: for a given (ground) query predicate Q(q1 , q2 , . . . , qk ), consider all sets of k predicates and all assignments of constants q1 , q2 , . . . , qk as arguments to the k predicates, with exactly one constant per
predicate (in any order). For instance, if Q is Advised  By(Matt, Pedro) then one
such possible set would be {TeachingAssistant( , Matt, ), TaughtBy( , Pedro, )}.
This forms 2k attributes of the example, each corresponding to a particular truth assignment to the k predicates. The value of an attribute is the number of times, in the
training data the set of predicates have that particular truth assignment, when their
unassigned arguments are all lled with the same constants. For example, consider
lling the above empty arguments with CSE546 and Autumn 0304. The resulting
set, {TeachingAssistant(CSE546, Matt, Autumn 0304), TaughtBy(CSE546, Pedro,
Autumn 0304)} has some truth assignment in the training data (e.g., {True,True},
{True,False}, . . . ). One attribute is the number of such sets of constants that create
the truth assignment {True,True}, another for {True,False}, and so on. Some examples of second-order attributes generated for the query AdvisedBy(Matt, Pedro)
are: how often Matt is a teaching assistant for a course that Pedro taught (as well
as how often he is not), how many publications Pedro and Matt have coauthored,
etc.
The resulting 28 order-1 attributes and 120 order-2 attributes (for the All Info
case) were discretized into ve equal-frequency bins (based on the training set).
We used two propositional learners: naive Bayes [14] and Bayesian networks [22]
with structure and parameters learned using the VFBN2 algorithm [25] with a
maximum of four parents per node. The order-2 attributes helped the naive Bayes
classier but hurt the performance of the Bayesian network classier, so below we
report results using the order-1 and order-2 attributes for naive Bayes, and only
the order-1 attributes for Bayesian networks.
12.9.1.3
Our original KB was acquired from volunteers, but we were also interested in
whether it could have been developed automatically using ILP methods. As mentioned earlier, we used CLAUDIEN to induce a KB from data. CLAUDIEN was
run with: local scope; minimum accuracy of 0.1; minimum coverage of 1; maximum
complexity of 10; and breadth-rst search. CLAUDIENs search space is dened by
its language bias. We constructed a language bias which allowed: a maximum of
three variables in a clause; unlimited predicates in a clause; up to two non-negated
appearances of a predicate in a clause, and two negated ones; and use of knowledge of predicate argument types. To minimize search, the equality predicates (e.g.,
SamePerson) were not used in CLAUDIEN, and this improved its results.
364
Besides inducing clauses from the training data, we were also interested in using
data to automatically rene the KB provided by our volunteers. CLAUDIEN does
not support this feature directly, but it can be emulated by an appropriately
constructed language bias. We did this by, for each clause in the KB, allowing
CLAUDIEN to (1) remove any number of the literals, (2) add up to v new variables,
and (3) add up to l new literals. We ran CLAUDIEN for 24 hours on a Sun-Blade
1000 for each (v, l) in the set {(1, 2), (2, 3), (3, 4)}. All three gave nearly identical
results; we report the results with v = 3 and l = 4.
12.9.1.4
Markov Logic
Our results compare the above systems to Markov logic. The MLNs were trained
using a Gaussian weight prior with zero mean and unit variance, and with the
weights initialized at the mode of the prior (zero). For optimization, we used the
FORTRAN implementation of L-BFGS from Zhu et al. [58] and Byrd et al. [5],
leaving all parameters at their default values, and with a convergence criterion (ftol )
of 105 . Inference was performed using Gibbs sampling as described in section 12.7,
with ten parallel Markov chains, each initialized to a mode of the distribution using
MaxWalkSat. The number of Gibbs steps was determined using the criterion of
DeGroot and Schervish [11][pp. 707 and 740-741]. Sampling continued until we
reached a condence of 95% that the probability estimate was within 1% of the true
value in at least 95% of the nodes (ignoring nodes which are always true or false). A
minimum of 1000 and maximum of 500,000 samples was used, with one sample per
complete Gibbs pass through the variables. Typically, inference converged within
5000 to 100,000 passes. The results were insensitive to variation in the convergence
thresholds.
12.9.2
12.9.2.1
Results
Training with MC-MLE
Our initial system used MC-MLE to train MLNs, with ten Gibbs chains, and each
ground atom being initialized to true with the corresponding rst-order predicates
probability of being true in the data. Gibbs steps may be taken quite quickly by
noting that few counts of satised clauses will change on any given step. On the
UW-CSE domain, our implementation took 4-5 ms per step. We used the maximum
across all predicates of the Gelman criterion R [20] to determine when the chains
had reached their stationary distribution. In order to speed convergence, our Gibbs
sampler preferentially samples atoms that were true in either the data or the initial
state of the chain. The intuition behind this is that most atoms are always false,
and sampling repeatedly from them is inecient. This improved convergence by
approximately an order of magnitude over uniform selection of atoms. Despite
these optimizations, the Gibbs sampler took a prohibitively long time to reach
a reasonable convergence threshold (e.g., R = 1.01). After running for 24 hours
12.9
Experiments
365
(approximately 2 million Gibbs steps per chain), the average R-value across training
sets was 3.04, with no one training set having reached an R-value less than 2 (other
than briey dipping to 1.5 in the early stages of the process). Considering this must
be done iteratively as L-BFGS searches for the minimum, we estimate it would
take anywhere from 20 to 400 days to complete the training, even with a weak
convergence threshold such as R = 2.0. Experiments conrmed the poor quality
of the models that resulted if we ignored the convergence threshold and limited
the training process to less than ten hours. With a better choice of initial state,
approximate counting, and improved MCMC techniques such as the SwendsenWang algorithm [15], MC-MLE may become practical, but it is not a viable option
for training in the current version. (Notice that during learning MCMC is performed
over the full ground network, which is too large to apply MaxWalkSat to.)
12.9.2.2
Inference
Inference was also quite quick. Inferring the probability of all AdvisedBy(x, y) atoms
in the All Info case took 3.3 minutes in the AI test set (4624 atoms), 24.4 in
graphics (3721), 1.8 in programming languages (784), 10.4 in systems (5476), and
1.6 in theory (2704). The number of Gibbs passes ranged from 4270 to 500,000,
and averaged 124,000. This amounts to 18 ms per Gibbs pass and approximately
200,000500,000 Gibbs steps per second. The average time to perform inference in
the Partial Info case was 14.8 minutes (vs. 8.3 in the All Info case).
12.9.2.4
Comparison of Systems
We compared twelve systems: the original KB (KB); CLAUDIEN (CL); CLAUDIEN with the original KB as language bias (CLB); the union of the original KB and
CLAUDIENs output in both cases (KB+CL and KB+CLB); an MLN with each
of the above KBs (MLN(KB), MLN(CL), MLN(KB+CL), and MLN(KB+CLB));
naive Bayes (NB); and a Bayesian network learner (BN). Add-one smoothing of
probabilities was used in all cases.
Table 12.4 summarizes the results, and gure 12.2 shows precision-recall curves
for all areas (i.e., averaged over all AdvisedBy(x, y) predicates). MLNs are clearly
more accurate than the alternatives, showing the promise of this approach. The
purely logical and purely probabilistic methods often suer when intermediate
predicates have to be inferred, while MLNs are largely unaected. Naive Bayes
366
All Info
Partial Info
AUC
CLL
AUC
CLL
0.2150.0172
0.1520.0165
0.0110.0003
0.0350.0008
0.0030.0000
0.0590.0081
0.0370.0012
0.0840.0100
0.0480.0009
0.0030.0000
0.0540.0006
0.0150.0006
0.0520.004
0.0580.005
3.9050.048
2.3150.030
0.0520.005
0.1350.005
0.2020.008
0.0560.004
0.4340.012
0.0520.005
1.2140.036
0.0720.003
0.2240.0185
0.2030.0196
0.0110.0003
0.0320.0009
0.0230.0003
0.0480.0058
0.0280.0012
0.0440.0064
0.0370.0001
0.0100.0001
0.0440.0009
0.0150.0007
0.0480.004
0.0450.004
3.9580.048
2.4780.030
0.3380.002
0.0630.004
0.1220.006
0.0510.005
0.8360.017
0.5980.003
1.1400.031
0.2150.003
performs well in AUC in some test sets, but very poorly in others; its CLLs
are uniformly poor. CLAUDIEN performs poorly on its own, and produces no
improvement when added to the KB in the MLN. Using CLAUDIEN to rene the
KB typically performs worse in AUC but better in CLL than using CLAUDIEN
from scratch; overall, the best-performing logical method is KB+CLB, but its
results fall well short of the best MLNs. The general drop-o in precision at around
50% recall is attributable to the fact that the database is very incomplete, and only
allows identifying a minority of the AdvisedBy relations. Inspection reveals that the
occasional smaller drop-os in precision at very low recalls are due to students who
graduated or changed advisors after coauthoring many publications with them.
12.10
Conclusion
367
0.8
MLN(KB)
MLN(KB+CL)
KB
KB+CL
CL
NB
BN
Precision
0.6
0.4
0.2
0
0
0.2
0.8
0.8
0.8
MLN(KB)
MLN(KB+CL)
KB
KB+CL
CL
NB
BN
0.6
Precision
0.4
0.6
Recall
0.4
0.2
0
0
0.2
0.4
0.6
Recall
Figure 12.2 Precision and recall for all areas: All Info (upper graph) and Partial
Info (lower graph).
12.10
Conclusion
The rapid growth in the variety of SRL approaches and tasks has led to the need for
a unifying framework. In this chapter we propose Markov logic as a candidate for
such a framework. Markov logic combines rst-order logic and Markov networks and
allows a wide variety of SRL tasks and approaches to be formulated in a common
language. Initial experiments with an implementation of Markov logic have yielded
good results. Software implementing Markov logic and learning and inference
algorithms for it is available at http://www.cs.washington.edu/ai/alchemy.
368
Acknowledgments
We are grateful to Julian Besag, Vitor Santos Costa, James Cussens, Nilesh
Dalvi, Alan Fern, Alon Halevy, Mark Handcock, Henry Kautz, Kristian Kersting,
Tian Sang, Bart Selman, Dan Suciu, Jeremy Tantrum, and Wei Wei for helpful
discussions. This research was partly supported by ONR grant N00014-02-1-0408
and by a Sloan Fellowship awarded to P. D. We used the VFML library in our
experiments (http://www.cs.washington.edu/dm/vfml/).
References
[1] C. Anderson, P. Domingos, and D. Weld. Relational Markov models and their
application to adaptive Web navigation. In Proceedings of the Eighth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining,
pages 143152, Edmonton, Canada, 2002. ACM Press.
[2] T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Scientic
American, 284(5):3443, 2001.
[3] J. Besag. Statistical analysis of non-lattice data. The Statistician, 24:179195,
1975.
[4] W. Buntine. Operations for learning with graphical models.
Articial Intelligence Research, 2:159225, 1994.
Journal of
[5] R. H. Byrd, P. Lu, and J. Nocedal. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientic and Statistical Computing,
16(5):11901208, 1995.
[6] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization
using hyperlinks. In Proceedings of ACM International Conference on Management of Data, 1998.
[7] C. Cumby and D. Roth. Feature extraction languages for propositionalized
relational learning. In Proceedings of the IJCAI-2003 Workshop on Learning
Statistical Models from Relational Data, 2003.
[8] J. Cussens. Individuals, relations and structures in probabilistic models. In
Proceedings of the IJCAI-2003 Workshop on Learning Statistical Models from
Relational Data, 2003.
[9] J. Cussens. Loglinear models for rst-order probabilistic reasoning. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 1999.
[10] L. De Raedt and L. Dehaspe. Clausal discovery. Machine Learning, 26:99146,
1997.
[11] M. H. DeGroot and M. J. Schervish. Probability and Statistics, 3rd edition.
Addison Wesley, Boston, 2002.
References
369
[12] L. Dehaspe. Maximum entropy modeling with clausal constraints. In Proceedings of the International Conference on Inductive Logic Programming, 1997.
[13] S. Della Pietra, V. Della Pietra, and J. Laerty. Inducing features of random
elds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19:
380392, 1997.
[14] P. Domingos and M. Pazzani. On the optimality of the simple Bayesian
classier under zero-one loss. Machine Learning, 29:103130, 1997.
[15] R.G. Edwards and A.G. Sokal. Generalization of the Fortuin-KasteleynSwendsen-Wang representation and Monte Carlo algorithm. Physics Review
D, 38:20092012, 1988.
[16] G. W. Flake, S. Lawrence, and C. L. Giles. Ecient identication of Web
communities. In International Conference on Knowledge Discovery and Data
Mining, 2000.
[17] N. Friedman, L. Getoor, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In Proceedings of the International Joint Conference on
Articial Intelligence, 1999.
[18] M. R. Genesereth and N. J. Nilsson. Logical Foundations of Articial Intelligence. Morgan Kaufmann, San Mateo, CA, 1987.
[19] C. J. Geyer and E. A. Thompson. Constrained Monte Carlo maximum
likelihood for dependent data. Journal of the Royal Statistical Society, Series
B, 54(3):657699, 1992.
[20] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain
Monte Carlo in Practice. Chapman and Hall, London, 1996.
[21] J. Halpern. An analysis of rst-order logics of probability. Articial Intelligence, 46:311350, 1990.
[22] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks:
The combination of knowledge and statistical data. Machine Learning, 20:197
243, 1995.
[23] D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie.
Dependency networks for inference, collaborative ltering, and data visualization. Journal of Machine Learning Research, 1:4975, 2000.
[24] D. Heckerman, C. Meek, and D. Koller. Probabilistic entity-relationship
models, PRMs, and plate models. In Proceedings of the ICML-2004 Workshop
on Statistical Relational Learning and Its Connections to Other Fields, 2004.
[25] G. Hulten and P. Domingos. Mining complex models from arbitrarily large
databases in constant time. In International Conference on Knowledge Discovery and Data Mining, 2002.
[26] M. Jaeger. On the complexity of inference about probabilistic relational
models. Articial Intelligence, 117:297308, 2000.
370
References
371
13.1
Introduction
Human beings and AI systems must convert sensory input into some understanding
of what is going on in the world around them. That is, they must make inferences
about the objects and events that underlie their observations. No prespecied list of
objects is given; the agent must infer the existence of objects that were not known
initially to exist.
In many AI systems, this problem of unknown objects is engineered away or
resolved in a preprocessing step. However, there are important applications where
the problem is unavoidable. Population estimation, for example, involves counting a
population by sampling from it randomly and measuring how often the same object
is resampled; this would be pointless if the set of objects were known in advance.
Record linkage, a task undertaken by an industry of more than 300 companies,
374
involves matching entries across multiple databases. These companies exist because
of uncertainty about the mapping from observations to underlying objects. Finally,
multitarget tracking systems perform data association, connecting, say, radar blips
to hypothesized aircraft.
Probability models for such tasks are not new: Bayesian models for data association have been used since the 1960s [29]. The models are written in English and
mathematical notation and converted by hand into special-purpose code. This can
result in inexible models of limited expressivenessfor example, tracking systems
assume independent trajectories with linear dynamics, and record linkage systems
assume a naive Bayes model for elds in records. It seems natural, therefore, to seek
a formal language in which to express probability models that allow for unknown
objects.
Recent achievements in the eld of probabilistic graphical models [24] illustrate
the benets that can be expected from adopting a formal language: general-purpose
inference algorithms, more sophisticated models, and techniques for automated
model selection (structure learning). However, graphical models only describe xed
sets of random variables with xed dependencies among them; they become awkward in scenarios with unknown objects. There has also been signicant work on
rst-order probabilistic languages (FOPLs), which explicitly represent objects and
the relations between them. We review some of this work in section 13.7. However,
most FOPLs make the assumptions of unique names, requiring that the symbols
or terms of the language all refer to distinct objects, and domain closure, requiring
that no objects exist besides the ones referred to by terms in the language. These
assumptions are inappropriate for problems such as multitarget tracking, where we
may want to reason about objects that are observed multiple times or that are
not observed at all. Those FOPLs that do support unknown objects often do so in
limited and ad hoc ways. In this chapter, we describe Bayesian logic (Blog) [19], a
new language that compactly and intuitively denes probability distributions over
outcomes with varying sets of objects.
We begin in section 13.2 with three example problems, each of which involves
possible worlds with varying object sets and identity uncertainty. We show Blog
models for these problems and give initial, informal descriptions of the probability
distributions that they dene. Section 13.3 observes that the possible worlds in
these scenarios are naturally viewed as model structures of rst-order logic. It then
denes precisely the set of possible worlds corresponding to a Blog model. The
key idea is a generative process that constructs a world by adding objects whose
existence and properties depend on those of objects already created. In such a
process, the existence of objects may be governed by many random variables, not
just a single population size variable. Section 13.4 discusses exactly how a Blog
model species a probability distribution over possible worlds.
Section 13.5 solves a previously unnoticed probabilistic Skolemization problem:
how to specify evidence about objectssuch as radar blipsthat one didnt know
existed. Finally, section 13.6 briey discusses inference in unbounded outcome
13.2
Examples
375
spaces, stating a sampling algorithm and a completeness theorem for a large class
of Blog models and giving experimental results on one particular model.
13.2
Examples
In this section we examine three typical scenarios with unknown objectssimplied
versions of the population estimation, record linkage, and multitarget tracking
problems mentioned above. In each case, we provide a short Blog model that,
when combined with a suitable inference engine, constitutes a working solution for
the problem in question.
Example 13.1
An urn contains an unknown number of ballssay, a number chosen from a Poisson
distribution. Balls are equally likely to be blue or green. We draw some balls from
the urn, observing the color of each and replacing it. We cannot tell two identically
colored balls apart; furthermore, observed colors are wrong with probability 0.2.
How many balls are in the urn? Was the same ball drawn twice?
2
3
4
5
6
#Ball Poisson[6]();
10
11
12
ObsColor(d)
if (BallDrawn(d) != null) then
 TabularCPD[[0.8, 0.2], [0.2, 0.8]](TrueColor(BallDrawn(d)));
Figure 13.1
Blog model for balls in an urn (Example 13.1) with four draws.
The Blog model for this problem, shown in Figure 13.1, describes a stochastic
process for generating worlds. The rst 4 lines introduce the types of objects in these
worldscolors, balls, and drawsand the functions that can be applied to these
objects. For each function, the model species a type signature in a syntax similar to
that of C or Java. For instance, line 2 species that TrueColor is a random function
that takes a single argument of type Ball and returns a value of type Color. Lines
376
57 specify what objects may exist in each world. In every world, there are exactly
two distinct colors, blue and green, and there are exactly four draws. These are the
guaranteed objects. On the other hand, dierent worlds have dierent numbers of
balls, so the number of balls that exist is chosen from a priora Poisson with mean
6. Each ball is then given a color, as specied on line 8. Properties of the four draws
are lled in by choosing a ball (line 9) and an observed color for that ball (lines
1012). The probability of the generated world is the product of the probabilities
of all the choices made.
1
2
3
4
5
random
random
random
random
8
9
#Researcher  NumResearchersPrior();
#Publication(Author = r)  NumPubsPrior();
10
11
Name(r)  NamePrior();
Title(p)  TitlePrior();
12
13
14
Text(c)  NoisyCitationGrammar(Title(PubCited(c)),
Name(Author(PubCited(c))));
String Name(Researcher);
String Title(Publication);
Publication PubCited(Citation);
String Text(Citation);
Figure 13.2
Example 13.2
We have a collection of citations that refer to publications in a certain eld. What
publications and researchers exist, with what titles and names? Who wrote which
publication, and to which publication does each citation refer? For simplicity, we
just consider the title and author-name strings in these citations, which are subject
to errors of various kinds, and we assume only single-author publications.
Figure 13.2 shows a Blog model for this example, based on the model in [23].
The Blog model denes the following generative process. First, sample the total
number of researchers from some distribution; then, for each researcher r, sample
the number of publications by that researcher. Sample the researchers names and
publications titles from appropriate prior distributions. Then, for each citation,
sample the publication cited by choosing uniformly at random from the set of pub-
13.2
Examples
377
lications. Finally, generate the citation text with a noisy formatting distribution
that allows for errors and abbreviations in the title and author names.
1
2
3
5
6
#Aircraft NumAircraftPrior();
8
9
10
State(a, t)
if t = 0 then  InitState()
else  StateTransition(State(a, Pred(t)));
11
12
13
14
15
ApparentPos(b)
if (Source(b) = null) then  FalseAlarmDistrib()
else  ObsCPD(State(Source(b), Time(b)));
Figure 13.3
Example 13.3
An unknown number of aircraft exist in some volume of airspace. An aircrafts
state (position and velocity) at each time step depends on its state at the previous
time step. We observe the area with radar: aircraft may appear as identical blips
on a radar screen. Each blip gives the approximate position of the aircraft that
generated it. However, some blips may be false detections, and some aircraft may
not be detected. What aircraft exist, and what are their trajectories? Are there any
aircraft that are not observed?
The Blog model for this scenario (Figure 13.3) describes the following process:
rst, sample the number of aircraft in the area. Then, for each time step t (starting
at t = 0), choose the state (position and velocity) of each aircraft given its state at
time t  1. Also, for each aircraft a and time step t, possibly generate a radar blip
b with Source(b) = a and Time(b) = t. Whether a blip is generated or not depends
on the state of the aircraftthus the number of objects in the world depends on
certain objects attributes. Also, at each step t, generate some false-alarm blips
b with Time(b ) = t and Source(b ) = null. Finally, sample the position for each
blip given the true state of its source aircraft (or using a default distribution for a
false-alarm blip).
378
13.3
The possible outcomes for examples 12.1 through 12.3 are structures containing
many related objects, with the set of objects and the relations among them varying
from outcome to outcome. We will treat these outcomes formally as model structures
of rst-order logic. A model structure provides interpretations for the symbols of a
rst-order language; each sentence of the rst-order language can be evaluated to
yield a truth-value in each model structure.
In Example 13.1, the language has function symbols such as TrueColor(b) for the
true color of ball b; BallDrawn(d) for the ball drawn on draw d; and Draw1 for
the rst draw. (Usually, rst-order languages are described as having predicate,
function, and constant symbols. For conciseness, we view all symbols as function
symbols; predicates are just functions that return a Boolean value, and constants are
just zero-ary functions.) To eliminate meaningless random variables, we use typed
logical languages. Each Blog model uses a language with a particular set of types,
such as Ball and Draw. Blog also has some built-in types that are available in all
models, namely Boolean, NaturalNum, Integer, String, Real, and RkVector (for each
k  2). Each function symbol f has a type signature (0 , . . . , k ), where 0 is the
return type of f and 1 , . . . , k are the argument types. The type Boolean receives
special syntactic treatment: if the return type of a function f is Boolean, then terms
of the form f (t1 , . . . , tk ) constitute atomic formulae, which can be combined using
logical operators and placed inside quantiers.
The logical languages used in Blog are also free: a function is not required to
apply to all tuples of arguments, even if they are appropriately typed [16]. For
instance, in Example 13.3, the function Source usually maps blips to aircraft, but
it is not applicable if the blip is a false detection. We adopt the convention that
when a function is not applicable to some arguments, it returns the special value
null. Any function that receives null as an argument also returns null, and an atomic
formula that evaluates to null is treated as false.
The truth of any rst-order sentence is determined by a model structure for the
corresponding language. A model structure species the extension of each type and
the interpretation for each function symbol:
Denition 13.1
A model structure  of a typed, free, rst-order language consists of an extension
[ ] for each type  , which may be an arbitrary set, and an interpretation [f ] for
each function symbol f . If f has return type 0 and argument types 1 , . . . , k , then
equal to [BallDrawn] (Draw2) in one structure (such as Figure 13.4(a)) but not
13.3
Balls
Balls
2 3
Draws
(a)
379
2 3
Draws
Balls
(b)
2 3
Draws
(c)
Three model structures for the language of Figure 13.1. Shaded circles
represent balls that are blue; shaded squares represent draws where the drawn ball
appeared blue (unshaded means green). Arrows represent the BallDrawn function
from draws to balls.
Figure 13.4
another (such as Figure 13.4(b)). The set of balls, [Ball] , can also vary between
structures, as Figure 13.4 illustrates. The purpose of a Blog model is to dene a
probability distribution over such structures. Because any sentence can be evaluated
as true or false in each model structure, a distribution over model structures
implicitly denes the probability that  is true for each sentence  in the logical
language.
13.3.2
380
[f ] (o1 , . . . , ok ). For instance, in a simplied version of Example 13.1 where the urn
contains a known set of balls {Ball1, . . . , Ball8} and we make four draws, the RVs are
TrueColor [Ball1] , . . . , TrueColor [Ball8], BallDrawn [Draw1] , . . . , BallDrawn [Draw4],
and ObsColor [Draw1] , . . . , ObsColor [Draw4]. The possible worlds are in one-to-one
correspondence with full instantiations of these basic RVs. Thus, a joint distribution
for the basic RVs denes a distribution over possible worlds.
13.3.3
Unknown Objects
In general, a Blog model denes a generative process in which objects are added
iteratively to a world. To describe such processes, we rst introduce origin function
declarations 1, such as lines 56 of Figure 13.3. Unlike other functions, origin
functions such as Source or Time have their values set when an object is added.
An origin function must take a single argument of some type  (namely Blip in the
example); it is then called a  -origin function.
Generative steps that add objects to the world are described by number statements, such as line 11 of Figure 13.3:
#Blip(Source = a, Time = t)  DetectionCPD(State(a, t));
This statement says that for each aircraft a and time step t, the process adds some
number of blips, and each of these added blips b has the property that Source(b) = a
and Time(b) = t. In general, the beginning of a number statement has the form
# (g1 = x1 , . . . , gk = xk ),
where  is a type, g1 , . . . , gk are  -origin functions, and x1 , . . . , xk are logical
variables. (For types that are generated ab initio with no origin functions, the empty
parentheses are omitted, as in Figure 13.1.) The inclusion of a number statement
means that for each appropriately typed tuple of objects o1 , . . . , ok , the generative
process adds some random number (possibly zero) of objects q of type  such that
[gi ] (q) = oi for i = 1, . . . , k. Note that the types of the generating objects o1 , . . . , ok
are the return types of g1 , . . . , gk .
Object generation can even be recursive: objects can generate other objects of
the same type. For instance, consider a model of sexual reproduction in which
every malefemale pair of individuals produces some number of ospring. We could
represent this with the number statement:
1. In [19] we used the term generating function, but we have now adopted the term
origin function because it seems clearer.
13.3
381
#Individual(Mother = m, Father = f)
if Female(m) & !Female(f) then  NumOffspringPrior();
We can also view number statements more declaratively:
Denition 13.2
Let  be a model structure of LM , and consider a number statement for type 
applied to o1 , . . . , ok in  if [gi ] (q) = oi for i = 1, . . . , k, and [g] (q) = null for all
other  -origin functions g.
Note that if a number statement for type  omits one of the  -origin functions,
then this function takes on the value null for all objects satisfying that number
statement. For instance, Source is null for objects satisfying the
false-detection number statement on line 12 of Figure 13.3:
#Blip(Time = t)  NumFalseAlarmsPrior();
Also, a Blog model cannot contain two number statements with the same set of
origin functions. This ensures that, in any given model structure, each object o
has exactly one generation history, which can be found by tracing back the origin
functions on o.
The set of possible worlds M is the set of model structures that can be
constructed by M s generative process. To complete the picture, we must explain
not only how many objects are added on each step, but also what these objects are. It
turns out to be convenient to dene the generated objects as follows: when a number
statement with type  and origin functions g1 , . . . , gk is applied to generating
objects o1 , . . . , ok , the generated objects are tuples {(, (g1 , o1 ), . . . , (gk , ok ), n) :
n = 1, . . . , N }, where N is the number of objects generated. Thus in Example 13.3,
the aircraft are pairs (Aircraft, 1), (Aircraft, 2), etc., and the blips generated by
aircraft are nested tuples such as (Blip, (Source, (Aircraft, 2)), (Time, 8), 1). The tuple
encodes the objects generation history; of course, it is purely internal to the
semantics and remains invisible to the user.
Denition 13.3
The universe of a type  in a Blog model M , denoted UM ( ), consists of the
guaranteed objects of type  as well as all nested tuples of type  that can be
generated from the guaranteed objects through nitely many recursive applications
of number statements.
As the following denition stipulates, in each possible world the extension of  is
some subset of UM ( ).
Denition 13.4
For a Blog model M , the set of possible worlds M is the set of model structures
 of LM such that
382
4. for every type  , each element of [ ] satises some number statement applied to
some objects in .
Note that by part 3 of this denition, the number of objects generated by any
given application of a number statement in world  is a nite number N . However,
a world can still contain innitely many nonguaranteed objects if some number
statements are applied recursively: then the world may contain tuples that are
nested to depths 1, 2, 3, . . ., with no upper bound. Innitely many objects can
also result if number statements are triggered for every natural number, like the
statements that generate radar blips in Example 13.3.
With a xed set of objects, it was easy to dene a set of basic RVs such that a
full instantiation of the basic RVs uniquely identied a possible world. To achieve
the same eect with unknown objects, we need two kinds of basic RVs:
Denition 13.5
For a Blog model M , the set VM of basic random variables consists of:
for each random function f with type signature (0 , . . . , k ) and each tuple
of objects (o1 , . . . , ok )  UM (1 )      UM (k ), a function application RV
13.4
383
ing objects with tuples might seem unnecessarily complicated, but it becomes
very helpful when we dene a Bayes net over the basic RVs (which we do
in section 13.4.2). For instance, in the aircraft tracking example, the parent
of ApparentPos [(Blip, (Source, (Aircraft, 2)), (Time, 8), 1)] is State [(Aircraft, 2), 8]. It
might seem more elegant to assign numbers to objects as they are generated, so
that the extension of each type in each possible world would be simply a prex
of the natural numbers. Specically, we could number the aircraft arbitrarily, and
then number the radar blips lexicographically by aircraft and time step. Then we
would have basic RVs such as ApparentPos [23], representing the apparent aircraft
position for blip 23. But blip 23 could be generated by any aircraft at any time
step. In fact, the parents of ApparentPos [23] would have to include all the #Blip
and State variables in the model. So dening objects as tuples yields a much simpler
Bayes net.
13.4
Dependency Statements
Dependency and number statements specify exactly how the steps are carried out
in our generative process. Consider the dependency statement for State(a, t) from
Figure 13.3:
State(a, t)
if t = 0 then  InitState()
else  StateTransition(State(a, Pred(t)));
This statement is applied for every basic RV of the form State [a, t] where a 
UM (Aircraft) and t  N. If t = 0, the conditional distribution for State [a, t]
is given by the elementary CPD InitState; otherwise it is given by the elementary conditional probability distribution CPD StateTransition, which takes
State(a, Pred(t)) as an argument. These elementary CPDs dene distributions over
objects of type R6Vector (the return type of State). In our implementation, elementary CPDs are Java classes with a method getProb that returns the probability of
a particular value given a list of CPD arguments, and a method sampleVal that
samples a value given the CPD arguments.
A dependency statement begins with a function symbol f and a tuple of logical
variables x1 , . . . , xk representing the arguments to this function. In a number
statement, the variables x1 , . . . , xk represent the generating objects. In either case,
the rest of the statement consists of a sequence of clauses. When the statement is
not abbreviated, the syntax for the rst clause is
if cond then  elem-cpd (arg1, . . ., argN )
384
Declarative Semantics
13.4
385
#Ball[]
TrueColor[(Ball, 1)]
TrueColor[(Ball, 2)]
TrueColor[(Ball, 3)]
BallDrawn[Draw1]
BallDrawn[Draw4]
ObsColor[Draw1]
ObsColor[Draw4]
Bayes net for the Blog model in Figure 13.1. The ellipses and dashed
arrows indicate that there are innitely many TrueColor [b] nodes.
Figure 13.5
BN is acyclic and each variable has nitely many ancestors, then these probability
assignments dene a unique distribution [14].
The diculty is that in the BN corresponding to a Blog model, variables often
have innite parent sets. For instance, the BN for Example 13.1 (shown partially
in Figure 13.5) has an innite number of basic RVs of the form TrueColor [b]: if it
had only a nite number N of these RVs, it could not represent outcomes with
more than N balls. Furthermore, each of these TrueColor [b] RVs is a parent of each
ObsColor [d] RV, since if BallDrawn [d] happens to be b, then the observed color on
draw d depends directly on the color of ball b. So the
ObsColor [d] nodes have innitely many parents. In such a model, assigning
probabilities to nite instantiations that are closed under the parent relation
does not dene a unique distribution: in particular, it tells us nothing about the
ObsColor [d] variables.
We required instantiations to be closed under the parent relation so that the
factors pX (X |Pa(X) ) would be well-dened. But we may not need the values of
all of Xs parents in order to determine the conditional distribution for X. For
instance, knowing BallDrawn [d] = (Ball, 13) and TrueColor [(Ball, 13)] = Blue is sufcient to determine the distribution for ObsColor [d]: the colors of all the other balls
are irrelevant in this context. We can read o this context-specic independence
from the dependency statement for ObsColor in Figure 13.1 by noting that the instantiation (BallDrawn [d] = (Ball, 13), TrueColor [(Ball, 13)] = Blue) determines the
value of the sole CPD argument TrueColor(BallDrawn(d)). We say this instantiation
supports the variable ObsColor [d] (see [20]).
Denition 13.7
An instantiation  supports a basic RV V of the form f [o1 , . . . , ok ] or
# [g1 = o1 , . . . , gk = ok ] if all possible worlds consistent with  agree on (1) whether
all the objects o1 , . . . , ok exist, and, if so, on (2) the applicable clause in the dependency or number statement for V and the values for the CPD arguments in that
clause.
386
Note that some RVs, such as #Ball [] in Example 13.1, are supported by the
empty instantiation. We can now generalize the notion of being closed under the
parent relation.
Denition 13.8
A nite instantiation  is self-supporting if its instantiated variables can be numbered X1 , . . . , XN such that for each n  N , the restriction of  to {X1 , . . . , Xn1 }
supports Xn .
This denition lets us give semantics to Blog models in a way that is meaningful
even when the corresponding BNs contain innite parent sets. We will write
pV (v | ) for the probability that V s dependency or number statement assigns
to the value v, given an instantiation  that supports V .
Denition 13.9
A distribution P over M satises a Blog model M if for every nite, selfsupporting instantiation  with vars()  VM :
P ( ) =
N
(13.1)
n=1
13.4
387
388
13.5
13.6
Inference
Because the set of basic RVs of a Blog model can be innite, it is not obvious that
inference for well-dened Blog models is even decidable. However, the generative
process intuition suggests a rejection sampling algorithm. We present this algorithm
not because it is particularly ecient, but because it demonstrates the decidability
13.6
Inference
389
of inference for a large class of Blog models (see Theorem 13.12 below) and
illustrates several issues that any Blog inference algorithm must deal with. At
the end of this section, we present experimental results from a somewhat more
ecient likelihood weighting algorithm.
13.6.1
Rejection sampling
390
Termination Criteria
In order to generate each sample, the algorithm above repeatedly instantiates the
rst variable that is supported but not yet instantiated, until it instantiates all
the query and evidence variables. When can we be sure that this will take a nite
amount of time? The rst way this process could fail to terminate is if it goes into
an innite loop while checking whether a particular variable is supported. This
happens if the program ends up enumerating an innite set while evaluating a
set expression or quantied formula. We can avoid this by ensuring that all such
expressions in the Blog model are nite once origin function restrictions are taken
into account.
The sample generator also fails to terminate if it never constructs an instantiation
that supports a particular query or evidence variable. To see how this can happen,
consider calling the subroutine described above to sample a variable V . If V is not
supported, the subroutine will realize this when it encounters a variable U that is
relevant but not instantiated. Now consider a graph over basic variables where we
draw an edge from U to V when the evaluation process for V hits U in this way. If
a variable is never supported, then it must be part of a cycle in this graph, or part
of a receding chain of variables V1  V2     that is extended innitely.
The graph constructed in this way varies from sample to sample: for instance,
sometimes the evaluation process for ObsColor [d] will hit TrueColor [(Ball, 7)], and
sometimes it will hit TrueColor [(Ball, 13)]. However, we can rule out cycles and
4. This left-to-right evaluation scheme does not always detect that a formula is determined: for instance, on   , it returns undetermined if  is undetermined but  is
trueeven though    must be true in this case.
13.6
Inference
391
Color
Ball
Researcher
Draw
Publication
Citation
Name
BallDrawn
Title
TrueColor
PubCited
ObsColor
Text
(a)
(b)
Aircraft
State
Blip
ApparentPos
NaturalNum
(c)
Symbol graphs for (a) the urn-and-balls model in Figure 13.1; (b) the
bibliographic model in Figure 13.2; (c) the aircraft tracking model in Figure 13.3.
Figure 13.6
innite receding chains in all these graphs by considering a more abstract graph
over function symbols and types (along the same lines as the dependency graph of
[15, 4]).
Denition 13.11
The symbol graph for a Blog model M is a directed graph whose nodes are the
types and random function symbols of M , where the parents of a type  or function
symbol f are
the random function symbols that occur on the right-hand side of the dependency
statement for f or some number statement for  ;
the types of variables that are quantied over in formulae or set expressions on
the right-hand side of such a statement;
the types of the arguments for f or the return types of origin functions for  .
The symbol graphs for our three examples are shown in Figure 13.6. If the
sampling subroutine for a basic RV V hits a basic RV U , then there must be
an edge from U s function symbol (or type, if U is a number RV) to V s function
symbol (or type) in the symbol graph. This property, along with ideas from [20],
allows us to prove the following:
392
Theorem 13.12
Suppose M is a Blog model where
1. uncountable built-in types do not serve as function arguments or as the return
types of origin functions;
2. each quantied formula and set expression ranges over a nite set once origin
function restrictions are taken into account;
3. the symbol graph is acyclic.
Then M is well-dened. Also, for any evidence instantiation e and query variable
Q, the rejection sampling algorithm described in section 13.6.1 converges to the
posterior P (Q|e) dened by the model, taking nite time per sampling step.
The criteria in Theorem 13.12 are very conservative: in particular, when we construct the symbol graph, we ignore all structure in the dependency statements and
just check for the occurrence of function and type symbols. These criteria are satised by the models in Figures 13.1 and 13.2. However, the aircraft tracking model in
Figure 13.3 does not satisfy the criteria because its symbol graph (Figure 13.6(c))
contains a self-loop from State to State. The criteria do not exploit the fact that
State(a, t) depends only on State(a, Pred(t)), and the nonrandom function Pred is
acyclic. Friedman et al. [4] have already dealt with this issue in the context of
probabilistic relational models; their algorithm can be adapted to obtain a stronger
version of Theorem 13.12 that covers the aircraft tracking model.
13.6.3
Experimental results
Milch et al. [20] describe a guided likelihood weighting algorithm that uses backward
chaining from the query and evidence nodes to avoid sampling irrelevant variables.
This algorithm can also be adapted to Blog models. We applied this algorithm
for Example 13.1, asserting that 10 balls were drawn and all appeared blue, and
querying the number of balls in the urn. Figure 13.7(a) shows that when the prior
for the number of balls is uniform over {1, . . . , 8}, the posterior puts more weight
on small numbers of balls; this makes sense because the more balls there are in the
urn, the less likely it is that they are all blue. Figure 13.7(b), using a Poisson(6)
prior, shows a similar but less pronounced eect.
Note that in Figure 13.7, the posterior probabilities computed by the likelihood
weighting algorithm are very close to the exact values (computed by exhaustive
enumeration of possible worlds with up to 170 balls). We were able to obtain
this level of accuracy using runs of 20,000 samples with the uniform prior, and
100,000 samples using the Poisson prior. On a Linux workstation with a 3.2GHz
Pentium 4 processor, the runs with the uniform prior took about 35 seconds (571
samples/second), and those with the Poisson prior took about 170 seconds (588
samples/second). Such results could not be obtained using any algorithm that
constructed a single xed BN, since the number of potentially relevant TrueColor [b]
variables is innite in the Poisson case.
Related Work
393
0.45
0.18
0.4
0.16
0.35
0.14
0.3
0.12
Probability
Probability
13.7
0.25
0.2
0.1
0.08
0.15
0.06
0.1
0.04
0.05
0.02
0
1
3
4
5
6
Number of balls in urn
(a)
10
15
Number of balls in urn
20
25
(b)
13.7
Related Work
Gaifman [5] was the rst to suggest dening a probability distribution over rstorder model structures. Halpern [10] denes a language in which one can make
statements about such distributions: for instance, that the probability of the set of
worlds that satisfy Flies(Tweety) is 0.8. Probabilistic logic programming [22] can be
seen as an application of this approach to Horn-clause knowledge bases. Such an
approach only denes constraints on distributions, rather than dening a unique
distribution.
Most FOPLs that dene unique distributions x the set of objects and the
interpretations of (non-Boolean) function symbols. Examples include relational
Bayesian networks [12] and Markov logic models [3]. Prolog-based languages such
as probabilistic Horn abduction [26], PRISM [28], and Bayesian logic programs [14]
work with Herbrand models, where the objects are in one-to-one correspondence
with the ground terms of the language (a consequence of the unique names and
domain closure assumptions).
There are a few FOPLs that allow explicit reference uncertainty, i.e., uncertainty
about the interpretations of function symbols. Among these are two languages that
use indexed RVs rather than logical notation: BUGS [7] and indexed probability
diagrams (IPDs) [21]. Reference uncertainty can also be represented in probabilistic
relational models (PRMs) [15], where a single-valued complex slot corresponds
to an uncertain unary function. PRMs are unfortunately restricted to unary functions (attributes) and binary predicates (relations). Probabilistic entity-relationship
models [11] lift this restriction, but represent reference uncertainty using relations
(such as Drawn(d, b)) and special mutual exclusivity constraints, rather than with
394
13.8
13.8
395
396
References
[1] P. Carbonetto, J. Kisy
nski, N. de Freitas, and D. Poole. Nonparametric
Bayesian logic. In Proceedings of the Conference on Uncertainty in Articial
Intelligence, 2005.
[2] E. Charniak and R. P. Goldman. A Bayesian model of plan recognition.
Articial Intelligence, 64(1):5379, 1993.
[3] P. Domingos and M. Richardson. Markov logic: A unifying framework for
statistical relational learning. In ICML Workshop on Statistical Relational
Learning and Its Connections to Other Fields, 2004.
[4] N. Friedman, L. Getoor, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In Proceedings of the International Joint Conference on
Articial Intelligence, 1999.
[5] H. Gaifman. Concerning measures in rst order calculi. Israel Journal of
Mathematics, 2:118, 1964.
[6] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic
models of relational structure. In Proceedings of the International Conference
on Machine Learning, 2001.
[7] W. R. Gilks, A. Thomas, and D. J. Spiegelhalter. A language and program for
complex Bayesian modelling. The Statistician, 43(1):169177, 1994.
[8] W. R. Gilks, S. Richardson, and D. J. Spiegelhalter, editors. Markov Chain
Monte Carlo in Practice. Chapman and Hall, London, 1996.
[9] H. Haario, E. Saksman, and J. Tamminen. An adaptive Metropolis algorithm.
Bernoulli, 7:223242, 2001.
[10] J. Y. Halpern. An analysis of rst-order logics of probability. Articial
Intelligence, 46:311350, 1990.
[11] D. Heckerman, C. Meek, and D. Koller. Probabilistic models for relational
data. Technical Report MSR-TR-2004-30, Microsoft Research, Seattle, WA,
2004.
[12] M. Jaeger. Complex probabilistic modeling with recursive relational Bayesian
networks. Annals of Math and Articial Intelligence, 32:179220, 2001.
[13] S. Jain and R. M. Neal. A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and
Graphical Statistics, 13:158182, 2004.
[14] K. Kersting and L. De Raedt. Adaptive Bayesian logic programs. In
Proceedings of the International Conference on Inductive Logic Programming,
2001.
[15] D. Koller and A. Pfeer. Probabilistic frame-based systems. In Proceedings
of the National Conference on Articial Intelligence, 1998.
References
397
[16] K. Lambert. Free logics, philosophical issues in. In E. Craig, editor, Routledge
Encyclopedia of Philosophy. Routledge, London, 1998.
[17] K. B. Laskey and P. C. G. da Costa. Of starships and Klingons: Bayesian
logic for the 23rd century. In Proceedings of the Conference on Uncertainty in
Articial Intelligence, 2005.
[18] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller.
Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21:10871092, 1953.
[19] B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, and A. Kolobov. BLOG:
Probabilistic models with unknown objects. In Proceedings of the International
Joint Conference on Articial Intelligence, 2005.
[20] B. Milch, B. Marthi, D. Sontag, S. Russell, D. L. Ong, and A. Kolobov.
Approximate inference for innite contingent Bayesian networks. In Tenth
International Workshop on Articial Intelligence and Statistics, 2005.
[21] E. Mjolsness. Labeled graph notations for graphical models. Technical Report
04-03, School of Information and Computer Science, University of California,
Irvine, 2004.
[22] R. T. Ng and V. S. Subrahmanian. Probabilistic logic programming. Information and Computation, 101(2):150201, 1992.
[23] H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In Proceedings of Neural Information Processing
Systems, 2003.
[24] J. Pearl. Probabilistic Reasoning in Intelligent Systems, revised edition.
Morgan Kaufmann, San Francisco, 1988.
[25] A. Pfeer. IBAL: A probabilistic rational programming language. In Proceedings of the International Joint Conference on Articial Intelligence, 2001.
[26] D. Poole. Probabilistic Horn abduction and Bayesian networks. Articial
Intelligence, 64(1):81129, 1993.
[27] A. Popescul, L. H. Ungar, S. Lawrence, and D. M. Pennock. Statistical relational learning for document mining. In Proceedings of the IEEE International
Conference on Data Mining, 2003.
[28] T. Sato and Y. Kameya. Parameter learning of logic programs for symbolicstatistical modeling. Journal of Articial Intelligence Research, 15:391454,
2001.
[29] R. W. Sittler. An optimal data association problem in surveillance theory.
IEEE Transactions on Military Electronics, MIL-8:125139, 1964.
[30] G. C. G. Wei and M. A. Tanner. A Monte Carlo implementation of the EM
algorithm and the poor mans data augmentation algorithms. Journal of the
American Statistical Association, 85:699704, 1990.
398
Avi Pfeer
14.1
Introduction
In a rational programming language, a program specifes a situation encountered
by an agent; evaluating the program amounts to computing what a rational agent
would believe or do in the situation. Rational programming combines the advantages of declarative representations with features of programming languages such
as modularity, compositionality, and type systems. A system designer need not
reinvent the algorithms for deciding what the system should do in each possible
situation it encounters. It is sucient to declaratively describe the situation, and
leave the sophisticated inference algorithms to the implementors of the language.
One can think of Prolog as a rational programming language, focused on computing the beliefs of an agent that uses logical deduction. In the past few years there has
been a shift in AI toward specications of rational behavior in terms of probability
and decision theory. There is therefore a need for a natural, expressive, generalpurpose, and easy-to-program language for probabilistic modeling. This chapter
presents IBAL, a probabilistic rational programming language. IBAL, pronounced
eyeball, stands for I ntegrated B ayesian Agent Language. As its name suggests,
400
it integrates various aspects of probability-based rational behavior, including probabilistic reasoning, Bayesian parameter estimation, and decision-theoretic utility
maximization. This chapter will focus on the probabilistic representation and reasoning capabilities of IBAL, and not discuss the learning and decision-making aspects.
High-level probabilistic languages have generally fallen into two categories. The
rst category is rule-based [19, 13, 5]. In this approach the general idea is to associate
logic-programming-like rules with noise factors. A rule describes how one rst-order
term depends on other terms. Given a specic query and a set of observations, a
Bayesian network (BN) can be constructed describing a joint distribution over all
the rst-order variables in the domain.
The second category of language is object-based [7, 8, 10]. In this approach, the
world is described in terms of objects and the relationships between them. Objects
have attributes, and the probabilistic model describes how the attributes of an
object depend on other attributes of the same object and on attributes of related
objects. The model species a joint probability distribution over the attributes of
all objects in the domain.
This chapter explores a dierent approach to designing high-level probabilistic
languages. IBAL is a functional language for specifying probabilistic models. Models
in IBAL look like programs in a functional programming language. In the functional
approach, a model is a description of a computational process. The process stochastically generates a value, and the meaning of the model is the distribution over the
value generated by the process.
The functional approach, as embodied in IBAL, has a number of attractive
features. First of all, it is an extremely natural way to describe a probabilistic
model. To construct a model, one simply has to provide a description of the way
the world works. Describing the generative process explicitly is the most direct way
to describe a generative model. Second, IBAL is highly expressive. It builds on top
of a Turing-complete programming language, so that every generative model that
can reasonably be described computationally can be described in IBAL. Third, by
basing probabilistic modeling languages on programming languages, we are able
to enjoy the benets of a programming language, such as a type system and type
inference. Furthermore, by building on the technology of functional languages, we
are able to utilize all their features, such as lambda abstraction and higher-order
functions.
In addition, the use of a functional programming framework provides an elegant
and uniform language with which to describe all aspects of a model. All levels of a
model can be described in the language, including the low-level probabilistic dependencies and the high-level structure. This is in contrast to rule-based approaches,
in which combination rules describe how the dierent rules t together. It is also in
contrast to object-based languages, in which the low-level structure is represented
using conditional probability tables and a dierent language is used for high-level
structure. Furthermore, PRMs use special syntax to handle uncertainty over the
relational structure. This means that each such feature must be treated as a spe-
14.2
401
cial case, with special purpose inference algorithms. In IBAL, special features are
encoded using the language syntax, and the general-purpose inference algorithm is
applied to handle them.
IBAL is an ideal rapid prototyping language for developing new probabilistic
models. Several examples are provided that show how easy it is to express models
in the language. These include well-known models as well as new models. IBAL has
been implemented, and made publicly available at
http:www.eecs.harvard.edu/~avi/IBAL.
The chapter begins by presenting the IBAL language. The initial focus is on the
features that allow description of generative probabilistic models. After presenting
examples, the chapter presents the declarative semantics of IBAL.
When implementing a highly expressive reasoning language, the question of
inference comes to the forefront. Because IBAL is capable of expressing many
dierent frameworks, its inference algorithm should generalize the algorithms of
those frameworks. If, for example, a BN is encoded in IBAL, the IBAL inference
algorithm should perform the same operations as a BN inference algorithm. This
chapter describes the IBAL inference algorithm and shows how it generalizes many
existing frameworks, including Bayesian networks, hidden Markov models (HMMs),
and stochastic context free grammars (SCFGs). Seven desiderata for a generalpurpose inference algorithm are presented, and it is shown how IBALs algorithm
satises all of them simultaneously.
14.2
Basic Expressions
402
14.2
403
Example 14.1
It is important to note that in an expression of the form let x = e1 in e2 the variable
x is assigned a specic value in the experiment; any stochastic choices made while
evaluating e1 are resolved, and the result is assigned to x. For example, consider
let z = dist [ 0.5 : true, 0.5 : false ] in
z & z
The value of z is resolved to be either true or false, and the same value is used in
the two places in which z appears in z & z. Thus the whole expression evaluates
to true with probability 0.5, not 0.25, which is what the result would be if z was
reevaluated each time it appears. Thus the let construct provides a way to make
dierent parts of an expression probabilistically dependent, by making them both
mention the same variable.
Example 14.2
This example illustrates the use of a higher-order function. It begins by dening
two functions, one corresponding to the toss of a fair coin and one describing a toss
of a biased coin. It then denes a higher-order function, whose return value is one
of the rst two functions. This corresponds to the act of deciding which kind of
coin to toss. The example then denes a variable named c whose value is either
the fair or biased function. It then denes two variables x and y to be dierent
applications of the function contained in c. The variables x and y are conditionally
independent of each other given the value of c. Note by the way that in this example
the functions take zero arguments.
let fair = lambda () -> dist [ 0.5 : heads, 0.5 : tails ] in
let biased = lambda () -> dist [ 0.9 : heads, 0.1 : tails ] in
let pick = lambda () -> dist [ 0.5 : fair, 0.5 : biased ] in
let c = pick () in
let x = c () in
let y = c () in
<x:x, y:y> \ \ \bbox
14.2.2
Observations
The previous section presented the basic constructs for describing generative probabilistic models. Using the constructs above, one can describe any stochastic experiment that generatively produces values. The language presented so far can express
many common models, such as BNs, probabilistic relational models, HMMs, dynamic Bayesian networks, and SCFGs. All these models are generative in nature.
The richness of the model is encoded in the way the values are generated.
IBAL also provides the ability to describe conditional models, in which the
generative probability distribution is conditioned on certain observations being
satised. IBAL achieves this by allowing observations to be encoded explicitly
404
Syntactic Sugar
In addition to the basic constructs described above, IBAL provides a good deal of
syntactic sugar. The sugar does not increase the expressive power of the language,
but makes it considerably easier to work with. The syntactic sugar is presented
here, because it will be used in many of the later examples.
The let syntax is extended to make it easy to dene functions. The syntax
let f (x1 , . . . , xn ) = e is equivalent to let f = fix f (x1 , . . . , xn ) = e.
Thus far, every IBAL construct has been an expression. Indeed, everything in
IBAL can be written as an expression, and presenting everything as expressions
simplies the presentation. A real IBAL program, however, also contains denitions.
A block is a piece of IBAL code consisting of a sequence of variable denitions.
Example 14.4
For example, we can rewrite our coins example using denitions.
fair() = dist [ 0.5 : heads, 0.5 : tails ]
biased() = dist [ 0.9 : heads, 0.1 : tails ]
pick() = dist [ 0.5 : fair, 0.5 : biased ]
c = pick()
x = c()
y = c()
The value of this block is a tuple containing a component for every variable dened
in the block, i.e., fair, biased, pick, c, x, and y.
14.2
405
Bernoulli and uniform random variables are so common that a special notation
is created for them. The expression flip  is shorthand for dist [ : true, 1   :
false]. The expression uniform n is short for dist [ n1 : 0, . . . , n1 : n  1].
IBAL provides basic operators for working with values. These include logical
operators for working with Boolean values and arithmetic operators for integer
values. IBAL also provides an equality operator that tests any two values for
equality. Operator notation is equivalent to function application, where the relevant
functions are built in.
Dot notation can be used to reference nested components of variables. For
example, x.a.b means the component named b of the component named a of
the variable named x. This notation can appear anywhere a variable appears. For
example, in an observation one can say obs x.a = true in y. This is equivalent
to saying
let z = x.a in obs z = true in y.
Patterns can be used to match sets of values. A pattern may be
an atomic value (Boolean, integer, or strong), that matches itself;
the special pattern *, that matches any value;
a variable, which matches any value, binding the variable to the matched value
in the process;
a tuple of patterns, which matches any tuple value such that each component
pattern matches the corresponding component value.
For example, the pattern < 2, , y > matches value < 2, true, h >, binding
y to h in the process. A pattern can appear in an observation. For example,
obs x = <2,*,y> in true conditions the experiment on the value of x matching
the pattern.
Patterns also appear in case expressions, which allow the computation to branch
depending on the value of a variable. The general syntax of case expressions is
case e0 of
#p1 : e1
...
#pn : en
where the pi are patterns and the ei are expressions. The meaning, in terms of a
stochastic experiment, is to begin by evaluating e0 . Then its value is matched to
each of the patterns in turn. If the value matches p1 , the result of the experiment
is the result of e1 . If the value does not match p1 through pi1 and it does match
pi , then ei is the result. It is an error for the value not to match any pattern. A
case expression can be rewritten as a series of nested if expressions.
The case expression is useful for describing conditional probability tables as are
used in BNs. In this case, the expression e0 is a tuple consisting of the parents of the
node, each of the patterns pi matches a specic set of values of the parents, and the
406
corresponding expression ei is the conditional distribution over the node given the
values of the parents. It is also possible to dene a pattern that matches whenever
a subset of the variables takes on specied values, regardless of the values of other
variables. Such a pattern can be used to dene conditional probability tables with
context-specic independence, where only some of the parents are relevant in certain
circumstances, depending on the values of other parents.
In addition to tuples, IBAL provides algebraic data types (ADTs) for creating
structured data. An ADT is a data type with several variants. Each variant has a
tag and a set of elds. ADTs are very useful in dening recursive data types such
as lists and trees. For example, the list type has two variants. The rst is Nil and
has no elds. The second is Cons and has a eld representing the head of the list
and a further eld representing the remainder of the list.
Example 14.5
Using the list type, we can easily dene a stochastic context free grammar. First
we dene the append function that appends two lists. Then, for each nonterminal
in the grammar we dene a function corresponding to the act of generating a string
with that non-terminal. For example,
append(x,y) =
case x of
# Nil -> y
# Cons(a,z) -> Cons(a, append(z,y))
term(x) = Cons(x,Nil)
s() = dist [0.6:term(a);
0.4:append(s(),t())]
t() = dist [0.9:term(b);
0.1:append(t(),s())]
We can then examine the beginning of a string generated by the grammar using
the take function:
take(n,x) =
case(n,x) of
# (0,_) -> Nil
# (_,Nil) -> Nil
# (_,Cons(y,z)) -> Cons(y,take(n-1,z))
IBAL is a strongly typed language. The language includes type declarations that
declare new types, and data declarations that dene algebraic data types. The type
system is based on that of ML. The type language will not be presented here, but
it will be used in the examples, where it will be explained.
In some cases, it is useful to dene a condition as being erroneous. For example,
when one tries to take the head of an empty list, an error condition should result.
IBAL provides an expression error s, where s is a string, to signal an error
14.3
Examples
407
condition. This expression takes on the special value ERROR: s, which belongs to
every type and can only be used to indicate errors.
Finally, IBAL allows comments in programs. A comment is anything beginning
with a // through to the end of the line.
14.3
Examples
Example 14.6
Encoding a BN is easy and natural in IBAL. We include a denition for each variable
in the network. A case expression is used to encode the conditional probability table
for a variable. For example,
burglary = flip 0.01;
earthquake = flip 0.001;
alarm = case <burglary, earthquake> of
# <false, false> : flip 0.01
# <false, true> : flip 0.1
# <true, false> : flip 0.7
# <true, true> : flip 0.8
We can also easily encode conditional probability tables with structure. For
example, we may want the alarm variable to have a noisy-or structure:
alarm = flip 0.01 // leak probability
| earthquake & flip 0.1
| alarm & flip 0.7
We may also create variables with context-specic independence. Context-specic
independence is the case where a variable depends on a parent for some values of the
other parents but not others. For example, if we introduce variables representing
whether or not John is at home and John calls, John calling is dependent on
the alarm only in the case that John is at home. IBALs pattern syntax is very
convenient for capturing context-specic independence. The symbol * is used as
the pattern that matches all values, when we dont care about the value of a specic
variable:
john_home = flip 0.5
john_calls = case <john_home, alarm> of
# <false,*> : false
# <true,false> : flip 0.001
# <true,true> : flip 0.7
408
Example 14.7
Markov chains can easily be encoded in IBAL. Here we present an example where
the states are integers. The sequence of chains produced by the model is represented
as a List. The rst line of the program denes the List data type:
data List [a] = Nil | Cons (a, List [a])
This declaration states that List is a parameterized type, taking on the type
parameter a. That is, for any type a, List [a] is also a type. It then goes on to
state that a List [a] can be one of two things: it can be Nil, or it can be the Cons
of two arguments, the rst of type a and the second of type List [a].
Given a sequence of states represented as a List, it is useful to be able to examine
a particular state in the sequence. The standard function nth does this.
nth (n,l) : (Int, List [a]) -> a =
case l of
# Cons (x,xs) : if n==0 then x else nth (n-1,xs)
# Nil : error "Too short";
The rst line of nth includes a typing rule. It states that nth is a function taking
two arguments, where the rst is an integer and the second is a List [a], and
returning a value of type a.
Next, we dene the types to build up a Markov model. A Markov model consists of
two functions, an initialization function and a transition function. The initialization
function takes zero arguments and produces a state. The transition function takes
a state argument and produces a state. Markov models are parameterized by the
type of the state, which is here called a.
type Init [a] = () -> a;
type Trans [a] = (a) -> a;
type Markov [a] = < init : Init [a], trans : Trans [a] >;
Given a Markov model, we can realize it to produce a sequence of states.
realize (m) : (Markov [a]) -> List [a] =
let f(x) = Cons (x, f(m.trans (x))) in
f(m.init ());
Thus far, the denitions have been abstract, applying to every Markov model.
Now we dene a particular Markov model by supplying denitions for the initialization and transition functions. Note that the state here is integer, so the state
space is innite. The state can be any type whatsoever, including algebraic data
types like lists or trees.
random_walk : Markov [Int] =
< init : lambda () -> 0,
trans : lambda (n) -> dist [ 0.5 : n++, 0.5 : n-- ] >;
14.3
Examples
409
410
count(p, s) =
case s of
# Nil : 0
# Cons(x,xs) :
if p x
then 1 + count(p, xs)
else count(p, xs)
In addition to count, we can easily dene universal and existential quantiers
and other aggregates.
Example 14.9
IBAL is an ideal language in which to rapidly prototype new probabilistic models. Here we illustrate using a recently developed kind of model, the repetition
model [15]. A repetition model is used to describe a sequence of elements in which
repetition of elements from earlier in the sequence is a common occurrence. It is
attached to an existing sequence model such as an n-gram or an HMM. Here we
describe the repetition HMM.
In a repetition HMM, there is a hidden state that evolves according to a
Markov process, just as in an ordinary HMM. An observation is generated at each
time point. With some probability , the observation is generated from memory,
meaning that a previous observation is reused for the current observation. With
the remaining 1   probability, the observation is generated from the hidden state
according to the observation model of the HMM. This model captures the fact that
there is an underlying generative process as described by the HMM, but this process
is sometimes superseded by repeating elements that have previously appeared.
Repetition is a key element of music, and repetition models have successfully been
applied to modeling musical rhythm.
To describe a repetition HMM in IBAL, we rst need a function to select a random
element from a sequence. The function nth takes an integer argument and selects
the given element of the sequence. We then let the argument range uniformly over
the length of the sequence, which is passed as an argument to the select function.
nth(n, seq) =
case seq of
# Cons(x,xs) :
if n = 0
then x
else nth(n-1, xs)
# Nil : error
select(length, seq) = nth(uniform length, seq)
Similarly to the way we dened Markov models earlier, a repetition HMM takes
init, trans, and obs functions as arguments. The parameter  must be supplied.
If we used all of IBALs features it could be a learnable parameter. In our example
14.4
Semantics
411
14.4
Semantics
In specifying the semantics of the language, it is sucient to provide semantics
for the core expressions, since the syntactic sugar is naturally induced from them.
The semantics is distributional: the meaning of a program is specied in terms of
a probability distribution over values.
14.4.1
Distributional Semantics
We use the notation M[e] to denote the meaning of expression e, under the
distributional semantics. The meaning function takes as argument a probability
distribution over environments. The function returns a probability distribution over
values. We write M[e]  v to denote the probability of v under the meaning of e
412
when the distribution over environments is . We also use the notation M[e]  v to
denote the probability of v under the meaning of e when the probability distribution
over environments assigns positive probability only to .
We now dene the meaning function for dierent types of expressions. The
meaning of a constant expression is given by
1 if v = v,
M[v]  v =
0 otherwise
The probability that referring to a variable produces a value is obtained simply
by summing over environments in which the variable has the given value:
().
M[x]  v =
:(x)=v
The meaning of an if expression is dened as follows. We rst take the sum over
all environments of the meaning of the expression in the particular environments.
The reason we need to do this is because the meanings of the if clause and of
the then and else clauses are correlated by the environment. Therefore we need
to specify the particular environment before we can break up the meaning into
the meanings of the subexpressions. Given the environments, however, the subexpressions become conditionally independent, so we can multiply their meanings
together.
(M[e1 ]  true)(M[e2 ]  v)+
()
M[if e1 then e2 else e3 ]  v =
(M[e1 ]  false)(M[e3 ]  v)
The distributional semantics of a dist expression simply states that the probability of a value under a dist expression is the weighted sum of the probability of
the value under the dierent branches:
pi (M[ei ]  v).
M[dist[p1 : e1 , . . . , pn : en ]]  v =
i
14.4
Semantics
413
lambda and fix expressions are treated as constants whose values are closures.
The only dierence is that the closure species an environment, so we take the
probability that the current environment is the closure environment.
args
=
x
,
.
.
.
,
x
;
1
n
 () if v =
body = e;
M[lambda x1 , . . . , xn e] v =
env =
0
otherwise
args = x1 , . . . , xn ;
 () if v =
body = e;
M[fix x1 , . . . , xn e] v =
env = [f /v]
0
otherwise
()
M[e0 (e1 , . . . , en )]  v =
	n
v0 ,v1 ,...,vn ( i=0 M[ei ]  vi )(M[e]  [x1 /v1 , . . . , xn /vn ] v)
()
M[< x1 : e1 , . . . , xn : en >]  v =
	n
if v =< x1 : v1 , . . . , xn : vn >,
i=1 M[ei ] () vi
otherwise
414
Finally, the probability of a comparison being true is derived by taking the sum,
over all possible values, of the probability that both expressions produce the value.
if v = true
 p
M[e1 == e2 ] v =
1  p if v = false
 0
otherwise
where p =  () v (M[e1 ]  v  )(M[e2 ]  v  )
The distributional semantics captures observations quite simply. The eect of an
observation is to condition the distribution  over environments on the observation
holding. When the probability that the observation holds is zero, the probability of
the expression is dened to be zero.
 P
:(x)=v ()(M[e]  v)
if P (x = v  ) > 0
P (x=v  )
M[obs x = v  in e]  v =
0
if P (x = v  ) = 0
where P (x = v  ) = :(x)=v ()
14.4.2
Lazy Semantics
14.5
415
g(x) =
case x of
# Cons(y,z) -> y
g(f())
The function f() denes an innite sequence of true and false elements. The
function g() then returns the rst element in the sequence. When g is applied to
f, the body of g species that only the rst component of its argument is required.
Therefore, when evaluating f, only its rst component will be evaluated. That can
be done by examining a single flip.
The distributional semantics presented earlier is agnostic about whether it is
eager or lazy. It simply presents a set of equations, and says nothing about how
the equations are evaluated. Both eager and lazy interpretations are possible. The
meaning of an expression under either interpretation is only well-dened when the
process of evaluating it converges. The eager and lazy semantics do not necessarily
agree. The eager semantics may diverge in some cases where the lazy semantics
produces a result. However, if the eager semantics converges, the lazy semantics
will produce the same result.
14.5
416
14.6
Related Approaches
Previous approaches to inference in high-level probabilistic languages have generally
fallen into four categories. On one side are approaches that use approximate
inference, particularly Markov chain Monte Carlo methods. This is the approach
used in BUGS [23] and the approach taken by Pasula and Russell in their rst-order
probabilistic logic [14]. While exact inference may be intractable for many models,
and approximate strategies are therefore needed, the goal of this chapter is to push
exact inference as far as possible.
The rst generation of high-level probabilistic languages generally used the
knowledge-based model construction (KBMC) approach (e.g. [19, 13, 10, 5]). In
this approach, a knowledge base describes the general probabilistic mechanisms.
These are combined with ground facts to produce a BN for a specic situation. A
standard BN inference algorithm is then used to answer queries.
14.6
Related Approaches
417
This approach generally satises only the rst of the above desiderata. Since a
BN is constructed, any independence will be represented in that network, and can
be exploited by the BN algorithm. The second desideratum can also be satised, if
a BN algorithm that exploits low-level structure is used, and the BN construction
process is able to produce that structure. Since the construction process creates one
large BN, any structure resulting from weakly interacting components is lost, so the
third desideratum is not satised. Similarly, when there is repetition in the domain
the large BN contains many replicated components, and the fourth desideratum is
not satised. Satisfaction of the remaining desiderata depends on the details of the
BN construction process. The most common approach is to grow the network using
backward chaining, starting at the query and the evidence. If any of these lead to
an innite regress, the process will fail.
Sato and Kameya [22] present a more advanced version of this approach that
achieves some of the aims of this paper. They use a tabling procedure to avoid
performing redundant computations. In addition, their approach is query-directed.
However they do not exploit low-level independence or weak interaction between
objects, nor do they utilize observations or support.
More recent approaches take one of two tacks. The rst is to design a probabilistic representation language as a programming language, whether a functional
language [9, 18] or logic programming [12]. The inference algorithms presented for
these languages are similar to evaluation algorithms for ordinary programming languages, using recursive descent on the structure of programs. The programming language approach has a number of appealing properties. First, the evaluation strategy
is natural and familiar. Second, a programming language provides the ne-grained
representational control with which to describe low-level structure. Third, simple
solutions are suggested for many of the desiderata. For example, high-level structure
can be represented in the structure of a program, with dierent functions representing dierent components. As for exploiting repetition, this can be achieved by
the standard technique of memoization. When a function is applied to a given set
of arguments, the result is cached, and retrieved whenever the same function is
applied to the same arguments. Meanwhile, lazy evaluation can be used to exploit
the query to make a computation simpler.
However, approaches based on programming languages have a major drawback.
They do not do a good job of exploiting independence. Koller et al. [9] made an eort
to exploit independence by maintaining a list of variables shared by dierent parts of
the computation. The resulting algorithm is much more dicult to understand, and
the solution is only partial. Given a BN encoded in their language, the algorithm can
be viewed as performing variable elimination (VE) using a particular elimination
order: namely, from the last variable in the program upward. It is well-known that
the cost of VE is highly dependent on the elimination order, so the algorithm is
exponentially more expensive for some families of models than an algorithm that
can use any order.
In addition, while these approaches suggest solutions to many of the desiderata,
actually integrating them into a single implementation is dicult. For example,
418
Koller et al. [9] suggested using both memoization and lazy evaluation, believing
that since both were standard techniques their combination would be simple. In
fact it turns out that implementing both simultaneously is considered extremely
dicult!1 The nal three desiderata are all variations on the idea that knowledge
can be used to simplify computation. The general approach was captured by the
term evidence-nite computation in [9]. However, this catchall term fails to capture
the distinctions between the dierent ways knowledge can be exploited. A careful
implementation of the algorithm in [9] showed that it achieved termination only in
a relatively small number of possible cases. In particular it failed to exploit support
and observations.
The nal approach to high-level probabilistic inference is to use a structured
inference algorithm. In this approach, used in object-oriented Bayesian networks
and relational probabilistic models [17, 16], a BN fragment is provided for each
model component, and the components are related to each other in various ways.
Rather than constructing a single BN to represent an entire domain, inference
works directly on the structured model, using a standard BN algorithm to work
within each component. The approach was designed explicitly to exploit high-level
structure and repetition. In addition, because a standard BN algorithm is used,
this approach exploits independence. However, it does not address the nal three
desiderata. An anytime approximation algorithm [6] was provided for dealing with
innitely recursive models, but it is not an approximate inference algorithm.
In addition, this approach does not do as well as one might hope at exploiting
low-level structure. One might rely on the underlying BN inference algorithm to
exploit whatever structure it can. For example, if it is desired to exploit noisy-or
structure, the representation should explicitly encode such structure, and the BN
algorithm should take advantage of it. The problem with this approach is that
it requires a special-purpose solution for each possible structure, and high-level
languages make it easy to specify new structures. A case in point is the structure
arising from quantication over a set of objects. In the SPOOK system [17], an
object A can be related to a set of objects B, and the properties of A can depend
on an aggregate property of B. If implemented naively, A will depend on each of
the objects in B, so its conditional probability table will be exponential in the size
of B. As shown in [17], the relationship between A and B can be decomposed in
such a way that the representation and inference are linear in the size of B. Special
purpose code had to be written in SPOOK to capture this structure, but it is easy
to specify in IBAL, as described in example 14.8, so it would be highly benecial if
IBALs inference algorithm can exploit it automatically.
14.7
14.7
Inference
419
Inference
14.7.1
Inference Overview
If we examine the desiderata of section 14.5, we see that they fall into two
categories. Exploiting repetition, queries, support, and evidence all require avoiding
unnecessary computation, while exploiting structure and independence require
performing the necessary computation as eciently as possible. One of the main
insights gained during the development of IBALs inference algorithm is that
simultaneously trying to satisfy all the desiderata can lead to quite complex code.
The inference process can be greatly simplied by recognizing the two dierent kinds
of desiderata, and dividing the inference process into two phases. The rst phase
is responsible for determining exactly what computations need to be performed,
while the second phase is responsible for performing them eciently.
This division of labor is reminiscent of the symbolic probabilistic inference (SPI)
algorithm for BN inference [11], in which the rst phase nds a factoring of
the probability expression, and the second phase solves the expression using the
factoring. However, there is a marked dierence between the two approaches. In
SPI, the goal of the rst phase is to nd the order in which terms should be
multiplied. In IBAL, the rst phase determines which computations need to be
performed, but not their order. That is left for the variable elimination algorithm
in the second phase. Indeed, SPI could be used in the second phase of IBAL as the
algorithm that computes probabilities.
The rst phase of IBAL operates directly on programs, and produces a data
structure called the computation graph. This rooted directed acyclic graph contains
a node for every distinct computation to be performed. A computation consists of
an expression to be evaluated, and the supports of free variables in the expression.
The computation graph contains an edge from one node to another if the second
node represents a computation for a subexpression that is required for the rst
node.
The second phase of the algorithm traverses the computation graph, solving every
node. A solution for a node is a conditional probability distribution over the value of
the expression given the values of the free variables, assuming that the free variables
have values in the given supports. The solution is computed bottom-up. To solve a
node, the solutions of its children are combined to form the solution for the node.
On the surface, the design seems similar to that of the KBMC approaches.
They both create a data structure, and then proceed to solve it. The IBAL
approach shares with KBMC the idea of piggybacking on top of existing BN
technology. However, the two approaches are fundamentally dierent. In KBMC,
the constructed BN contains a node for every random variable occurring in the
solution. By contrast, IBALs computation graph contains a node for every distinct
420
First Phase
It is the task of the rst phase to construct the computation graph, containing a
node for every computation that has to be performed. At the end of the phase,
each node will contain an expression to be evaluated, annotated with the supports
of the free variables, and the support of the expression itself. The rst phase begins
by propagating observations to all subexpressions that they eect. The result of
this operation is an annotated expression, where each expression is annotated with
the eective observation about its result. When the computation graph is later
constructed, the annotations will be used to restrict the supports of variables,
and possibly to restrict the set of computations that are required. Thus the
seventh desideratum of exploiting evidence is achieved. IBALs observation
propagation process is sound but not complete. For an SCFG, it is able to infer
when the output string is nite that only a nite computation is needed to produce
it. The details are omitted here.
14.7.2.1
Lazy Memoization
14.7
Inference
421
stipulates that the supports only need to be the same on the required parts of the
arguments.
Unfortunately, the standard technique of memoization does not interact well with
lazy evaluation. The problem is that in memoization, when we want to create a new
node in the computation graph, we have to check if there is an existing node for the
same expression that has the same supports for the required parts of the arguments.
But we dont know yet what the required parts of the arguments are, or what their
supports are. Worse yet, with lazy evaluation, we may not yet know these things
for expressions that already have nodes. This issue is the crux of the diculty with
combining lazy evaluation and memoization. In fact, no functional programming
language appears to implement both, despite the obvious appeal of these features.
A new evaluation strategy was developed for IBAL to achieve both laziness and
memoization together. The key idea is that when the graph is constructed for a
function application, the algorithm speculatively assumes that an argument is not
required. If it turns out that part of it is required, enough of the computation graph
is created for the required part, and the graph for the application is reconstructed,
again speculatively assuming that enough of the argument has been constructed.
This process continues until the speculation turns out to be correct. At each point,
we can check to see if there is a previously created node for the same expression
that uses as much as we think is required of the argument. At no point will we
create a node or examine part of the argument that is not required.
An important detail is that whenever it is discovered that an argument to the
function is required, this fact is stored in the cache. This way, the speculative evaluation is avoided if it has already been performed for the same partial arguments. In
general, the cache consists of a mapping from partial argument supports to either
a node in the computation graph or to a note specifying that another argument is
required.
For example, suppose we have a function
f(x,y,z) = if x then y else z
where the support of x is {true}, the support of y is {5,6}, and z is dened by
a divergent function. We rst try to evaluate f with no arguments evaluated. We
immediately discover that x is needed, and store this fact in the cache. We obtain
the support of x, and attempt to evaluate f again. Now, since x must be true, we
discover that y is needed, and store this in the cache. We now attempt again to
evaluate f with the supports of x and y, and since z is not needed, we return with
a computation node, storing the fact that when x and y have the given supports,
the result is the given node. The contents of the cache after the evaluation has
completed are
f(x,y,z)
Need x
f({true},y,z)
Need y
f({true},{5,6},z)
{5,6}
422
Support Computation
Aside from issues of laziness and memoization, the support computation is fairly
straightforward, with the support of an expression being computed from the support
of its subexpressions and its free variables. For example, to compute the support of
dist [e1 , ..., en ], simply take the union of the supports of each of the ei .
Some care is taken to use the supports of some subexpressions to simplify the
computation of other subexpressions, so as to achieve the sixth desideratum of
exploiting supports. The most basic manifestation of this idea is the application
expression e1 e2 , where we have functional uncertainty, i.e., uncertainty over the
identity of the function to apply. For such an expression, IBAL rst computes the
support of e1 to see which functions can be applied. Then, for each value f in the
support of e1 , IBAL computes the support of applying f to e2 . Finally, the union of
all these supports is returned as the support of e1 e2 . For another example, consider
an expression e of the form if e1 else e2 then e3 . A naive implementation would
set the support of e to be the union of the supports of e2 and e3 . IBAL is smarter,
and performs a form of short-circuiting: if true is not in the support of e1 , the
support of e2 is not included in the support of e, and similarly for false and e3 .
14.7.3
Second Phase
In the second phase, the computation graph is solved from the bottom up. The
solution for each node is generally not represented directly. Rather, it is represented
as a set of factors. A factor mentions a set of variables, and denes a function from
the values of those variables to real numbers. The variables mentioned by the factors
in a solution include a special variable  (pronounced star) corresponding to the
value of the expression, the free variables X of the expression, and other variables Y.
 	
The solution specied by a set of factors f1 , ..., fn is P (|x) = Z1 y i fi (, x, y),
where Z is a normalizing factor.2 The set of factors at any node are a compact,
implicit representation of the solution at that node. It is up to the solution algorithm
to decide which Y variables to keep around, and which to eliminate.
At various points in the computation, the algorithm eliminates some of the
intermediate variables Y, using VE [3] to produce a new set of factors over the
remaining variables. The root of the computation graph corresponds to the users
2. The fi do not need to mention the same variables. The notation fi (, x, y) denotes the
value of fi when , x, and y are projected onto the variables mentioned by fi .
14.7
Inference
423
query. At the root there are no free variables. To compute the nal answer,
all variables other than  are eliminated using VE, all remaining factors are
multiplied together, and the result is normalized. By using VE for the actual
process of computing probabilities, the algorithm achieves the rst desideratum
of exploiting independence. The main point is that unlike other programming
language-based approaches, IBAL does not try to compute probabilities directly by
working with a program, but rather converts a program into the more manipulable
form of factors, and rests on tried and true technology for working with them.
In addition, this inference framework provides an easy method to satisfy the
third desideratum of exploiting the high-level structure of programs. As
discussed in section 14.5, high-level structure is represented in IBAL using functions.
In particular, the internals of a function are encapsulated inside the function, and
are conditionally independent of the external world given the function inputs and
outputs. From the point of view of VE, this means that we can safely eliminate all
variables internal to the function consecutively. This idea is implemented by using
VE to eliminate all variables internal to a function at the time the solution to the
function is computed.
14.7.3.1
Microfactors
424
The next step in IBAL inference is to translate a program into a set of microfactors,
and then perform VE. The goal is to produce factors that capture all the structure in
the program, including both the independence structure and the low-level structure.
The translation is expressed through a set of rules, each of which takes an
expression of a certain form and returns a set of microfactors. The notation T [e] is
used to denote the translation rule for expression e. Thus, for a constant expression
v the rule is3
T [ v] =
.
1
The Boolean constants and lambda and fix expressions are treated similary.
For a variable expression, T [x], we need to make sure that the result has the same
value as x. If x is a simple variable, whose values are symbols, the rule is as follows.
Assuming v1 , . . . , vn are the values in the support of x, this is achieved with the
3. For convenience, we omit the set brackets for singletons.
14.7
Inference
425
rule
T [x] =
v1
v1
...
vn
vn
Here, we exploit the fact that an assignment of values to variables not covered by
any row has value 0.
If x is a complex variable with multiple elds, each of which is itself complex, we
could use the above rule, considering all values in the cross-product space of the
elds of x. However, that is unnecessarily inecient. Rather, for each eld a of x,
we ensure separately that .a is equal to x.a. If a itself is complex, we break that
equality up into elds. We end up with a factor like the one above for each simple
chain c dened on x. If we let the simple chains be c1 , . . . , cm , and the possible
values of ci be v1i , . . . , vni i , we get the rule
T [x] =
m
+
.ci
x.ci
vi1
vi1
...
i=1
vn1 i
vn1 i
	m i
i
The total number of rows according to this method is m
i=1 n , rather than
i=1 n
for the product method.
Next we turn to variable denitions. Recall that those are specied in IBAL
through a let expression of the form let x = e1 in e2 . We need some notation: if F
1
is a set of factors, F cc2 denotes the same set as F , except that chain c1 is substituted
for c2 in all the factors in F . Now the rule for let is simple. We compute the factors
for e1 , and replace  with x. We then conjoin the factors for e2 , with no additional
change. The full rule is4 T [let x = e1 in e2 ] = T [e1 ]x  T [e2 ].
For if-then-else expressions, we proceed as follows. First we dene a primitive
prim_if (x, y, z) that is the same as if but only operates on variables. Then we can
rewrite
if e1 then e2 else e3 =
let x = e1 in
let y = e2 in
let z = e3 in
prim_if (x, y, z)
4. A fresh variable name is provided for the bound variable to avoid name clashes.
426
Now, all we need is a translation rule for prim_if and we can invoke the above let
rule to translate all if expressions.5 Let the simple chains on y and z be c1 , . . . , cm .
(They must have the same set of simple chains for the program to be well typed.)
Using the same notation as before for the possible values of these chains, a naive
rule for prim_if is as follows:
T [prim_if(x, y, z)] =
m
+
i=1
.ci
y.ci
z.ci
v1i
v1i
...
vni i
v1i
vni i
v1i
vni i
...
vni i
This rule exploits the context-specic independence (CSI) present in any if expression: the outcome is independent of either the then clause or the else clause
given the value of the test. The CSI is captured in the  entries for the irrelevant
variables. However, we can do even better. This rule unites y.ci and z.ci in a single
factor. However, there is no row in which both are simultaneously relevant. We see
that if expressions satisfy a stronger property than CSI. To exploit this property,
the prim_if rule produces two factors for each ci whose product is equal to the
factor above.
T [prim_if(x, y, z)] =
$m
i=1
.ci
y.ci
v1i
v1i
z.ci
v1i
v1i
...
vni i
.ci
vni i
...
vni i
vni i
Note the last row in each of these factors. It is a way of indicating that the factor is
only relevant if x has the appropriate value. For the rst factor, if x has the value
F , the factor has value 1 whatever the values of the other variables, and similarly
for the other factor. The number of rows in the factors for ci is two more than for
the previous method, because of the irrelevance rows. However, we have gained in
that y.ci and z.ci are no longer in the same factor. Considering all the ci , the moral
graph for the second approach contains m fewer edges than for the rst approach.
Essentially, the variable x is playing the role of a separator for all the pairs y.ci and
14.7
Inference
427
z.ci . If we can avoid eliminating x until as late as possible, we may never have to
connect many of the y.ci and z.ci .
None of the expression forms introduced so far contained uncertainty. Therefore,
every factor represented a zero-one function, in other words, a constraint on the
values of variables. Intermediate probabilities are nally introduced by the dist
expression, which has the form dist [p1 : e1 , . . . , pn : en ]. As in the case of if,
we introduce a primitive prim_dist (p1 , . . . , pn ), which selects an integer from 1
to n with the corresponding probability. We also use prim_case which generalizes
the prim_if above to take an integer test with n possible outcomes. We can then
rewrite
dist [p1 : e1 , . . . , pn : en ] =
let x1 = e1 in
...
let xn = en in
let z = prim_dist (p1 , . . . , pn ) in
prim_case (z, [x1 , . . . , xn ])
To complete the specication, we only need to provide rules for prim_dist and
prim_case. The prim_dist rule is extremely simple:
T [prim_dist(p1 , . . . , pn )] =
p1
...
n
pn
The prim_case rule generalizes the rule for prim_if above. It exploits the property
that no two of the xj can be relevant, because the dist expression selects only one
of them. This technique really comes into its own here. If there are m dierent
chains dened on the result, as before, and n dierent possible outcomes of the
dist expression, the number of edges removed from the moral graph is m  n. The
rule is
T [prim_case(z, [x1 , . . . , xn ])] =
$m $n
i=1
.ci
xj .ci
v1i
v1i
...
j=1
vni i
vni i
{j}
The rules for record construction and eld access expressions are relatively simple,
and are omitted. Observations are also very simple.
428
Next, we turn to the mechanism for applying functions. It also needs to be able
to handle functional uncertainty  the fact that the function to be applied is itself
dened by an expression, over whose value we have uncertainty. To start with,
however, let us assume that we know which particular function we are applying to
a certain set of arguments. For a function f , let f.x1 , . . . , f.xn denote its formal
arguments, and f.b denote its body. Let A[f, e1 , . . . , en ] denote the application of
f to arguments dened by expressions e1 , . . . , en . Then
letf.x1 = e1 in
...
A[f, e1 , . . . , en ] = T 
 letf.x = e in  .
n
n
f.b
By the let rule presented earlier, this will convert f.b into a set of factors that
mention the result variable , the arguments f.xi , and variables internal to the
body of f . Meanwhile, each of the ei is converted into a set of factors dening the
distribution over f.xi .
+
i
A[f, e1 , . . . , en ] = T [f.b] 
T [ei ]f.x
To exploit encapsulation, we want to eliminate all the variables that are internal
to the function call before passing the set of factors out to the next level. This can be
achieved simply by eliminating all temporary variables except for those representing
the f.xi from T [f.b]. Thus, a VE process is performed for every function application.
The result of performing VE is a conditional distribution over  given the f.xi .6
Normally in VE, once all the designated variables have been eliminated, the
remaining factors are multiplied together to obtain a distribution over the uneliminated variables. Here that is not necessary: performing VE returns a set of factors
over the uneliminated variables that is passed to the next level up in the computation. Delaying the multiplication can remove some edges from the moral graph at
the next level up.
Now suppose we have an application expression e0 (e1 , . . . , en ). The expression e0
does not have to name a particular function, and there may be uncertainty as to
its value. We need to consider all possible values of the function, and apply each of
those to the arguments. Let F denote the support of e0 . Then for each fi  F , we
need to compute Ai = A[fi , e1 , . . . , en ] as above.
Now, we cannot simply take the union of the Ai as part of the application result,
since we do not want to multiply factors in dierent Ai together. The dierent Ai
represent the conditional distribution over the result for dierent function bodies.
We therefore need to condition Ai on F being fi . This eect is achieved as follows.
j
j
j
j
Let A1i , . . . , Am
i be the factors in Ai , and let (r1 , p1 ), . . . , (rj , pj ) be the rows in
6. There may also be variables that are free in the body of f and not bound by function
arguments. These should also not be eliminated.
14.8
429
Bi =
m
+
j=1
, fi .x1 , . . . , fi .xn
fi
r1j
pj1
...
fi
rjj
pjj
{fi }
for all
In words, each Bij is formed from the corresponding Aji in two steps. First, Aji
is extended by adding a column for F , and setting its value to be equal to fi . The
eect is to say that when F is equal to fi , we want Aji to hold. Then, a row is added
saying that when F is unequal to fi , the other variables can take on any value and
the result will be 1. The eect is to say that Aji does not matter when F = fi . We
can now take the union of all the Bi . To complete the translation rule for function
application, we just have to supply the distribution over F :
T [e0 (e1 , . . . , en )] = i Bi  T [e0 ]F
14.8
430
Beware unexpected interactions between goals! Koller et al. [9] blithely declared
that lazy evaluation and memoization would be used. In retrospect, combining
the two mechanisms was the single most dicult thing in the implementation.
This chapter has presented the probabilistic inference mechanism for IBAL, a
highly expressive probabilistic representation language. A number of apparently
conicting desiderata for inference were presented, and it was shown how IBALs
inference algorithm satises all of them. It is hoped that the development of IBAL
provides a service to the community in two ways. First, it provides a blueprint
for anyone who wants to build a rst-order probabilistic reasoning system. Second,
and more important, it is a general-purpose system that has been released for
public use. In future it will hopefully be unnecessary for designers of expressive
models to have to build their own inference engine. IBAL has succesfully been tried
on BNs, HMMs (including innite state-space models), stochastic grammars, and
probabilistic relational models. IBAL has also been used successfully as a teaching
tool in a probabilistic reasoning course at Harvard. Its implementation consists of
approximately 10,000 lines of code. It includes over fty test examples, all of which
the inference engine is able to handle. IBALs tutorial and reference manuals are
both over twenty pages long.
Of course, there are many models for which the techniques presented in this
chapter will be insucient, and for which approximate inference is needed. The next
step of IBAL development is to provide approximate inference algorithms. IBALs
inference mechanism already provides one way to do this. One can simply plug in
any standard BN approximate inference algorithm in place of VE whenever a set of
factors has to be simplied. However, other methods such as Markov chain Monte
Carlo will change the way programs are evaluated, and will require a completely
dierent approach.
References
[1] R. I. Bahar, E. A. Frohm, C. M. Gaona, G. D. Hachtel, E. Macii, A. Pardo, and
F. Somenzi. Algebraic decision diagrams and their applications. In IEEE/ACM
International Conference on Computer-Aided Design, 1993.
[2] T. Dean and K. Kanazawa. A model for reasoning about persistence and
causation. Computational Intelligence, 5:142150, 1989.
[3] R. Dechter. Bucket elimination : a unifying framework for probabilistic inference. In Proceedings of the Conference on Uncertainty in Articial Intelligence,
1996.
[4] D. Heckerman and J. S. Breese. A new look at causal independence. In
Proceedings of the Conference on Uncertainty in Articial Intelligence, 1994.
[5] K. Kersting and L. de Raedt. Bayesian logic programs. In Proceedings of
the Work-In-Progress Track at the 10th International Conference on Inductive
Logic Programming, 2000.
References
431
[6] D. Koller and A. Pfeer. Semantics and inference for recursive probability
models. In Proceedings of the National Conference on Articial Intelligence,
2000.
[7] D. Koller and A. Pfeer. Object-oriented Bayesian networks. In Uncertainty
in Articial Intelligence (UAI), 1997.
[8] D. Koller and A. Pfeer. Probabilistic frame-based systems. In Proceedings of
the National Conference on Articial Intelligence, 1998.
[9] D. Koller, D. McAllester, and A. Pfeer. Eective Bayesian inference for
stochastic programs. In Proceedings of the National Conference on Articial
Intelligence, 1997.
[10] K. B. Laskey and S. M. Mahoney. Network fragments: Representing knowledge
for constructing probabilistic models. In Proceedings of the Conference on
Uncertainty in Articial Intelligence, 1997.
[11] Z. Li and B. DAmbrosio. Ecient inference in bayes networks as a combinatorial optimization problem. International Journal of Approximate Inference,
11, 1994.
[12] S. Muggleton. Stochastic logic programs. Journal of Logic Programming,
2001. Accepted subject to revision.
[13] L. Ngo and P. Haddawy. Answering queries from context-sensitive probabilistic knowledge bases. Theoretical Computer Science, 1996.
[14] H. Pasula and S. Russell. Approximate inference for rst-order probabilistic
languages. In Proceedings of the International Joint Conference on Articial
Intelligence, 2001.
[15] A. Pfeer. Repeated observation models. In Proceedings of the National
Conference on Articial Intelligence, 2004.
[16] A. Pfeer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford
Univeristy, 2000.
[17] A. Pfeer, D. Koller, B. Milch, and K. T. Takusagawa. SPOOK: A system
for probabilistic object-oriented knowledge representation. In Proceedings of
the Conference on Uncertainty in Articial Intelligence, 1999.
[18] D. Pless and G. Luger. Toward general analysis of recursive probability models. In Proceedings of the Conference on Uncertainty in Articial Intelligence,
2001.
[19] D. Poole. Probabilistic Horn abduction and Bayesian networks. Articial
Intelligence Journal, 64(1):81129, 1993.
[20] D. Poole and N. L. Zhang. Exploiting contextual independence in probabilistic
inference. Journal of Articial Intelligence Research (JAIR), 2003.
[21] S. Sanghai, P. Domingos, and D. Weld. Dynamic probabilistic relational
models. In Proceedings of the International Joint Conference on Articial
Intelligence, 2003.
432
[22] T. Sato and Y. Kameya. Parameter learning of logic programs for symbolic
statistical modeling. Journal of Articial Intelligence Research, 15:391454,
2001.
[23] D. J. Spiegelhalter, A. Thomas, N. Best, and W. R. Gilks. BUGS 0.5 :
Bayesian inference using Gibbs sampling manual. Technical report, Institute
of Public Health, Cambridge University, 1995.
Most probabilistic inference algorithms are specied and processed on a propositional level, even though many domains are better represented by rst-order specications that compactly stand for a class of propositional instantiations. In the last
fteen years, many algorithms accepting rst-order specications have been proposed. However, these algorithms still perform inference on a mostly propositional
model, generated by the instantiation of rst-order constructs. When this is done,
the rich and useful rst-order structure is not explicit anymore. This rst-order
representation and structure allow us to perform lifted inference, that is, inference
on the rst-order representation directly, manipulating not only individuals but
also groups of individuals. This has the potential of greatly speeding up inference.
We precisely dene the problem and present an algorithm that generalizes variable
elimination and manipulates rst-order representations in order to perform lifted
inference.
15.1
Introduction
Probabilistic inference algorithms are widely employed in articial intelligence.
Among those, graphical models such as Bayesian and Markov networks (BNs and
MNs respectively) ([8]) are among the most popular. These models are specied by a
set of conditional probabilities (for BNs) or factors, also called potential functions
(for MNs). Both conditional probabilities and factors are dened over particular
subsets of the available random variables, and map assignments of those random
variables to positive real numbers (called potentials in MNs). For our purposes,
it will be helpful to think of graphical models in general and simply consider
conditional probabilities as a type of factor.
For example, in an application for document subject classication, one can specify a dependence between the random variables subject apple, word mac (which
434
indicate that the subject of the document is apple and that the word mac is
present in it) by dening a factor on their assignments. The higher the potential
for a given assignment to these random variables, the more likely it will be in the
joint distribution dened by the model.
A limitation of graphical models arises when the same dependence holds between
dierent subsets of random variables. For example, we might declare the dependence above to hold also between subject microsof t, word windows. In traditional
graphical models, we must use separate potential functions to do so, even though
the dependence is the same. This brings redundancy to the model and possibly
wasted computation. It is also an ad hoc mechanism since it does not cover other
sets of random variables exhibiting the same dependence (in this case, some other
company and product).
The root of this limitation is that graphical models are propositional (random
variables can be seen as analogous to propositions in logic), that is, they do not
allow quantiers and parameterization of random variables by objects. A rst-order
or relational language, on the other hand, does allow for these elements. With such
a language, we can specify a potential function that applies, for example, to all
tuples of random variables obtained by instantiating X and Y in the tuple
subject(X), company(X), product(X, Y ), word(Y ).
(15.1)
This way we not only cover both cases presented before, but also unforeseen ones,
with a single compact specication.
In the last fteen years, many proposals for probabilistic inference algorithms
accepting rst-order specications have been presented ([7, 6, 1, 4, 10, 11], among
many others), most of which based on the theoretic framework of Halpern [5].
However, these solutions still perform inference at a mostly propositional level;
they typically instantiate potential functions according to the objects relevant to
the present query, thus obtaining a regular graphical model on propositional random
variables, and then using a regular inference algorithm on this model. In domains
with a large number of objects this may be both costly and essentially unnecessary.
Suppose we have a medical application about the health of a large population,
with a random variable per person indicating whether they are sick with a certain
disease, and with a potential function representing the dependence between a person
being sick and that person getting hospitalized. To answer the query what is the
probability that someone will be hospitalized?, an algorithm that depends on
propositionalization will instantiate a random variable per person. However this
is not necessary since one can calculate the same probability by reasoning about
individuals on a general level, simply using the population size, in order to answer
that query in a much shorter time. In fact, the latter calculation would not depend
on the population size at all.
Naturally, it is possible to reformulate the problem so that it is solved in a
more ecient manner. However, this would require manual devising of a process
specic to the model or query in question. It is desirable to have an algorithm that
15.2
435
can receive a general rst-order model and automatically answer queries like these
without computational waste.
A rst step in this direction was given by Poole [9], which proposes a generalized
version of the variable elimination algorithm [12] that is lifted, that is, deals
with groups of random variables at a rst-order level. The algorithm receives
a specication in which parameterized random variables stand for all of their
instantiations and then eliminates them in a way that is equivalent to, but much
cheaper than, eliminating all their instantiations at once. For the parameterized
potential function (15.1), for example, one can eliminate product(X, Y ) in a single
step that would be equivalent to eliminating all of its instantiations.
The algorithm in Poole [9], however, applies only to certain types of models
because it uses a single elimination operation that can only eliminate parameterized
random variables containing all parameters present in the potential function (the
method can eliminate product(X, Y ) from (15.1) but not company(X) because the
latter does not contain the parameter Y ). As we will see later, Pooles algorithm uses
the operation we call inversion elimination. In addition to inversion elimination, we
have developed further operations (the main ones called counting elimination and
partial inversion) that broaden the applicability of lifted inference to a greater
extent ([2, 3]). These operations are combined to form the rst-order variable
elimination (FOVE) algorithm presented in this chapter. The cases to which lifted
inference applies can be roughly summarized as those containing dependencies
where the set of parameters of each parameterized random variable are disjoint or,
when this is not the case, where there is a set of parameters whose instantiations
create independent solvable cases. We specify these conditions in more detail when
explaining the operations, and further discuss applicability in section 15.6. When no
lifted inference operation applies to a specic part of a model, FOVE can still apply
standard propositional methods to that part, assuring completeness and limiting
propositional inference to only some parts of the model.
15.2
436
mally refer to atoms as parameterized random variables, they are not, technically
speaking, random variables, but stand for classes of them. A ground atom, however, denotes a random variable. Sometimes we call random variables ground to
emphasize their correspondence to ground atoms.
Logical variables are typed, with each type being a nite set of objects. We
denote the domain, or type, of a logical variable X by DX and its cardinality by
|X|. In our examples, unless noted, all logical variables have the same type. Each
predicate p also has its domain, Dp , which is the set of values that each of the
random variables with that predicate can take.
Formally, a parfactor g is a tuple (g , Ag , Cg ), where g is a potential function dened over atoms Ag to be instantiated by all substitutions of its logical
variables satisfying a constraint Cg . A constraint is a pair (F, V ) where F is
an equational formula on logical variables and V is the set of logical variables
to be instantiated (some of them may not be in the formula). We sometimes denote a constraint by its formula F alone, when the set of logical variables V is
clear from context. Tautological formulas are represented by ". For example, the
parfactor (, (p(X), q(X, Y )), (X = a, {X, Y })) applies  to all instantiations of
(p(X), q(X, Y )) by substitutions of X and Y satisfying X = a. We denote the set
of substitutions satisfying C by [C].
While we are neutral as to how the potential functions are actually specied, logical formulas seem to be a convenient choice. For example, a weighted
formula 0.7 : epidemic(D)  sick(P, D) might represent a potential function
(epidemic(D), sick(P, D)) with potential 0.7 for assignments in which the formula is true. This allows us to specify FOPMs by sets of weighted logical formulas
that are intuitive and simple to read, and is the approach taken by Markov logic
networks ([11]).
The projection C|L of a constraint C = (F, V ) onto a set of logical variables
L is a constraint equivalent to (L F, L) for L = V \ L. Intuitively, C|L describes
the conditions posed by C on L alone, that is, the possible substitutions on L
that are part of substitutions in [C]. For example, (X = a  X = Y  Y =
b, {X, Y })|{X} = (X = a, {X}). FOVE uses a constraint solver which is able to
solve several constraint problems, such as determining the number of solutions of a
constraint and its projection onto sets of logical variables.
In certain contexts we wish to describe the class of random variables instantiated
from an atom with constraints on its logical variables (for example, the set of
random variables instantiated from p(X, Y ), with X = a). We call such pairs
of atoms and constraints constrained atoms, or c-atoms. The c-atoms of a
parfactor is the set of c-atoms formed by its atoms and its constraint.
Let  be a parfactor, c-atom, constraint or a set of those. We dene RV () to
be the set of (ground) random variables specied by , and  denotes the result
of applying a substitution  to . [Cg ] is also denoted by g .
A FOPM is specied by a set of parfactors G and the types of its logical variables.
Its semantics is a joint distribution dened on RV (G) by the Markov network
formed by all the instantiations of parfactors. Thus it is proportional to the product
15.3
437
g.
gG g
	
	
For convenience, we denote g g by (g), and gG (g) by (G). Therefore
we can write the above as P (RV (G))  (G).
The most important inference task in graphical models is marginalization. For
FOPMs, it takes the following form: given a set of ground random variables Q,
calculate
(G),
(15.2)
P (Q) 
RV (G)\Q
where the summation ranges over all assignments to RV (G) \ Q. Posterior probabilities can be calculated by representing evidence as additional parfactors on the
evidence atoms.
The FOVE algorithm makes the simplifying assumption that the FOPM is
shattered w.r.t the query Q. A set of c-atoms is shattered if the instantiations
of any pair of its elements are either identical or disjoint. A parfactor, or set of
parfactors, is shattered if the set of their c-atoms is shattered. A FOPM is shattered
w.r.t. a query Q if the union of its c-atoms and those of the query is shattered. For
example, we can have c-atoms (p(X), X = a),(p(Y ), Y = a) and p(a) in a model,
but not p(Y ) and p(a), because RV (p(a))  RV (p(Y )) but RV (p(a)) = RV (p(Y )).
When a FOPM and query are not shattered, we can replace them by equivalent
shattered FOPM and query through the process of shattering, detailed in section
15.5.2.
15.3
RV (E)
438
We later show operations computing a parfactor g  such that RV (E) (GE ) =
(g  ). Once we have g  , the right-hand side of the above is equal to
(GE )(g  ) =
(GE  {g  }) =
(G )
(RV (G)\RV (E))\Q
RV (G )\Q
Counting Elimination
We rst show counting elimination on a specic example and later generalize it.
Consider the summation
 
(p(X), p(Y )),
RV (p(X)) X,Y
where p is a boolean predicate. (Note that the X used under the summation is not
the same X used by the product. RV (p(X)) is shorthand for all assignments over
the set {p(X) : X  DX }, so X is locally used. In fact, we could have written
RV (p(Y )), or even RV (p(Z)), to the same eect. We choose to use X or Y to make
the link with the atom in the parfactor more obvious.)
Counting elimination is based on the following insight: because a parfactor will
typically only evaluate to a few dierent potentials, large groups of its instantiations
will evaluate to the same potential. So the summation is rewritten
(0, 0)|(0,0)| (0, 1)|(0,1)| (1, 0)|(1,0)| (1, 1)|(1,1)| ,
RV (p(X))
where |(v1 , v2 )| indicates the number of possible choices for X and Y so that p(X) =
v1 and p(Y ) = v2 given the current assignment to RV (p(X)). These partition sizes
 p,
can be calculated by a combinatorial, or counting, argument. Assume we know N
a vector of integers that indicates how many random variables in RV (p(X)) are
 p,i = |{r  RV (p(X)) : r = i}|
currently assigned a particular value, that is, N
15.3
439
 
for each i  Dp . Naturally, i N
p,i = |RV (p(X))|. Then there are Np,v1 possible
 p,v2 distinct possible values for Y (so that
values for X (so that p(X) = v1 ) and N
 p,v1 N
 p,v2 .
p(Y ) = v2 ), so |(v1 , v2 )| = N
We take advantage of the fact that the values |(v1 , v2 )| do not depend on the
 p . This allows us to iterate over
particular assignments to RV (p(X)), but only on N
the groups of assignments with the same Np and do the calculation for the entire
group. We also take
into account
the group size," which is provided
by the binomial
"|RV
(p(X))|#
|RV (p(X))|#
(or, equivalently,
). We then have
coecient of Np ,
N
N
p,0
.
p
N
p,1
/
|RV (p(X))| 
(v1 , v2 )Np,v1 Np,v2
 p,0
N
(v ,v )
1
which has a number of terms linear in |RV (p(X))|, as opposed to the previous
exponential number.
Counting elimination is not a universal method. The counting argument presented
above requires that there be little interaction between the logical variables of atoms.
If a parfactor is on p(X, Y ), q(X, Z), for example, the counting argument does not
work because the choices for (X, Z) depend on the particular X chosen for p(X); we
can no longer compute number of choices using counters alone but need to know the
particular assignment to RV (p(X)). Generally, under counting elimination, choices
for one atom cannot constrain the choices for another atom (there are exceptions
to this rule, as for example just-dierent atoms, presented in [3]).
We now give the formal account of counting elimination, starting with some
preliminary denitions.
First, we dene the notion of independent atoms given a constraint.
Intuitively, this happens when choosing a substitution for the logical variables of
one atom does not change the possible choices of substitutions for the other atom.
 2 be two sets of logical variables such that X
1  X
2  V . X
 1 is
 1 and X
Let X
 2 given C if, for any substitution 2  [C|X ], C|X  (C2 )|X .
independent from X
2
1
1
 2 are independent given C if X
 1 is independent from X
 2 given C and
 1 and X
X
 1 ) and p2 (X
 2 ) are independent given C if X
 1 and X
2
vice-versa. Two atoms p1 (X
are independent given C.
Finally, we dene multinomial counters. Let a be a c-atom with domain Da .
 a,j indicates how many
 a , is a vector where N
Then the multinomial counter of a, N
instantiations of a are assigned the j-th value in Da . The multinomial coecient
 a ! = (Na,1 ++Na,|Da | )! is a generalization of binomial coecients and indicates
N
 a,1 !...N
 a,|D | !
N
a
how many assignments to RV (a) exhibit the particular value distribution counted
 a.
by N
Counters can be applied to sets of c-atoms with the same general meaning. The
 A , and the product
set of multinomial counters for a set of c-atoms A is denoted N
	
aA Na ! of their multinomial coecients is denoted NA !.
440
E
N
vDE
The theorems proof reects the argument given above. Counting elimination
brings a signicant computational advantage because iterating over assignments is
exponential in |RV (E)| while doing so over groups of assignments is only polynomial
in it.
It is important to notice that E must contain all non ground c-atoms in g. Also, if
all c-atoms in g are ground, E can be any subset of them and we will have a simple
propositional summation, the same used in VE (counters over 1-random variable
c-atoms reduce to ordinary assignments).
15.3.2
Inversion
q(on ,on )
(p(o1 ), q(o1 , o1 ))
q(o1 ,o1 )
(p(on ), q(on , on ))
q(on ,on )
0
1
0 
1
(p(o1 ), q(o1 , o1 )) . . .
(p(on ), q(on , on ))
q(o1 ,o1 )
 
XY q(X,Y )
q(on ,on )
(p(X), q(X, Y ))
15.3
441
(by observing that only the summation is the same for all q(X, Y ))
=
(p(X)).
XY
RV (p(X,Y ))
X Y,Z
=
0
RV (p(o1 ,Y ))
RV (p(on ,Y )) Y,Z
1 0
(p(o1 , Y ), p(o1 , Z)) . . .
RV (p(o1 ,Y )) Y,Z
X
Y,Z
1
(p(on , Y ), p(on , Z))
RV (p(on ,Y )) Y,Z
RV (p(X,Y )) Y,Z
442
15.3.2.1
Before we present the theorem formalizing inversion, we touch a last issue. Consider
the inversion of X resulting in the expression
(p(X, Y ), p(X, Z)).
X RV (p(X,Y )) Y =X,Z=X,Y =a,Z=a
0
1
(p(X, Y ), p(X, Z))
Inversion Formalization
(g) =
(gC ),
CUL (Cg )
where gC is the parfactor (g , Ag , C  Cg ) and using g  dened by the recursive
computation g   =
RV (E) (g), for  an arbitrary element of [C] (by the
denition of USCP, it does not matter which).
15.3
443
RV (E) g
RV (E) [Cg|L ]
(Ag )
1 0
(Ag 1  ) . . .
0
1
(Ag  )
# "
(Ag 1  ) . . .
RV (e1 m )
#
(Ag m  )
[Cg m ]
1
(Ag m  )
RV (en m ) [Cg m ]
RV (en ) [Cg ]
0
1
(Ag  )
"
RV (en m ) [Cg 1 ]
RV (e1 m )
RV (en 1 ) [Cg 1 ]
[Cg|L ] RV (e1 )
RV (e1 1 )
RV (en 1 )
0 
g ]
RV (e1 1 )
(Ag )
RV (e1 )
[C
(g) =
(gC ) =
(gC ).
CUL (Cg )
Note that condition 1 is used to ensure the summations on RV (e1 1 )    RV (en m )
are indeed distinct. Condition 2 ensures that the innermost products are on distinct
sets of random variables and can therefore be factored out as shown.
15.3.3
The Algorithm
Figure 15.1 shows the main pseudocode for FOVE. The algorithm consists of
successively choosing eliminations (E, {L1 , . . . , Lk }), consisting of a collection of
atoms E to eliminate after performing a series of inversions based on sets L1 , . . . , Lk
of logical variables. A possible way of choosing eliminations is presented in gure
15.2. It is presented separately from the main algorithm for clarity, but because
these two phases have many operations in common, actual implementations will
typically integrate them more tightly.
There are potentially many ways to choose eliminations. The one we present
starts by choosing an atom and checking if its inversion will produce a propositional
summation, since this is the most ecient case. If not, we successively add atoms
to E until GE forms a parfactor where all atoms with logical variables are part
of E (because counting elimination requires it). Then, for eciency and to avoid
shared logical variables between atoms, we try to determine as many inversions as
possible, coded in the sequence L1 , . . . , Lk , to be done before counting elimination
(or explicit summation when counting cannot be done).
444
Notation:
LV (): logical variables in object .
g: parfactor (g , Ag , Cg ).
UL (Cg ): USCP of L with respect to Cg (section 15.3.2.1).
C|L : constraints projected to a set of logical variables L.
GE : subset of parfactors G which depend on RV (E).
GE : subset of parfactors G which do not depend on RV (E).
: tautology constraint.
Figure 15.1
15.4
An experiment
We use the implementation available at http://l2r.cs.uiuc.edu/~cogcomp to
compare average run times between lifted and propositional inference (which produce the exact same results) for two dierent models while increasing the number
An experiment
445
Figure 15.2
Lifted
290
Ground
190
90
-10
1
590
15.4
10 11 12 13 14 15
Domain size
3400
2900
2400
1900
1400
900
400
-100
Lifted
Ground
10
Domain size
Figure 15.3 (I) Average run time for answering query P (p) from a parfactor
on (p, q(X)), using inversion elimination, with domain size |X| being gradually
increased. (II) Average run time for answering query P (r) from a parfactor on
(p(X), p(Y ), r), using counting elimination, with domain size |X| = |Y | being gradually increased.
of objects in the domain. The rst one, (I) in gure 15.3, answers the query P (p)
from a parfactor on (p, q(X)) and uses inversion elimination only. The inference
in (II) answers query P (r) from a parfactor on (p(X), p(Y ), r) and uses counting
elimination only. In both cases propositional inference starts taking very long before
any noticeable variation in lifted inference run times.
446
15.5
Auxiliary operations
15.5.1
Fusion
We have assumed in section 15.3 that we have operations to calculate RV (E) (GE ),
but elimination operations calculate RV (E) (g), for g a single parfactor. Fusion
bridges this gap by computing, for any set of parfactors G, a single parfactor f s(G)
such that (G) = (f s(G)).
Fusion works by replacing the constraints of all parfactor in the set by a single,
common constraint which is the conjunction of them all. This guarantees that all
parfactors get instantiated by the same set of substitutions on a single set of logical
variables, which allows their products (in the expression for (G)) to be unied
under a single product. Note that not all parfactors contain all the logical variables,
and will be instantiated to the same ground factor by distinct substitutions (those
agreeing on the logical variables present in the parfactor, but disagreeing on some
of the others). In other words, some of the parfactors will have their number of
instantiatiations increased by this unication. For this reason, we also exponentiate
the potential function to the inverse of how many times the number of instantiations
was increased, keeping the nal result the same as before.
This is illustrated in the example below:
1
#"
#
"
1 (e(D), s(D, P ))
2 (e(D )) =
1 (e(D), s(D, P ))2|D,P | (e(D ))
D
D,P
D,P,D
D,P,D
g (Ag ) =
gG g
G gG
gG G
|g |/|G |
g (Ag )
While the above is correct, it is rather unnatural to have e(D) and e(D ) be
distinct atoms. If a set of logical variables has the same possible substitutions, like
D and D here, we can do something better:
15.5
Auxiliary operations
0
447
10
1 0 
10
1
1 (e(D), s(D, P ))
2 (e(D )) =
1 (e(D), s(D, P ))
2 (e(D ))
D
D,P
0"
D
#"
#1
1 (e(D ), s(D , P )) 2 (e(D ))
D
P
0
D
P
1
D ,P
D ,P
Formally, this process is similar to inversion with respect to D . However, it does
require the additional previous step of unifying distinct logical variables (but with
identical sets of possible substitutions) into a single one rst (in the example, D and
D are replaced by D ). For lack of space we omit the details of this improvement.
15.5.2
Shattering
In section 15.3 we mentioned the need for shattering, which we now discuss in more
detail. This need arises from c-atoms representing overlapping, but not identical,
classes of random variables. Consider the following marginalization over parfactors
g1 and g2 with potential functions 1 and 2 respectively:
0
1
1 (p(X, Y ), q)
2 (p(a, Y ))
RV (p(X,Y )) X,Y
1  0
1
1 (p(X, Y ), q)
1 (p(a, Y ), q)
2 (p(a, Y ))
p(a,Y )
448
Inversion often produces parfactors with constraints with logical variables not
present in its atoms. The rst inversion example produces the expression below.
We can simplify it by observing that the actual value of Y is irrelevant inside
the product. Only the number |Y | of possible values for Y will make a dierence.
Therefore we can write
 (p(X)) =
 (p(X))|Y | =
 (p(X)).
XY
15.6
15.7
Future Directions
449
15.7
Future Directions
There are several possible directions for further development of FOVE. One of the
main ones is the incorporation of function symbols, both random (the color of an
object, for example) and interpreted (summation over integers), which will greatly
increase its expressivity and applicability.
In applications involving evidence over many objects (for example, the facts about
all the words in an English document), shattering may take a long time because all
parfactors have to be checked against it. The large number of objects involved may
create the need for numerous parfactor splittings. This is unfortunate because often
only some objects are truly relevant to the query. For example, analyzing only some
words and phrases in a document will often be enough to determine its subject.
Therefore a variant of FOVE that does only the necessary shattering, guided by
the inference process, is of great interest.
Finally, lifted FOVE operations do not cover all possible cases and explicit
summation may be required at times, so increasing their coverage is an important
direction.
15.8
Conclusion
Intuitive descriptions of models very often include rst-order elements. When these
models are probabilistic, the dominant approach has been that of grounding the
model to a propositional one and solving it with a regular propositional algorithm.
This strategy loses the explicit representation of the models rst-order structure,
which can be used to great computational advantage, and which is computationally
hard to retrieve from the grounded model.
450
References
451
Acknowledgments
This work was partly supported by Cycorp in relation to the Cyc technology,
the Advanced Research and Development Activity (ARDA)s Advanced Question
Answering for Intelligence (AQUAINT) program, NSF grant ITR-IIS- 0085980, and
a Defense Advanced Research Projects Agency (DARPA) grant HR0011-05-1-0040.
References
[1] V. S. Costa, D. Page, M. Qazi, and J. Cussens. CLP(BN): Constraint logic
programming for probabilistic knowledge. In Proceedings of the Conference on
Uncertainty in Articial Intelligence, 2003.
[2] R. de Salvo Braz, E. Amir, and D. Roth. Lifted rst-order probabilistic
inference. In Proceedings of the International Joint Conference on Articial
Intelligence, 2005.
[3] R. de Salvo Braz, E. Amir, and D. Roth. MPE and partial inversion in
lifted probabilistic variable elimination. In National Conference on Articial
Intelligence, 2006.
[4] N. Friedman, L. Getoor, D. Koller, and A. Pfeer. Learning probabilistic
relational models. In Proceedings of the International Joint Conference on
Articial Intelligence, 1999.
[5] J. Y. Halpern. An analysis of rst-order logics of probability. In Proceedings
of the International Joint Conference on Articial Intelligence, 1990.
[6] K. Kersting and L. De Raedt. Bayesian logic programs. In Proceedings of
the Work-in-Progress Track at the 10th International Conference on Inductive
Logic Programming, 2000.
[7] L. Ngo and P. Haddawy. Probabilistic logic programming and Bayesian
networks. In Asian Computing Science Conference, 1995.
[8] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann, San Mateo, CA, 1988.
[9] D. Poole. First-order probabilistic inference. In Proceedings of the International
Joint Conference on Articial Intelligence, 2003.
[10] D. Poole. Probabilistic Horn abduction and Bayesian networks. Articial
Intelligence, 64(1):81129, 1993.
[11] M. Richardson and P. Domingos. Markov logic networks. Technical report,
Department of Computer Science, University of Washington, 2004.
[12] N. L. Zhang and D. Poole. A simple approach to Bayesian network computations. In Proceedings of the Tenth Biennial Canadian Articial Intelligence
Conference, 1994.
Using rich sets of features generated from relational data often improves the predictive accuracy of regression models. The number of feature candidates, however,
rapidly grows prohibitively large as richer feature spaces are explored. We present
a framework, structural generalized linear regression (SGLR), which exibly integrates feature generation with model selection allowing (1) augmentation of relational representation with cluster-derived concepts, and (2) dynamic control over
the search strategy used to generate features. Clustering increases the expressivity
of feature spaces by creating new concepts which contribute to the creation of new
features, and can lead to more accurate models. Dynamic feature generation, in
which decisions of which features to generate are based on the results of run-time
feature selection, can lead to the discovery of accurate models with signicantly less
computation than generating all features in advance. We present experimental results supporting these claims in two multirelational document mining applications:
document classication and link prediction.
16.1
Introduction
We present a statistical relational learning method, structural generalized linear regression (SGLR), for building predictive regression models from relational databases
or domains with implicit relational structure such as collections of documents
linked by citations or hyperlinks. In SGRL, features are dynamically generated
by a renement-graph style search over SQL queries, and tested for potential inclusion into a generalized linear regression model, such as linear, logistic, or Poisson
regression. This approach has several advantages over more traditional logic-based
inductive logic programming (ILP) methods. The tables resulting from SQL queries
are easily aggregated in many ways, giving a rich space of quantitative, as well as
Boolean features. The resulting regression models are typically more accurate than
454
logical models. We also show how to automatically augment the original relational
schema with additional derived features, facilitating the search for compound features.
SGLR, like several related methods [23, 18, 9, 27], searches a space of feature
generating expressions to nd those which generate new predictive features. In
SGLR, a given relational database schema describing background data structures
a search over database queries. Features are generated in two steps: a renementgraph-like search of the space of SQL queries generates tables, which are then
aggregated into real-valued features, which are tested for inclusion in a generalized
linear model; i.e., each query generates a table, which in turn is aggregated to
produce scalar feature candidates, from which statistically signicant predictors
are selected.
The initial relational schema is dynamically augmented with new relations containing concepts derived by clustering the data in the tables. For example, clustering
documents by the words they contain or authors by venues they have published in
gives new concepts  topics (document clusters) or communities (author clusters)
 and new relations between the original items and the clusters they occur in (documents on a topic or authors in a community).
The main search is over the space of possible relational database queries, augmented to include aggregate or statistical operators, groupings, richer join conditions, and argmax-based queries. This search can be guided based on the types of
predictive features discovered so far. We show below that a very simple intelligent
search over the space of possible queries (and hence features) can result in discovery of predictive features with far less computation than static (e.g., breadth-rst)
search.
SGLR couples two elements helpful for successful learning:(1) a class of statistical
models which outperforms logic-based models and (2) principled means of choosing
what features to include in this model. Regression models are often more accurate
than recursive partitioning methods such as C4.5 or FOIL-style logic descriptions.
This dierence is particularly apparent when there are vast numbers of potential features, many of which contribute some signal, for example, when words are included
as features. Regression also allows us to use principled feature selection criteria
such as Akaike information criterion (AIC), Bayes information criterion (BIC), and
streaming feature selection (SFS) [4, 29, 33] to control against overtting.
Figure 16.1 highlights the components of SGLR. Two main processes  relational
feature generation and statistical modeling  are dynamically coupled into a single
loop. Knowing the types of features selected so far by the statistical modeler allows
the query generation component to guide its search, focusing on promising subspaces of the feature space. The search in the space of database queries involving
one or more relations produces feature candidates one at a time for consideration
by the statistical model selection component. The process results in a statistical
model where each selected feature is the evaluation of a database query encoding
a predictive data pattern in a given domain. We use logistic regression (or, equivalently, maximum entropy modeling). Features are tested sequentially for inclusion
16.1
Introduction
455
model selection
feedback
Learning Process
Control Module
y = f (X)
xs are selected
database queries
search control
information
Search:
Relational Feature Generation
feature
columns
Figure 16.1
Relational Database
Engine
database
query
in the regression model, and accepted if they are statistically signicant after using
a BIC [29] penalty to control against false discovery.
SGLR has several key characteristics which distinguish it from either pure
probabilistic network modeling or ILP:
The use of regression rather than logic allows the feature space to include
statistical summaries or aggregates, and more expressive substitutions through
nesting of intermediate aggregates (e.g., How many times does this paper cite
the most cited author in a conference to which it was submitted?).
We use clustering to dynamically extend the set of relations generating new
features. Clusters give better models of sparse data, improve scalability, and
produce representations not possible with standard aggregates [12]. For example,
one can cluster words based on co-occurrence in documents, giving topics, or
authors based on the number of papers they have published in the same venues,
giving communities. Once clusters are formed, they represent new relational
concepts which are added to the relational database schema, and then used
together with the original relations.
We use relational database management systems and SQL rather than Prolog.
Most real-world data lies in relational databases, with schemata and metainformation we can use. Relational database management systems incorporate decades
of work on optimization, giving better scalability.
Coupling generation and feature selection using discriminative modeling into a
single loop gives a more exible search than propositionalization Since the total
number and type of features is not known in advance, the search formulation
does lazy feature evaluation, allowing it to focus on more promising feature
subspaces, giving higher time eciency. Space eciency is achieved by not
storing pregenerated features, but rather considering them one by one as they
are generated, and keeping only the few selected features.
456
We present results on two sets of tasks which use the data from CiteSeer
(a.k.a. ResearchIndex), an online digital library of computer science papers [19].
CiteSeer contains a rich set of data, including paper titles, text of abstracts and
documents, citation information, author names and aliations, and conference
or journal names. We represent CiteSeer as a relational database. For example,
citation information is represented as a binary relation between citing and cited
documents. Document authorship and publication venues of documents are also
binary relations, while word counts can be represented as a ternary relation.
16.1.1
SGLR uses clustering to derive new relations and adds them to the database
schema used in automatic generation of predictive features in statistical relational
learning. Entities and relationships derived from clusters increase the expressivity
of feature spaces by creating new rst-class concepts. These concepts and relations
are added to the database schema, and thus are considered (potentially in multiple
combinations) during the search of the space of possible queries (gure 16.2).
For example, in CiteSeer, papers can be clustered based on words or citations
giving topics, and authors can be clustered based on documents they coauthor
giving communities. In addition to simpler grouping (e.g., Is this document on a
given topic?), such cluster-derived concepts become part of more complex feature
expressions (e.g. Does the database contain another document on the same topic
and published in the same conference?). The original database schema is implicitly
used to decide which entities to cluster and what sources of attributes to use,
possibly several per entity, creating alternative clusterings of the same objects. For
example, documents can be clustered using words and, separately, using citations.
Out of the large number of features generated, those which improve predictive
accuracy are kept in the model, as decided by statistical feature selection criteria.
Using cluster improves accuracy. Perhaps surprisingly, using cluster relations can
also lead to a more rapid discovery of predictive features.
Cluster-relation invention as described here diers importantly from aggregation,
which also creates new features from a relational representation [23, 26]. Aggregation allows one to summarize the information in a table returned from an SQL or
logic query into scalar values usable by a regression model, for example, computing
the average of a word count in all cited documents, or selecting a citing document
with max number of incoming links. The clusters, on the other hand, create new
relations in the database schema. The cluster relations are then used repeatedly to
generate new queries and hence tables and features.
16.1.2
SGLR also supports dynamic feature generation, in which the order in which
features are generated and evaluated is determined at run-time. Generating features
is by far the most computationally demanding part of SGLR. In the example
16.1
Introduction
457
CLUSTERING
i
O
O
i
i
O
O
O
i
i
i i
X
O
X
X
X
X
X
DATABASE SCHEMA
y
1
1
0
Figure 16.2
FEATURES
(AGGREGATED)
EVALUATED TABLES
y
1
1
10
20
1
0
X
1.5 100
1.9 95
0.4 30
candidates.
presented below, generating 100,000 features can take several CPU days due to
the extensive SQL queries, particularly the joins. Dynamic feature generation can
lead to discovery of predictive features with far less computation than generating
all features in advance. When using the appropriate complexity penalties, one can
still guarantee no overtting, even when the order in which we generate features
and test them for inclusion in the model is dynamically determined based on which
features have so far been found to be predictive. This best rst search often vastly
reduces the number of computationally expensive feature evaluations.
Query expressions are assigned into multiple streams based on user-selected
properties of the feature expressions; for example, based on the aggregate operator
type. Since dierent sets of features are of dierent size (e.g., the number of dierent
words is much greater than the number of journals or the number of topics),
it is often easy to heuristically classify features into dierent streams. If feature
generation has a known cost, this can also be taken into account. At each iteration,
one of the streams is chosen to generate the next candidate feature, based on the
expected utility of the streams features relative to those of other streams. For
example, a simple and eective rule is to select the next query to be evaluated
from the stream whose features have been included in the model in the highest
percentage.
16.1.3
Chapter Overview
The following section describes the SGLR methodology in some detail, including
how we cast feature generation as a search in the space of relational database
queries, how cluster relations are created, and how the feature space is searched
dynamically. Section 16.3 then describes two tasks using CiteSeer data which we
use to test SGLR: classifying documents into their publication venues, conferences,
458
16.2
Detailed Methodology
As described above, SGLR dynamically couples two main components: generation
of feature candidates via a search in the space of queries to a relational database,
and their selection for inclusion in a regression model using statistical model selection criteria. First, we give the high-level SGLR algorithm. Lines in italics are
the parts that do cluster-relation generation. We deliberately leave the stopping
criterion underspecied. Given the incremental nature of model building in SGLR,
deciding when to stop will often depend on the available CPU time and on the
accuracy achieved so far.
1:
2:
3:
4:
5:
6:
7:
8:
16.2.1
The language of nonrecursive rst-order logic formulae maps directly into SQL and
relational algebra, (see e.g., [8]). Our implementation uses SQL for eciency and
connectivity with relational database engines.
Throughout this paper we use the following schema:
cites(F romDoc, T oDoc),
author(Doc, Auth),
published in(Doc, V enue),
word count(Doc, W ord, Int).
16.2
Detailed Methodology
459
Domains, or types, used here are dierent from the primitive SQL types. The
specication of these domains in addition to the primitive SQL types is necessary
to guide the search process more eciently.
First-order expressions are treated as database queries resulting in a table of
all satisfying solutions, rather than a single Boolean value. The extended notation
supports aggregation over entire query results or over their individual columns.
Aggregate operators are subscripted with the corresponding variable name if applied
to an individual column, or are used without subscripts if applied to the entire
table. For example, an average count of the word learning in documents cited by
document d, is denoted as:
460
author_of(d, Auth).
word_count(d, Word, Int).
cites(d,Doc).
author_of(d, Auth = "smith").
Figure 16.3
16.2
Detailed Methodology
Table 16.1
461
rene(Query: q)
Qref  {}
for each Ri  R(i  [1, n])
Seq  {}
for each Aj  Ri
for each A  {Ak |Ak  attrib(q)}  {Al |(Al in Ri )  Al = Aj }
if(type(Aj ) = type(A))
Seq  Seq  {norm(Aj = A)}
for each a  dom(Aj )
Seq  Seq  {norm(Aj = a)}
for each S  2Seq
if (Ai = Aj )  S such that Ai  attrib(q)  Aj  attrib(q)
Qref  Qref  {q |q .W HERE =
q.W HERE  {S}  q .F ROM = q.F ROM  {Ri }}
return Qref
and doc2, if they identify a target observation in the example above). In situations
where generated features can include references to other constants, dom(A) can
include all values of A, or a subset, e.g., the entries with the highest correlation with
the response variable or those above a count cuto value. The following example of
a query about the target pair < d1, d2 > references other constants in the domain
of document IDs; the query is nonempty when both d1 and d2 cite a particular
document d2370:
SELECT DISTINCT *
FROM cites R1, cites R2
WHERE R1.doc1=d1 AND R2.doc1=d2 AND R1.doc2=R2.doc2
AND R2.doc2=d2370
norm(Ai = Aj ) alphanumerically orders Ai and Aj to avoid storing in Seq
equivalent entries Ai = Aj and Aj = Ai . type(A) is metatype of A, as is
Document in the examples above, rather than an SQL type String. The set of
equality conditions in query q is denoted by q.W HERE, e.g., a four-element set
corresponding to the latter query example:
{R1.doc1=d1, R2.doc1=d2, R1.doc2=R2.doc2, R2.doc2=d2370}
The renement operator given in table 16.1 takes a query q as argument and
returns the set of its renements, Qref . Renement of a given query starts by
picking a relation instance in the database schema (loop starting at line 4). Adding
this relation results in its Cartesian product with the view of q (not included in the
462
As in predicate calculus, aggregates are not part of the abstract relational languages.
Practical systems, however, implement them as additional features. SQL supports
the use aggregates which produce real values, rather than the more limited Boolean
features produced by logic-based approaches. Regression modeling makes full use
of these real-valued features.
As we described above, a node in our renement graph is a query evaluating into a
table. These tables are in turn aggregated by multiple operators to produce features.
We use the aggregate operators common in relational language extensions: count,
ave, max, and min; binary logic-style features are included through the empty
aggregate operator. Aggregate operators are applied to an entire table or to its
columns, as appropriate given type restrictions, e.g., ave cannot be applied to a
column of a categorical type. When aggregate operators are not dened, e.g., the
average of an empty set, we use an interaction with a 1/0 (dened/not-dened)
indicator variable. Table 16.2 presents pseudocode of the aggregation procedure at
each search node (called for each observation i).
The use of aggregate operators in feature generation complicates pruning of the
search space. We use a hash function of partially evaluated feature columns to
avoid fully recomputing equivalent features. In general, determining equivalence
among relational expressions is known to be NP-complete, although polynomial
algorithms exist for restricted classes of expressions; see, e.g., [3, 22]. Equivalence
determination based on the homomorphism theorem for tableau query formalism,
essentially the class of conjunctive queries we consider before aggregation, is given
in [1]. Optimizations could be done by better avoiding generation of equivalent
queries. Children nodes in the renement graph can, of course, reuse evaluations
16.3
Experimental Evaluation
Table 16.2
463
aggregate(View: v)
v is the evaluation of a search node query per observation i
F  {}
// A is a set of aggregate operators
for each Aggri  A (i  [1, n])
// applicability of Aggri is determined by typing
if(def ined(Aggri (v))
F  F  {Aggri (v)} // e.g. average cannot be applied a categorical column
for each column C  v
if(def ined(Aggri(C)))
F  F  {Aggri (C)}
return F
16.3
Experimental Evaluation
16.3.1
464
Table 16.3
Relation
PublishedIn(doc:Document, vn:Venue)
Author(doc:Document, auth:Person)
Citation(from:Document, to:Document)
HasWord(doc:Document, word:Word)
ClusterDocumentsByAuthors(doc:Document, clust:Clust0)
ClusterAuthorsByDocuments(auth:Person, clust:Clust1)
ClusterDocumentsByCitingDocuments(doc:Document,clust:Clust2)
ClusterDocumentsByCitedDocuments(doc:Document,clust:Clust3)
ClusterDocumentsByWords(doc:Document, clust:Clust4)
ClusterWordsByDocuments(word:Word, clust:Clust5)
Size
60,646
131,582
173,410
6,894,712
53,660
26,740
31,603
42,749
56,104
1,000
Cluster Creation
We use k-means (e.g., see [15]) to derive cluster relations; any other hard clustering
algorithm could also be used. The results of clustering are represented by binary
relations of the form <ClusteredEntity, ClusterID>.
Each many-to-many relation in the original schema can produce two distinct
cluster relations (e.g., clusters of words by documents or of documents by words).
Three out of the four relations in the schema presented above are many-tomany (PublishedIn is not); this results in six new cluster relations. Since the
PublishedIn relation does not produce new clusters, nothing needs to be done
to exclude the attributes of entities in the venue prediction training and test sets
from participating in clustering. In link prediction, on the other hand, the relation
corresponding to the target concept, Citation, does produce clusters, so in this
case clustering is run without the links sampled for training and test sets.
k-means clustering requires the selection of k, the number of groups into which
the entities are clustered. In the experiments presented here we x k equal to 100 in
all cluster relations except in ClusterWordsByDocuments, where only ten clusters
were used because there are roughly an order of magnitude fewer clustered words
than authors or documents. (This, since the vocabulary was limited to 1000 words.)
The accuracy of resulting cluster-based models reported below could potentially be
improved if one is willing to incur the cost of generating clusters with dierent
values of k and testing the resulting features for model inclusion. One could also
generate clusters from the rest of the tables generated as the space of queries is
searched. For simplicity, we stuck to the rst six such cluster relations. Table 16.3
summarizes the sizes of four original and the six derived cluster relations.
16.3
Experimental Evaluation
465
For clustering, we use the tf-idf vector-space cosine similarity [28]. The measure
was originally designed for document similarity using word features, but we apply
it here to broader types of data. In the formulae below, d stands for any object we
want to cluster, and w are the attributes used to cluster d. For example, authors
d can be clustered using the documents w they write. Below we refer to ds as
documents and ws as words.
Each document d is viewed as a vector whose dimensions correspond to the words
ws in the vocabulary; the vector elements are the tf-idf weights of the corresponding
words, where tf idf (w, d) = tf (w, d)  idf (w). In the original formulation, term
frequency tf (w, d) is the number of times w occurs in d. In the experiments
reported here we use binary tf indicating whether or not w occurs in d.2 Inverse
document frequency idf (w) = log df|D|
(w) , where |D| is the number of documents in
a collection and df (w) is the number of documents in which word w occurs at least
once.
The similarity between two documents is then
sim(di , dj ) =
di  dj
,
||di ||||dj ||
We compare models learned from the feature space generated from the four original
noncluster relations with the models learned from the original four relations plus
six derived cluster relations (clustersNO and clustersYES models). Models are
learned with sequential feature selection using BIC [29], i.e., once each feature is
generated, it is added to the model permanently if the BIC-penalized error improves,
or is permanently dropped otherwise.
We use ten-fold reverse cross-validation to measure accuracy improvement from
using cluster relations. All observations are split equally into ten sets. Each of
the sets is used to train a model. Each of the models is tested on the remaining
90% of observations. This results in ten values per each tested level, which are
used to derive error bounds. In venue prediction, there are 10,000 observations:
5000 positive examples of <Document,Venue> target pairs uniformly sampled from
the relation PublishedIn, and 5000 negative examples where the document is
uniformly sampled from the remaining documents, and the venue is uniformly
2. We use binary tf for consistency with the relation HasWord; we do not use counts in
computing similarities since the original relation HasWord contains binary word occurrence
data. Other derived cluster relations use naturally binary attributes.
clustersNO
70
accuracy
70
clustersYES
clustersYES
clustersNO
50
60
60
accuracy
80
80
90
50
466
500
1000
1500
2000
# of features considered
2500
3000
3500
500
1000
1500
2000
2500
3000
3500
# of features considered
sampled from the domain of venues other than the true venue of the document.
Positive example pairs are removed from the background relation PublishedIn, as
well as the tuples involving documents sampled for the negative set. The size of the
background relation PublishedIn decreases by 10,000 after removing training and
test set tuples. In link prediction, the total number of observations is 5000: 2500
positive examples of <Document,Document> target pairs uniformly sampled from
the Citation relation, and 2500 negative examples uniformly sampled from empty
links in the citation graph. Positive example pairs are removed from the background
relation Citation. The size of the background relation Citation reduces by 2500,
the number of sampled positive examples.
A total of 3500 features are used in training each model. A numeric signature
of partially evaluated features is maintained to avoid fully generating numerically
equivalent features; note that this is dierent from avoiding syntactically equivalent
nodes of the search space: two dierent queries can produce numerically equivalent
feature columns, e.g., all zeros. Such repetition becomes common when feature
generation progresses deeper in the search space.
Figure 16.4 presents test accuracy learning curves for models learned with and
without cluster relations in venue prediction and link prediction respectively. Curve
coordinates are averages over the runs in ten-fold cross validation. The learning
curves show test-set accuracy changing with the number of features, in intervals of
250, generated and sequentially considered for model selection from the training set.
The average test set accuracy of the cluster-based models after exploring the entire
feature stream is 87.2% in venue prediction and 93.1% in link prediction, which is,
respectively, 4.75 and 3.22 percentage points higher than the average accuracy of
the models not using cluster relations.
467
3
2
1
0
accuracy(clustersYES) accuracy(clustersNO)
4
2
0
accuracy(clustersYES) accuracy(clustersNO)
Experimental Evaluation
16.3
500
1000
1500
2000
2500
# of features considered
3000
3500
500
1000
1500
2000
2500
3000
3500
# of features considered
Figure
16.5 Mean
accuracy
dierence:
accuracy(clustersY ES) 
accuracy(clustersN O) with 95% condence intervals (bounds based on N =10
points, t-test distribution). Left: venue prediction. Right: link prediction
Figure 16.5 presents 95% condence intervals of the dierence in mean test accuracies of clustersYES and clustersNO models in venue prediction and link prediction respectively. In venue prediction, after exploring approximately half of the
feature stream, the improvement in accuracy by the cluster-based models is statistically signicant at the 95% condence level according to the t-test (condence
intervals do not intersect with y=0). In the early feature generation, when considering the streams of about 1000 features, cluster-based models perform signicantly
worse: at this phase, additional cluster-based features, while not yet signicantly
improving accuracy, are delaying the discovery of signicant noncluster-based features. In link prediction, while the signicance of the improvement from clusterbased features is reduced early in the stream, it continuously increases throughout
the rest of the stream. At the end of the stream the improvement in accuracy of the
cluster-based model is 3.22 percentage points, statistically signicant at the 99.8%
condence level. The highest accuracies (after seeing 750 features by clustersNO
and after seeing 3500 features by clustersYES) also statistically dier: the accuracy improvement in cluster-based models is 1.49 percentage points, signicant at
the 99.9% condence level.
The average number of features selected in ten clustersYES models is 32.0 in
venue prediction and 32.3 in link prediction, respectively; 27.9 and 31.8 features on
average were selected into clustersNO models from equally many feature candidates
(3500). The BIC penalty used here allows a small amount of overtting (see
gure 16.4); more recent penalty methods such as SFS [33] avoid this problem.
The improved accuracy of the cluster-based model in venue prediction comes
mostly from a single cluster-based feature. This feature was selected in all crossvalidation runs. It is a binary cluster-based feature which is on for target document/venue pair <D,V>, if a document D1 exists in the cluster where D belongs
such that D1 is published in the same venue as D. Using a logic-based notation, the
468
Model
size[publishedIn( , V )]
exists[cites(D, D1), publishedIn(D1, V )]
exists[cites(D1, D), publishedIn(D1, V )]
exists[cites(D, D2), cites(D1, D2), publishedIn(D1, V )]
exists[author(D, A), author(D1, A), publishedIn(D1, V )]
both
both
both
both
both
clustersNO
3. Note that D1 is always distinct from D as the tuple with publication venue of document
D is removed from the background relation PublishedIn.
16.3
Experimental Evaluation
16.3.4
469
Up to this point, we presented models learned when doing the breadth-rst search of
the feature space. In this section we explore an alternative search strategy in which
separate streams are used to generate queries (and hence features), and new queries
are preferentially selected from those streams which have been most productive of
useful features. The database query evaluation used in feature generation dominates
the computational cost of our statistical relational learning methodology; thus,
intelligently deciding which queries to evaluate can have a signicant eect on total
cost.
Feature generation in the SGLR framework consists of two steps: query expression
generation and query evaluation. The former is cheap as it involves only syntactic
operations on query strings; the latter is computationally demanding. The experiment is set up to test two strategies which dier in the order in which queries
are evaluated. In both strategies, query expressions are generated by breadth-rst
search. The base-line, static, strategy evaluates queries in the same order the expressions appear in the search queue, while the alternative, dynamic strategy, enqueues
queries into separate streams at the time its expression is generated, but chooses
the next feature to be evaluated from the stream with the highest ratio:
(f eaturesAdded + 1)/(f eaturesT ried + 1),
where f eaturesAdded is the number of features selected for addition to the model,
and f eaturesT ried is the total number of features tried by feature selection in
this stream. Many other ranking methods could be used; this one has the virtue
of being simple and, for the realistic situation in which the density of predictive
features tends to decrease as one goes far into a stream, complete.
In the tests below, we use two streams. The rst stream contains queries with
aggregate operators exists and count over the entire table. The second stream
contains features which are the counts of unique elements in individual columns.
We stop the experiment when one of the streams is exhausted.
We report the dierence in test-set accuracy between dynamic and static feature
generation. In each of four data sets, the dierence in accuracy is plotted against the
number of features evaluated and considered by feature selection We also kept track
of the CPU time required for each of these cases. The MySQL database engine was
used. Data sets 1, 2, and 3 took roughly 20,000 seconds, while data set 4 took 40,000
seconds. In all four cases, plots of accuracy vs. CPU time used were qualitatively
similar to the plots shown in gure 16.6.
In the experiments presented here, one of the two feature streams was a clear
winner, suggesting the heuristic splitting feature was eective. When the choice of
a good heuristic is dicult, dynamic feature generation, in the worst case, will
split features into equally good streams, and will asymptotically lead to the
same expected performance as the static feature generation by taking features from
dierent streams with equal likelihood.
3
2
1
1
(accuracyDynamic accuracyStatic)
6
4
2
(accuracyDynamic accuracyStatic)
500
1000
1500
500
1000
1500
# of features considered
6
4
2
(accuracyDynamic accuracyStatic)
4
2
0
(accuracyDynamic accuracyStatic)
# of features considered
470
500
1000
# of features considered
1500
500
1000
1500
# of features considered
16.4
471
16.4
472
16.5
Conclusion
We presented structural generalized linear regression and used its logistic regression
variant for analyzing document data from CiteSeer. SGLR combines the strengths
of generalized linear regression modeling (e.g., linear, logistic, and Poisson) with
the higher expressivity of features automatically generated from relational data
sources. New, potentially predictive features and relations in the database are
generated lazily, and selected with statistically rigorous criteria derived from the
regression model being built. SGLR is applicable to large domains with complex,
sparse and noisy data sources; these characteristics suggest focused, dynamic feature
generation from rich feature spaces, regression modeling, rigorous feature selection,
and the use of query and statistical optimizations, all of which contribute to the
expressivity, accuracy, and scalability of SGLR.
SGLR is attractive in oering a factored architecture which allows one to plug in
any additive statistical modeling tool and its corresponding feature selection criterion. This contrasts with recursive subdivision methods in which one cannot easily
separate out search from modeling and feature selection. The factored architecture
oers many advantages, including support for dynamic feature selection.
We showed how clustering can be used to derive new concepts and relations
which augment database schema used in the automatic generation of predictive features in statistical relational learning. Clustering improves scalability
through dimensionality reduction. More importantly, entities derived from clusters increase the expressivity of feature spaces by creating new rst-class concepts which contribute to the creation of new features in more complex ways.
For example, in CiteSeer, papers can be clustered based on words giving topics. Associated with each cluster (or concept) is a cluster relation (e.g.,
on topic) which then becomes part of more complex feature expressions such as
exists[publishedIn(D1, V ), on topic(D, C), on topic(D1, C)]. Such richer features
result in more accurate models than those built only from the original relational
concepts.
References
473
We also showed that dynamically deciding which features to generate can lead to
the discovery of predictive features with substantially less computation than generating all features in advance, as done, for example, in propositionalization. Native
statistical feature selection criteria can give run-time feedback for determining the
order in which features are generated. Coupling feature generation to model construction can signicantly reduce computational costs. Some ILP systems, such as
Progol, also perform dynamic feature generation, albeit with logic models. Many
problem domains should benet from the SGLR or similar methods, including modeling of social networks, bioinformatics, disclosure control in statistical databases,
and modeling of other hyperlinked domains, such as the web and databases of
patents and legal cases.
References
[1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. AddisonWesley, Boston, 1995.
[2] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. In Proceedings
of ACM International Conference on Management of Data, 1998.
[3] A. V. Aho, Y. Sagiv, and J. D. Ullman. Equivalences among relational
expressions. SIAM Journal of Computing, 8(2):218246, 1979.
[4] H. Akaike. Information theory and an extension of the maximum likelihood
principle. In Second International Symposium on Information Theory, 1973.
[5] A. Bernstein, S. Clearwater, and F. Provost. The relational vector-space model
and industry classication. In IJCAI Workshop on Learning Statistical Models
from Relational Data, 2003.
[6] H. Blockeel and L. Dehaspe. Cumulativity as inductive bias. In Workshop on
Data Mining, Decision Support, Meta-Learning and ILP at PKDD, 2000.
[7] H. Blockeel and L. De Raedt. Top-down induction of logical decision trees.
Articial Intelligence, 101(1-2):285297, 1998.
[8] S. Ceri, G. Gottlob, and L. Tanca. Logic Programming and Databases. SpringerVerlag, Berlin, 1990.
[9] L. Dehaspe. Maximum entropy modeling with clausal constraints. In Proceedings of the International Conference on Inductive Logic Programming, 1997.
[10] L. Dehaspe and H. Toivonen. Discovery of frequent datalog patterns. Data
Mining and Knowledge Discovery, 3(1):736, 1999.
[11] S. Dzeroski and N. Lavrac. An introduction to inductive logic programming.
In Saso Dzeroski and Nada Lavrac, editors, Relational Data Mining, pages
4873. Springer-Verlag, Berlin, 2001.
474
References
475
Statistical relational learning (SRL) algorithms model joint probability distributions over relational databases. However, current SRL techniques that operate on
databases are restricted to using only the elds and tables already in the database.
Yet, database users often dene additional elds or tables, known as views, that can
be computed from the existing ones. We augment SRL algorithms by adding the
ability to learn new elds. We present two dierent approaches to view learning.
First, we develop a two-step approach where we search for all views of interest and
then build a statistical model incorporating the selected views. Second, we describe
SAYU-View, which integrates the view generation and model building steps. We
motivate view learning in the context of creating an expert system for mammography. We show that view learning signicantly improves the performance of the
expert system.
17.1
Introduction
Statistical relational learning (SRL) focuses on algorithms for learning statistical
models from relational databases. SRL advances beyond Bayesian network learning
and related techniques by handling domains with multiple tables, by representing
relationships between dierent rows of the same table, and by integrating data from
several distinct databases. Currently, SRL techniques can learn joint probability
distributions over the elds of a relational database with multiple tables. Nevertheless, SRL techniques are constrained to use only the tables and elds already
in the database, without modication. In contrast, many human users of relational
databases nd it benecial to dene alternative views of a databasefurther elds
or tables that can be computed from existing ones. This chapter shows that SRL
algorithms also can benet from the ability to dene new views. Namely, it shows
478
that view learning can be used for more accurate prediction of important elds in
the original database.
We augment SRL algorithms by adding the ability to learn new elds, intentionally dened in terms of existing elds and intentional background knowledge. In
database terminology, these new elds constitute a learned view of the database.
We use inductive logic programming (ILP) to learn rules which intentionally dene
the new elds. We present two dierent methods to accomplish this goal. The rst
is a two-step approach where we search for all views of interest. This process is
expensive and does not necessarily guarantee selecting the most useful view. The
second framework, which we refer to as SAYU-View, has a tighter coupling between
view generation and view usage. Our results show that view learning can result in
signicant benets.
We present view learning in the specic application of creating an expert system
in mammography. We chose this application for a number of reasons. First, it is an
important practical application where there has been recent progress in collecting
sizable amounts of data. Second, we have access to an expert-developed system.
This provides a base reference against which we can evaluate our work [3]. Third, a
large proportion of examples are negative. This distribution skew is often found in
multi-relational applications. Last, our data consists of a single table. This allows
us to compare our techniques against standard propositional learning. In this case,
it is sucient for view learning to extend an existing table with new elds, achieved
by using ILP to learn rules for unary predicates. For other applications, it may be
desirable to learn predicates of higher arity, which will correspond to learning a
view with new tables rather than new elds only.
17.2
17.2
479
Ca
Mass Stability
++
Lucent
Centered
Milk of
Calcium
Mass Margins
++
Ca
Mass Density
Ca
Mass Shape
Ca
Mass Size
Breast
Tubular
Dystrophic
Ca
++
Ca
++
++
Pleomorphic
Ca
Asymmetric
Family
Density
Table 17.1
++
Eggshell
Punctate
HRT
Architectural
Figure 17.1
Fine/
Ca
Ca
Distortion
Popcorn
++
Linear
Age
LN
Round
++
Ca
Skin Lesion
Density
++
Disease
Mass P/A/O
Density
Dermal
Ca
++
++
Amorphous
Rod-like
hx
features
Patient
Abnormality
Date
Mass Shape
...
Mass Size
Location
Be/Mal
P1
5/02
S pic
...
0.03
RU4
P1
5/04
Var
...
0.04
RU4
P1
5/04
S pic
...
0.04
LL4
...
...
...
...
...
...
...
...
480
17.2
481
Figure 17.2 Hierarchy of learning types. Levels 1 and 2 are available through
ordinary Bayesian network learning algorithms, level 3 is available only through
state-of-the-art SRL techniques, and level 4 is described in this chapter.
reference to other rows in the table for the given patient, as well as intensional
background knowledge to dene concepts such as increases over time. Neither
rule can be captured by standard aggregation of existing elds.
Note that level 3 and level 4 learning would not be necessary if the database
initially contained all the potentially useful elds capturing information from other
relevant rows or tables. For example, the database might be initially constructed
to contain elds such as slope of change in abnormality size at this location over
482
time, average abnormality size on this mammogram, and so on. If humans can
identify all such potentially useful elds beforehand and
dene views containing these, then level 3 and level 4 learning are unnecessary.
Nevertheless, the space of such possibly useful elds is quite large, and perhaps more
easily searched by computer via level 3 and level 4 learning. Certainly in the case of
the National Mammography Database standard [1], such elds were not available
because they had not been dened and populated in the database by the domain
experts, thus making level 3 and level 4 learning potentially useful.
17.3
17.4
17.4
Initial Experiments
483
Initial Experiments
The purposes of the experiments we conducted are twofold. First, we want to
determine if using SRL yields an improvement compared to propositional learning.
Secondly, we want to evaluate whether we see an improvement when moving up
a level in the hierarchy outlined in gure 17.2. First, we try to learn a structure
with just the original attributes (level 2) and see if that performs better than using
the expert structure with trained parameters (level 1). Next, we add aggregate
features to our network, representing summaries of abnormalities found either in a
particular mammogram or for a particular patient. This corresponds to level 3 and
we test whether this improves over levels 1 and 2. Finally, we investigate doing level
4 learning through the two-step algorithm and compare its performance to levels 1
through 3.
We experimented with a number of structure learning algorithms for Bayesian
networks, including naive Bayes, tree-augmented nave (TAN) Bayes [11], and the
sparse candidate algorithm [13]. However, we obtained the best results with the
TAN algorithm in all experiments, so we will focus our discussion on TAN. In a
TAN network, each attribute can have at most one other parent in addition to
the class variable. The TAN model can be constructed in polynomial time with
a guarantee that the model maximizes the log-likelihood of the network structure
given the data set [14, 11].
17.4.1
We collected data for all screening and diagnostic mammography examinations that
were performed at the Froedtert and Medical College of Wisconsin Breast Imaging
Center between April 5, 1999 and February 9, 2004. It is important to note that
the data consists of a radiologists interpretation of a mammogram and not the
raw image data. The radiologist reports conformed to the National Mammography
Database (NMD) standard established by the American College of Radiology. From
these reports, we followed the original network [3] to cull the 36 features deemed
to be relevant by coauthor Burnside, an expert mammographer.
To evaluate and compare these approaches, we used stratied ten-fold crossvalidation. We randomly divided the abnormalities into ten roughly equal-sized
sets, each with approximately one-tenth of the malignant abnormalities and onetenth of the benign abnormalities. When evaluating just the structure learning and
aggregation, nine folds were used for the training set. When performing aggregation,
we used binning to discretize the created features. We took care to only use the
examples from the training set to determine the bin widths. When performing view
learning, we had two steps in the learning process. In the rst part, four folds of
data were used to learn the ILP rules. The remaining ve folds were used to learn
the Bayes net structure and parameters.
484
17.4
Initial Experiments
485
Patient
Abnormality
Date
Mass Shape
...
Mass Size
Location
Average
Patient
Mass Size
Average
Mammogram
Mass Size
Be/Mal
P1
5/02
Spic
...
0.03
RU4
0.0367
0.03
P1
5/04
Var
...
0.04
RU4
0.0367
0.04
P1
5/04
Spic
...
0.04
LL4
0.0367
0.04
...
...
...
...
...
...
...
...
...
...
Table 17.2 Database after aggregation on Mass Size eld. Note the addition of
two new elds, Average Patient Mass Size and Average Mammogram Mass Size,
which represent aggregate features.
patient level. The second key is the combination of patient ID and mammogram
date, which returns all abnormalities for a patient on a specic mammogram,
providing aggregation on the mammogram level. To demonstrate this process, we
will work though an example of computing an aggregate feature for patient 1 in
the database given in gure 17.1. We will aggregate on the Mass Size eld and use
average as the aggregation function. Patient 1 has three abnormalities, one from a
mammogram in May 2002 and two from a mammogram in May 2004. To calculate
the aggregate on the patient level, we average the size for all three abnormalities,
which is .0367. To nd the aggregate on the mammogram level for patient 1, we have
to perform two separate computations. First, we follow the link P1 and 5/02, which
yields abnormality 1. The average for this key mammogram is simply .03. Second,
we follow the link P1 and 5/04, which yields abnormalities 2 and 3. The average
for these abnormalities is .04. Table 17.2 shows the database following construction
of these aggregate features.
Level 4: View learning We used the ILP system Aleph [35] to implement level
4 learning. Aleph was asked to learn rules predictive of malignancy. We introduced
three new intensional tables into Alephs background knowledge to take advantage
of relational information.
1. The prior Mammogram relation connects information about any prior abnormality
that a given patient may have.
2. The same Location relation is a specication of the previous predicate. It adds
the restriction that the prior abnormality must be in the same location as the
current abnormality. Radiology reports include information about the location of
abnormalities.
486
3. The in Same Mammogram relation incorporates information about other abnormalities a patient may have on the current mammogram.
By default, Aleph is set up to generate rules that would fully explain the
examples. In contrast, our goal was to extract rules that would be benecial as
new views. The major problem in implementing level 4 learning was how to select
rules that would best complement level 3 information. Clearly, Alephs standard
coverage algorithm was not designed for this application. Instead, we chose to rst
enumerate as many rules of interest as possible, and then chose interesting rules.
In order to obtain a varied set of rules, we ran Aleph under induce max for
each fold. Induce max uses every positive example in each fold as a seed for the
search. Also note that induce max does not discard previously covered examples
when scoring a new clause. Several thousand distinct rules were learned for each
fold, with each rule covering many more malignant cases than (incorrectly covering)
benign cases. We avoid the rule overtting found by other authors [24] by doing
breadth-rst search for rules and by having a minimal limit on coverage.
Each seed generated anywhere from zero to tens of thousands of rules. Adding
all rules would mean introducing thousands of often redundant features. We implemented the following algorithm:
1. We scanned all rules looking for duplicates and for rules that performed worse
than a more general rule. This step signicantly reduced the number of rules to
consider.
2. We sorted rules according to their m-estimate.
3. We used a greedy algorithm that picks the rule with the highest m-estimate such
that it covers an unexplained training example. Furthermore, each rule needs to
cover a signicant number of malignant cases. This step is similar to the standard
ILP greedy covering algorithm, except that we do not follow the original order of
the seed examples.
4. Last, we scanned the remaining rules, selecting those that covered a signicant
number of examples, and that were dierent from all previous rules, even though
these rules would not cover any new examples.
It is important to note that the rule selection was an automated process. We
picked the top fty clauses in our experiments, obtained from practical considerations on the size of the Bayesian networks we would need to learn. The resulting
views were added as new features to the database.
17.4.3
Results
We present the results of our rst experiment, comparing levels 1 and 2, using
both ROC and precision-recall curves. Figure 17.3 shows the ROC curve for these
experiments, and gure 17.4 shows the precision-recall curves. Because of our
skewed class distribution, due to the large number of benign cases, we prefer
precision-recall curves over ROC curves because they better show the number of
17.4
Initial Experiments
487
488
17.4
Initial Experiments
Figure 17.5
Figure 17.6
489
490
17.5
17.6
Figure 17.7
491
as a rule combiner only, not as a tool for view learning that adds elds to the
existing set of elds (features) in the database [8]. We have modied SAYU to take
advantage of the predened features yielding a more integrated approach to view
learning. We also report on a more natural design where SAYU starts from the
level 3 network. We call this approach SAYU-View. Figure 17.7 gives pseudocode
for the SAYU-View algorithm.
17.6
492
Figure 17.8
Within SAYU, the time to score a rule has increased. The Bayes net algorithm
has to learn a new network topology and new parameters each time we score a rule
(feature). Furthermore, inference must be performed to compute the score after
incorporating a new feature. The SAYU algorithm is strictly more expensive than
standard ILP as SAYU also has to prove whether a rule covers each example in
order to create the new feature. To reect the added cost, we use a time-based stop
criterion for the new algorithm. This criterion is described in further detail in [8].
For each fold, we use the times from the baseline experiments in [8], so that our
new approach to view learning takes the same time as the old approach. In practice,
our settings resulted in evaluating around 20,000 clauses for each fold, requiring on
average around four hours per fold on a Xeon 3MHz class machine.
Figure 17.8 includes a comparison of SAYU-View to level 3 and the initial
approach to level 4. Again, we perform a two-tailed paired t -test on the area under
the precision recall curve for levels of recall  0.5. SAYU-view performs signicantly
better than both these approaches at the 99% condence level. Although we do not
include the graph, SAYU-View performs signicantly better than the SAYU-TAN
(no initial features), also with a p-value < 0.01. SAYU-View also performs better
than level 1 and level 2 with a p-value < 0.01. With the integrated framework for
level 4, we now see signicant improvement over lower levels of learning when we
ascend the hierarchy dened in gure 17.2.
Figure 17.9 shows the average area under the precision-recall curve (AUCPR) for
levels of recall  0.5 for level 3, the initial approach to level 4, and SAYU-View.
The average AUCPR for SAYU-View yields a 30% increase in the average AUCPR
over the initial approach to level 4. Furthermore, we see an increase in the average
AUCPR of 53% over level 3. Another way to look at these results is the potential
reduction of benign biopsies: procedures done on women without cancer. When
detecting 90% of cancers (i.e., recall = 0.9), SAYU-View achieves a 35% reduction
in benign biopsies over level 3 and a 39% reduction over the initial level 4 method.
17.7
Related Work
Figure 17.9
17.7
493
Related Work
Research in SRL has advanced along two main lines: methods that allow graphical
models to represent relations, and frameworks that extend logic to handle probabilities. Along the rst line, probabilistic relational models, or PRMs, introduced
by Friedman et al., represent one of the rst attempts to learn the structure of
graphical models while incorporating relational information[12]. Recently Heckerman et al. have discussed extensions to PRMs and compared them to other graphical
models[16]. A statistical learning algorithm for probabilistic logic representations
was rst given by Sato [33], and later Cussens [7] proposed a more general algorithm
to handle log-linear models. Additionally, Muggleton [21] has provided learning algorithms for stochastic logic programs. The structure of the logic program is learned
using ILP techniques, while the parameters are learned using an algorithm scaled
up from that used for stochastic context-free grammars.
Newer representations garnering arguably the most attention are Bayesian logic
programs (BLPs)[18], relational Markov networks (RMNs) [37], constraint logic
programming with Bayes net constraints, or CLP(BN ) [32], and Markov logic
networks (MLNs) [31]. MLNs are most similar to our approach. Nodes of MLNs
are the ground instances of the literals in the rule, and the arcs correspond to
the rules. One major dierence is that, in our approach, nodes are the rules
themselves. Although we cannot work at the same level of detail, our approach
makes it straightforward to combine logical rules with other features, and we now
can take full advantage of propositional learning algorithms.
494
The present work builds upon previous work on using ILP for feature construction. Such work treats ILP-constructed rules as Boolean features, re-represents each
example as a feature vector, and then uses a feature-vector learner to produce a
nal classier. To our knowledge, Pompe and Kononenko [25] were the rst to apply
naive Bayes to combine clauses. Other work in this category was by Srinivasan and
King [36], who used rules as extra features for the task of predicting biological activities of molecules from their atom and bond structures. Popescul and Unger [26]
use k  means to derive cluster relations, which are then combined with the original features through structural regression. In a dierent vein, relational decision
trees [23] use aggregation to provide extra features on a multi-relational setting, and
are close to our level 3 setting. Knobbe et al. [19] proposed numeric aggregates in
combination with logic-based feature construction for single attributes. Perlich and
Provost discuss several approaches for attribute construction using aggregates over
multi-relational features [24]. They also propose a hierarchy of levels of learning:
feature vectors, independent attributes on a table, multidimensional aggregation on
a table, and aggregation across tables. Some of these techniques in their hierarchy
could be applied to perform view learning in SRL.
Another approach for a tight coupling between rule learning and rule usage is
the recent work (done in parallel with ours) by Landwehr et al. [20]. That work
presented a new system called nFOIL. We would like to highlight that several
signicant dierences in the two pieces of work appear to be the following. First,
nFOIL scores clauses by conditional log-likelihood rather than improvement in
classier accuracy or classier AUC (area under ROC or precision-recall curve).
Second, nFOIL can handle multiple-class classication tasks, which SAYU cannot.
Third, the present chapter reports experiments on data sets with signicant class
skew, to which probabilistic classiers are often sensitive. Fourth, this work looks
at TAN opposed to naive Bayes. Finally, this work extends both [20] and [8] by
giving the network an initial feature set.
Another related piece of work is that by Popescul et al. [28, 27, 29] on structural
logistic regression. They use an ILP-like (renement graph) search over rules,
expressed as database queries, to dene new features. Dierences from the present
work include their use of the new features within an logistic regression model rather
than a graphical model, and the fact that they do not update the logistic regression
model after adding each rule. A notable strength of their approach is that the
rule-learning process itself can include aggregation.
17.8
17.8
495
This chapter has shown that a simple form of view learningtreating rules
induced by a standard ILP system as the additional features of a new view
yields improved performance over level 2 learning. Nevertheless, this improvement
is roughly equal to that obtained by level 3 learningby aggregation, as might
be performed, for example, by a PRM. We have noted how this approach to view
learning is quite similar to earlier work using ILP for feature construction.
A more interesting form of view learning, or level 4 learning, is SAYU-View, which
closely integrates the ILP system and Bayesian network learning. It signicantly
improves performance over both level 3 learning and the simple form of view
learning.
We believe many further improvements in view learning are possible. It makes
sense to include aggregates in the background knowledge for rule generation.
Alternatively, one can extend rules with aggregation operators, as proposed in
recent work by Vens et al. [38]. We have found the rule selection problem to
be nontrivial. Our greedy algorithm often generates too similar rules, and is not
guaranteed to maximize coverage. We would like to approach this problem as an
optimization problem weighing coverage, diversity, and accuracy.
Our approach of using ILP to learn new features for an existing table merely
scratches the surface of the potential for view learning. A more ambitious approach
would be to more closely integrate structure learning and view learning. A search
could be performed in which each move in the search space is either to modify the
probabilistic model or to rene the intentional denition of some eld in the new
view. Going further still, one might learn an intentional denition for an entirely
new table. As a concrete example, for mammography one could learn rules dening
a binary predicate that identies similar abnormalities. Because such a predicate
would represent a many-to-many relationship among abnormalities, a new table
would be required.
SRL algorithms provide a substantial extension to existing statistical learning
algorithms, such as Bayesian networks, by permitting statistical learning to be
applied directly to relational databases with multiple tables. Nevertheless, the
schemata for relational databases often are dened based on criteria other than
eectiveness of learning. If a schema is not the most appropriate for a given learning
task, it may be necessary to change itby dening a new viewbefore applying
other SRL techniques. View learning, as presented in this chapter, provides an
automated capability to make such schema changes. Our approaches so far to view
learning build on existing ILP technology. We believe ILP-based view learning
can be greatly improved and extended, as outlined in the preceding paragraphs,
for example to learn entirely new tables. Furthermore, many approaches to view
learning outside of ILP remain to be explored.
496
Acknowledgments
Support for this research was partially provided by U.S. Air Force grant F3060201-2-0571. Elizabeth Burnside is supported by a General Electric Research in Radiology Academic Fellowship. Ines Dutra and Vtor Santos Costa did this work while
visiting the University of Wisconsin-Madison. Vtor Santos Costa was partially supported by the Fundacao para a Ciencia e Tecnologia. We thank Lisa Torrey, Mark
Goadrich, Rich Maclin, Jill Davis, and Allison Holloway for reading over drafts of
this chapter. We also thank the referees for their insightful comments.
References
[1] American College of Radiology. Breast imaging reporting and data system
(bi-rads), 2004. American College of Radiology.
[2] M. Brown, F. Houn, E. Sickles, and L. Kessler. Screening mammography in
community practice: Positive predictive value of abnormal ndings and yield
of follow-up diagnostic procedures. American Journal of Roentgenology, 165:
13731377, 1995.
[3] E. Burnside, D. Rubin, and R. Shachter. A Bayesian network for screening
mammography. In American Medical Informatics Association, pages 106110,
2000.
[4] E. Burnside, Y. Pan, C. Kahn, K. Shaer, and D. Page. Training a Probabilistic
Expert System to Predict the Likelihood of Breast Cancer Using a Large Dataset
of Mammograms (abstract). Radiological Society of North America, 2004.
[5] E. Burnside, D. Rubin, and R. Shachter. Using a Bayesian network to predict
the probability and type of breast cancer represented by microcalcications on
mammography. Medinfo, 2004:1317, 2004.
[6] E. Burnside, J. Davis, V. Santos Costa, I. Dutra, C. Kahn, J. Fine, and D. Page.
Knowledge discovery from structured mammography reports using inductive
logic programming. In American Medical Informatics Association Symposium,
pages 96100, 2005.
[7] J. Cussens. Parameter estimation in stochastic logic programs.
Learning, 44(3):245271, 2001.
Machine
References
497
498
18.1
Introduction
Many planning domains are most most naturally represented in terms of objects and
relations among them. Accordingly, AI researchers have long studied algorithms for
planning and learning-to-plan in relational state and action spaces. These include,
for example, classical STRIPS domains such as the blocks world and logistics.
A common criticism of such domains and algorithms is the assumption of an
idealized, deterministic world model. This, in part, has led AI researchers to
study planning and learning within a decision-theoretic framework, which explicitly
handles stochastic environments and generalized reward-based objectives. However,
most of this work is based on explicit or propositional state-space models, and so far
500
has not demonstrated scalability to the large relational domains that are commonly
addressed in classical planning.
Intelligent agents must be able to simultaneously deal with both the complexity
arising from relational structure and the complexity arising from uncertainty. The
primary goal of this research is to move toward such agents by bridging the gap
between classical and decision-theoretic techniques.
In this chapter, we describe a straightforward and practical method for solving
very large, relational Markov decision processes (MDPs). Our work can be viewed
as a form of relational reinforcement learning (RRL) where we assume a strong
simulation model of the environment. That is, we assume access to a black-box
simulator, for which we can provide any (relationally represented) state/action pair
and receive a sample from the appropriate next-state and reward distributions. The
goal is to interact with the simulator in order to learn a policy for achieving high
expected reward. It is a separate challenge, not considered here, to combine our
work with methods for learning the environment simulator to avoid dependence on
being provided such a simulator.
Dynamic-programming approaches to nding optimal control policies in MDPs
[6, 25], using explicit (at) state-space representations, break down when the state
space becomes extremely large. More recent work extends these algorithms to use
propositional [8, 11, 12, 9, 18, 21] as well as relational [10, 20] state-space representations. These extensions have signicantly expanded the set of approachable
problems, but have not yet shown the capacity to solve large classical planning
problems such as the benchmark problems used in planning competitions [3], let
alone their stochastic variants (see section 18.6 for example benchmarks). One possible reason for this is that these methods are based on calculating and representing
value functions. For familiar STRIPS planning domains (among others), useful value
functions can be dicult to represent compactly, and their manipulation becomes
a bottleneck.
Most of the above techniques are purely deductivethat is, each value function
is guaranteed to have a certain level of accuracy. Rather, in this work, we will
focus on inductive techniques that make no such guarantees in practice. Existing
inductive forms of approximate policy iteration (API) utilize machine learning
to select compactly represented approximate value functions at each iteration of
dynamic programming [7]. As with any machine learning algorithm, the selection
of the hypothesis space, here a space of value functions, is critical to performance.
An example space used frequently is the space of linear combinations of a humanselected feature set.
To our knowledge, there has been no previous work that applies any form of
API to benchmark problems from classical planning, or their stochastic variants.1
1. Recent work in relational reinforcement learning has been applied to STRIPS problems
with much simpler goals than typical benchmark planning domains, and is discussed below
in section 18.7.
18.1
Introduction
501
Again, one reason for this is the high complexity of typical value functions for these
large relational domains, making it dicult to specify good value-function spaces
that facilitate learning. Comparably, it is often much easier to compactly specify
good policies, and accordingly good policy spaces for learning. This observation
is the basis for recent work on inductive policy selection in relational planning
domains, both deterministic [29, 33], and probabilistic [51]. These techniques show
that useful policies can be learned using a policy-space bias described by a generic
(relational) knowledge representation language. Here we incorporate those ideas
into a novel variant of API that achieves signicant success without representing or
learning approximate value functions. Of course, a natural direction for future work
is to combine policy-space techniques with value-function techniques, to leverage
the advantages of both.
Given an initial policy, our approach uses the simulation technique of policy
rollout [46] to generate trajectories of an improved policy. These trajectories are
then given to a classication learner, which searches for a classier, or policy,
that matches the trajectory data, resulting in an approximately improved policy.
These two steps are iterated until no further improvement is observed. The resulting
algorithm can be viewed as a form of API where the iteration is carried out without
inducing approximate value functions.
By avoiding value-function learning, this algorithm addresses the representational
challenge of applying API to relational planning domains. However, another fundamental challenge is that, for nontrivial relational domains, API requires some form
of bootstrapping. In particular, for most STRIPS planning domains the reward,
which corresponds to achieving a goal condition, is sparsely distributed and unlikely to be reached by random exploration. Thus, initializing API with a random
or uninformed policy, will likely result in no reward signal and hence no guidance
for policy improvement. One approach to bootstrapping is to rely on the user to
provide a good initial policy or heuristic that gives guidance toward achieving reward. Rather, in this work we develop a new automatic bootstrapping approach for
goal-based planning domains which does not require user intervention.
Our bootstrapping approach is based on the idea of random-walk problem distributions. For a given planning domain, such as the blocks world, this distribution
randomly generates a problem (i.e., an initial state and a goal) by selecting a random initial state and then executing a sequence of n random actions, taking the
goal condition to be a subset of properties from the resulting state. The problem
diculty typically increases with n, and for small n (short random walks) even
random policies can uncover reward. Intuitively, a good policy for problems with
walk length n can be used to bootstrap API for problems with slightly longer walk
lengths. Our bootstrapping approach iterates this idea, by starting with a random
policy and very small n, and then gradually increasing the walk length until we
learn a policy for very long random walks. Such long-random-walk policies clearly
capture much domain knowledge, and can be used in various ways. Here, we show
that empirically such policies often perform well on problems distributions from
502
18.2
Problem Setup
We formulate our work in the framework of MDPs. While our primary motivation is
to develop algorithms for relational planning domains, we rst describe our problem
setup and approach for a general, action-simulator-based MDP representation.
Later, in section 18.4, we describe a particular representation of planning domains
as relational MDPs and the corresponding relational instantiation of our approach.
Following and adapting Kearns et al. [27] and Bertsekas and Tsitsiklis [7], we
represent an MDP using a generative model S, A, T, R, I
, where S is a nite set
of states, A is a nite, ordered set of actions, and T is a randomized actionsimulation algorithm that, given state s and action a, returns a next state s
according to some unknown probability distribution PT (s |s, a). The component R
is a reward function that maps S  A to real numbers, with R(s, a) representing the
reward for taking action a in state s, and I is a randomized initial-state algorithm
with no inputs that returns a state s according to some unknown distribution P0 (s).
We sometimes treat I and T (s, a) as random variables with distributions P0 () and
PT (|s, a) respectively.
For an MDP M = S, A, T, R, I
, a policy  is a (possibly stochastic) mapping
from S to A. The value function of , denoted V  (s), represents the expected,
cumulative, discounted reward of following policy  in M starting from state s, and
18.3
503
(18.1)
We will measure the quality of a policy by the objective function V () = E[V  (I)],
giving the expected value obtained by that policy when starting from a randomly
drawn initial state. A common objective in MDP planning and reinforcement
learning is to nd an optimal policy   = argmax V (). However, no automated
technique, including the one we present here, has to date been able to guarantee
nding an optimal policy in the relational planning domains we consider, in
reasonable running time.
It is a well-known fact that given a current policy , we can dene a new improved
policy
PI  (s) = argmaxaA Q (s, a)
(18.2)
such that the value function of PI  is guaranteed to (1) be no worse than that of
 at each state s, and (2) strictly improve at some state when  is not optimal.
Policy iteration is an algorithm for computing optimal policies by iterating policy
improvement (PI) from any initial policy to reach a xed point, which is guaranteed
to be an optimal policy. Each iteration of policy improvement involves two steps:
(1) policy evaluation, where we compute the value function V  of the current policy
, and (2) policy selection, where, given V  from step 1, we select the action that
maximizes Q (s, a) at each state, dening a new improved policy.
18.3
504
18.3.1
2. Under very strong assumptions, API can be shown to converge in the innite limit to
a near-optimal policy [7].
3. In particular, the RRL work has considered a variety of value-function representation
including relational regression trees, instance-based methods, and graph kernels, but none
of them have generalized well over varying numbers of objects.
18.3
505
such languages may provide useful policy-space biases for learning in API. However,
all prior API methods are based on approximating value functions and hence can
not leverage these biases. With this motivation, we introduce a new form of API
that directly learns policies without directly representing or approximating value
functions.
18.3.2
A policy is simply a classier that maps states to actions. Our API approach is
based on this view, and is motivated by recent work that casts policy selection
as a standard classication learning problem. In particular, given the ability to
observe trajectories of a target policy, we can use machine learning to select a
policy, or classier, that mimics the target as closely as possible. This idea has
been studied previously under the name behavioral cloning [44]. Khardon [30]
studied this learning setting and provided PAC-like learnability results, showing
that under certain assumptions, a small number of trajectories is sucient to learn
a policy whose value is close to that of the target. In addition, recent empirical
work, in relational planning domains [29, 33, 51], has shown that by using expressive
languages for specifying state-action mappings, good policies can be learned from
sample trajectories of good policies.
These results suggest that, given a policy , if we can somehow generate trajectories of an improved policy, then we can learn an approximately improved policy
based on those trajectories. This idea is the basis of our approach. Figure 18.1 gives
pseudocode for our API variant, which starts with an initial policy 0 and produces
a sequence of approximately improved policies. Each iteration involves two primary
steps: First, given the current policy , the procedure Improved-Trajectories
(approximately) generates trajectories of the improved policy   = PI  . Second,
these trajectories are used as training data for the procedure Learn-Policy, which
returns an approximation of   . We now describe each step in more detail.
Step 1: Generating Improved-Trajectories Given a base policy policy ,
the simulation technique of policy rollout [46, 7] computes an approximation 
 (s) without the need to solve for   at all other states, and thus provides a
tractable way to approximately simulate the improved policy   in large state-space
 , which can lead
MDPs. Often   is signicantly better than , and hence so is 
to substantially improved performance at a small cost. Policy rollout has provided
signicant benets in a number of application domains, including, for example,
backgammon [46], instruction scheduling [37], network congestion control [49], and
solitaire [50].
Policy rollout computes 
 (s), the estimate of   (s), by estimating Q (s, a) for
each action a and then taking the maximizing action to be 
 (s) as suggested by
506
API (n, w, h, M, 0 , )
  0 ;
loop
T  Improved-Trajectories(n, w, h, M, );
  Learn-Policy(T );
until satised with ;
// e.g. until change is small
Return ;
Improved-Trajectories(n, w, h, M, )
// training set size n, sampling width w,
// horizon h, MDP M , current policy 
T  ;
repeat n times // generate n trajectories of improved policy
t  nil;
s  state drawn from I; // draw random initial state
for i = 1 to h
 am )  Policy-Rollout(, s, w, h, H);
 a1 ), . . . , Q(s,
Q(s,
 am )); // concatenate sample to trajectory
 a1 ), . . . , Q(s,
t  t  s, (s), Q(s,
 a); // action of the improved policy at state s
a  action maximizing Q(s,
s  state sampled from T (s, a); // simulate action of improved policy
T  T  t;
Return T ;
Policy-Rollout (s, w, h, M, )
// policy , state s, sampling width w, horizon h, cost estimator H
for each action ai in A
 ai )  0;
Q(s,
 ai ) is an average over w trajectories
repeat w times // Q(s,
R  R(s, ai );
s  a state sampled from T (s, ai ); // take action ai in s
for i = 1 to h  1 // take h  1 steps of , accumulating reward in R
R  R +  i R(s , (s ));
s  a state sampled from T (s , (s ))
 ai ) + R; // include trajectory in average
 ai )  Q(s,
Q(s,
i)
 ai )  Q(s,a
;
Q(s,
w
 am )
 a1 ), . . . , Q(s,
Return Q(s,
Figure 18.1 Pseudocode for our API algorithm. See section 18.4.3 for an instantiation of Learn-Policy called Learn-Decision-List.
18.4
507
actions selected by  for h1 steps. The estimate of Q (s, a) is then taken to be the
average of the cumulative discounted reward along each trajectory. The sampling
width w and horizon h are specied by the user, and control the tradeo between
increased computation time for large values, and reduced accuracy for small values.
The procedure Improved-Trajectories uses rollout to generate n length h
trajectories of the improved policy 
 , each trajectory beginning at a randomly
drawn initial state. Rather than just recording the sequence of states encountered
and actions selected by 
 along each trajectory, we store additional information
that is used by our policy-learning algorithm. In particular, the ith element of a
 i , a1 ), . . . , Q(s
 i , am )
, giving the ith state si
trajectory has the form si , (si ), Q(s
along the trajectory, the action selected by the current (unimproved) policy at si ,
 i , a) for each action. Thus each trajectory generated
and the Q-value estimates Q(s
by Improved-Trajectories records for each state the action selected by 
 and the
Q-values for all actions. Note that given the Q-value information for si the learning
algorithm can determine the approximately improved action  (s), by maximizing
over actions, if desired.
Step 2: Learn-Policy Intuitively, we want Learn-Policy to select a new
policy that closely matches the training trajectories. In our experiments, we
use relatively simple learning algorithms based on greedy search within a space of
policies specied by a policy-language bias. In sections 18.4.2 and 18.4.3 we detail
the policy-language learning bias used by our technique, and the associated learning
algorithm. In Fern et al. [16] we provide a technical analysis of an idealized version
of this algorithm, providing guidance regarding the number of training trajectories,
horizon, and sampling width required to guarantee policy improvement with high
probability. We note that by labeling each training state in the trajectories with
the associated Q-values for each action, rather than simply with the best action,
we enable the learner to make more informed tradeos, focusing on accuracy at
states where wrong decisions have high costs, which was empirically useful. Also,
the inclusion of (s) in the training data enables the learner to adjust the data
relative to , if desirede.g., our learner uses a bias that focuses on states where
large improvement appears possible.
Finally, we note that for API to be eective, it is important that the initial
policy 0 provide guidance toward improvement, i.e., 0 must bootstrap the API
process. For example, in goal-based planning domains 0 should reach a goal from
some of the sampled states. In section 18.5 we will discuss this important issue of
bootstrapping and introduce a new bootstrapping technique.
18.4
508
problem instance from a planning domain, and hence can be viewed as a form of
domain-specic control knowledge.
In this section, we rst describe a straightforward way to view classical planning
domains (not just single problem instances) as relationally factored MDPs. Next,
we describe our relational policy space in which policies are compactly represented
as taxonomic decision lists. Finally, we present a heuristic learning algorithm for
this policy space.
18.4.1
18.4
509
For single argument action types, many useful rules for planning domains take
the form of apply action type A to any object in class C [33]. For example, in
the blocks world a useful planning rule might be, Pick up any clear block that
belongs on the table but is not on the table, or in a logistics world, Unload
any object that is at its destination. This motivates the idea of using a formal
class description language for representing such classes or sets of objects, and
then learning policies that are represented via rules expressed in that language. In
particular, if the selected class description language can compactly encode useful
classes of objects, then we can learn rules for the policy by simply searching over
short class descriptions.
510
This idea was rst explored by Martin and Gener [33] who introduced the
use of decision lists of such rules, using description logic as a class description
language. Their experiments in the deterministic blocks world showed promising
results, highlighting the potential benets of using class description languages to
represent policies. With that motivation, we consider a policy space that is similar
to the one used originally by Martin and Gener, but generalized to handle multiple
action arguments. Also, for historical reasons, rather than use description logic as
our class description language, we use taxonomic syntax [35, 36], as described below.
Comparison Predicates For relational MDPs with world and goal predicates,
such as those corresponding to classical planning domains, it is often useful for
policies to compare the current state with the goal. To this end, we introduce a new
set of predicates, called comparison predicates, which are derived from the world
and goal predicates. For each world predicate p and corresponding goal predicate
gp, we introduce a new comparison predicate cp that is dened as the conjunction
of p and gp. That is, a comparison predicate fact is true if and only if both the
corresponding world and goal predicates facts are true. For example, in the blocks
world, the comparison predicate fact con(a, b) indicates that a is on b in both the
current state and the goali.e., on(a, b) and gon(a, b) are true.
Taxonomic Syntax Taxonomic syntax provides a language for writing class
expressions that represent sets of objects with properties of interest and serve
as the fundamental pieces with which we build policies. Class expressions are
built from the MDP predicates (including comparison predicates if applicable)
and variables. In our policy representation, the variables will be used to denote
action arguments, and at run-time will be instantiated by objects. For simplicity
we only consider predicates of arity one and two, which we call primitive classes and
relations, respectively. When a domain contains predicates of arity three or more,
we automatically convert them to multiple auxiliary binary predicates. Given a list
of variables X = (x1 , . . . , xk ), the syntax of class expressions is given by
C[X] ::= C0 | xi | a-thing | C[X] | (R C[X]) | (min R)
R ::= R0 | R 1 | R ,
where C[X] is a class expression, R is a relation expression, C0 is a primitive class,
R0 is a primitive relation, and xi is a variable in X. Note that, for classical planning
domains, the primitive classes and relations can be world, goal, or comparison
predicates. We dene the depth d(C[X]) of a class expression C[X] to be one if
C[X] is either a primitive class, a-thing, a variable, or (min R); otherwise we
dene d(C[X]) and d(R C[X]) to be d(C[X]) + 1, where R is a relation expression
and C[X] is a class expression. For a given relational MDP we denote by Cd [X] the
set of all class expressions C[X] that have a depth of d or less.
The semantics of class expressions are given in terms of an MDP state s and a
variable assignment O = (o1 , . . . , ok ), which assigns object oi to variable xi . The
interpretation of C[X] relative to s and O is a set of objects and is denoted by
C[X]s,O . A primitive class C0 is interpreted as the set of objects for which the
18.4
511
predicate symbol C0 is true in s. For example, in the blocks world, the primitive
class expressions clear and gclear represent the sets of blocks that are clear in the
current world state and clear in the goal respectively. Likewise, a primitive relation
R0 is interpreted as the set of all object tuples for which the relation R0 holds in s.
For example, the primitive relation expression on represents the set of all pairs of
blocks (o1 , o2 ) such that o1 is on o2 in the current world state. The class expression
a-thing denotes the set of all objects in s. The class expression xi , where xi is a
variable, is interpreted to be the singleton set {oi }.
The interpretation of compound expressions is given by
(C[X])s,O = {o | o  C[X]s,O }
(R C[X])s,O = {o | o  C[X]s,O s.t. (o , o)  Rs,O }
(min R)s,O = {o | o s.t. (o, o )  Rs,O ,  o s.t. (o , o)  Rs,O }
(R )s,O = ID  {(o1 , ov ) | o2 , . . . , ov1 s.t. (oi , oi+1 )  Rs,O for 1  i < v}
(R1 )s,O = {(o, o ) | (o , o)  Rs,O },
where C[X] is a class expression, R is a relation expression, and ID is the identity
relation. Intuitively the class expression (R C[X]) denotes the set of objects that
are related through relation R to some object in the set C[X]. For example, in
the blocks world, the expression (on on-table) denotes the set of blocks that are
currently on a block that is on the table. The expression (R C[X]) denotes the
set of objects that are related through some R chain to an object in C[X]
this constructor is important for representing recursive concepts. For example, the
expression (on a), where a is a block, represents the set of blocks that are currently
above a. The expression (min R) denotes the set of objects that are minimal under
the relation R. For example, the expression (min on) represents the set of blocks
that have no blocks above them, and are on some other block (i.e., the set of clear
blocks).
The following class expressions are some examples of useful blocks-world concepts, given the primitive classes clear, gclear, holding, and con-table, along
with the primitive relations on, gon, and con.
(gon1 holding) has depth two, and denotes the block that we want under the
block being held.
(on (on gclear)) has depth three, and denotes the blocks currently above blocks
that we want to make clear.
(con con-table) has depth two, and denotes the set of blocks in well-constructed
towers.
(gon (con con-table)) has depth three, and denotes the blocks that belong on
top of a currently well-constructed tower.
Decision-List Policies We represent policies as decision lists of action-selection
rules. Each rule has the form a(x1 , . . . , xk ) : L1 , L2 , . . . Lm , where a is a k-argument
action type, the Li are literals, and the xi are action-argument variables. We will
512
For a given relational MDP, dene Rd,l to be the set of action-selection rules that
have a length of at most l literals and whose class expressions have depth at most
d. Also, dene Hd,l to be the policy space dened by decision lists whose rules are
from Rd,l . Since the number of depth-bounded class expressions is nite there are
a nite number of rules, and hence Hd,l is nite, though exponentially large. Our
implementation of Learn-Policy, as used in the main API loop, learns a policy in
Hd,l for user-specied values of d and l.
We use a Rivest-style decision-list learning approach [43]an approach also taken
by Martin and Gener [33] for learning class-based policies. The primary dierence
between Martin and Gener [33] and our technique is the method for selecting
18.4
513
individual rules in the decision list. We use a greedy, heuristic search, while previous
work used an exhaustive enumeration approach. This dierence allows us to nd
rules that are more complex, at the potential cost of failing to nd some good simple
rules that enumeration might discover.
Recall from section 18.3, that the training set given to Learn-Policy contains
trajectories of the rollout policy. Our learning algorithm, however, is not sensitive to
the trajectory structure (i.e., the order of trajectory elements) and thus, to simplify
our discussion, we will take the input to our learner to be a training set D that
contains the union of all the trajectory elements. This means that for a trajectory
set that contains n length h trajectories, D will contain a total of n  h training
examples. As described in section 18.3, each training example in D has the form
 am )
, where s is a state, (s) is the action selected in s
 a1 ), . . . , Q(s,
s, (s), Q(s,
 ai ) is the Q-value estimate of Q (s, ai ). Note that
by the previous policy, and Q(s,
in our experiments the training examples only contain values for the legal actions
in a state.
Given a training set D, a natural learning goal is to nd a decision-list policy that
for each training example selects an action with the maximum estimated Q-value.
This learning goal, however, can be problematic in practice as often there are several
best (or close to best) actions as measured by the true Q-function. In such case, due
to random sampling, the particular action that looks best according to the Q-value
estimates in the training set is arbitrary. Attempting to learn a concise policy that
matches these arbitrary actions will be dicult at best and likely impossible.
One approach [31] to avoiding this problem is to use statistical tests to determine
the actions that are clearly the best (positive examples) and the ones that are
clearly not the best (negative examples). The learner is then asked to nd a policy
that is consistent with the positive and negative examples. While this approach has
shown some empirical success, it has the potential shortcoming of throwing away
most of the Q-value information. In particular, it may not always be possible to
nd a policy that exactly matches the training data. In such cases, we would like
the learner to make informed tradeos regarding suboptimal actionsi.e., prefer
suboptimal actions that have larger Q-values. With this motivation, below we
describe a cost-sensitive decision-list learner that is sensitive to the full set of Qvalues in D. The learning goal is roughly to nd a decision list that selects actions
with large cumulative Q-values over the training set.
18.4.3.1
 am )
 a1 ), . . . , Q(s,
We say that a decision list L covers a training example s, (s), Q(s,
if L suggests an action in state s. Given a set of training examples D, we search for
a decision list that selects actions with high Q-value via an iterative set-covering
approach carried out by Learn-Decision-List. Decision-list rules are constructed
one at a time and in order until the list covers all of the training examples. Pseudocode for our algorithm is given in gure 18.2. Initially, the decision list is the null
list and does not cover any training examples. During each iteration, we search for a
514
Learn-Decision-List (D, d, l, b)
// training set D, concept depth d, rule length l, beam width b
L  nil;
while (D is not empty)
R  Learn-Rule(D, d, l, b);
D  D  {d  d | R covers d};
L  Extend-List(L, R); // add R to end of list
Return L;
Learn-Rule(D, d, l, b)
// training set D, concept depth d, rule length l, beam width b
for each action type a // compute rule for each action type a
Ra  Beam-Search(D, d, l, w, a);
Return argmaxa Hvalue(Ra , D);
Beam-Search (D, d, l, w, a)
// training set D, concept depth d, rule length l, beam width b, action type a
k  arity of a;
X  (x1 , . . . , xk ); // X is a sequence of action-argument variables
L  {(x  C) | x  X, C  Cd [X]}; // set of depth bounded candidate literals
B0  { a(X) : nil }; i  1; // initialize beam to a single rule with no literals
loop
G = Bi1  {R  Rd,l | R = Add-Literal(R , l), R  Bi1 , l  L};
Bi  Beam-Select(G, w, D); // select best b heuristic values
i  i + 1;
until Bi1 = Bi ; // loop until there is no more improvement in heuristic
Return argmaxRBi Hvalue(R, D) // return best rule in nal beam
high-quality rule R with quality measured relative to the set of currently uncovered training examples. The selected rule is appended to the current decision-list,
and the training examples newly covered by the selected rule are removed from the
training set. This process repeats until the list covers all of the training examples.
The success of this approach depends heavily on the function Learn-Rule, which
selects a good rule relative to the uncovered training examplestypically a good
rule is one that selects actions with the best (or close to best) Q-value and also
covers a signicant number of examples.
18.4
18.4.3.2
515
The input to the rule learner Learn-Rule is a set of training examples, along with
depth and length parameters d and l, and a beam width b. For each action type
a, the rule learner calls the routine Beam-Search to nd a good rule Ra in Rd,l
for action type a. Learn-Rule then returns the rule Ra with the highest value as
measured by our heuristic, which is described later in this section.
For a given action type a, the procedure Beam-Search generates a beam
B0 , B1 . . ., where each Bi is a set of rules in Rd,l for action type a. The sets evolve by
specializing rules in previous sets by adding literals to them, guided by our heuristic
function. Search begins with the most general rule a(X) : nil, which allows any
action of type a in any state. Search iteration i produces a set Bi that contains b
rules with the highest dierent heuristic values among those in the following set4:
G = Bi1  {R  Rd,l | R = Add-Literal(R , l), R  Bi1 , l  L},
where L is the set of all possible literals with a depth of d or less. This set includes
the current best rules (those in Bi1 ) and also any rule in Rd,l that can be formed
by adding a new literal to a rule in Bi1 . The search ends when no improvement
in heuristic value occurs, that is, when Bi = Bi1 . Beam-Search then returns the
best rule in Bi according to the heuristic.
 am )
,
 a1 ), . . . , Q(s,
Heuristic Function For a training instance s, (s), Q(s,
following Harmon and Baird [22], we dene the Q-advantage of taking action ai
 ai )  Q(s,
 (s)). Likewise, the Qinstead of (s) in state s by (s, ai ) = Q(s,
advantage of a rule R is the sum of the Q-advantages of actions allowed by R
in s. Given a rule R and a set of training examples D, our heuristic function
Hvalue(R, D) is equal to the number of training examples that the rule covers plus
the sum of all the Q-advantages of the rule over those training examples.5 Using Qadvantage rather than Q-value focuses the learner toward instances where a large
improvement over the previous policy is possible. Naturally, one could consider
using dierent weights for the coverage and Q-advantage terms, possibly tuning
the weight automatically using validation data.
4. Since many rules in Rd,l are equivalent, we must prevent the beam from lling up
with semantically equivalent rules. Rather than deal with this problem via expensive
equivalence testing we take an ad hoc, but practically eective approach. We assume that
rules do not coincidentally have the same heuristic value, so that ones that do must be
equivalent. Thus, we construct beams whose members all have dierent heuristic values.
We choose between rules with the same value by preferring shorter rules, then choose
arbitrarily.
5. If the coverage term is not included, then covering a zero Q-advantage example is the
same as not covering it. But zero Q-advantage can be good (e.g., the previous policy is
optimal in that state).
516
18.5
Bootstrapping
There are two issues that are critical to the success of our API technique. First,
API is fundamentally limited by the expressiveness of the policy language and
the strength of the learner, which dictates its ability to capture the improved
policy described by the training data at each iteration. Second, API can only
yield improvement if Improved-Trajectories successfully generates training data
that describes an improved policy. For large classical planning domains, initializing
API with an uninformed random policy will typically result in essentially random
training data, which is not helpful for policy improvement. For example, consider
the MDP corresponding to the 20-block blocks world with an initial problem
distribution that generates random initial and goal states. In this case, a random
policy is unlikely to reach a goal state within any practical horizon time. Hence,
the rollout trajectories are unlikely to reach the goal, providing no guidance toward
learning an improved policy (i.e., a policy that can more reliably reach the goal).
Because we are interested in solving large domains such as this, providing guiding inputs to API is critical. In Fern et al. [15], we showed that by bootstrapping
API with the domain-independent heuristic of the planner FF [24], API was able
to uncover good policies for the blocks world, simplied logistics world (no planes),
and stochastic variants. This approach, however, is limited by the heuristics ability
to provide useful guidance, which can vary widely across domains.
Here we describe a new bootstrapping procedure for goal-based planning domains, based on random walks, for guiding API toward good policies. Our planning
system, which is evaluated in section 18.6, is based on integrating this procedure
with API in order to nd policies for goal-based planning domains. For non-goalbased MDPs, this bootstrapping procedure cannot be directly applied, and other
bootstrapping mechanisms must be used if necessary. This might include providing
an initial nontrivial policy, providing a heuristic function, or some form of reward
shaping [34]. Below, we rst describe the idea of random-walk distributions. Next,
we describe how to use these distributions in the context of bootstrapping API,
giving a new algorithm LRW-API.
18.5.1
Random-Walk Distributions
Throughout we consider an MDP M = S, A, T, R, I
 that correspond to goalbased planning domains, as described in section 18.4.1. Recall that each state
s  S corresponds to a planning problem, specifying a world state (via world
facts) and a set of goal conditions (via goal facts). We will use the terms MDP
state and planning problem interchangeably. Note that, in this context, I is a
distribution over planning problems. For convenience we will denote MDP states
as tuples s = (w, g), where w and g are the sets of world facts and goal facts in s
respectively.
18.5
Bootstrapping
517
Given an MDP state s = (w, g) and set of goal predicates G, we dene s|G to be
the MDP state (w, g  ) where g  contains those goal facts in g that are applications
of a predicate in G. Given M and a set of goal predicates G, we dene the nstep random walk problem distribution RW n (M, G) by the following stochastic
algorithm:
1. Draw a random state s0 = (w0 , g0 ) from the initial state distribution I.
2. Starting at s0 take n uniformly random actions, 6, giving a state sequence
(s0 , . . . , sn ), where sn = (wn , g0 ) (recall that actions do not change goal facts).
At each uniformly random action selection, we assume that an extra no-op
action (that does not change the state) is selected with some xed probability,
for reasons explained below.
3. Let g be the set of goal facts corresponding to the world facts in wn , so, e.g., if
wn = {on(a, b), clear(a)}, then g = {gon(a, b), gclear(a)}. Return the planning
problem (MDP state) (s0 , g)|G as the output.
We will sometimes abbreviate RW n (M, G) by RW n when M and G are clear in
context.
Intuitively, to perform well on this distribution a policy must be able to achieve
facts involving the goal predicates that typically result after an n-step random walk
from an initial state. By restricting the set of goal predicates G we can specify the
types of facts that we are interested in achievinge.g., in the blocks world we may
only be interested in achieving facts involving the on predicate.
The random-walk distributions provide a natural way to span a range of problem
diculties. Since longer random walks tend to take us further from an initial
state, for small n we typically expect that the planning problems generated by
RW n will become more dicult as n grows. However, as n becomes large, the
problems generated will require far fewer than n steps to solvei.e., there will be
more direct paths from an initial state to the end state of a long random walk.
Eventually, since S is nite, the problem diculty will stop increasing with n.
A question raised by this idea is whether, for large n, good performance on
RW n ensures good performance on other problem distributions of interest in the
domain. In some domains, such as the simple blocks world, 7, good random-walk
performance does seem to yield good performance on other distributions of interest.
In other domains, such as the grid world (with keys and locked doors), intuitively,
a random walk is very unlikely to uncover a problem that requires unlocking a
sequence of doors.
6. In practice, we only select random actions from the set of applicable actions in a state
si , provided our simulator makes it possible to identify this set.
7. In the blocks world with large n, RW n generates various pairs of random block
congurations, typically pairing states that are far apartclearly, a policy that performs
well on this distribution has captured signicant information about the blocks world.
518
We believe that good performance on long random walks is often useful, but
is only addressing one component of the diculty of many planning benchmarks.
To successfully address problems with other components of diculty, a planner
will need to deploy orthogonal technology such as landmark extraction for setting
subgoals [23]. For example, in the grid world, if we could automatically set the
subgoal of possessing a key for the rst door, a long random-walk policy could
provide a useful macro for getting that key.
For the purpose of developing a bootstrapping technique for API, we limit our
focus to nding good policies for long random walks. In our experiments, we dene
long by specifying a large walk length N . Theoretically, the inclusion of the
no-op action in the denition of RW ensures that the induced random-walk
Markov chain is aperiodic, and thus that the distribution over states reached
by increasingly long random walks converges to a stationary distribution.8 Thus
RW  = limn RW n is well-dened, and we take good performance on RW  to
be our goal.
18.5.2
Random-Walk Bootstrapping
18.5
Bootstrapping
519
LRW-API (N, G, n, w, h, M, 0 , )
// max random-walk length N , goal predicates G
// training set size n, sampling width w, horizon h,
// MDP M , initial policy 0 , discount factor .
  0 ; n  1;
loop
c  (n) > 
if SR
// Find harder n-step distribution for .
c  (i) <   , or N if none;
n  least i  [n, N ] s.t. SR
M  = M [RW n (M, G)];
T  Improved-Trajectories(n, w, h, M  , );
  Learn-Policy(T );
until satised with 
Return ;
c  (n) estimates the success ratio of 
Pseudocode for LRW-API. SR
in planning domain D on problems drawn from RW n (M, G) by drawing a set of
problems and returning the fraction solved by  . Constants  and  are described
in the text.
Figure 18.3
520
18.6
18.6
18.6.1.1
521
LRW Experiments
Our rst set of experiments evaluates the ability of LRW-API to nd good policies
for RW  . Here we utilize a sampling width of one for rollout, since these are
deterministic domains. Recall that in each iteration of LRW-API we compute
an (approximately) improved policy and may also increase the walk length n to
nd a harder problem distribution. We continued iterating LRW-API until we
observed no further improvement. The training time per iteration is approximately
ve hours. Though the initial training period is signicant, once a policy is learned
it can be used to solve new problems very quickly, terminating in seconds with a
solution when one is found, even for very large problems.
Figure 18.4 provides data for each iteration of LRW-API in each of the seven
domains with the indicated parameter settings. The rst column, for each domain,
indicates the iteration number (e.g., the Blocks World was run for eight iterations).
The second column records the walk length n used for learning in the corresponding
iteration. The third and fourth columns record the success rate (SR) and average
lenght (AL) of the policy learned at the corresponding iteration as measured on 100
problems drawn from RW n for the corresponding value of n (i.e., the distribution
used for learning). When this SR exceeds  , the next iteration seeks an increased
walk length n. The fth and sixth columns record the SR and AL of the same
policy, but measured on 100 problems drawn from the LRW target distribution
RW  , which in these experiments is approximated by RW N for N = 10, 000.
So, for example, we see that in the Blocks World there are a total of eight
iterations, where we learn at rst for one iteration with n = 4, one more iteration
with n = 14, four iterations with n = 54, and then two iterations with n = 334.
At this point we see that the resulting policy performs well on RW  . Further
iterations with n = N , not shown, showed no improvement over the policy found
after iteration 8. In other domains, we also observed no improvement after iterating
with n = N , and thus do not show those iterations. We note that all domains except
Logistics (see below) achieve policies with good performance on RW N by learning
on much shorter RW n distributions, indicating that we have indeed selected a large
enough value of N to capture RW  , as desired.
18.6.1.2
General Observations
For several domains, our learner bootstraps very quickly from short random-walk
problems, nding a policy that works well even for much longer random-walk
problems. These include Schedule, Briefcase, Gripper, and Elevator. Typically,
large problems in these domains have many somewhat independent subproblems
with short solutions, so that short random walks can generate instances of all the
dierent typical subproblems. In each of these domains, our best LRW policy is
found in a small number of iterations and performs comparably to FF on RW  .
We note that FF is considered a very good domain-independent planner for these
domains, so we consider this a successful result.
RW 
RW n
SR AL SR AL
4
14
54
54
54
54
334
334
0.92
0.94
0.56
0.78
0.88
0.98
0.84
0.99
2.0
5.6
15.0
15.0
33.7
25.1
45.6
37.8
FF
Freecell
1
2
3
4
5
6
7
8
9
5
8
30
30
30
30
30
30
30
0.97
0.97
0.65
0.72
0.90
0.81
0.78
0.90
0.93
FF
0
0.10
0.17
0.32
0.65
0.90
0.87
1
0
41.4
42.8
40.2
47.0
43.9
50.1
43.3
iter. #
522
RW 
RW n
n SR AL SR AL
Logistics (1,2,2,6)
1
2
3
4
5
6
7
8
0.96 49.0 9
10
(4,2,2,4)
43
1.4 0.08 3.6
44
2.7 0.26 6.3
45
7.0 0.78 7.0
7.1 0.85 7.0
6.7 0.85 6.3
6.7 0.89 6.6
6.8 0.87 6.8
1
2
6.9 0.89 6.6
7.7 0.93 7.9
1
5 0.86
45 0.86
45 0.81
45 0.86
45 0.76
45 0.76
45 0.86
45 0.76
45 0.70
45 0.81
 
45 0.74
45 0.90
45 0.92
FF
3.1
6.5
6.9
6.8
6.1
5.9
6.2
6.9
6.1
6.1
6.4
6.9
6.6
0.25
0.28
0.31
0.28
0.28
0.32
0.39
0.31
0.19
0.25
0.25
0.39
0.38
1
11.3
7.2
8.4
8.9
7.8
8.4
9.1
11.0
7.8
7.6
9.0
9.3
9.4
13
Schedule (20)
1 0.79 1 0.48
4 1 3.45 1
FF
27
34
36
5.4
Briefcase (10)
Elevator (20,10)
1 20 1
4.0 1
26
FF
23
1 5 0.91 1.4 0
2 15 0.89 4.2 0.2
3 15 1 3.0 1
FF
0
38
30
28
Gripper (10)
1 10 1
3.8 1
13
FF
13
Figure 18.4 Results for each iteration of LRW-API in seven deterministic planning domains. For each iteration, we show the walk length n used for learning, along
with the success ratio (SR) and average length (AL) of the learned policy on both
RW n and RW  . Note that larger SR and smaller AL is better. The nal policy
shown in each domain performs above  = 0.9 SR on walks of length N = 10, 000
(with the exception of Logistics), and further iteration does not improve the performance. For each benchmark we also show the SR and AL of the planner FF on
problems drawn from RW  .
18.6
523
For two domains, Logistics9 and Freecell, our planner is unable to nd a policy
with success ratio one on RW  . We believe that this is a result of the limited
knowledge representation we allowed for policies for the following reasons. First,
we ourselves cannot write good policies for these domains within our current
policy language.10 Second, the nal learned decision lists for Logistics and Freecell
contain a much larger number of more specic rules than the lists learned in the
other domains. This indicates that the learner has diculty nding general rules
within the language restrictions that are applicable to large portions of training
data, resulting in poor generalization. Third, the success ratio (not shown) for the
sampling-based rollout policy, i.e., the improved policy simulated by ImprovedTrajectories, is substantially higher than that for the resulting learned policy that
becomes the policy of the next iteration. This indicates that Learn-Decision-List
is learning a much weaker policy than the sampling-based policy generating its
training data, indicating a weakness in either the policy language or the learning
algorithm. For example, in the Logistics domain, at iteration 8, the training data
for learning the iteration 9 policy is generated by a sampling rollout policy that
achieves success ratio 0.97 on 100 training problems drawn from the same RW 45
distribution, but the learned iteration 9 policy only achieves success ratio 0.70, as
shown in the gure at iteration 9. Extending our policy language to incorporate
the expressiveness that appears to be required in these domains will require a more
sophisticated learning algorithm, which is a point of future work.
In the remaining domain, the Blocks World, the bootstrapping provided by
increasingly long random walks appears particularly useful. The policies learned
at each of the walk lengths 4, 14, 54, and 334 are increasingly eective on the
target LRW distribution RW  . For walks of length 54 and 334, it takes multiple
iterations to master the provided level of diculty beyond the previous walk length.
Finally, upon mastering walk length 334, the resulting policy appears to perform
well for any walk length. The learned policy is modestly superior to FF on RW 
in success ratio and average length.
18.6.1.3
In each domain we denote by  the best learned LRW policyi.e., the policy, from
each domain, with the highest performance on RW  , as shown in gure 18.4. Figure
18.5 shows the performance of  , in comparison to FF, on the original intended
problem distributions for each of our domains. We measured the success ratio of
both systems by giving a time limit of 100 seconds to solve a problem. Here we
9. In Logistics, the planner generates a long sequence of policies with similar, oscillating
success ratios that are elided from the gure with ellipses for space reasons.
10. For example, in Logistics, one of the important concepts is the set containing all
packages on trucks such that the truck is in the packages goal city. However, the domain
is dened in such a way that this concept cannot be expressed within the language used
in our experiments.
524
Domain
Size
Blocks (20)
(50)
FF
SR AL SR AL
1 54 0.81 60
1 151 0.28 158
1 112 1
Schedule (50)
1 175 1 212
Briefcase (10)
(50)
1 30 1
1 162 0
Gripper (50)
98
29
1 149 1 149
have attempted to select the largest problem sizes previously used in evaluation of
domain-specic planners (either in AIPS-2000 or in Bacchus and Kabanza [4]), as
well as show a smaller problem size for those cases where one of the planners we
show performed poorly on the large size. In each case, we use the problem generators
provided with the domains, and evaluate on 100 problems of each size.
Overall, these results indicate that our learned, reactive policies are competitive
with the domain-independent planner FF. It is important to remember that these
policies are learned in a domain-independent fashion, and thus LRW-API can
be viewed as a general approach to generating domain-specic reactive planners.
On two domains, Blocks World and Briefcase, our learned policies substantially
outperform FF on success ratio, especially on large domain sizes. On three domains,
Elevator, Schedule, and Gripper, the two approaches perform quite similarly on
success ratio, with our approach superior in average length on Schedule but FF
superior in average length on Elevator.
On two domains, Logistics and Freecell, FF substantially outperforms our learned
policies on success ratio. We believe that this is partly due to an inadequate policy
language, as discussed above. We also believe, however, that another reason for
the poor performance is that the long-random-walk distribution RW  does not
correspond well to the standard problem distributions. This seems to be particularly
true for Freecell. The policy learned for Freecell (4,2,2,4) achieved a success ratio
of 93 % on RW  ; however, for the standard distribution it only achieved 36%.
This suggests that RW  generates problems that are signicantly easier than the
18.6
525
standard distribution. This is supported by the fact that the solutions produced
by FF on the standard distribution are on average twice as long as those produced
on RW  . One likely reason for this is that it is easy for random walks to end up
in dead states in Freecell, where no actions are applicable. Thus the random-walk
distribution will typically produce many problems where the goals correspond to
such dead states. The standard distribution on the other hand will not treat such
dead states as goals.
18.6.2
526
SR
RW n
AL
RW 
SR AL
Boxworld (10,5)
1
2
3
4
5
6
7
8
9
10
10
20
40
170
170
170
170
170
0.73
0.93
0.91
0.96
0.62
0.49
0.63
0.63
0.48
4.3
2.3
4.4
6.1
30.8
37.9
29.3
29.1
36.4
0.03
0.13
0.17
0.31
0.25
0.17
0.21
0.18
0.17
61.5
58.4
55.9
50.4
52.2
55.7
55
55.3
55.3
2.71
2.06
6.41
0.17 168.9
0.84 17.5
1 7.2
1
20
1.7
8.4
11.7
37.5
20.0
0.19
0.81
0.85
0.77
0.95
93.6
40.8
32.7
38.5
21.9
0.95 123
Figure 18.6
18.7
Related Work
527
18.7
Related Work
Boutilier et al. [10] presented the rst exact solution technique for relational MDPs
based on structured dynamic programming. However, a practical implementation
of the approach was not provided, primarily due to the need for the simplication
of rst-order logic formulae. These ideas, however, served as the basis for a logic
programming-based system [28] that was successfully applied to blocks world
problems involving simple goals and a simplied logistics world. This style of
approach is inherently limited to domains where the exact value functions or
policies can be compactly represented in the chosen knowledge representation.
Unfortunately, this is not generally the case for the types of domains that we
consider here, particularly as the planning horizon grows. Nevertheless, providing
techniques such as these that directly reason about the MDP model is an important
direction. Note that our API approach essentially ignores the underlying MDP
model, and simply interacts with the MDP simulator as a black box.
An interesting research direction is to consider principled approximations of these
techniques that can discover good policies in more dicult domains. This has been
considered by Guestrin et al. [20], where a class-based MDP and value function
representation was used to compute an approximate value function that could
528
generalize across dierent sets of objects. Promising empirical results were shown
in a multiagent tactical battle domain. Presently the class-based representation
does not support some of the representation features that are commonly found in
classical planning domains (e.g., relational facts such as on(a, b) that change over
time), and thus is not directly applicable in these contexts. However, extending
this work to richer representations is an interesting direction. Its ability to reason
globally about a domain may give it some advantages compared to API.
Our approach is closely related to work in RRL [13], a form of online API that
learns relational value-function approximations. Q-value functions are learned in
the form of relational decision trees (Q-trees) and are used to learn corresponding
policies (P -trees). The RRL results clearly demonstrate the diculty of learning
value-function approximations in relational domains. Compared to P -trees, Q-trees
tend to generalize poorly and be much larger. RRL has not yet demonstrated
scalability to problems as complex as those considered hereprevious RRL blocks
world experiments include relatively simple goals,11, which lead to value functions
that are much less complex than the ones here. For this reason, we suspect that
RRL would have diculty in the domains we consider precisely because of the valuefunction approximation step that we avoid; however, this needs to be experimentally
tested.
We note, however, that our API approach has the advantage of using an unconstrained simulator, whereas RRL learns from irreversible world experience
(pure reinforcment learning). By using a simulator, we are able to estimate the
Q-values for all actions at each training state, providing us with rich training data.
Without such a simulator, RRL is not able to directly estimate the Q-value for
each action in each training statethus, RRL learns a Q-tree to provide estimates
of the Q-value information needed to learn the P -tree. In this way, value-function
learning serves a more critical role when a simulator is unavailable. We believe that
in many relational planning problems, it is possible to learn a model or simulator
from world experiencein this case, our API approach can be incorporated as the
planning component of RRL. Otherwise, nding ways to either avoid learning or to
more eectively learn relational value functions in RRL is an interesting research
direction.
Researchers in classical planning have long studied techniques for learning to
improve planning performance. For a collection and survey of work on learning
for planning domains, see [39, 53]. Two primary approaches are to learn domainspecic control rules for guiding search-based planners (e.g., see [40, 48, 14, 26,
2, 1]), and, more closely related, to learn domain-specic reactive control policies
[29, 33, 51].
Regarding the latter, our work is novel in using API to iteratively improve standalone control policies. Regarding the former, in theory, search-based planners can
11. The most complex blocks world goal for RRL was to achieve on(A, B) in an n block
environment. We consider blocks world goals that involve all n blocks.
18.7
Related Work
529
530
18.8
12. Here the initial state distribution is dictated by the policies at previous time steps,
which are held xed. Likewise the actions selected along the rollout trajectories are dictated
by policies at future time steps, which are also held xed.
References
531
have seen that limitations of our current policy language and learner are partly
responsible for some of the failures of our system. In such cases, we must either (1)
depend on the human to provide useful features to the system, or (2) extend the
policy language and develop more advanced learning techniques. Policy-language
extensions that we are considering include various extensions to the knowledge representation used to represent sets of objects in the domain (in particular, for route
nding in maps/grids), as well as non-reactive policies that incorporate search into
decision making.
As we consider ever more complex planning domains, it is inevitable that our
brute-force enumeration approach to learning policies from trajectories will not
scale. Presently our policy learner, as well as the entire API technique, makes no
attempt to use the denition of a domain when one is available. We believe that
developing a learner that can exploit this information to bias its search for good
policies is an important direction of future work. Recently, Gretton and Thiebaux
[19] have taken a step in this direction by using logical regression (based on a
domain model) to generate candidate rules for the learner. Developing tractable
variations of this approach is a promising research direction. In addition, exploring
other ways of incorporating a domain model into our approach and other modelblind approaches are critical. Ultimately, scalable AI planning systems will need
to combine experience with stronger forms of explicit reasoning.
Acknowledgments
We thank Lin Zhu for originally suggesting the idea of using random walks for
bootstrapping. This work was supported in part by NSF grants 9977981-IIS and
0093100-IIS.
References
[1] R. Aler, D. Borrajo, and P. Isasi. Using genetic programming to learn and
improve control knowledge. Articial Intelligence, 141(1-2):2956, 2002.
[2] J. Ambite, C. Knoblock, and S. Minton. Learning plan rewriting rules. In
Proceedings of the International Conference on Articial Intelligence Planning
and Scheduling Systems, 2000.
[3] F. Bacchus. The AIPS 00 planning competition. AI Magazine, 22(3)(3):5762,
2001.
[4] F. Bacchus and F. Kabanza. Using temporal logics to express search control
knowledge for planning. Articial Intelligence, 16:123191, 2000.
[5] J. Bagnell, S. Kakade, A. Ng, and J. Schneider. Policy search by dynamic
programming. In Proceedings of Neural Information Processing Systems, 2003.
532
References
533
534
(2/3):141160, 2002.
[38] S. Minton. Quantitative results concerning the utility of explanation-based
learning. In National Conference on Articial Intelligence, 1988.
[39] S. Minton, editor. Machine Learning Methods for Planning. Morgan Kaufmann, San Fransisco, CA, 1993.
[40] S. Minton, J. Carbonell, C. A. Knoblock, D. R. Kuokka, O. Etzioni, and
Y. Gil. Explanation-based learning: A problem solving perspective. Articial
Intelligence, 40:63118, 1989.
[41] B. K. Natarajan. On learning from exercises.
Computational Learning Theory, 1989.
In Annual Workshop on
In
[45] G. Tesauro. Practical issues in temporal dierence learning. Machine Learning, 8:257277, 1992.
[46] G. Tesauro and G. Galperin. On-line policy improvement using monte-carlo
search. In Conference on Advances in Neural Information Processing, 1996.
[47] J. Tsitsiklis and B. Van Roy. Feature-based methods for large scale DP.
Machine Learning, 22:5994, 1996.
[48] M. Veloso, J. Carbonell, A. Perez, D. Borrajo, E. Fink, and J. Blythe.
Integrating planning and learning: The PRODIGY architecture. Journal of
Experimental and Theoretical AI, 7(1):81120, 1995.
[49] G. Wu, E. Chong, and R. Givan. Congestion control via online sampling. In
Infocom, 2001.
[50] X. Yan, P. Diaconis, P. Rusmevichientong, and B. Van Roy. Solitaire: Man
versus machine. In Proceedings of Neural Information Processing Systems,
2004.
[51] S. Yoon, A. Fern, and R. Givan. Inductive policy selection for rst-order
MDPs. In Proceedings of the Conference on Uncertainty in Articial Intelligence, 2002.
[52] H. Younes. Extending PDDL to model stochastic decision processes. In Proceedings of the International Conference on Automated Planning and Scheduling Workshop on PDDL, 2003.
[53] T. Zimmerman and S. Kambhampati. Learning-assisted automated planning:
Looking back, taking stock, going forward. AI Magazine, 24(2)(2):7396, 2003.
Traditionally, information extraction (IE) systems treat separate potential extractions as independent. There are, however, cases when modeling the inuences between dierent potential extractions could improve overall accuracy. In this chapter, we use the framework of relational Markov networks (RMNs) in order to model
several specic relationships between candidate extractions. Inference and learning using this graphical model allow for collective information extraction in a
way that exploits the mutual inuence between possible extractions. Experiments
on learning to extract protein names from biomedical abstracts demonstrate the
advantage of this approach over existing IE methods.
19.1
Introduction
Understanding natural language presents many challenging problems that lend
themselves to statistical relational learning (SRL). Historically, both logical and
probabilistic methods have found wide application in natural language processing
(NLP). NLP inevitably involves reasoning about an arbitrary number of entities
(people, places, and things) that have an unbounded set of complex relationships
between them. Representing and reasoning about unbounded sets of entities and
relations has generally been considered a strength of predicate logic. However, NLP
also requires integrating uncertain evidence from a variety of sources in order to resolve numerous syntactic and semantic ambiguities. Eectively integrating multiple
sources of uncertain evidence has generally been considered a strength of Bayesian
probabilistic methods and graphical models. Consequently, NLP problems are particularly suited for SRL methods that combine the strengths of rst-order predicate
logic and probabilistic graphical models. In this chapter, we review our recent work
[4] on using relational Markov networks (RMNs) [30] for information extraction,
536
the problem of identifying phrases in natural language text that refer to specic
types of entities [7]. We use the expressive power of RMNs to represent and reason
about several specic relationships between candidate entities and thereby collectively identify the appropriate set of phrases to extract. We present experiments
on learning to extract protein names from biomedical text that demonstrate the
advantage of this approach over existing information extraction methods.
The remainder of the chapter is organized as follows. In section 19.2, we review
the history of logical and probabilistic approaches to NLP, and discuss the unique
suitability of SRL for NLP. Section 19.3 introduces the problem of information
extraction, followed by section 19.4, where we summarize our work on collective information extraction using RMNs. In section 19.5, we examine challenging problems
for future research on SRL for NLP. In section 19.6, we present our conclusions.
19.2
19.3
Information Extraction
537
holds between adjacent items. Many NLP tasks, such as POS tagging, phrase
chunking [24], and information extraction (e.g., named entity tagging), can be
viewed as sequence labeling problems in which each word is assigned one of a
small number of class labels. The label of each word typically depends on the labels
of adjacent words in the sentence and collective inference must be performed to
assign the overall most probable combination of labels to all of the words in the
sentence. Statistical sequence models such as hidden Markov models (HMMs) [22]
or conditional random elds (CRFs) [18] are used to model the data and some form
of the Viterbi dynamic programming algorithm [31] is used to eciently perform the
collective classication. However, in order to develop systems that accurately and
robustly perform natural language analysis, we believe that more advanced SRL
methods are needed. In this chapter, we explore the application of an alternative
SRL method to the natural language task of information extraction. We introduce
the task in the following section and then present our recent SRL approach.
19.3
Information Extraction
Information extraction, locating references to specic types of items in natural
language documents, is an important task with many practical applications. Typical examples include identifying various named entities such as names of people, companies, and locations. In this chapter, we consider the task of identifying names of human proteins in abstracts of biomedical journal articles. Figure 19.1 shows part of a sample abstract highlighting the protein names to be
identied. This task is an important part of mining the scientic literature in
order to build structured databases of existing biological knowledge. In particular, by mining 753,459 abstracts on the human organism from the Medline
repository (http://www.ncbi.nlm.nih.gov/entrez/) we have extracted a database
of 6580 interactions among 3737 human proteins. The details of this database have
been published in the biological literature [23] and it is available on the web at
http://bioinformatics.icmb.utexas.edu/idserve.
Figure 19.1
538
19.4
19.4
539
it too is a protein name. The same name rpL22 occurs later in the abstract in
contexts which do not indicate so clearly the entity type; however, we can use the
fact that repetitions of the same name tend to have the same type inside the same
document.
Figure 19.2
The capitalization pattern of the name itself is another useful indicator; nevertheless it is not sucient by itself, as similar patterns are also used for other types
of biological entities such as cell types or amino acids. Therefore, correlations between the labels of repeated phrases, or between acronyms and their long form can
provide additional useful information. Our intuition is that a method that could use
this kind of information would show an increase in performance, especially when
doing extraction from biomedical literature, where phenomena like repetitions and
acronyms are pervasive. This type of document-level knowledge can be captured
using relational Markov networks (RMNs), a version of undirected graphical models
which have already been successfully used to improve the classication of hyperlinked webpages [30].
The rest of this section is organized as follows. In sections 19.4.1 and 19.4.2 we
describe the input to our named entity extractor in terms of a set of candidate
entities and their features. Subsequent sections introduce the RMN framework for
entity recognition (representation, inference, and learning), ending with experimental results in section 19.4.8.
19.4.1
Candidate Entities
540
various heuristics that can signicantly reduce the size of the candidate set; two of
these are listed below:
H1: In general, named entities have limited length. Therefore, one simple way of
creating the set of candidate phrases is to compute the maximum length of all
annotated entities in the training set, and then consider as candidates all word
sequences whose length is up to this maximum length. This is also the approach
followed in SRV [12].
H2: In the task of extracting protein names from Medline abstracts, we noticed
that, like most entity names, almost all proteins in our data are base noun
phrases (NPs) or parts of them. Therefore, such substrings are used to determine
candidate entities. To avoid missing options, we adopt a very broad denition of
base NP  a maximal contiguous sequence of tokens with their POS restricted to
nouns, gerund verbs, past participle verbs, adjectives, numbers, and dashes. The
complete set of POS tags is {JJ, VBN, VBG, POS, NN, NNS, NNP, NNPS, CD,
} (using the treebank notation [20]). Also, the last word (the head) of a base
NP is constrained to be either a noun or a number. Candidate extractions then
consist of base NPs, together with all their contiguous subsequences headed by a
noun or number.
19.4.2
Entity Features
The set of features associated with each candidate is based on the feature templates
introduced in [9], used there for training a reranking algorithm on the extractions
returned by a maximum-entropy tagger. Many of these features use the concept
of word type, which allows a dierent form of token generalization than POS tags.
The short type of a word is created by replacing any maximal contiguous sequences
of capital letters with A, of lowercase letters with a, and of digits with 0. For
example, the word TGF-1 would be mapped to type A-0.
Consequently, each token position i in a candidate extraction provides three types
of information: the word itself wi , its POS tag ti , and its short type si . The full
set of feature types is listed in table 19.1, where we consider a generic candidate
extraction as a sequence of n + 1 words w0 w1 ...wn .
Each feature template instantiates numerous features. For example, the candidate
extraction HDAC1 enzyme has the headword HD=enzyme, the short type ST=A0 a,
the prexes PF=A0 and PF=A0 a, and the suxes SF=a and SF=A0 a. All other
features depend on the left or right context of the entity. Feature values that occur
less than three times in the training data are ltered out.
19.4.3
19.4
541
Feature templates
Description
Feature Template
Description
Feature Template
Text / head
w0 w1 ...wn / wn
Short type
s0 s1 ...sn
Bigram left
(4 bigrams)
z1 z0
where z  {w, s}
Bigram right
(4 bigrams)
zn zn+1
where z  {w, s}
Trigram left
(8 trigrams)
z2 z1 z0
where z  {w, s}
Trigram right
(8 trigrams)
zn zn+1 zn+2
where z  {w, s}
POS left
t1
POS right
tn+1
Prex
(n+1 prexes)
s0 s0 s1 ...
s0 s1 ...sn+1
Sux
(n+1 suxes)
sn sn1 sn
s0 s1 ...sn+1
...
predened set of Boolean attributes e.F section 19.4.2, the same for all candidate
entities. One particular attribute is e.label which is set to 1 if e is considered a valid
extraction, and 0 otherwise. In this document model, labels are the only hidden
variables, and the inference procedure will try to nd a most probable assignment
of values to labels, given the current model parameters and the values of all other
variables.
Each document is associated with a factor graph [17], which is a bipartite graph
containing two types of nodes:
Variable nodes correspond directly to the labels of all candidate entities in the
document.
Potential nodes model the correlations between two or more entity attributes.
For each such correlation, a potential node is created that is linked to all variable
nodes involved. This is equivalent to creating a clique in the corresponding Markov
random eld.
The types of correlations captured by factor graphs (see gure 19.4 for some
examples) are specied by matching clique templates against the entire set of
candidate entities d.E. A clique template is a procedure that nds all subsets of
entities satisfying a given constraint, after which, for each entity subset, it connects
through a potential node all the variable nodes corresponding to a selected set of
attributes. Formally, there is a set of clique templates C, with each template c  C
specied by:
1. A matching operator Mc for selecting subsets of entities, Mc (E)  2E .
2. A selected set of features Sc = Xc , Yc 
, the same for all subsets of entities returned
by the matching operator. Xc denotes the observed features, while Yc refers to
the hidden labels.
3. A clique potential c which gives the compatibility of each possible conguration
of values for the features in Sc , s.t. c (s)  0, s  Sc .
542
(19.2)
cC GMc (d.E)
There are two problems that need to be addressed when working with RMNs:
1. Inference Usually, two types of quantities are needed from an RMN model:
The marginal distribution for a hidden variable, or for a subset of hidden
variables in the graphical model.
The most probable assignment of values to all hidden variables in the model.
2. Learning As the structure of the RMN model is already dened by its clique
templates, learning refers to nding the clique potentials that maximize the
likelihood over the training data. Inference is usually performed multiple times
during the learning algorithm, which means that an accurate, fast inference
procedure is doubly important.
The actual algorithms used for inference and learning will be described in
sections 19.4.6 and 19.4.7 respectively.
19.4
19.4.4
543
As described in the previous section, the role of local clique templates is to model
correlations between an entitys observed features (see table 19.1 and its label. For
each binary feature f we introduce a local template LTf . Given a candidate entity
e, with the observed feature e.f = 1, the template LTf creates a potential node
linked to the variable node e.label. As an example, gure 19.3 shows that part of the
factor graph which is generated around the entity label for HDAC1 enzyme, with
potential nodes for the head feature (HD), prex features (PF) and sux features
(SF). Variable nodes are shown as empty circles and potential nodes are gured
as black squares. The potential f associated with all potential nodes created by
template LTf would consist in a 1  2 table, as e.f is known to be 1, and e.label
has cardinality 2 (assuming only one entity type is to be extracted, we need only
two values for the label attribute).
e label
...
 HD=enzyme  PF=A0_a
 PF=A0
Figure 19.3
19.4.5
 SF=A0_a
 SF=a
544
In gure 19.4 we show the factor graphs created by these global templates, each of
which is explained in the following sections.
RT
u or
 or
OT
u
v
(a) Overlap factor graph
Figure 19.4
19.4.5.1
u1
u2
AT
v
... u
n
vor
 or
v1
v2
u or
 or
... v
m
u1
u2
...
un
The denition of a candidate extraction from section 19.4.1 leads to many overlapping entities. For example, glutathione S - transferase is a base NP, and it generates ve
candidate extractions: glutathione, glutathione S, glutathione S - transferase, S - transferase,
and transferase. If glutathione S - transferase has label-value 1, the other four entities
should all have label-value 0, because they overlap with it.
This type of constraint is enforced by the overlap template by creating a potential
node for each pair of overlapping entities and connecting it to their label nodes,
as shown in gure 19.4(a). To avoid clutter, all entities in this and subsequent
factor graphs stand for their corresponding labels. The potential function OT is
manually set so that at most one of the overlapping entities can have label-value 1,
as illustrated in table 19.2.
Table 19.2
Overlap potential
OT
e1 .label = 0
e1 .label = 1
e2 .label = 0
e2 .label = 1
Continuing with the previous example, because glutathione S and S - transferase are
two overlapping entities, the factor graph model will contain an overlap potential
node connected to the label nodes of these two entities.
19.4
19.4.5.2
545
We could specify the potential for the repeat template in a 2  2 table, this time
leaving the table entries to be learned, given that assigning the same label to
repetitions is not a hard constraint. However, we can do better by noting that
the vast majority of cases where a repeated protein name is not also tagged as a
protein happens when it is part of a larger phrase that is tagged. For example,
HDAC1 enzyme is a protein name, therefore HDAC1 is not tagged in this phrase,
even though it may have been tagged previously in the abstract where it was not
followed by enzyme. We need a potential that allows two entities with the same
text to have dierent labels if the entity with label-value 0 is inside another entity
with label-value 1. But a candidate entity may be inside more than one including
entity, and the number of including entities may vary from one candidate extraction
to another. Using the example from section 19.4.5.1, the candidate entity glutathione
is included in two other entities: glutathione S and glutathione S - transferase.
In order to instantiate potentials over a variable number of label nodes, we
introduce a logical OR clique template that matches a variable number of entities.
When this template matches a subset of entities e1 , e2 , ..., en , it will create an
auxiliary OR entity eOR , with a single attribute eOR .label. The potential function
OR is manually set so that it assigns a nonzero potential only when eOR .label =
e1 .labele2.label...en .label. The potential nodes are only created as needed, e.g.,
when the auxiliary OR entity is required by repeat and acronym clique templates.
Figure 19.4(b) shows the factor graph for a sample instantiation of the repeat
template using the OR template. Here, u and v represent two same-text entities, u1 ,
u2 , ... un are all entities that include u, and v1 , v2 , ..., vm are entities that include v.
The potential function RT can either be manually preset to prohibit unlikely label
congurations, or it can be learned to represent an appropriate soft constraint. In
our experiments, it was learned since this gave slightly better performance.
Following the previous example, suppose that the word glutathione occurs inside
two base NPs in the same document, glutathione S - transferase and glutathione antioxidant system. Then the rst occurrence of glutathione will be associated with the entity
u, and correspondingly its including entities will be u1 = glutathione S and u2 =
glutathione S - transferase. Similarly, the second occurrence of glutathione will be associated with the entity v, with the corresponding including entities v1 = glutathione
antioxidant and v2 = glutathione antioxidant system.
19.4.5.3
One approach to the acronym template would be to use an extant algorithm for
identifying acronyms and their long forms in a document, and then dene a potential
function that would favor label congurations in which both the acronym and its
denition have the same label. One such algorithm is described by Schwartz and
Hearst[27], achieving a precision of 96% at a recall rate of 82%. However, because
this algorithm would miss a signicant number of acronyms, we have decided to
546
implement a softer version as follows: detect all situations in which a single word is
enclosed between parentheses, such that the word length is at least 2 and it begins
with a letter. Let v denote the corresponding entity. Let u1 , u2 , ..., un be all entities
that end exactly before the open parenthesis. If this is a situation in which v is an
acronym, then one of the entities ui is its corresponding long form. Consequently,
we use a logical OR template to introduce the auxiliary entity uOR , and connect it
to vs node label through an acronym potential AT , as illustrated in gure 19.4(c).
For example, consider the phrase the antioxidant superoxide dismutase - 1 ( SOD1 ).
SOD1 satises our criteria for acronyms, thus it will be associated with the entity v
in gure 19.4(c). The candidate long forms are u1 = antioxidant superoxide dismutase 1, u2 = superoxide dismutase - 1, and u3 = dismutase - 1.
19.4.6
In our setting, given the clique potentials, the inference step for the factor graph
associated with a document involves computing the most probable assignment of
values to the hidden labels of all candidate entities:
d.Y  = arg max P (d.Y |d.X),
d.Y
(19.3)
Following a maximum likelihood estimation, we shall use the log-linear representation of potentials:
C (G.Xc , G.Yc ) = exp{wc fc (G.Xc , G.Yc )}.
(19.4)
19.4
547
parameters w.
L(w, D) =
fc (d.X, d.Y )
(19.5)
dD d.Y
dD
dD
The voted perceptron algorithm is detailed in table 19.3. At each step i in the
Table 19.3
Experimental Results
We have tested the RMN approach on two data sets that have been hand-tagged for
human protein names. The rst data set is Yapex1 which consists of 200 Medline
abstracts. The second dataset is Aimed2, which consists of 225 Medline abstracts
we previously annotated for evaluating systems that extract both human proteins
and their interactions [6].
1. URL:www.sics.se/humle/projects/prothalt/
2. URL:ftp.cs.utexas.edu/mooney/bio-data/
548
Method
LT-RMN
GLT-RMN
CRF
Yapex
Precision
70.79
69.71
72.45
Recall
53.81
65.76
58.64
F-m
61.14
67.68
64.81
Method
LT-RMN
GLT-RMN
CRF
Aimed
Precision
81.33
82.79
85.37
Recall
72.79
80.04
75.90
F-m
76.82
81.39
80.36
19.5
549
19.5
550
a system would create massive collective inference problems and would require
ecient SRL methods that could scale to very large networks.
19.6
Conclusions
The area of natural language processing includes many problems that lend themselves to SRL methods. Most existing statistical methods in NLP, such as HMMs,
sequence CRFs, and probabilistic context-free grammars are actually restrictive
forms of SRL. More general SRL techniques have advantages over these existing
methods and hold the promise of improving results on a number of dicult NLP
problems. In this chapter, we have reviewed our research on applying SRL techniques to information extraction. By using RMNs to capture dependencies between
distinct candidate extractions in a document, we achieved improved results on identifying names of proteins in biomedical abstracts compared to a traditional CRF.
By using the ability of SRL to integrate disparate sources of evidence to perform
collective inference over complex relational data, robust NLP systems that accurately resolve many interacting ambiguities can hopefully be developed.
Acknowledgments
This research was partially supported by the National Science Foundation under
grants IIS-0325116 and IRI-9704943.
References
[1] J. Allen. Natural Language Understanding. Benjamin/Cummings, Menlo Park,
CA, 1987.
[2] E. Brill. Transformation-based error-driven learning and natural language
processing: A case study in part-of-speech tagging. Computational Linguistics,
21(4):543565, 1995.
[3] R. Bunescu. Learning for collective information extraction. Technical Report
TR-05-02, Department of Computer Sciences, University of Texas at Austin,
2004.
[4] R. Bunescu and R. J. Mooney. Collective information extraction with relational
Markov networks. In Proceedings of the Annual Meeting of the Association for
Computational Linguistics, 2004.
[5] R. Bunescu and R. J. Mooney. Subsequence kernels for relation extraction.
In Proceedings of the Conference on Neural Information Processing Systems,
2005.
[6] R. Bunescu, R. Ge, R. Kate, E. Marcotte, R. J. Mooney, A. Kumar Ramani,
and Y. Wah Wong. Comparative experiments on learning information extrac-
References
551
tors for proteins and their interactions. Articial Intelligence in Medicine (Special Issue on Summarization and Information Extraction from Medical Documents), 33(2):139155, 2005.
[7] C. Cardie. Empirical methods in information extraction. AI Magazine, 18(4):
6579, 1997.
[8] Kenneth W. Church. A stochastic parts program and noun phrase parser
for unrestricted text. In Proceedings of the Conference on Applied Natural
Language Processing, 1988.
[9] M. Collins. Ranking algorithms for named-entity extraction: Boosting and the
voted perceptron. In Proceedings of the Annual Meeting of the Association for
Computational Linguistics, 2002.
[10] M. Craven and J. Kumlien. Using multiple levels of learning and diverse
evidence sources to uncover coordinately controlled genes. In Proceedings of the
International Conference on Intelligent Systems for Molecular Biology, 1999.
[11] J. Finkel, T. Grenager, and C. Manning. Incorporating non-local information
into information extraction systems by Gibbs sampling. In Proceedings of the
Annual Meeting of the Association for Computational Linguistics, 2005.
[12] D. Freitag. Information extraction from HTML: Application of a general
learning approach. In Proceedings of the National Conference on Articial
Intelligence, 1998.
[13] Y. Freund and R. Schapire. Large margin classication using the perceptron
algorithm. Machine Learning, 37:277296, 1999.
[14] J. Hirschberg. Every time I re a linguist, my performance goes up, and other
myths of the statistical natural language processing revolution. Presented at
the National Conference on Articial Intelligence, 1998.
[15] F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, Cambridge,
MA, 1998.
[16] F. Jelinek. Continuous speech recognition by statistical methods. Proceedings
of the IEEE, 64(4):532556, 1976.
[17] F. R. Kschischang, B. Frey, and H.-A. Loeliger. Factor graphs and the sumproduct algorithm. IEEE Transactions on Information Theory, 47(2):498519,
2001.
[18] J. Laerty, A. McCallum, and F. Pereira. Conditional random elds: Probabilistic models for segmenting and labeling sequence data. In Proceedings of
the International Conference on Machine Learning, 2001.
[19] C. Manning and H. Sch
utze. Foundations of Statistical Natural Language
Processing. MIT Press, Cambridge, MA, 1999.
[20] M. Marcus, B. Santorini, and M. Marcinkiewicz. Building a large annotated
corpus of English: The Penn treebank. Computational Linguistics, 19(2):313
330, 1993.
552
[21] A. McCallum.
Mallet: A machine learning for language toolkit.
http://mallet.cs.umass.edu, 2002.
[22] L. Rabiner. A tutorial on hidden Markov models and selected applications in
speech recognition. Proceedings of the IEEE, 77(2):257286, 1989.
[23] A. Ramani, R. Bunescu, R. J. Mooney, and E. Marcotte. Consolidating the
set of known human protein-protein interactions in preparation for large-scale
mapping of the human interactome. Genome Biology, 6(5):r40, 2005.
[24] L. Ramshaw and M. Marcus. Text chunking using transformation-based
learning. In Proceedings of the Third Workshop on Very Large Corpora, 1995.
[25] D. Roth and W. Yih. A linear programming formulation for global inference in
natural language tasks. In Proceedings of the Conference on Natural Language
Learning, 2004.
[26] R. Schank and C. Riesbeck. Inside Computer Understanding: Five Programs
plus Miniatures. Lawrence Erlbaum and Associates, Hillsdale, NJ, 1981.
[27] A. Schwartz and M. Hearst. A simple algorithm for identifying abbreviation
denitions in biomedical text. In Proceedings of the Eighth Pacic Symposium
on Biocomputing, 2003.
[28] C. Sutton and A. McCallum. Collective segmentation and labeling of distant
entities in information extraction. In ICML Workshop on Statistical Relational
Learning and Its Connections to Other Fields, 2004.
[29] C. Sutton, K. Rohanimanesh, and A. McCallum. Dynamic conditional random
elds: Factorized probabilistic models for labeling and segmenting sequence
data. In Proceedings of the International Conference on Machine Learning,
2004.
[30] B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for
relational data. In Proceedings of the Conference on Uncertainty in Articial
Intelligence, 2002.
[31] A. Viterbi. Error bounds for convolutional codes and an asymptotically
optimum decoding algorithm. IEEE Transactions on Information Theory, 13
(2):260269, 1967.
[32] M. Wainwright, T. Jaakkola, and A. Willsky. Tree-based reparameterization
framework for approximate estimation on graphs with cycles. In Proceedings
of the Conference on Neural Information Processing Systems, 2001.
[33] T. Winograd. Understanding Natural Language. Academic Press, Orlando,
FL, 1972.
[34] W. A. Woods. Lunar rocks in natural English: Explorations in natural
language question answering. In Antonio Zampoli, editor, Linguistic Structures
Processing. Elsevier North-Holland, New York, 1977.
[35] D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation
extraction. Journal of Machine Learning Research, 3:10831106, 2003.
Natural language decisions often involve assigning values to sets of variables, representing low-level decisions and context-dependent disambiguation. In most cases
there are complex relationships among these variables representing dependencies
that range from simple statistical correlations to those that are constrained by
deeper structural, relational, and semantic properties of the text.
In this chapter we study a specic instantiation of this problem in the context
of identifying named entities and relations between them in free-form text. Given
a collection of discrete random variables representing outcomes of learned local
predictors for entities and relations, we seek an optimal global assignment to the
variables that respects multiple constraints, including constraints on the type of
arguments a relation can take, and the mutual activity of dierent relations.
We develop a linear programming formulation to address this global inference
problem and evaluate it in the context of simultaneously learning named entities and
relations. We show that global inference improves stand-alone learning; in addition,
our approach allows us to eciently incorporate expressive domain and task-specic
constraints at decision time, resulting, beyond signicant improvements in the
accuracy, in coherent quality of the inference.
20.1
Introduction
In a variety of AI problems there is a need to learn, represent, and reason with
respect to denitions over structured and relational data. Examples include learning
to identify properties of text fragments such as functional phrases and named
entities, identifying relations such as A is the assassin of B in text, learning to
classify molecules for mutagenicity from atom-bond data in drug design, learning
554
Global Inference for Entity and Relation Identication via a Linear Programming Formulation
20.1
Introduction
555
Typically, ecient inference procedures in both frameworks rely on dynamic programming (e.g., Viterbi), which works well for sequential data. However, in many
important problems, the structure is more general, resulting in computationally intractable inference. Problems of these sorts have been studied in computer vision,
where inference is generally performed over low-level measurements rather than
over higher-level predictors [22, 3].
This work develops a novel inference with classiers approach. Rather than being
restricted to sequential data, we study a fairly general setting. The problem is
dened in terms of a collection of discrete random variables representing binary
relations and their arguments; we seek an optimal assignment to the variables
in the presence of the constraints on the binary relations between variables and
the relation types. Following ideas that were developed recently in the context
of approximation algorithms [8], we model inference as an optimization problem,
and show how to cast it in a linear programming (LP) formulation. Using existing
numerical packages, which are able to solve very large LP problems in a very short
time1, inference can be done very quickly.
Our approach could be contrasted with other approaches to sequential inference or to general MRF approaches [21, 35]. The key dierence is that in these
approaches, the model is learned globally, under the constraints imposed by the
domain. Our approach is designed to address also cases in which some of the local
classiers are learned (or acquired otherwise) in other contexts and at other times,
or incorporated as background knowledge. That is, some components of the global
decision need not, or cannot, be trained in the context of the decision problem.
This way, our approach allows the incorporation of constraints into decisions in a
dynamic fashion and can therefore support task-specic inference. The signicance
of this is clearly shown in our experimental results.
We develop our model in the context of natural language inference and evaluate
it here on the problem of simultaneously recognizing named entities and relations
between them.
For instance, in the sentence J. V. Oswald was murdered at JFK after his
assassin, R. U. KFJ shot..., we want to identify the kill (KFJ, Oswald) relation. This task requires making several local decisions, such as identifying named
entities in the sentence, in order to support the relation identication. For example,
it may be useful to identify that Oswald and KFJ are people, and JFK is a location.
This, in turn, may help to identify that a kill action is described in the sentence. At
the same time, the relation kill constrains its arguments to be people (or at least,
not to be locations) and helps to enforce that Oswald and KFJ are likely to be
people, while JFK may not.
In our model, we rst learn a collection of local predictors, e.g., entity and
relation identiers. At decision time, given a sentence, we produce a global decision
1. For example, CPLEX [11] is able to solve a linear programming problem of 13 million
variables within 5 minutes.
556
Global Inference for Entity and Relation Identication via a Linear Programming Formulation
that optimizes over the suggestions of the classiers that are active in the sentence,
known constraints among them and, potentially, domain-specic or task-specic
constraints relevant to the current decision. Although a brute-force algorithm may
seem feasible for short sentences, as the number of entity variables grows, the
computation becomes intractable very quickly. Given n entities in a sentence, there
are O(n2 ) possible binary relations between them. Assume that each variable (entity
2
or relation) can take l labels (none is one of these labels). Thus, there are ln
possible assignments, which is too large to explicitly enumerate even for a small n.
When evaluated on simultaneous learning of named entities and relations, our
approach not only provides a signicant improvement in the predictors accuracy;
more importantly, it provides coherent solutions. While many statistical methods
make incoherent mistakes (i.e., inconsistency among predictions) that no human
ever makes, as we show, our approach improves also the quality of the inference
signicantly.
The rest of the chapter is organized as follows. Section 20.2 formally denes
our problem and section 20.3 describes the computational approach we propose.
Experimental results are given in section 20.5, including a case study that illustrates
how our inference procedure improves the performance. We introduce some common
inference methods used in various text problems as comparison in section 20.6,
followed by some discussions and conclusions in section 20.7.
20.2
20.2
557
R 31
R 32
E1
R 32
E2
R 12
E3
Spelling
POS
...
Label
R 23
R13
Label-1
Label-2
...
Label-n
Figure 20.1
they take values (i.e., labels) that range over a set of entity types LE . The value
assigned to Ei  E is denoted fEi  LE .
Notice that determining the entity boundaries is also a dicult problem  the
segmentation (or phrase detection) problem [1, 25]. Here we assume it is solved and
given to us as input; thus we only concentrate on classication.
Figure 20.2
Example 20.1
The sentence in gure 20.2 has three entities: E1 = Dole, E2 = Elizabeth, and
E3 = Salisbury, N.C.
A relation is dened by the entities that are involved in it (its arguments). Note
that we only discuss binary relations.
Denition 20.2 Relations
A (binary) relation Rij = (Ei , Ej ) represents the relation between Ei and Ej , where
Ei is the rst argument and Ej is the second. In addition, Rij can range over a set of
entity types LR . We use R = {Rij }{1i,jn;i=j} as the set of binary relations on the
entities E in a sentence. Two special functions N 1 and N 2 are used to indicate the
argument entities of a relation Rij . Specically, Ei = N 1 (Rij ) and Ej = N 2 (Rij ).
Note that in this denition, the relations are directed (e.g., there are both Rij and
Rji variables). This is because the arguments in a relation often take dierent roles
and have to be distinguished. Examples of this sort include work for, located in and
558
Global Inference for Entity and Relation Identication via a Linear Programming Formulation
live in. If a relation variable Rij is predicted as a mutual relation (e.g., spouse of ),
then the corresponding relation Rji should be also assigned the same label. This
additional constraint can be easily incorporated in our inference framework. Also
notice that we simplify the denition slightly by not considering self-relations (e.g.,
Rii ). This can be relaxed if this type of relations appears in the data.
Example 20.2
In the sentence given in gure 20.2, there are six relations between the entities: R12
= (Dole, Elizabeth), R21 = (Elizabeth, Dole), R13 = (Dole, Salisbury,
N.C.), R31 = (Salisbury, N.C., Dole), R23 = (Elizabeth, Salisbury, N.C.),
and R32 = (Salisbury, N.C., Elizabeth)
We dene the types (i.e., classes) of relations and entities as follows.
Denition 20.3 Classes
We denote the set of predened entity classes and relation classes as LE and LR
respectively. LE has one special element, other ent, which represents any unlisted
entity class. Similarly, LR also has one special element, other rel, which means the
involved entities are irrelevant or the relation class is undened.
When it is clear from the context, we use Ei and Rij to refer to the entity and
relation, as well as their types (class labels). Note that each relation and entity
variable can take only one class according to denition 20.3. Although there may
be dierent relations between two entities, it seldom occurs in the data. Therefore,
we ignore this issue for now.
Example 20.3
Suppose LE = { other ent, person, location } and LR = { other rel, born in,
spouse of }. For the entities in gure 20.2, E1 and E2 belong to person and E3
belongs to location. In addition, relation R23 is born in, R12 and R21 are spouse of.
Other relations are other rel.
Given a sentence, we want to predict the labels of a set V which consists of two
types of variables  entities E and relations R. That is, V = E  R. However, the
class label of a single entity or relation depends not only on its local properties
but also on the properties of other entities and relations. The classication task
is somewhat dicult since the predictions of entity labels and relation labels are
mutually dependent. For instance, the class label of E1 depends on the class label of
R12 and the class label of R12 also depends on the class label of E1 and E2 . While
we can assume that all the data is annotated for training purposes, this cannot
be assumed at evaluation time. We may presume that some local properties, such
as the words or POS tags, are given, but none of the class labels for entities or
relations are.
To simplify the complexity of the interaction within the graph but still preserve
the characteristic of mutual dependency, we abstract this classication problem
in the following probabilistic framework. First, the classiers are trained independently and used to estimate the probabilities of assigning dierent labels given the
20.2
559
observation (that is, the easily classied properties in it). Then, the output of the
classiers is used as a conditional distribution for each entity and relation, given
the observation. This information, along with the constraints among the relations
and entities, is used to make global inference.
In the task of entity and relation recognition, there exist some constraints on the
labels of corresponding relation and entity variables. For instance, if the relation
is live in, then the rst entity should be a person, and the second entity should
be a location. The correspondence between the relation and entity variables can be
represented by a bipartite graph. Each relation variable Rij is connected to its rst
entity Ei , and second entity Ej . We dene a set of constraints on the outcomes of
the variables in V as follows.
Denition 20.4 Constraints
A constraint is a function that maps a relation label and an entity label to either
0 or 1 (contradict or satisfy the constraint). Specically, C 1 : LR  LE  {0, 1}
constrains values of the rst argument of a relation. C 2 is dened similarly and
constrains values of the second argument.
Note that while we dene the constraints here as Boolean functions, our formalism allows us to associate weights with constraints and to include statistical
constraints [32]. Also note that we can dene a large number of constraints, such
as C R : LR  LR  {0, 1} which constrain the labels of two relation variables.
For example, we can dene a set of constraints on a mutual relation spouse of
as {(spouse of, spouse of) = 1, (spouse of, lr ) = 0, and (lr , spouse of) = 0 for any
lr  LR , where lr = spouse of}. By enforcing these constraints on a pair of symmetric relation variables Rij and Rji , the relation class spouse of will be assigned
to either both Rij and Rji or none of them. [In fact, as will be clear in section 20.3,
the language used to describe constraints is very rich  linear (in)equalities over V.]
We seek an inference algorithm that can produce a coherent labeling of entities
and relations in a given sentence. Furthermore, it optimizes an objective function
based on the conditional probabilities or other condence scores estimated by the
entity and relation classiers, subject to some natural constraints. Examples of
these constraints include whether specic entities can be the argument of specic
relations, whether two relations can occur together among a subset of entity
variables in a sentence, and any other information that might be available at the
inference time. For instance, suppose it is known that entities A and B represent the
same location; one may like to incorporate an additional constraint that prevents
an inference of the type: C lives in A; C does not live in B.
We note that a large number of problems can be modeled this way. Examples
include problems such as chunking sentences [25], coreference resolution and sequencing problems in computational biology, and the recently popular problem of
semantic role labeling [5, 6]. In fact, each of the components of our problem here,
namely the separate task of recognizing named entities in sentences and the task of
recognizing semantic relations between phrases, can be modeled this way. However,
560
Global Inference for Entity and Relation Identication via a Linear Programming Formulation
20.3
C(f ) =
uV
cu (fu ) +
d1 (fRij , fEi ) + d2 (fRij , fEj )
(20.1)
Rij R
20.3
561
min
cE (e) x{E,e} +
EE eLE
Ei ,Ej E
Ei =Ej
cR (r) x{R,r}
RR rLR
rLR e1 LE
subject to:
d2 (r, e2 )  x{Rij ,r,Ej ,e2 } ,
rLR e2 LE
x{E,e} = 1
E E
(20.2)
x{R,r} = 1
R R
(20.3)
eLE
rLR
x{E,e} =
x{R,r,E,e}
rLR
x{R,r} =
E  E, e  LE ,
R  {R : E = N 1 (R) or E = N 2 (R)}
(20.4)
x{R,r,E,e}
R R, r LR , E = N 1 (R)
(20.5)
x{R,r,E,e}
R R, r LR , E = N 2 (R)
(20.6)
x{E,e} {0, 1}
E E, e LE
(20.7)
x{R,r} {0, 1}
R R, r LR
(20.8)
R R, r LR , E E, e LE
(20.9)
eLE
x{R,r} =
eLE
x{R,r,E,e} {0, 1}
Equations (20.2) and (20.3) require that each entity or relation variable can
only be assigned one label. Equations (20.4), (20.5), and (20.6) assure that the
assignment to each entity or relation variable is consistent with the assignment
to its neighboring variables. Equations (20.7), (20.8), and (20.9) are the integral
constraints on these binary variables.
562
Global Inference for Entity and Relation Identication via a Linear Programming Formulation
20.4
E E, e LE
(20.10)
x{R,r} 0
R R, r LR
(20.11)
R R, r LR , E E, e LE ,
(20.12)
x{R,r,E,e} 0
20.5
Experiments
563
When LPR nds a noninteger solution, it splits the problem on the noninteger
variable. For example, suppose variable xi is fractional in a noninteger solution to
the ILP problem min{cx : x  S, x  {0, 1}n}, where S is the linear constraints. The
ILP problem can be split into two sub-LPR problems, min{cx : x  S  {xi = 0}}
and min{cx : x  S {xi = 1}}. Since any feasible solution provides an upper bound
and any LPR solution generates a lower bound, the search tree can be eectively
cut.
One technique that is often combined with branch and bound
is cutting plane. When a noninteger solution is given by LPR, it adds a new
linear constraint that makes the noninteger point infeasible, while still keeping the
optimal integer solution in the feasible region. As a result, the feasible region is
closer to the ideal polyhedron, which is the convex hull of feasible integer solutions.
The most well-known cutting plane algorithm is Gomorys fractional cutting plane
method [41], for which it can be shown that only a nite number of additional
constraints are needed. Moreover, researchers developed dierent cutting plane
algorithms for dierent types of ILP problems. One example is [40], which only
focuses on binary ILP problems.
In theory, a search-based strategy may need several steps to nd the optimal
solution. However, LPR always generates integer solutions for all the (thousands
of) cases we have experimented with, even though the coecient matrix in our
problem is not unimodular.
20.5
Experiments
Data Preparation
We annotated the named entities and relations in some sentences from the TREC
documents. In order to eectively observe the interaction between relations and
564
Global Inference for Entity and Relation Identication via a Linear Programming Formulation
entities, we chose 1437 sentences4 that have at least one active relation. Among
those sentences, there are 5336 entities, and 19,048 pairs of entities (binary relations). Entity labels include 1685 persons, 1968 locations, 978 organizations, and
705 other ent. Relation labels include 406 located in, 394 work for, 451 orgBased in,
521 live in, 268 kill, and 17,007 other rel. Note that most pairs of entities have no
active relations at all. Therefore, relation other rel signicantly outnumbers others.
Examples of each relation label and the constraints between a relation variable and
its two entity arguments are shown in table 20.1.
Table 20.1
20.5.2
Entity1 Entity2
loc
per
org
per
per
loc
org
loc
loc
per
Example
(New York, US)
(Bill Gates, Microsoft)
(HP, Palo Alto)
(Bush, US)
(Oswald, JFK)
Tested Approaches
4. The data used here is available by following the data link from
http://L2R.cs.uiuc.edu/cogcomp/
5. We collected names of famous places, people, and popular titles from other data sources
in advance.
20.5
Experiments
Table 20.2
Table 20.3
565
Explanation
icap
acap
incap
sux
bigram
len
place5
prof5
name5
Example
arg1 , arg2
arg1 ,    a    arg2 prof
in/at arg1 in/at/, arg2
arg2 prof arg1
arg1    native of    arg2
arg1    based in/at arg2
San Jose, CA
John Smith, a Starbucks manager   
Ocials in Perugia in Umbria province said   
CNN reporter David McKinley   
Elizabeth Dole is a native of Salisbury, N.C.
   a manager for Kmart based in Troy, Mich. said   
Some features in category 3 are the number of words between arg1 and arg2 ,
whether arg1 and arg2 are the same word, or arg1 is the beginning of the
sentence and has words that consist of all capitalized characters, where arg1 and
arg2 represent the rst and second argument entities respectively. Table 20.3
presents some patterns we use.
The learning algorithm used is a regularized variation of the Winnow update rule
incorporated in SNoW [29, 31, 4], a multiclass classier that is specically tailored
for large-scale learning tasks. SNoW learns a sparse network of linear functions, in
which the targets (entity classes or relation classes, in this case) are represented
as linear functions over a common feature space. While SNoW can be used as
a classier and predicts using a winner-take-all mechanism over the activation
value of the target classes, we can also rely directly on the raw activation value
it outputs, which is the weighted linear sum of the active features, to estimate
the posteriors. It can be veried that the resulting values provide a good source
of probability estimation. We use softmax [2] over the raw activation values as
conditional probabilities. Specically, suppose the number of classes is n, and the
raw activation values of class i is acti . The posterior estimation for class i is derived
by the following equation
pi = 
eacti
1jn
eactj
566
Global Inference for Entity and Relation Identication via a Linear Programming Formulation
In addition to the separate approach, we also test several pipeline models, which
we denote E  R, R  E and E  R. The E  R approach rst trains the
basic entity classier (E), which is identical to the entity classier in the separate
approach. Its predictions on the two entity arguments of a relation are then used
conjunctively as additional features (e.g., personperson or personlocation) in
learning the relation classier (R). Similarly, R  E rst trains the relation classier
(R); its output is then used as additional features in the entity classier (E).
For example, the additional feature could be this entity is predicted as the rst
argument of a work for relation. The E  R model is the combination of the above
two. It uses the entity classier in the R  E model and the relation classier in
the E  R model as its nal classiers.
Although the true labels of entities and relations are known during training, only
the predicted labels are available during evaluation on new data (and in testing).
Therefore, rather than training the second-stage pipeline classiers on the available
true labels, we train them on the predictions of the previous stage classiers. This
way, at test time the classiers are being evaluated on data of the same type they
were trained on, making the second-stage classier more tolerant to the mistakes 6.
The need to train pipeline classiers this way has been observed multiple times
in natural language processing (NLP) research, and we also have validated it in
our experiments. For example, when the relation classier is trained using the true
entity labels, the performance is usually worse than when training it using the
predicted entity labels.
The last approach, omniscient, tests the conceptual upper bound of this entityrelation classication problem. It also trains the two classiers separately. However,
it assumes that the entity classier knows the correct relation labels, and similarly
the relation classier knows the right entity labels as well. This additional information is then used as features in training and testing. Note that this assumption is
unrealistic. Nevertheless, it may give us a hint on how accurately the classiers with
global inference can achieve. Finally, we apply the LP-based inference procedure to
the above ve models, and observe how it improves the performance.
20.5.3
Results
6. In order to derive similar performance in testing, ideally the previous stage classier
should be trained using a dierent corpus. We didnt take this approach because of data
scarcity.
20.5
Experiments
567
Table 20.4
Approach
person
Rec Prec F1
location
Rec Prec F1
organization
Rec Prec F1
is that the omniscient classiers, which know the correct entity or relation labels,
can still be improved by the inference procedure. This demonstrates the eectiveness of incorporating constraints, even when the learning algorithm may be able to
learn them from the data.
One of the more signicant results in our experiments, we believe, is the improvement in the quality of the decisions. As mentioned in section 20.1, incorporating
constraints helps to avoid inconsistency in classication. It is interesting to investigate how often such mistakes happen without global inference, and see how eective
the global inference is.
For this purpose, we dene the quality of the decision as follows. For a relation
variable and its two corresponding entity variables, if the labels of these variables are
predicted correctly and the relation is active (i.e., not other rel ), then we count it
as a coherent prediction. Quality is then the number of coherent predictions divided
by the sum of coherent and incoherent predictions. When the inference procedure
is not applied, 5% to 25% of the predictions are incoherent. Therefore, the quality
is not always good. On the other hand, our global inference procedure takes the
natural constraints into account, so it never generates incoherent predictions. If the
relation classier has the correct entity labels as features, a good learner should
learn the constraints as well. As a result, the quality of omniscient is almost as
good as omniscient with inference.
Another experiment we performed is the forced decision test, which boosts the F1
score of the kill relation to 86.2%. In this experiment, we assume that the system
knows which sentences have the kill relation at the decision time, but it does not
know which pair of entities have this relation. We force the system to determine
568
Global Inference for Entity and Relation Identication via a Linear Programming Formulation
Table 20.5 Results of the relation classication in dierent approaches. Experiments are conducted using ve-fold cross-validation. Numbers in boldface indicates
that that the p -values are smaller than 0.1. Symbols  and  indicate signicance at
95% and 99% levels respectively. Signicance tests were computed with a two-tailed
paired t -test.
Approach
located in
Rec Prec F1
work for
Rec Prec F1
orgBased in
Rec Prec F1
Separate 53.0 43.3 45.2 41.9 55.1 46.3 35.6 85.4 50.0
Separate w/ Inf 51.6 56.3 50.5 40.1 74.1 51.2 35.7 90.8 50.8
E  R 56.4 52.5 50.7
E  R w/ Inf 55.7 53.2 50.9
Omniscient 62.9 59.5 57.5 50.3 69.4 58.2 50.3 77.9 60.9
Omniscient w/ Inf 62.9 61.9 59.1 50.3 79.2 61.4 50.9 81.7 62.5
Approach
live in
Rec Prec F1
kill
Rec Prec F1
which of the possible relations in a sentence (i.e., which pair of entities) has this
kill relation
by adding the following linear inequality.
x{R,kill}  1
RR
This is equivalent to saying that at least one of the relation variables in the sentence
should be labeled as kill. Since this additional constraint only applies to on
the sentences in which the kill relation is active, the inference results of other
sentences are not changed. Note that it is a realistic situation (e.g., in the context
of question answering) in that it adds an external constraint, not present at the time
20.5
Experiments
569
of learning the classiers, and it evaluates the ability of our inference algorithm to
cope with it. The results exhibit that our expectations are correct.
20.5.4
Case Study
Although tables 20.4 and 20.5 clearly demonstrate that the inference procedure
improves the performance, it is interesting to see how it corrects the mistakes by
examining a specic case. The following sentence is taken from a news article in
our corpus. The eight entities are in boldface, labeled E1 to E8 .
At the proposal of the Serb Radical Party|E1 , the Assembly elected political
Branko Vojnic|E2 from Beli Manastir|E3 as its speaker, while Marko
Atlagic|E4 and Dr. Milan Ernjakovic|E5 , Krajina|E6 Serb Democratic
Party|E7 (SDS|E8 ) candidates, were elected as deputy speakers.
Table 20.6 shows the probability distribution estimated by the basic classiers,
the predictions before and after the inference, along with the true labels. Table 20.7
provides this information for the relation variables. Because the values of most of
them are other rel, we only show a small set of them here.
Table 20.6 Example: Inference eect on entities predictions: the true labels, the
predictions before and after inference, and the probabilities estimated by the basic
classiers.
Label before Inf. after Inf.
E1
E2
E3
E4
E5
E6
E7
E7
Org
Per
Loc
Per
Per
Loc
Org
Org
Org
Other
Loc
Other
Loc
Loc
Per
Org
Org
Other
Loc
Other
Per
Loc
Org
Org
other person
loc.
org
0.21
0.46
0.29
0.37
0.10
0.24
0.15
0.35
0.06
0.33
0.31
0.33
0.36
0.61
0.03
0.11
0.60
0.05
0.15
0.10
0.23
0.10
0.40
0.37
0.13
0.16
0.25
0.20
0.31
0.05
0.41
0.17
In this example, the inference procedure corrects two variables  E5 (Milan Ernjakovic) and E7 (Serb Democratic Party). If we examine the probability distribution
of these two entity variables in table 20.6, it is easy to see that the classier has
diculty deciding whether E5 is a persons name or location, and whether E7 is a
person or organization. The strong belief that there is a work for relation between
these two entities (see the row R57 in table 20.7) enables the inference procedure
to correct this mistake. In addition, several relation predictions are also corrected
from work for to other rel because they lack the support of the entity classier.
Note that not every mistake can be rectied, as several work for relations are
misidentied as other rel. This may be due to the fact that the relation other rel
can take any types of entities as its arguments. In some rare cases, the inference
570
Global Inference for Entity and Relation Identication via a Linear Programming Formulation
Table 20.7
Label
R23
R37
R47
R48
R51
R52
R56
R57
R58
R67
R68
kill
other rel
work for
work for
other rel
other rel
other rel
work for
work for
work for
work for
other rel
other rel
other rel
other rel
work for
other rel
other rel
work for
other rel
other rel
other rel
0.10
0.07
0.05
0.06
0.06
0.15
0.16
0.07
0.06
0.06
0.09
0.03
0.41
0.19
0.03
0.42
0.28
0.35
0.44
0.14
0.19
0.04
0.03
0.02
0.02
0.02
0.01
0.04
0.01
0.01
0.02
0.02
0.04
0.11
0.10
0.06
0.04
0.13
0.22
0.22
0.21
0.17
0.05
0.05
0.08
0.02
0.03
0.03
0.02
0.07
0.02
0.02
0.02
0.01
0.02
procedure may change a correct prediction to a wrong label. However, since this
seldom happens, the overall performance is still improved after inference.
One interesting thing to notice is the eciency of this ILP inference in practice.
Using a Pentium III 800MHz machine, it takes less than 30 seconds to process all
the 1437 sentences (5336 entity variables and 19,048 relation variables in total).
20.6
20.6
571
Linear-chain structures are often used for sequence labeling problems, where the
task is to decide the label of each token. For this problem, HMMs [27], conditional
sequential models and other extensions [25], and conditional random elds [21] are
commonly used. While the rst two methods learn the state transition between
a pair of consecutive tokens, conditional random elds relax the directionality
assumption and train the potential functions for the size-1 (i.e., a single token) and
size-2 (a pair of consecutive tokens) cliques. In both cases, the Viterbi algorithm is
usually used to nd the most probable sequence assignment.
We describe the Viterbi algorithm in the linear-chain conditional random elds
setting as follows. Suppose we need to predict the labels of a sequence of tokens,
t0 , t1 ,    , tm1 . Let Y be the set of possible labels for each token, where |Y| = m.
A set of m  m matrices {Mi (x)|i = 0, . . . , n  1} is dened over each pair of labels
y, y  Y
Mi (y  , y|x) = exp(
j fj (y  , y, x, i)),
j
where j are the model parameters and fj are the features. By augmenting two
special nodes y1 and yn before and after the sequence with labels start and end
respectively, the sequence probability is
1 
Mi (yi1 , yi |x).
Z(x) i=0
n
p(y|x, ) =
Z(x) is a normalization factor that can be computed from the Mi s but is not
needed in evaluation. We only need to nd the label sequence y that maximizes
the product of the corresponding elements of these n + 1 matrices. The Viterbi
algorithm is the standard method that computes the most likely label sequence
572
Global Inference for Entity and Relation Identication via a Linear Programming Formulation
given the observation. It grows the optimal label sequence incrementally by scanning
the matrices from position 0 to n. At step i, it records all the optimal sequences
ending at a label y, y  Y (denoted by yi (y)), and also the corresponding product
Pi (y). The recursive function of this dynamic programming algorithm is
1. P0 (y) = M0 (start, y|x) and y0 (y) = y.
(
y ).(y) and Pi (y) = maxy Y Pi1 (y  )M (y  , y|x),
2. for 1  i  n, yi (y) = yi1
where y = argmaxy Y Pi1 (y )M (y  , y|x) and . is the concatenation operator.
The solution that Viterbi outputs is in fact the shortest path in the graph
constructed is as follows. Let n be the number of tokens in the sequence, and m be
the number of labels each token can take. The graph consists of nm + 2 nodes and
(n  1)m2 + 2m edges. In addition to two special nodes start and end that denote
the start and end positions of the path, the label of each token is represented by a
node vij , where 0  i  n  1, and 0  j  m  1. If the path passes node vij , then
label j is assigned to token i. For nodes that represent two adjacent tokens v(i1)j
and vij  , where 0  i  n, and 0  j, j   m  1, there is a directed edge xi,jj  from
v(i1)j to vij  , with the cost  log(Mi (jj  |x)).
Obviously, the path from start to end will pass exactly one node on position i. That
is, exactly one of the nodes vi,j , 0  j  m1, will be picked. Figure 20.3 illustrates
the graph. Suppose that y = y0 y1    yn1 is the label sequence determined by the
path. Then
argminy 
n1
i=0
n1
Mi (yi1 yi |x).
i=0
Namely, the nodes in the shortest path are exactly the labels returned by the Viterbi
algorithm.
Figure 20.3 The graph that represents the labels of the tokens and the state
transition (also known as the trellis in hidden Markov models).
20.6
573
The Viterbi algorithm can still be used when the matrix is slightly modied to
incorporate simple constraints. For example, in the task of information extraction,
if the label of a word is the beginning of an entity (B), inside an entity (I), or
outside any entity (O), a token label O immediately followed by a label I is not a
valid labeling. The constraint can be incorporated by changing the corresponding
transitional probability or matrix entries to 0 [10, 20]. However, more general,
nonMarkovian constraints cannot be resolved using the same trick.
Recently, Roth and Yih [32] proposed a dierent inference approach based on
ILP to replace the Viterbi algorithm. The basic idea there is to use integer linear
programming to nd the shortest path in the trellis (e.g., gure 20.3). Each edge
of the graph is represented by an indicator variable to represent whether this edge
is in the shortest path or not. The cost function can be written in terms of a linear
function of these indicator variables. In this ILP, linear (in)equalities are added to
enforce that the values of these indicator variables represent a legitimate path. This
ILP can be solved simply by LP relaxation because the coecient matrix is totally
unimodular. However, the main advantage of this new setting is its ability to allow
more general constraints that can be encoded either in linear (in)equalities or in
the cost function. Interested readers may see [32] for more details.
20.6.1.2
A second ecient inference algorithm for linear sequence tasks that has been used
successfully for natural language and information extraction problems is constraint
satisfaction with classiers (CSCL) [25]. This method was rst proposed for shallow
parsing  identifying atomic phrases (e.g., base noun phrases) in a given sentence.
In that case, two classiers are rst trained to predict whether a word opens
(O) a phrase or closes (C) a phrase. Since these two classiers may generate
inconsistent predictions, the inference task has to decide which OC pairs are indeed
the boundaries of a phrase.
We illustrate their approach by the following example. Suppose a sentence has
six tokens, t1 ,    , t6 , as indicated in gure 20.4. The classiers have identied three
opens (O) and three closes (C) in this sentence (i.e., the open and close brackets).
Among the OC pairs (t1 , t3 ), (t1 , t5 ), (t1 , t6 ), (t2 , t3 ), (t2 , t5 ), (t2 , t6 ), (t4 , t5 ), (t4 , t6 ),
the inference procedure needs to decide which of them are the predicted phrases,
based on the cost function. In addition, the chosen phrases should not overlap or
embed with each other. Let the predicate this pair is selected as a phrase be
represented by an indicator variable xi  X, where |X| = 8 in this case. They
associate a cost function c : X  R with each variable (where the value c(xi ) is
determined as a function of the corresponding OC classiers), and try to nd a
solution that minimizes the overall cost, ni=1 c(xi )xi .
This problem can be reduced elegantly to a shortest path problem by the following
graph construction. Each open and close word is represented by an O node and a C
node. For each possible OC pair, there is a direct link from the corresponding open
574
Global Inference for Entity and Relation Identication via a Linear Programming Formulation
node O to the close node C. Finally, one source (s) node and one target (t) node
are added. Links are added from s to each O and from each C to t. The cost of
an OC link is pi , where pi is the probability that this OC pair represents a phrase,
estimated by the O and C classiers.
Because the inference process is also done by nding the shortest path in the
graph, the ILP framework described in [32] is applicable here as well.
20.6.1.3
Clause Identication
The two ecient approaches mentioned above can be generalized beyond the
sequential structure, to tree structures. Cubic-time dynamic algorithms are often
used for inference in various tree-structure problems, such as parsing [17] or clause
identication [38]. As an example, we discuss the inference approach proposed
by Carreras et al. [7], in the context of clause identication. Clause identication is a
partial parsing problem. Given a sentence, a clause is dened as a sequence of words
that contains a subject and a predicate [38]. In the following example sentence taken
from the Penn yreebank [23], each pair of corresponding parentheses represents a
clause. The task is thus to identify all the clauses in the sentence.
(The deregulation of railroads and trucking companies (that (began in 1980))
enabled (shippers to bargain for transportation).)
Although the problem looks similar to shallow parsing, the constraints between
the clauses are weaker  clauses may not overlap, but a clause can be embedded in
another. Formally speaking, let wi be the ith word in a sentence of n words. A clause
can be dened as a pair of numbers (s, t), where 1  s  t  n, which represents the
20.6
575
Carreras et al. [7] proposed a dynamic programming algorithm to solve this inference problem. In this algorithm, two 2D matrices are maintained: best-split[s,t]
stores the optimal clause predictions in ws , ws+1 , . . . , wt ; score[s,t] is the score
of the clause (s, t). By lling the table recursively, the optimal clause prediction
can be found in O(n3 ) time.
As in the previous cases discussed in this section, it is clear that this problem
can be represented as an ILP. Each candidate clause (s, t) can be represented
by an indicator variable xs,t . The cost function is the sum of the score times
the corresponding indicator variable, namely (score(s, t)  xs,t ). Suppose clause
candidates (s1 , t1 ) and (s2 , t2 ) overlap. The nonoverlapping constraint can be
enforced by adding a linear inequality, xs1 ,t1 + xs2 ,t2  1.
20.6.2
As discussed above, exact polynomial time algorithms exist for specic constraint
structures; however, the inference problem typically becomes computationally intractable when additional constraints are introduced, or more complex structures
are needed. A common computational approach to the inference problem in this
case is search. Following the denition in [33], search is used to nd a legitimate
state transition path from the initial state to a goal state while trying to minimize the cost. The problem can be treated as consisting of four components: state
space, operators (the legitimate state transitions), goal-test (a function that examines whether a goal state is reached), and path-cost-function (the cost function of
the whole path). Figure 20.5 depicts a generic search algorithm.
To solve the entity-relation problem described in this chapter, we can dene the
state space as the set of all possible labels of the entities and relations (namely,
LE and LR ), plus undecided. In the initial state, the values of all the variables
are undecided. A legitimate operator changes an entity or relation variable from
undecided  to one of the possible labels, subject to the constraints. The goal-test
evaluates whether every variable has been assigned a label, and the path-cost is the
sum of the assignment cost of each variable.
The main advantage of inference using search is its generality. The cost function
need not be linear. The constraints can also be fairly general: as long as the decision
576
Global Inference for Entity and Relation Identication via a Linear Programming Formulation
Algorithm 1
generic-search(problem, enqueue-func)
nodes  MakeQueue(MakeNode(init-state(problem))
while (node is not empty)
node  RemoveFront(nodes)
if (goal-test(node)) then return node
next  Operators(node)
nodes  enqueue-func(problem, nodes, next)
end
return failure
end
Figure 20.5
on whether a state violates constraints can be evaluated eciently, they can be used
to dene the operators.
The main disadvantage, however, is that there is no guarantee of optimality.
Despite this weakness, it has been shown that search is a successful approach in
some tasks empirically. For instance, Moore [24] applied beam search to nd the
best word alignment given a linear model learned using voted perceptron. Recently,
Daume and Marcu [14] demonstrated an approximate large margin method for
learning structured output, where the key inference component is search.
In contrast, our ILP approach may or may not be able to replace this search
mechanism, depending on the specic cost function. Nevertheless, in several realworld problems, we observed that our ILP method may not be slower than search,
but is guaranteed to nd the optimal solution.
20.7
Conclusion
We presented a linear-programming based approach for global inference in cases
where decisions depend on the outcomes of several dierent but mutually dependent
classiers. Even in the presence of a fairly general constraint structure, deviating
from the sequential nature typically studied, this approach can nd the optimal
solution eciently.
Contrary to general search schemes (e.g., beam search), which do not guarantee
optimality, the LP approach provides an ecient way of nding the optimal
solution. The key advantage of the LP formulation is its generality and exibility; in
particular, it supports the ability to incorporate classiers learned in other contexts,
hints supplied, and decision-time constraints, and reason with all these
for the best global prediction. In sharp contrast with the typically used pipeline
framework, our formulation does not blindly trust the results of some classiers,
and therefore is able to overcome mistakes made by classiers with the help of
constraints.
References
577
Our experiments have demonstrated these advantages by considering the interaction between entity and relation classiers. In fact, more classiers can be added
and used within the same framework. For example, if coreference resolution is available, it is possible to incorporate it in the form of constraints that force the labels of
the coreferred entities to be the same (but, of course, allowing the global solution
to reject the suggestion of these classiers). Consequently, this may enhance the
performance of entity-relation recognition and, at the same time, correct possible
coreference resolution errors. Another example is to use chunking information for
better relation identication; suppose, for example, that we have available chunking
information that identies Subj+Verb and Verb+Object phrases. Given a sentence
that has the verb murder, we may conclude that the subject and object of this
verb are in a kill relation. Since the chunking information is used in the global inference procedure, this information will contribute to enhancing its performance and
robustness, relying on having more constraints and overcoming possible mistakes
by some of the classiers. Moreover, in an interactive environment where a user can
supply new constraints (e.g., a question-answering situation) this framework is able
to make use of the new information and enhance the performance at decision time,
without retraining the classiers. As we have shown, our formulation supports not
only improved accuracy but also improves the coherent quality of the decisions.
We believe that it has the potential to be a powerful way for supporting natural
language inference.
Acknowledgments
Most of this research was done when Wen-tau Yih was at the University of Illinois
at Urbana-Champaign. This research was supported by NSF grants CAREER IIS9984168 and ITR IIS-0085836, an ONR MURI award and by the Advanced Research
and Development Activity (ARDA)s Advanced Question Answering for Intelligence
(AQUAINT) program.
References
[1] S. Abney. Parsing by chunks. In R. Berwick, S. Abney, and C. Tenny, editors,
Principle-Based Parsing: Computation and Psycholinguistics, pages 257278.
Kluwer, Dordrecht, Netherlands, 1991.
[2] C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press,
Oxford, UK, 1995.
[3] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization
via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):12221239, 2001.
578
Global Inference for Entity and Relation Identication via a Linear Programming Formulation
[4] A. Carlson, C. Cumby, J. Rosen, and D. Roth. The SNoW learning architecture. Technical Report UIUCDCS-R-99-2101, University of Illinois at UrbanaChampaign Computer Science Department, May 1999.
[5] X. Carreras and L. M`
arquez. Introduction to the CoNLL-2004 shared tasks:
Semantic role labeling. In Proceedings of the Conference on Natural Language
Learning, 2004.
[6] X. Carreras and L. M`arquez. Introduction to the CoNLL-2005 shared task:
Semantic role labeling. In Proceedings of the Conference on Natural Language
Learning, 2005.
[7] X. Carreras, L. M`arquez, V. Punyakanok, and D. Roth. Learning and inference
for clause identication. In Proceedings of the European Conference on Machine
Learning, 2002.
[8] C. Chekuri, S. Khanna, J. Naor, and L. Zosin. Approximation algorithms for
the metric labeling problem via a new linear programming formulation. In
Symposium on Discrete Algorithms, 2001.
[9] R. Chellappa and A. Jain. Markov Random Fields: Theory and Application.
Academic Press, April 1993.
[10] H. Chieu and H. Ng. A maximum entropy approach to information extraction
from semi-structure and free text. In Proceedings of the National Conference
on Articial Intelligence, 2002.
[11] CPLEX. ILOG, Inc. http://www.ilog.com/products/cplex/, 2003.
[12] C. Cumby and D. Roth. Relational representations that facilitate learning.
In Proceedings of the International Conference on Principles of Knowledge
Representation and Reasoning, 2000.
[13] C. Cumby and D. Roth. On kernel methods for relational learning. In
Proceedings of the International Conference on Machine Learning, 2003.
[14] H. Daume III and D. Marcu. Learning as search optimization: Approximate
large margin methods for structured prediction. In Proceedings of the International Conference on Machine Learning, 2005.
[15] T. Dietterich. Machine learning for sequential data: A review. In Structural,
Syntactic, and Statistical Pattern Recognition, pages 1530. Springer-Verlag,
2002.
[16] Y. Even-Zohar and D. Roth. A classication approach to word prediction.
In Proceedings of the Annual Meeting of the North American Association of
Computational Linguistics, 2000.
[17] M. Johnson. PCFG models of linguistic tree representations. Computational
Linguistics, 24(4):613632, 1998.
[18] R. Khardon, D. Roth, and L. G. Valiant. Relational learning for NLP using
linear threshold elements. In Proceedings of the International Joint Conference
on Articial Intelligence, 1999.
References
579
580
Global Inference for Entity and Relation Identication via a Linear Programming Formulation
Contributors
Pieter Abbeel
Computer Science Department
Stanford University
abbeel@cs.stanford.edu
Eyal Amir
Department of Computer Science
University of Illinois, Urbana-Champaign
eyal@cs.uiuc.edu
Rodrigo de Salvo Braz
Department of Computer Science
University of Illinois, Urbana-Champaign
braz@uiuc.edu
Razvan C. Bunescu
Department of Computer Sciences
University of Texas, Austin
razvan@cs.utexas.edu
Elizabeth Burnside
Department of Radiology
Department of Biostatistics and Medical Informatics
University of Wisconsin, Madison
es.burnside@hosp.wisc.edu
Vtor Santos Costa
COPPE/Sistemas
Universidade Federal do Rio de Janeiro, Brazil
vitor@cos.ufrj.br
582
Contributors
James Cussens
Department of Computer Science &
York Centre for Complex Systems Analysis
University of York, UK
jc@cs.york.ac.uk
Jesse Davis
Department of Computer Science
University of Wisconsin, Madison
jdavis@cs.wisc.edu
Luc De Raedt
Institute for Computer Science, Machine Learning Lab
Albert-Ludwigs-Universitat Freiburg, Germany
deraedt@informatik.uni-freiburg.de
Pedro Domingos
Department of Computer Science and Engineering
University of Washington
pedrod@cs.washington.edu
In
es Dutra
COPPE/Sistemas
Universidade Federal do Rio de Janeiro, Brazil
ines@cos.ufrj.br
Sa
so D
zeroski
Department of Knowledge Technologies
Jozef Stefan Institute, Slovenia
Saso.Dzeroski@ijs.si
Alan Fern
School of Electrical Engineering and Computer Science
Oregon State University
afern@eecs.orst.edu
Nir Friedman
School of Computer Science and Engineering
Hebrew University, Israel
nir@cs.huji.ac.il
Contributors
Lise Getoor
Computer Science Department
University of Maryland, College Park
getoor@cs.umd.edu
Robert Givan
School of Electrical and Computer Engineering
Purdue University
givan@purdue.edu
David Heckerman
Microsoft Research Redmond
heckerma@microsoft.com
David Jensen
Computer Science Department
University of Massachusetts, Amherst
jensen@cs.umass.edu
Kristian Kersting
Institute for Computer Science, Machine Learning Lab
Albert-Ludwigs-Universitat Freiburg, Germany
kersting@informatik.uni-freiburg.de
Daphne Koller
Computer Science Department
Stanford University
koller@cs.stanford.edu
Andrey Kolobov
Computer Science Division
University of California, Berkeley
karayaone@rambler.ru
Bhaskara Marthi
Computer Science Division
University of California, Berkeley
bhaskara@cs.berkeley.edu
583
584
Contributors
Andrew McCallum
Department of Computer Science
University of Massachusetts, Amherst
mccallum@cs.umass.edu
Chris Meek
Microsoft Research Redmond
meek@microsoft.com
Brian Milch
Computer Science Division
University of California, Berkeley
milch@cs.berkeley.edu
Raymond J. Mooney
Department of Computer Sciences
University of Texas, Austin
mooney@cs.utexas.edu
Stephen Muggleton
Department of Computing,
Imperial College London, UK
shm@doc.ic.ac.uk
Jennifer Neville
Computer Science Department
University of Massachusetts, Amherst
jneville@cs.umass.edu
Daniel L. Ong
Computer Science Division
University of California, Berkeley
dlong@ocf.berkeley.edu
David Page
Department of Biostatistics and Medical Informatics
University of Wisconsin, Madison
page@biostat.wisc.edu
Niels Pahlavi
Department of Computing,
Imperial College London, UK
Contributors
Avi Pfeer
Division of Engineering and Applied Sciences
Harvard University
avi@eecs.harvard.edu
Alexandrin Popescul
Department of Computer and Information Science
University of Pennsylvania
popescul@cis.upenn.edu
Raghu Ramakrishnan
Department of Computer Science
University of Wisconsin, Madison
raghu@cs.wisc.edu
Matthew Richardson
Microsoft Research Redmond
mattri@microsoft.com
Dan Roth
Department of Computer Science
University of Illinois, Urbana-Champaign
danr@uiuc.edu
Stuart Russell
Computer Science Division
University of California, Berkeley
russell@cs.berkeley.edu
Jude Shavlik
Department of Computer Science
University of Wisconsin, Madison
shavlik@cs.wisc.edu
David Sontag
Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
dsontag@csail.mit.edu
Charles Sutton
Department of Computer Science
University of Massachusetts, Amherst
casutton@cs.umass.edu
585
586
Contributors
Ben Taskar
Department of Computer and Information Science
University of Pennsylvania
taskar@cis.upenn.edu
Lyle H. Ungar
Department of Computer and Information Science
University of Pennsylvania
ungar@cis.upenn.edu
Ming-Fai Wong
Computer Science Department
Stanford University
mingfai.wong@cs.stanford.edu
Wen-tau Yih
Machine Learning and Applied Statistics Group
Microsoft Research Redmond
scottyih@microsoft.com
SungWook Yoon
School of Electrical and Computer Engineering
Purdue University
sy@purdue.edu
Index