NLP Lec
NLP Lec
Lecture Synopsis
Aims
This course introduces the fundamental techniques of natural language processing. It aims to explain the potential and
the main limitations of these techniques. Some current research issues are introduced and some current and potential
applications discussed and evaluated.
1. Introduction. Brief history of NLP research, current applications, components of NLP systems.
2. Finite-state techniques. Inflectional and derivational morphology, finite-state automata in NLP, finite-state
transducers.
3. Prediction and part-of-speech tagging. Corpora, simple N-grams, word prediction, stochastic tagging, evalu-
ating system performance.
4. Context-free grammars and parsing. Generative grammar, context-free grammars, parsing with context-free
grammars, weights and probabilities. Limitations of context-free grammars.
8. Distributional semantics Representing lexical meaning with distributions. Similarity metrics. Clustering.
9. Discourse and dialogue. Anaphora resolution, discourse relations.
10. Language generation Generation and regeneration. Components of a generation system. Generation of refer-
ring expressions.
Objectives
• be able to discuss the current and likely future performance of several NLP applications;
• be able to describe briefly a fundamental technique for processing language for several subtasks, such as mor-
phological processing, parsing, word sense disambiguation etc.;
• understand how these techniques draw on and relate to other areas of computer science.
1
Overview
NLP is a large and multidisciplinary field, so this course can only provide a very general introduction. The idea is that
this is a ‘taster’ course that gives an idea of the different subfields and shows a few of the huge range of computational
techniques that are used. The first lecture is designed to give an overview including a very brief idea of the main
applications and the methodologies which have been employed. The history of NLP is briefly discussed as a way of
putting this into perspective. The next nine lectures describe some of the main subdisciplines in more detail. The
organisation is mainly based on increased ‘depth’ of processing, starting with relatively surface-oriented techniques
and progressing to considering meaning of sentences and meaning of utterances in context. Most lectures will start off
by considering the subarea as a whole and then go on to describe one or more sample algorithms which tackle particular
problems. The algorithms have been chosen because they are relatively straightforward to describe and because they
illustrate a specific technique which has been shown to be useful, but the idea is to exemplify an approach, not to
give a detailed survey (which would be impossible in the time available). Lectures 2-9 are primarily about analysing
language: lecture 10 discusses generation. Lecture 11 will introduce some issues in computational psycholinguistics
and explain some of the differences between application-oriented work and attempting to model the way that humans
understand and generate language. The final lecture is intended to give further context: it will include discussion of
one or more NLP systems. The material in Lecture 12 will not be directly examined. Slides for Lecture 12 will be
made available via the course webpage after the lecture.
There are various themes running throughout the lectures. One theme is the connection to linguistics and the tension
that sometimes exists between the predominant view in theoretical linguistics and the approaches adopted within NLP.
A somewhat related theme is the distinction between knowledge-based and probabilistic approaches. Evaluation will
be discussed in the context of the different algorithms.
Because NLP is such a large area, there are many topics that aren’t touched on at all in these lectures. Speech
recognition and speech synthesis is almost totally ignored. Information Retrieval is the topic of a separate course.
Feedback on the handout, lists of typos etc, would be greatly appreciated.
Recommended Reading
Recommended Book:
Jurafsky, Daniel and James Martin, Speech and Language Processing, Prentice-Hall, 2008 (second edition): referenced
as J&M throughout this handout. In most cases, the first edition is still suitable, but the second edition has a much
clearer description of the material covered in lecture 3 and some of the material in lecture 9 is only in the second
edition. The second edition doesn’t have anything on language generation (lecture 10): there is a chapter on this in the
first edition but it is not very useful for this course. Section references given in these notes are to the second edition.
Background:
These books are about linguistics rather that NLP/computational linguistics. They are not necessary to understand the
course, but should give readers an idea about some of the properties of human languages that make NLP interesting
and challenging, without being technical.
Pinker, S., The Language Instinct, Penguin, 1994.
This is a thought-provoking and sometimes controversial ‘popular’ introduction to linguistics.
Matthews, Peter, Linguistics: a very short introduction, OUP, 2003.
The title is accurate . . .
Background/reference:
The Internet Grammar of English, http://www.ucl.ac.uk/internet-grammar/home.htm
Syntactic concepts and terminology.
Study Guide
The handouts and lectures should contain enough information to enable students to adequately answer the exam
questions, but the handout is not intended to substitute for a textbook (or for thought). In most cases, J&M go into a
2
considerable amount of further detail: rather than put lots of suggestions for further reading in the handout, in general
I have assumed that students will look at J&M, and then follow up the references in there if they are interested. The
notes at the end of each lecture give details of the sections of J&M that are relevant and details of any discrepancies
with these notes.
Supervisors ought to familiarise themselves with the relevant parts of Jurafsky and Martin (see notes at the end of each
lecture). However, good students should find it quite easy to come up with questions that the supervisors (and the
lecturer) can’t answer! Language is like that . . .
Generally I’m taking a rather informal/example-based approach to concepts such as finite-state automata, context-free
grammars etc. The assumption is that students will have already covered this material in other contexts and that this
course will illustrate some NLP applications.
This course inevitably assumes some very basic linguistic knowledge, such as the distinction between the major parts
of speech. It introduces some linguistic concepts that won’t be familiar to all students: since I’ll have to go through
these quickly, reading the first few chapters of an introductory linguistics textbook may help students understand the
material. The idea is to introduce just enough linguistics to motivate the approaches used within NLP rather than
to teach the linguistics for its own sake. At the end of this handout, there are some mini-exercises to help students
understand the concepts: it would be very useful if these were attempted before the lectures as indicated. There are
also some suggested post-lecture exercises: answers to these are made available to supervisors only.
Exam questions won’t rely on students remembering the details of any specific linguistic phenomenon. As far as
possible, exam questions will be suitable for people who speak English as a second language. For instance, if a
question relied on knowledge of the ambiguity of a particular English word, a gloss of the relevant senses would be
given.
Of course, I’ll be happy to try and answer questions about the course or more general NLP questions, preferably by
email.
URLs
Nearly all the URLs given in these notes should be linked from:
http://www.cl.cam.ac.uk/˜aac10/stuff.html
(apart from this one of course . . . ). If any links break, I will put corrected versions there, if available.
3
1 Lecture 1: Introduction to NLP
The aim of this lecture is to give students some idea of the objectives of NLP. The main subareas of NLP will be
introduced, especially those which will be discussed in more detail in the rest of the course. There will be a preliminary
discussion of the main problems involved in language processing by means of examples taken from NLP applications.
This lecture also introduces some methodological distinctions and puts the applications and methodology into some
historical context.
1. Morphology: the structure of words. For instance, unusually can be thought of as composed of a prefix un-, a
stem usual, and an affix -ly. composed is compose plus the inflectional affix -ed: a spelling rule means we end
up with composed rather than composeed. Morphology will be discussed in lecture 2.
2. Syntax: the way words are used to form phrases. e.g., it is part of English syntax that a determiner (a word such
as the) will come before a noun, and also that determiners are obligatory with certain singular nouns. Formal
and computational aspects of syntax will be discussed in lectures 3, 4 and 5.
3. Semantics. Compositional semantics is the construction of meaning (generally expressed as logic) based on
syntax. This is discussed in lecture 6. This is contrasted to lexical semantics, i.e., the meaning of individual
words which is the topic of lectures 7 and 8.
4. Pragmatics: meaning in context. This will come into lecture 9, although linguistics and NLP generally have
very different perspectives here.
Lecture 10 looks at language generation rather than language analysis, and lecture 11 covers some topics in computa-
tional psycholinguistics.
4
• Is FD5 compatible with a 505G?
Assume the query is to be evaluated against a database containing product and order information, with relations such
as the following:
ORDER
Order number Date ordered Date shipped
While some tasks in NLP can be done adequately without having any sort of account of meaning, others require that
we can construct detailed representations which will reflect the underlying meaning rather than the superficial string.
In fact, in natural languages (as opposed to programming languages), ambiguity is ubiquitous, so exactly the same
string might mean different things. For instance in the query:
the user may or may not be asking about Sony disk drives. This particular ambiguity may be represented by different
bracketings:
5
and acquiring world knowledge.3 The term AI-complete is intended jokingly, but conveys what’s probably the most
important guiding principle in current NLP: we’re looking for applications which don’t require AI-complete solutions:
i.e., ones where we can either work with very limited domains or approximate full world knowledge by relatively
simple techniques.
• lexicographers’ tools
• information retrieval
• document classification (filtering, routing)
• document clustering
• information extraction
• sentiment classification
• question answering
• summarization
• text segmentation
• exam marking
• language teaching
• report generation (possibly multilingual)
• machine translation
• natural language interfaces to databases
• email understanding
• dialogue systems
Several of these applications are discussed briefly below. Roughly speaking, they are ordered according to the com-
plexity of the language technology required. The applications towards the top of the list can be seen simply as aids to
human users, while those at the bottom are perceived as agents in their own right. Perfect performance on any of these
applications would be AI-complete, but perfection isn’t necessary for utility: in many cases, useful versions of these
applications had been built by the late 70s. Commercial success has often been harder to achieve, however.
3 In this course, I will use domain to mean some circumscribed body of knowledge: for instance, information about laptop orders constitutes a
limited domain.
6
1.5 Sentiment classification
Politicians want to know what people think about them. Companies want to know what users think about their prod-
ucts. Extracting this sort of information from the Web is a huge and lucrative business but much of the work is still
done by humans who have to read through the relevant documents and classify them by hand, although automation is
increasingly playing a role. The full problem involves finding all the references to an entity from some document set
(e.g., all newspaper articles appearing in September 2013), and then classifying them as positive, negative or neutral.
Customers want to see summaries of the data (e.g., to see whether popularity is going up or down), but may also want
to see actual examples (text snippets). Companies may want a fine-grained classification of aspects of their product
(e.g., laptop batteries, MP3 player screens).
The full problem involves retrieving relevant text, recognition of named entities (e.g., Sony 505G, Hilary Clinton, 2,4-
dinitrotoluene) and of parts of the text that refer to them. But academic researchers have looked at a simpler version of
sentiment classification by starting from a set of documents which are already known to be opinions about a particular
topic or entity (e.g., reviews) and where the problem is just to work out whether the author is expressing positive or
negative opinions. This still turns out to be hard for computers, though generally easy for humans, especially if neutral
reviews are excluded from the data set (as is often done). Much of the work has been done on movie reviews. The
rating associated with each review is known (5 stars, 1 star or whatever), so there is an objective standard as to whether
the review is positive or negative. The research problem is to guess this automatically over the entire corpus.4
The most basic technique is to look at the words in the review in isolation of each other, and to classify the document
on the basis of whether those words generally indicate positive or negative reviews. This is a bag of words technique:
we model the document as an unordered collection of words (bag rather than set because there will be repetition). A
document with more positive words than negative ones should be a positive review. In principle, this could be done
by using human judgements of positive/negative words, but using machine learning techniques works better5 (humans
don’t consider many words that turn out to be useful indicators). However, the accuracy of the classification is only
around 80% (for a problem where there is a 50% chance success rate).6 One source of errors is negation: (e.g., Ridley
Scott has never directed a bad film is a positive statement). Another problem is that the machine learning technique
may match the data too closely: e.g., if the machine learner is trained on reviews which include a lot of films from
before 2005, it may decide that Ridley is a strong positive indicator but then tend to misclassify reviews for ‘Kingdom
of Heaven’. More subtle problems arise from not tracking the contrasts in the discourse:
This film should be brilliant. It sounds like a great plot, the actors are first grade, and the supporting cast
is good as well, and Stallone is attempting to deliver a good performance. However, it can’t hold up.
Another example:
AN AMERICAN WEREWOLF IN PARIS is a failed attempt . . . Julie Delpy is far too good for this movie.
She imbues Serafine with spirit, spunk, and humanity. This isnt necessarily a good thing, since it prevents
us from relaxing and enjoying AN AMERICAN WEREWOLF IN PARIS as a completely mindless,
campy entertainment experience. Delpys injection of class into an otherwise classless production raises
the specter of what this film could have been with a better script and a better cast . . . She was radiant,
charismatic, and effective . . .
7
1.6 Information retrieval, information extraction and question answering
Information retrieval involves returning a set of documents in response to a user query: Internet search engines are a
form of IR. However, one change from classical IR is that Internet search now uses techniques that rank documents
according to how many links there are to them (e.g., Google’s PageRank) as well as the presence of search terms.
Information extraction involves trying to discover specific information from a set of documents. The information
required can be described as a template. For instance, for company joint ventures, the template might have slots for
the companies, the dates, the products, the amount of money involved. The slot fillers are generally strings.
Question answering attempts to find a specific answer to a specific question from a set of documents, or at least a short
piece of text that contains the answer.
There have been question-answering systems on the Web since the 1990s, but most have used very basic techniques.
One common approach involved employing a large staff of people who search the web to find pages which are answers
to potential questions. The question-answering system performs very limited manipulation on the actual input to map
to a known question. The same basic technique is used in many online help systems. However, with enough resource,
impressive results are now possible: most famously an IBM research team created a QA system that beat human
champions on the quiz show Jeopardy! in 2011 (see Ferrucci et al, AI Magazine, 2010, for an overview).
8
There have been many advances in NLP since these systems were built: natural language interface systems have
become much easier to build, and somewhat easier to use, but they still haven’t become ubiquitous. Natural Language
interfaces to databases were commercially available in the late 1970s, but largely died out by the 1990s: porting to
new databases and especially to new domains requires very specialist skills and is essentially too expensive (automatic
porting was attempted but never successfully developed). Users generally preferred graphical interfaces when these
became available. Speech input would make natural language interfaces much more useful: unfortunately, speaker-
independent speech recognition still isn’t good enough for even 1970s scale NLP to work well. Techniques for dealing
with misrecognised data have proved hard to develop. In some ways, current commercially-deployed phone-based
spoken dialogue systems are using pre-SHRDLU technology.
But it must be recognized that the notion ‘probability of a sentence’ is an entirely useless one, under any
known interpretation of this term. (Chomsky 1969)
Certain linguistics journals would not even review theoretical linguistics papers which had a quantitative component.
But speech and NLP researchers wanted results:
Whenever I fire a linguist our system performance improves. (Fred Jelinek, said at a workshop in 1988
(probably), various forms of the quotation have been attested. He has said he never actually fired anyone.)
Speech recognition had demonstrated that simple statistical techniques worked, given enough training data. NLP
systems were built which required very limited hand-coded knowledge, apart from initial training material. Most
applications were much shallower than the earlier NLIDs, but the switch to statistical NLP coincided with a change
in US funding, which started to emphasise speech recognition and IE. There was also a general realization of the
importance of serious evaluation and of reporting results in a way that could be reproduced by other researchers. US
funding emphasised competitions with specific tasks and supplied test material, which encouraged this, although there
was a downside in that some of the techniques developed were very task-specific. It should be emphasised that there
had been computational work on corpora for many years (much of it by linguists): it became much easier to do corpus
work by the late 1980s as disk space became cheap and machine-readable text became ubiquitous. Despite the shift
9
in research emphasis to statistical approaches, most commercial systems remained primarily based on hand-coded
linguistic information.
More recently the symbolic/statistical split has become less pronounced, since most researchers are interested in both.7
There is considerable emphasis on machine learning in general, including machine learning for symbolic processing.
Linguistically-based NLP has made something of a comeback, with increasing availability of open source resources,
and the realisation that at least some of the classic statistical techniques seem to be reaching limits on performance,
especially because of difficulties of acquiring training data and in adapting to new types of text. However, modern
linguistically-based NLP approaches are making use of machine learning and statistical processing.
The dotcom boom and bust at the turn of the millenium considerably affected NLP in industry but interest increased
again more recently. The ubiquity of the Internet has completely changed the space of interesting NLP applications
since the early 1990s, and the vast amount of text available can potentially be exploited, especially for statistical
techniques.
• input preprocessing: speech recogniser or text preprocessor (non-trivial in languages like Chinese or for highly
structured text for any language) or gesture recogniser. Such systems might themselves be very complex, but I
won’t discuss them in this course — we’ll assume that the input to the main NLP component is segmented text.
• morphological analysis: this is relatively well-understood for the most common languages that NLP has consid-
ered, but is complicated for many languages (e.g., Turkish, Basque).
• part of speech tagging: not an essential part of most deep processing systems, but sometimes used as a way of
cutting down parser search space.
• parsing: this includes syntax and compositional semantics, which are sometimes treated as separate components.
• disambiguation: this can be done as part of parsing, or (partially) left to a later phase.
• context module: this maintains information about the context, for anaphora resolution, for instance.
• discourse structuring: the part of language generation that’s concerned with deciding what meaning to convey.
• realization: converts meaning representations to strings. This may use the same grammar and lexicon8 as the
parser.
• morphological generation: as with morphological analysis, this is relatively straightforward for English.
7 At least, there are only a few researchers who avoid statistical techniques as a matter of principle and all statistical systems have a symbolic
component!
8 The term lexicon is generally used for the part of the NLP system that contains dictionary-like information — i.e. information about individual
words.
10
• output processing: text-to-speech, text formatter, etc. As with input processing, this may be complex, but for
now we’ll assume that we’re outputting simple text.
Application specific components: for NL interfaces, email answering and so on, we need an interface between the
semantic representation output by the parser (or accepted by the generator) and the underlying knowledge base. Other
types of application have different requirements.
It is also very important to distinguish between the knowledge sources and the programs that use them. For instance,
a morphological analyser has access to a lexicon and a set of morphological rules: the morphological generator might
share these knowledge sources. The lexicon for the morphology system may be the same as the lexicon for the parser
and generator.
Other things might be required in order to construct the standard components and knowledge sources:
• lexicon acquisition
• grammar acquisition
• acquisition of statistical information
For a component to be a true module, it obviously needs a well-defined set of interfaces. What’s less obvious is that it
needs its own evaluation strategy and test suites: developers need to be able to work somewhat independently.
In principle, at least, components are reusable in various ways: for instance, a parser could be used with multiple
grammars, the same grammar can be processed by different parsers and generators, a parser/grammar combination
could be used in MT or in a natural language interface. However, for a variety of reasons, it is not easy to reuse
components like this, and generally a lot of work is required for each new application, even if it’s based on an existing
grammar or the grammar is automatically acquired.
We can draw schematic diagrams for applications showing how the modules fit together.
KB
*
j
KB INTERFACE/CONTEXT MODULE KB OUTPUT/DISCOURSE STRUCTURING
6
?
PARSING REALIZATION
6
?
MORPHOLOGY MORPHOLOGY GENERATION
6
?
INPUT PROCESSING OUTPUT PROCESSING
6
?
user input output
11
However, it is doubtful that this describes any real NLP system! For instance, the IBM Jeopardy playing system has
over 100 modules which produce multiple candidate answers — sophisticated probabilistic methods are used to rank
these. Nevertheless, the diagram should give some indication of how multiple components can be combined to build a
full application. In lectures 2–10, various algorithms will be discussed which could be parts of modules in this generic
architecture, although most are also useful in less elaborate contexts. Lecture 12 will discuss a few applications in
some more detail.
• Applications cannot be 100% perfect, because full real world knowledge is not possible.
• Applications that are less than 100% perfect can be useful (humans aren’t 100% perfect anyway).
• Applications that aid humans are much easier to construct than applications which replace humans. It is difficult
to make the limitations of systems which accept speech or language obvious to naive human users.
• NLP interfaces are nearly always competing with a non-language based approach.
• Currently nearly all applications either do relatively shallow processing on arbitrary input or deep processing on
narrow domains. MT can be domain-specific to varying extents: MT on arbitrary text still isn’t very good, but
can be useful.
• Limited domain systems require extensive and expensive expertise to port. Research that relies on extensive
hand-coding of knowledge for small domains is now generally regarded as a dead-end, though reusable hand-
coding is a different matter.
• The development of NLP has been driven as much by hardware and software advances, and societal and infras-
tructure changes as by great new ideas. Improvements in NLP techniques are generally incremental rather than
revolutionary.
12
2 Lecture 2: Morphology and finite-state techniques
This lecture starts with a brief discussion of morphology, concentrating mainly on English morphology. The concept
of a lexicon in an NLP system is discussed with respect to morphological processing. Spelling rules are introduced
and the use of finite state transducers to implement spelling rules is explained. The lecture concludes with a brief
overview of some other uses of finite state techniques in NLP.
13
form (because un- can’t attach to a noun). Furthermore, although there is a prefix un- that can attach to verbs, it nearly
always denotes a reversal of a process (e.g., untie), whereas the un- that attaches to adjectives means ‘not’, which is
the meaning in the case of un- ion -ise -ed. Hence the internal structure of un- ion -ise -ed has to be (un- ((ion -ise)
-ed)).
In such rules, the mapping is always given from the ‘underlying’ form to the surface form, the mapping is shown to
the left of the slash and the context to the right, with the indicating the position in question. ε is used for the empty
string and ˆ for the affix boundary. This particular rule is read as saying that the empty string maps to ‘e’ in the context
where it is preceded by an s,x, or z and an affix boundary and followed by an s. For instance, this maps boxˆs to boxes.
This rule might look as though it is written in a context sensitive grammar formalism, but actually we’ll see in §2.7
that it corresponds to a finite state transducer. Because the rule is independent of the particular affix, it applies equally
to the plural form of nouns and the 3rd person singular present form of verbs. Other spelling rules in English include
consonant doubling (e.g., rat, ratted, though note, not *auditted) and y/ie conversion (party, parties).10
incorrect. ? is generally used for a sentence which is questionable, or at least doesn’t have the intended interpretation. # is used for a pragmatically
anomalous sentence.
14
in lectures 4 and 5, we’ll need more detailed syntactic and semantic information. Morphological generation takes a
stem and some syntactic information and returns the correct form. For some applications, there is a requirement that
morphological processing is bidirectional: that is, can be used for analysis and generation. The finite state transducers
we will look at below have this property.
• stems with syntactic categories (plus more detailed information if derivational morphology is to be treated as
productive)
One approach to an affix lexicon is for it to consist of a pairing of affix and some encoding of the syntactic/semantic
effect of the affix.11 For instance, consider the following fragment of a suffix lexicon (we can assume there is a separate
lexicon for prefixes):
ed PAST_VERB
ed PSP_VERB
s PLURAL_NOUN
Here PAST_VERB, PSP_VERB and PLURAL_NOUN are abbreviations for some bundle of syntactic/semantic infor-
mation and form the interface between morphology and the syntax/semantics: I’ll discuss this briefly in §5.6.
A lexicon of irregular forms is also needed. One approach is for this to just be a triple consisting of inflected form,
‘affix information’ and stem, where ‘affix information’ corresponds to whatever encoding is used for the regular affix.
For instance:
Note that this information can be used for generation as well as analysis, as can the affix lexicon.
In most cases, English irregular forms are the same for all senses of a word. For instance, ran is the past of run
whether we are talking about athletes, politicians or noses. This argues for associating irregularity with particular
word forms rather than particular senses, especially since compounds also tend to follow the irregular spelling, even
non-productively formed ones (e.g., the plural of dormouse is dormice). However, there are exceptions: e.g., The
washing was hung/*hanged out to dry vs the murderer was hanged.
Morphological analysers also generally have access to a lexicon of regular stems. This is needed for high precision:
e.g. to avoid analysing corpus as corpu -s, we need to know that there isn’t a word corpu. There are also cases where
historically a word was derived, but where the base form is no longer found in the language: we can avoid analysing
unkempt as un- kempt, for instance, simply by not having kempt in the stem lexicon. Ideally this lexicon should have
syntactic information: for instance, feed could be fee -ed, but since fee is a noun rather than a verb, this isn’t a possible
analysis. However, in the approach I’ll assume, the morphological analyser is split into two stages. The first of these
only concerns morpheme forms and returns both fee -ed and feed given the input feed. A second stage which is
closely coupled to the syntactic analysis then rules out fee -ed because the affix and stem syntactic information are not
compatible (see §5.6 for one approach to this).
If morphology was purely concatenative, it would be very simple to write an algorithm to split off affixes. Spelling
rules complicate this somewhat: in fact, it’s still possible to do a reasonable job for English with ad hoc code, but a
cleaner and more general approach is to use finite state techniques.
11 J&M describe an alternative approach which is to make the syntactic information correspond to a level in a finite state transducer. However, at
15
2.6 Finite state automata for recognition
The approach to spelling rules that I’ll describe involves the use of finite state transducers (FSTs). Rather than jumping
straight into this, I’ll briefly consider the simpler finite state automata and how they can be used in a simple recogniser.
Suppose we want to recognise dates (just day and month pairs) written in the format day/month. The day and the
month may be expressed as one or two digits (e.g. 11/2, 1/12 etc). This format corresponds to the following simple
FSA, where each character corresponds to one transition:
1 2 3 4 5 6
digit digit
Accept states are shown with a double circle. This is a non-deterministic FSA: for instance, an input starting with the
digit 3 will move the FSA to both state 2 and state 3. This corresponds to a local ambiguity: i.e., one that will be
resolved by subsequent context. By convention, there must be no ‘left over’ characters when the system is in the final
state.
To make this a bit more interesting, suppose we want to recognise a comma-separated list of such dates. The FSA,
shown below, now has a cycle and can accept a sequence of indefinite length (note that this is iteration and not full
recursion, however).
1 2 3 4 5 6
digit digit
Both these FSAs will accept sequences which are not valid dates, such as 37/00. Conversely, if we use them to generate
(random) dates, we will get some invalid output. In general, a system which generates output which is invalid is said
to overgenerate. In fact, in many language applications, some amount of overgeneration can be tolerated, especially if
we are only concerned with analysis.
complex transducer which is an accurate reflection of the spelling rule. They also use an explicit terminating character while I prefer to rely on the
‘use all the input’ convention, which results in simpler rules.
16
s
ε → e/ x ˆ s
z
e:e
other : other
ε:ˆ s:s
1 2 3
e:e s:s
x:x
other : other z:z e:ˆ
s:s
x:x
z:z
Transducers map between two representations, so each transition corresponds to a pair of characters. As with the
spelling rule, we use the special character ‘ε’ to correspond to the empty character and ‘ˆ’ to correspond to an affix
boundary. The abbreviation ‘other : other’ means that any character not mentioned specifically in the FST maps to
itself.13 As with the FSA example, we assume that the FST only accepts an input if the end of the input corresponds
to an accept state (i.e., no ‘left-over’ characters are allowed).
For instance, with this FST, the surface form cakes would start from 1 and go through the transitions/states (c:c) 1,
(a:a) 1, (k:k) 1, (e:e) 1, (ε:ˆ) 2, (s:s) 3 (accept, underlying cakeˆs) and also (c:c) 1, (a:a) 1, (k:k) 1, (e:e) 1, (s:s) 4
(accept, underlying cakes). ‘d o g s’ maps to ‘d o g ˆ s’, ‘f o x e s’ maps to ‘f o x ˆ s’ and to ‘f o x e ˆ s’, and ‘b u z z
e s’ maps to ‘b u z z ˆ s’ and ‘b u z z e ˆ s’.14 When the transducer is run in analysis mode, this means the system can
detect an affix boundary (and hence look up the stem and the affix in the appropriate lexicons). In generation mode, it
can construct the correct string. This FST is non-deterministic.
Similar FSTs can be written for the other spelling rules for English (although to do consonant doubling correctly, in-
formation about stress and syllable boundaries is required and there are also differences between British and American
spelling conventions which complicate matters). Morphology systems are usually implemented so that there is one
FST per spelling rule and these operate in parallel.
One issue with this use of FSTs is that they do not allow for any internal structure of the word form. For instance, we
can produce a set of FSTs which will result in unionised being mapped into unˆionˆiseˆed, but as we’ve seen, the
affixes actually have to be applied in the right order and this isn’t modelled by the FSTs.
17
state grammar may be adequate. More complex grammars can be written as context free grammars (CFGs) and
compiled into finite state approximations.
• Partial grammars for named entity recognition (briefly discussed in §4.11).
• Dialogue models for spoken dialogue systems (SDS). SDS use dialogue models for a variety of purposes: in-
cluding controlling the way that the information acquired from the user is instantiated (e.g., the slots that are
filled in an underlying database) and limiting the vocabulary to achieve higher recognition rates. FSAs can be
used to record possible transitions between states in a simple dialogue. For instance, consider the problem of
obtaining a date expressed as a day and a month from a user. There are four possible states, corresponding to
the user input recognised so far:
1. No information. System prompts for month and day.
2. Month only is known. System prompts for day.
3. Day only is known. System prompts for month.
4. Month and day known.
The FSA is shown below. The loops that stay in a single state correspond to user responses that aren’t recognised
as containing the required information (mumble is the term generally used for an unrecognised input).
mumble
1
mumble mumble
month day
2 day & 3
month
day month
0.1
1
0.1 0.2
0.5 0.1
2 0.3 3
0.9 0.8
18
3 Lecture 3: Prediction and part-of-speech tagging
This lecture introduces some simple statistical techniques and illustrates their use in NLP for prediction of words and
part-of-speech categories. It starts with a discussion of corpora, then introduces word prediction. Word prediction can
be seen as a way of (crudely) modelling some syntactic information (i.e., word order). Similar statistical techniques
can also be used to discover parts of speech for uses of words in a corpus. The lecture concludes with some discussion
of evaluation.
3.1 Corpora
A corpus (corpora is the plural) is simply a body of text that has been collected for some purpose. A balanced
corpus contains texts which represent different genres (newspapers, fiction, textbooks, parliamentary reports, cooking
recipes, scientific papers etc etc): early examples were the Brown corpus (US English) and the Lancaster-Oslo-Bergen
(LOB) corpus (British English) which are each about 1 million words: the more recent British National Corpus (BNC)
contains approx 100 million words and includes 20 million words of spoken English. Corpora are important for
many types of linguistic research, although mainstream linguists have in the past tended to dismiss their use in favour
of reliance on intuitive judgements about whether or not an utterance is grammatical. A corpus can only (directly)
provide positive evidence about grammaticality. Many linguists are gradually coming round to their use. Corpora are
essential for most modern NLP research, though NLP researchers have often used newspaper text (particularly the Wall
Street Journal) rather than balanced corpora. Distributed corpora are often annotated in some way: the most important
type of annotation for NLP is part-of-speech tagging (POS tagging), which I’ll discuss further below. Corpora may
also be collected for a specific task. For instance, when implementing an email answering application, it is essential
to collect samples of representative emails. For interface applications in particular, collecting a corpus requires a
simulation of the actual application: generally this is done by a Wizard of Oz experiment, where a human pretends to
be a computer.
Corpora are needed in NLP for two reasons. Firstly, we have to evaluate algorithms on real language: corpora are
required for this purpose for any style of NLP. Secondly, corpora provide the data source for many machine-learning
approaches.
3.2 Prediction
The essential idea of prediction is that, given a sequence of words, we want to determine what’s most likely to come
next. There are a number of reasons to want to do this: the most important is as a form of language modelling for
automatic speech recognition. Speech recognisers cannot accurately determine a word from the sound signal for that
word alone, and they cannot reliably tell where each word starts and finishes.15 So the most probable word is chosen
on the basis of the language model, which predicts the most likely word, given the prior context. The language models
which are currently most effective work on the basis of n-grams (a type of Markov chain), where the sequence of the
prior n − 1 words is used to predict the next. Trigram models use the preceding 2 words, bigram models the preceding
word and unigram models use no context at all, but simply work on the basis of individual word probabilities. Bigrams
are discussed below, though I won’t go into details of exactly how they are used in speech recognition.
Word prediction is also useful in communication aids: i.e., systems for people who can’t speak because of some form
of disability. People who use text-to-speech systems to talk because of a non-linguistic disability usually have some
form of general motor impairment which also restricts their ability to type at normal rates (stroke, ALS, cerebral
palsy etc). Often they use alternative input devices, such as adapted keyboards, puffer switches, mouth sticks or
eye trackers. Generally such users can only construct text at a few words a minute, which is too slow for anything
like normal communication to be possible (normal speech is around 150 words per minute). As a partial aid, a word
prediction system is sometimes helpful: this gives a list of candidate words that changes as the initial letters are entered
by the user. The user chooses the desired word from a menu when it appears. The main difficulty with using statistical
prediction models in such applications is in finding enough data: to be useful, the model really has to be trained on an
15 In fact, although humans are better at doing this than speech recognisers, we also need context to recognise words, especially words like the
and a. If a recording is made of normal, fluently spoken, speech and the segments corresponding to the and a are presented to a subject in isolation,
it’s generally not possible to tell the difference.
19
individual speaker’s output, but of course very little of this is likely to be available. Training a conversational aid on
newspaper text can be worse than using a unigram model from the user’s own data.
Prediction is important in estimation of entropy, including estimations of the entropy of English. The notion of entropy
is important in language modelling because it gives a metric for the difficulty of the prediction problem. For instance,
speech recognition is vastly easier in situations where the speaker is only saying two easily distinguishable words (e.g.,
when a dialogue system prompts by saying answer ‘yes’ or ‘no’) than when the vocabulary is unlimited: measurements
of entropy can quantify this, but won’t be discussed further in this course.
Other applications for prediction include optical character recognition (OCR), spelling correction and text segmen-
tation for languages such as Chinese, which are conventionally written without explicit word boundaries. Some ap-
proaches to word sense disambiguation, to be discussed in lecture 7, can also be treated as a form of prediction.
3.3 bigrams
A bigram model assigns a probability to a word based on the previous word alone: i.e., P (wn |wn−1 ) (the probability
of wn conditional on wn−1 ) where wn is the nth word in some string. For application to communication aids, we
are simply concerned with predicting the next word: once the user has made their choice, the word can’t be changed.
However, for speech recognition and similar applications, we require the probability of some string of words P (w1n )
which is approximated by the product of the bigram probabilities:
n
Y
P (w1n ) ≈ P (wk |wk−1 )
k=1
We acquire these probabilities from a corpus. For example, suppose we have the following tiny corpus of utterances:
good morning
good afternoon
good afternoon
it is very good
it is good
I’ll use the symbol hsi to indicate the beginning of the sentence and h/si to indicate the end, so the corpus really looks
like:
hsi good morning h/si hsi good afternoon h/si hsi good afternoon h/si hsi it is very good h/si hsi it is good
h/si
20
afternoon 2
afternoon </s> 2 1
it 2
it is 2 1
is 2
is very 1 .5
is good 1 .5
very 1
very good 1 1
</s> 5
</s><s> 4 1
This yields a probability of 0.24 for the string ‘hsi good h/si’ and also for ‘hsi good afternoon h/si’.
For speech recognition, the n-gram approach is applied to maximise the likelihood of a sequence of words, hence
we’re looking to find the most likely sequence overall. Notice that we can regard bigrams as comprising a simple
deterministic weighted FSA. The Viterbi algorithm, an dynamic programming technique for efficiently applying n-
grams in speech recognition and other applications to find the highest probability sequence (or sequences), is usually
described in terms of an FSA.
The probability of ‘hsi very good h/si’ based on this corpus is 0, since the conditional probability of ‘very’ given ‘hsi’
is 0 since we haven’t found any examples of this in the training data. In general, this is problematic because we will
never have enough data to ensure that we will see all possible events and so we don’t want to rule out unseen events
entirely. To allow for sparse data we have to use smoothing, which simply means that we make some assumption
about the ‘real’ probability of unseen or very infrequently seen events and distribute that probability appropriately. A
common approach is simply to add one to all counts: this is add-one smoothing which is not sound theoretically, but
is simple to implement. A better approach in the case of bigrams is to backoff to the unigram probabilities: i.e., to
distribute the unseen probability mass so that it is proportional to the unigram probabilities. This sort of estimation is
extremely important to get good results from n-gram techniques, but I won’t discuss the details in this course.
This has two readings: one (the most likely) about ability to fish and other about putting fish in cans. fish is ambiguous
between a singular noun, plural noun and a verb, while can is ambiguous between singular noun, verb (the ‘put in
cans’ use) and modal verb. However, they is unambiguously a pronoun. (I am ignoring some less likely possibilities,
such as proper names.) These distinctions can be indicated by POS tags:
they PNP
can VM0 VVB VVI NN1
fish NN1 NN2 VVB VVI
There are several standard tagsets used in corpora and in POS tagging experiments. The one I’m using for the examples
in this lecture is CLAWS 5 (C5) which is given in full in Figure 5.9 in J&M. The meaning of the tags above is:
21
A POS tagger resolves the lexical ambiguities to give the most likely set of tags for the sentence. In this case, the right
tagging is likely to be:
Note the tag for the full stop: punctuation is treated as unambiguous. POS tagging can be regarded as a form of very
basic word sense disambiguation.
The other syntactically possible reading is:
However, POS taggers (unlike full parsers) don’t attempt to produce globally coherent analyses. Thus a POS tagger
might return:
despite the fact that this doesn’t correspond to a possible reading of the sentence.
POS tagging is useful as a way of annotating a corpus because it makes it easier to extract some types of information
(for linguistic research or NLP experiments). It also acts as a basis for more complex forms of annotation. Named
entity recognisers (discussed in lecture 4) are generally run on POS-tagged data. POS taggers are sometimes run as
preprocessors to full parsing, since this can cut down the search space to be considered by the parser. They can also
be used as part of a method for dealing with words which are not in the parser’s lexicon (unknown words).
They used to can fish in those towns. But now few people fish in these areas.
22
PRP 1
PRP DT0 2 1
PUN 1
PUN CJC 1 1
TO0 1
TO0 VVI 1 1
VVB 1
VVB PRP 1 1
VVD 1
VVD TO0 1 1
VVI 1
VVI NN2 1 1
I have used the correct PUN CJC probability, allowing for the final PUN. We can also obtain a lexicon from the tagged
data:
The idea of stochastic POS tagging is that the tag can be assigned based on consideration of the lexical probability
(how likely it is that the word has that tag), plus the sequence of prior tags. For a bigram model, we only look at a
single previous tag. This is more complicated than the word prediction case because we have to take into account both
words and tags.
We wish to produce a sequence of tags which have the maximum probability given a sequence of words. I will follow
J&M’s notation: the hat,ˆ, means “estimate of”, so t̂n1 means “estimate of the sequence of n tags”, and argmaxf (x)
x
means “the x such that f(x) is maximized”. Hence:
We can’t estimate this directly (mini-exercise: explain why not). By Bayes theorem:
Since we’re looking at assigning tags to a particular sequence of words, P (w1n ) is constant, so for a relative measure
of probability we can use:
t̂n1 = argmax P (w1n |tn1 )P (tn1 )
tn
1
23
We now have to estimate P (tn1 ) and P (w1n |tn1 ). If we make the bigram assumption, then the probability of a tag
depends on the previous tag, hence the tag sequence is estimated as a product of the probabilities:
n
Y
P (tn1 ) ≈ P (ti |ti−1 )
i=1
We will also assume that the probability of the word is independent of the words and tags around it and depends only
on its own tag:
n
Y
P (w1n |tn1 ) ≈ P (wi |ti )
i=1
These values can be estimated from the corpus frequencies. So our final equation for the HMM POS tagger using
bigrams is:
Yn
n
t̂1 = argmax P (wi |ti )P (ti |ti−1 )
tn
1 i=1
Note that we end up multiplying P (ti |ti−1 ) with P (wi |ti ) (the probability of the word given the tag) rather than
P (ti |wi ) (the probability of the tag given the word). For instance, if we’re trying to choose between the tags NN2
and VVB for fish in the sentence they fish, we calculate P (NN2|PNP), P (fish|NN2), P (VVB|PNP) and P (fish|VVB)
(assuming PNP is the only possible tag for they).
As the equation above indicates, in order to POS tag a sentence, we maximise the overall tag sequence probability
(again, this can be implemented efficiently using the Viterbi algorithm). So a tag which has high probability consid-
ering its individual bigram estimate will not be chosen if it does not form part of the highest probability path. For
example:
they PNP can VVB fish NN2
they PNP can VM0 fish VVI
The product of P (VVI|VM0) and P (fish|VVI) may be lower than that of P (NN2|VVB) and P (fish|NN2) but the
overall probability depends also on P (can|VVB) versus P (can|VM0) and the latter (modal) use has much higher
frequency in a balanced corpus.
In fact, POS taggers generally use trigrams rather than bigrams — the relevant equations are given in J&M, 5.5.4. As
with word prediction, backoff (to bigrams) and smoothing are crucial for reasonable performance because of sparse
data.
When a POS tagger sees a word which was not in its training data, we need some way of assigning possible tags to the
word. One approach is simply to use all possible open class tags, with probabilities based on the unigram probabilities
of those tags. Open class words are ones for which we can never give a complete list for a living language, since words
are always being added: i.e., verbs, nouns, adjectives and adverbs. The rest are considered closed class. A better
approach is to use a morphological analyser (without a lexicon) to restrict this set: e.g., words ending in -ed are likely
to be VVD (simple past) or VVN (past participle), but can’t be VVG (-ing form).
24
discriminating power. On the other hand, if we tagged I and we as PRP1, you as PRP2 and so on, the n-gram approach
would allow some discrimination. In general, predicting on the basis of classes means we have less of a sparse data
problem than when predicting on the basis of words, but we also lose discriminating power. There is also something
of a tradeoff between the utility of a set of tags and their usefulness in POS tagging. For instance, C5 assigns separate
tags for the different forms of be, which is redundant for many purposes, but helps make distinctions between other
tags in tagging models such as the one described here where the context is given by a tag sequence alone (i.e., rather
than considering words prior to the current one).
POS tagging exemplifies some general issues in NLP evaluation:
Training data and test data The assumption in NLP is always that a system should work on novel data, therefore
test data must be kept unseen.
For machine learning approaches, such as stochastic POS tagging, the usual technique is to spilt a data set into
90% training and 10% test data. Care needs to be taken that the test data is representative.
For an approach that relies on significant hand-coding, the test data should be literally unseen by the researchers.
Development cycles involve looking at some initial data, developing the algorithm, testing on unseen data,
revising the algorithm and testing on a new batch of data. The seen data is kept for regression testing.
Baselines Evaluation should be reported with respect to a baseline, which is normally what could be achieved with a
very basic approach, given the same training data. For instance, the baseline for POS tagging with training data
is to choose the most common tag for a particular word on the basis of the training data (and to simply choose
the most frequent tag of all for unseen words).
Ceiling It is often useful to try and compute some sort of ceiling for the performance of an application. This is usually
taken to be human performance on that task, where the ceiling is the percentage agreement found between two
annotators (interannotator agreement). For POS tagging, this has been reported as 96% (which makes existing
POS taggers look impressive since some perform at higher accuracy). However this raises lots of questions:
relatively untrained human annotators working independently often have quite low agreement, but trained an-
notators discussing results can achieve much higher performance (approaching 100% for POS tagging). Human
performance varies considerably between individuals. Fatigue can cause errors, even with very experienced
annotators. In any case, human performance may not be a realistic ceiling on relatively unnatural tasks, such as
POS tagging.
Error analysis The error rate on a particular problem will be distributed very unevenly. For instance, a POS tagger
will never confuse the tag PUN with the tag VVN (past participle), but might confuse VVN with AJ0 (adjective)
because there’s a systematic ambiguity for many forms (e.g., given). For a particular application, some errors
may be more important than others. For instance, if one is looking for relatively low frequency cases of de-
nominal verbs (that is verbs derived from nouns — e.g., canoe, tango, fork used as verbs), then POS tagging is
not directly useful in general, because a verbal use without a characteristic affix is likely to be mistagged. This
makes POS-tagging less useful for lexicographers, who are often specifically interested in finding examples of
unusual word uses. Similarly, in text categorisation, some errors are more important than others: e.g. treating
an incoming order for an expensive product as junk email is a much worse error than the converse.
Reproducibility If at all possible, evaluation should be done on a generally available corpus so that other researchers
can replicate the experiments.
25
4 Lecture 4: Context-free grammars and parsing.
In this lecture, I’ll discuss syntax in a way which is much closer to the standard notions in formal linguistics than
POS-tagging is. To start with, I’ll briefly motivate the idea of a generative grammar in linguistics, review the notion
of a context-free grammar and then show a context-free grammar for a tiny fragment of English. We’ll then see how
context free grammars can be used to implement parsers, and discuss chart parsing, which allows efficient processing
of strings containing a high degree of ambiguity. Finally we’ll briefly touch on probabilistic context-free approaches.
can be bracketed
The phrase, big dog, is an example of a constituent (i.e. something that is enclosed in a pair of brackets): the big dog
is also a constituent, but the big is not. Constituent structure is generally justified by arguments about substitution
which I won’t go into here: J&M discuss this briefly, but see an introductory syntax book for a full discussion. In this
course, I will simply give bracketed structures and hope that the constituents make sense intuitively, rather than trying
to justify them.
Two grammars are said to be weakly-equivalent if they generate the same strings. Two grammars are strongly-
equivalent if they assign the same bracketings to all strings they generate.
In most, but not all, approaches, the internal structures are given labels. For instance, the big dog is a noun phrase
(abbreviated NP), slept, slept in the park and licked Sandy are verb phrases (VPs). The labels such as NP and VP
correspond to non-terminal symbols in a grammar. In this lecture, I’ll discuss the use of simple context-free grammars
for language description, moving onto a more expressive formalism in lecture 5.
The formal description of a CFG generally allows productions with an empty righthandside (e.g., Det → ε). It is
convenient to exclude these however, since they complicate parsing algorithms, and a weakly-equivalent grammar can
always be constructed that disallows such empty productions.
26
A grammar in which all nonterminal daughters are the leftmost daughter in a rule (i.e., where all rules are of the form
X → Y a∗), is said to be left-associative. A grammar where all the nonterminals are rightmost is right-associative.
Such grammars are weakly-equivalent to regular grammars (i.e., grammars that can be implemented by FSAs), but
natural languages seem to require more expressive power than this (see §4.11).
S -> NP VP
VP -> VP PP
VP -> V
VP -> V NP
VP -> V VP
NP -> NP PP
PP -> P NP
;;; lexicon
V -> can
V -> fish
NP -> fish
NP -> rivers
NP -> pools
NP -> December
NP -> Scotland
NP -> it
NP -> they
P -> in
The rules with terminal symbols on the right hand side correspond to the lexicon. Here and below, comments are
preceded by ;;;
Here are some strings which this grammar generates, along with their bracketings:
they fish
(S (NP they) (VP (V fish)))
they can fish
(S (NP they) (VP (V can) (VP (V fish))))
;;; the modal verb ‘are able to’ reading
(S (NP they) (VP (V can) (NP fish)))
;;; the less plausible, put fish in cans, reading
they fish in rivers
(S (NP they) (VP (VP (V fish)) (PP (P in) (NP rivers))))
they fish in rivers in December
(S (NP they) (VP (VP (V fish)) (PP (P in) (NP (NP rivers) (PP (P in) (NP December))))))
;;; i.e. the implausible reading where the rivers are in December
;;; (cf rivers in Scotland)
(S (NP they) (VP (VP (VP (V fish)) (PP (P in) (NP rivers))) (PP (P in) (NP December))))
;;; i.e. the fishing is done in December
One important thing to notice about these examples is that there’s lots of potential for ambiguity. In the they can fish
example, this is due to lexical ambiguity (it arises from the dual lexical entries of can and fish), but the last example
27
demonstrates purely structural ambiguity. In this case, the ambiguity arises from the two possible attachments of the
prepositional phrase (PP) in December: it can attach to the NP (rivers) or to the VP. These attachments correspond
to different semantics, as indicated by the glosses. PP attachment ambiguities are a major headache in parsing, since
sequences of four or more PPs are common in real texts and the number of readings increases as the Catalan series,
which is exponential. Other phenomena have similar properties: for instance, compound nouns (e.g. long-stay car
park shuttle bus). Humans disambiguate such attachments as they hear a sentence, but they’re relying on the meaning
in context to do this, in a way we cannot currently emulate, except when the sentences are restricted to a very limited
domain.
Notice that fish could have been entered in the lexicon directly as a VP, but that this would cause problems if we were
doing inflectional morphology, because we want to say that suffixes like -ed apply to Vs. Making rivers etc NPs rather
than nouns is a simplification I’ve adopted here to keep the grammar smaller.
NP VP
they V VP
can VP PP
V P NP
fish in December
(S (NP they)
(VP (V can)
(VP (VP (V fish))
(PP (P in)
(NP December)))))
28
A chart is a collection of edges, usually implemented as a vector of edges, indexed by edge identifiers. In the simplest
version of chart parsing, each edge records a rule application and has the following structure:
mother category refers to the rule that has been applied to create the edge. daughters is a list of the edges that acted
as the daughters for this particular rule application: it is there purely for record keeping so that the output of parsing
can be a labelled bracketing.
For instance, the following edges would be among those found on the chart after a complete parse of they can fish
according to the grammar given above (id numbering is arbitrary):
The daughters for the terminal rule applications are simply the input word strings.
Note that local ambiguities correspond to situations where a particular span has more than one associated edge. We’ll
see below that we can pack structures so that we never have two edges with the same category and the same span, but
we’ll ignore this for the moment (see §4.8). Also, in this chart we’re only recording complete rule applications: this is
passive chart parsing. The more efficient active chart is discussed below, in §4.9.
Parse:
Initialise the chart (i.e., clear previous results)
For each word word in the input sentence, let from be the left vertex, to be the right vertex and daughters be (word)
For each category category that is lexically associated with word
Add new edge from, to, category, daughters
Output results for all spanning edges
(i.e., ones that cover the entire input and which have a mother corresponding to the root category)
29
Notice that this means that the grammar rules are indexed by their rightmost category, and that the edges in the chart
must be indexed by their to vertex (because we scan backward from the rightmost category). Consider:
The following diagram shows the chart edges as they are constructed in order (when there is a choice, taking rules in
a priority order according to the order they appear in the grammar):
The spanning edges are 11 and 8: the output routine to give bracketed parses simply outputs a left bracket, outputs
the category, recurses through each of the daughters and then outputs a right bracket. So, for instance, the output from
edge 11 is:
This chart parsing algorithm is complete: it returns all possible analyses, except in the case where it does not terminate
because there is a recursively applicable rule.
VP -> V NP
PP -> P NP
word = can
categories = V
Add new edge 1, 2, V, (can)
1 2
they can fish
30
Matching grammar rules are:
VP -> V
3
1 2
they can fish
S -> NP VP
VP -> V VP
word = fish
categories = V, NP
Add new edge 2, 3, V, (fish)
4
3
1 2 5
they can fish
Matching grammar rules are:
VP -> V
4
3 6
1 2 5
they can fish
31
S -> NP VP
VP -> V VP
No edges match NP
set of edge lists for V VP = {(2, 6)}
Add new edge 1, 3, VP, (2, 6)
4 7
3 6
1 2 5
they can fish
Matching grammar rules are:
S -> NP VP
VP -> V VP
set of edge lists for NP VP = {(1, 7)}
Add new edge 0, 3, S, (1, 7)
8
4 7
3 6
1 2 5
they can fish
No matching grammar rules for S
No edges matching V
Add new edge 2, 3, NP, (fish)
8 9
4 7
3 6
1 2 5
they can fish
Matching grammar rules are:
VP -> V NP
PP -> P NP
set of edge lists corresponding to V NP = {(2, 9)}
Add new edge 1, 3, VP, (2, 9)
10
8 9
4 7
3 6
1 2 5
they can fish
32
Matching grammar rules are:
S -> NP VP
VP -> V VP
11
10
8 9
4 7
3 6
1 2 5
they can fish
No matching grammar rules for S
No edges corresponding to V VP
No edges corresponding to P NP
No further words in input
Spanning edges are 8 and 11: Output results for 8
4.8 Packing
The algorithm given above is exponential in the case where there are an exponential number of parses. The body
of the algorithm can be modified so that it runs in cubic time, though producing the output is still exponential. The
modification is simply to change the daughters value on an edge to be a set of lists of daughters and to make an equality
check before adding an edge so we don’t add one that’s equivalent to an existing one. That is, if we are about to add
an edge:
There is no need to recurse with this edge, because we couldn’t get any new results: once we’ve found we can pack an
edge, we always stop that part of the search. Thus packing saves computation and in fact leads to cubic time operation,
though I won’t go through the proof of this.
For the example above, everything proceeds as before up to edge 9:
33
id left right mother daughters
1 0 1 NP {(they)}
2 1 2 V {(can)}
3 1 2 VP {(2)}
4 0 2 S {(1 3)}
5 2 3 V {(fish)}
6 2 3 VP {(5)}
7 1 3 VP {(2 6)}
8 0 3 S {(1 7)}
9 2 3 NP {(fish)}
However, rather than add edge 10, which would be:
10 1 3 VP (2 9)
we match this with edge 7, and simply add the new daughters to that.
7 1 3 VP {(2 6), (2 9)}
The algorithm then terminates. We only have one spanning edge (edge 8) but the display routine is more complex
because we have to consider the alternative sets of daughters for edge 7. (You should go through this to convince
yourself that the same results are obtained as before.) Although in this case, the amount of processing saved is small,
the effects are much more important with longer sentences (consider he believes they can fish, for instance).
34
4.10 Ordering the search space
In the pseudo-code above, the order of addition of edges to the chart was determined by the recursion. In general,
chart parsers make use of an agenda of edges, so that the next edges to be operated on are the ones that are first on the
agenda. Different parsing algorithms can be implemented by making this agenda a stack or a queue, for instance.
So far, we’ve considered bottom up parsing: an alternative is top down parsing, where the initial edges are given by
the rules whose mother corresponds to the start symbol.
Some efficiency improvements can be obtained by ordering the search space appropriately, though which version is
most efficient depends on properties of the individual grammar. However, the most important reason to use an explicit
agenda is when we are returning parses in some sort of priority order, corresponding to weights on different grammar
rules or lexical entries.
Weights can be manually assigned to rules and lexical entries in a manually constructed grammar. However, since the
beginning of the 1990s, a lot of work has been done on automatically acquiring probabilities from a corpus annotated
with syntactic trees (a treebank), either as part of a general process of automatic grammar acquisition, or as auto-
matically acquired additions to a manually constructed grammar. Probabilistic CFGs (PCFGs) can be defined quite
straightforwardly, if the assumption is made that the probabilities of rules and lexical entries are independent of one
another (of course this assumption is not correct, but the orderings given seem to work quite well in practice). The
importance of this is that we rarely want to return all parses in a real application, but instead we want to return those
which are top-ranked: i.e., the most likely parses. This is especially true when we consider that realistic grammars
can easily return many tens of thousands of parses for sentences of quite moderate length (20 words or so). If edges
are prioritised by probability, very low priority edges can be completely excluded from consideration if there is a
cut-off such that we can be reasonably certain that no edges with a lower priority than the cut-off will contribute to the
highest-ranked parse. Limiting the number of analyses under consideration is known as beam search (the analogy is
that we’re looking within a beam of light, corresponding to the highest probability edges). Beam search is linear rather
than exponential or cubic. Just as importantly, a good priority ordering from a parser reduces the amount of work that
has to be done to filter the results by whatever system is processing the parser’s output.
4.11 Why can’t we use FSAs to model the syntax of natural languages?
In this lecture, we started using CFGs. This raises the question of why we need this more expressive (and hence
computationally expensive) formalism, rather than modelling syntax with FSAs. One reason is that the syntax of
natural languages cannot be described by an FSA, even in principle, due to the presence of centre-embedding, i.e.
structures which map to:
A → αAβ
has a centre-embedded structure. However, humans have difficulty processing more than two levels of embedding:
If the recursion is finite (no matter how deep), then the strings of the language can be generated by an FSA. So it’s not
entirely clear whether formally an FSA might not suffice.
There’s a fairly extensive discussion of these issues in J&M , but there are two essential points for our purposes:
1. Grammars written using finite state techniques alone are very highly redundant, which makes them very difficult
to build and maintain.
2. Without internal structure, we can’t build up good semantic representations.
Hence the use of more powerful formalisms: in the next section, I’ll discuss the inadequacies of simple CFGs from a
similar perspective.
35
However, FSAs are very useful for partial grammars which don’t require full recursion. In particular, for information
extraction, we need to recognise named entities: e.g. Professor Smith, IBM, 101 Dalmatians, the White House, the
Alps and so on. Although NPs are in general recursive (the man who likes the dog which bites postmen), relative
clauses are not generally part of named entities. Also the internal structure of the names is unimportant for IE. Hence
FSAs can be used, with sequences such as ‘title surname’, ‘DT0 PNP’ etc
CFGs can be automatically compiled into approximately equivalent FSAs by putting bounds on the recursion. This is
particularly important in speech recognition engines.
The subject and the verb must (usually) either both have singular morphology or both have plural mophology: i.e., they must agree. There was also
no account of case: this is only reflected in a few places in modern English, but *they can they is clearly ungrammatical (as opposed to they can
them, which is grammatical with the transitive verb use of can).
36
Notice that, in the third example, the verb were shows plural agreement.
Doing this in standard CFGs is possible, but extremely verbose, potentially leading to trillions of rules. Instead of
having simple atomic categories in the CFG, we want to allow for features on the categories, which can have values
indicating things like plurality. As the long-distance dependency examples should indicate, the features need to be
complex-valued. For instance,
* what kid did you say were making all that noise?
is not grammatical. The analysis needs to be able to represent the information that the gap corresponds to a plural
noun phrase.
In the next lecture, I will illustrate a simple constraint-based grammar formalism, using feature structures which
allows us to encode these phenomena.
37
5 Lecture 5: Parsing with constraint-based grammars
As discussed at the end of the last lecture, the simple CFG approach which we’ve looked at so far has some serious
deficiencies as a model of natural language. In this lecture, I’ll give an introduction to a more expressive formalism
which is widely used in NLP, again with the help of a sample grammar. As I outlined in the last lecture, instead of
simple atomic categories in the CFG, we will allow for features on the categories, which can have values indicating
things like plurality. To allow for long-distance dependency examples, the features need to be complex-valued. In
lecture 6, I will go on to sketch how we can use this approach to do compositional semantics.
The formalism we will look at is a type of constraint-based grammar: a feature structure grammar. A constraint-based
grammar describes a language using a set of independently stated constraints, without imposing any conditions on
processing or processing order. A CFG can be taken as an example of a constraint-based grammar, but usually the
term is reserved for richer formalisms. The simplest way to think of feature structures (FSs) is that we’re replacing the
atomic categories of a CFG with more complex data structures. I’ll first illustrate this idea intuitively, using a grammar
fragment like the one in lecture 4 but enforcing agreement. I’ll then go through the feature structure formalism in more
detail. This is followed by an example of a more complex grammar, which allows for subcategorization (I won’t show
how case and long-distance dependencies are dealt with).
CAT NP
-
AGR
sg
j
In the graph below, the feature HEAD is complex-valued, and the value of AGR (i.e., the value of the path HEAD AGR)
is unspecified:
HEAD CAT NP
- -
AGR
FSs are usually drawn as attribute-value matrices or AVMs. The AVMs corresponding to the two FSs above are as
follows:
CAT NP
AGR sg
CAT NP
HEAD
AGR
Since FSs are graphs, rather than trees, a particular node may be accessed from the root by more than one path: this is
known as reentrancy. In AVMs, reentrancy is conventionally indicated by boxed integers, with node identity indicated
by integer identity. The actual integers used are arbitrary. This is illustrated with an abstract example using features F
and G below:
38
Graph AVM
: a
F
F a
G
Non-reentrant G a
- a
F
G F 0 a
a
z
Reentrant - G 0
The FS equivalent shown below replaces the atomic categories with FSs, splitting up the categories so that the main
category and the agreement values are distinct. In the grammar below, I have used the arrow notation for rules as an
abbreviation: I will describe the actual FS encoding of rules shortly. The FS grammar just needs two rules. There is a
single rule corresponding to the S-> NP VP rule, which enforces identity of agreement values between the NP and
the VP by means of reentrancy (indicated by the tag 1 ). The rule corresponding to VP-> V NP simply makes the
agreement values of the V and the VP the same but ignores the agreement value on the NP.19 The lexicon specifies
agreement values for it, they, like and likes, but leaves the agreement value for fish uninstantiated (i.e., underspecified).
Note that the grammar also has a root FS: a structure only counts as a valid parse if it is unifiable with the root.
Grammar rules
VP V , CAT NP
Verb-object rule CAT
AGR 1
→ CAT
AGR 1
AGR
Lexicon:
;;; noun phrases
19 Note that the reentrancy indicators are local to each rule: the 1 in the subject-verb rule is not the same structure as the 1 in the verb-object
rule.
39
they CAT NP
AGR pl
CAT NP
fish
AGR
it CAT NP
AGR sg
;;; verbs
like CAT V
AGR pl
likes CAT V
AGR sg
Root structure:
CAT S
Consider parsing they like it with this grammar. The lexical structures for like and it are unified with the corresponding
structure to the right hand side of the verb-object rule. Both unifications succeed, and the structure corresponding to
the mother of the rule is:
CAT VP
AGR pl
The agreement value is pl because of the reentrancy with the agreement value of like. This structure can unify with
the rightmost daughter of the subject-verb rule. The structure for they is unified with the leftmost daughter. The
subject-verb rule says that both daughters have to have the same agreement value, which is true in this example. Rule
application therefore succeeds and since the result unifies with the root structure, there is a valid parse.
To see what is going on a bit more precisely, we need to show the rules as FSs. There are several ways of encoding this,
but for current purposes I will assume that rules have features MOTHER, DTR 1, DTR 2 . . . DTRN. So the verb-object
rule, which I informally wrote as:
VP V , CAT NP
CAT
AGR 1
→ CAT
AGR 1
AGR
is actually:
CAT VP
MOTHER
AGR 1
DTR 1
CAT V
AGR
1
CAT NP
DTR 2
AGR
Thus the rules in the CFG correspond to FSs in this formalism and we can formalise rule application by unification.
For instance, a rule application in bottom-up parsing involves unifying each of the DTR slots in the rule with the
feature structures for the phrases already in the chart.
Consider parsing they like it again.
STEP1: parsing like it with the rule above.
Step 1a
The structure for like can be unified with the value of DTR 1 in the rule.
CAT VP
MOTHER
AGR 1
CAT V
DTR 1
u
CAT V
DTR 1
AGR
1
AGR pl
CAT NP
DTR 2
AGR
40
Unification means all information is retained, so the result includes the agreement value from like:
CAT VP
MOTHER
AGR 1 pl
CAT V
DTR 1
AGR
1
CAT NP
DTR 2
AGR
Step 1b
The structure for it is unified with the value for DTR 2 in the result of Step 1a:
CAT VP CAT VP
MOTHER
MOTHER
AGR 1 pl
AGR 1 pl
CAT V
u NP CAT V
DTR 1 CAT
=
DTR 2 DTR 1
AGR
1
AGR sg AGR 1
NP
CAT
NP
CAT
DTR 2 DTR 2
AGR AGR sg
This gives:
MOTHER CAT S
AGR 1 pl
NP
DTR 1
CAT
AGR 1
CAT VP
DTR 2
AGR 1
Step 2b
The FS for they is:
CAT NP
AGR pl
The unification of this with the value of DTR 1 from Step 2a succeeds but adds no new information:
MOTHER CAT S
AGR 1 pl
CAT NP
DTR 1
AGR 1
CAT VP
DTR 2
AGR 1
41
Step 3:
Finally, the MOTHER of this structure unifies with the root structure, so this is a valid parse.
Note however, that if we had tried to parse it like it, a unification failure would have occurred at Step 2b, since the AGR
on the lexical entry for it has the value sg which clashes with the value pl.
I have described these unifications as occurring in a particular order, but it is very important to note that order is not
significant and that the same overall result would have been obtained if another order had been used. This means that
different parsing algorithms are guaranteed to give the same result. The one proviso is that with some FS grammars,
just like CFGs, some algorithms may terminate while others do not.
Connectedness and unique root A FS must have a unique root node: apart from the root node, all nodes have one or
more parent nodes.
Unique features Any node may have zero or more arcs leading out of it, but the label on each (that is, the feature)
must be unique.
No cycles No node may have an arc that points back to the root node or to a node that intervenes between it and the
root node. (Although some variants of FS formalisms allow cycles.)
Values A node which does not have any arcs leading out of it may have an associated atomic value.
Finiteness An FS must have a finite number of nodes.
Path values For every path P in FS1 there is a path P in FS2. If P has an atomic value t in FS1, then P also has value
t in FS2.
Path equivalences Every pair of paths P and Q which are reentrant in FS1 (i.e., which lead to the same node in the
graph) are also reentrant in FS2.
Unification corresponds to conjunction of information, and thus can be defined in terms of subsumption, which is a
relation of information containment. The unification of two FSs is defined to be the most general FS which contains
all the information in both of the FSs. Unification will fail if the two FSs contain conflicting information. As we saw
with the simple grammar above, this prevented it like it getting an analysis, because the AGR values conflicted.
Properties of unification The unification of two FSs, FS1 and FS2, is the most general FS which is subsumed by both
FS1 and FS2, if it exists.
42
the distinction between the lexical sign and the phrase (i.e., N vs NP and V vs VP). In the grammar below, the CAT
feature just encodes the major category (noun vs verb) and the phrasal distinction is encoded in terms of whether the
subcategorization requirements have been satisfied. The CAT and AGR features are now inside another feature head.
Signs have three features at the top-level: HEAD, OBJ and SUBJ.
Verb-object
rule
HEAD 1 HEAD 1
→
,
OBJ filled OBJ 2 2 OBJ filled
SUBJ 3 SUBJ 3
Lexicon:
;;; noun phrases
HEAD CAT noun
they
AGR pl
OBJ filled
SUBJ filled
CAT noun
HEAD
AGR
fish
OBJ filled
SUBJ filled
HEAD CAT noun
it
AGR sg
OBJ filled
SUBJ filled
;;; verbs
CAT verb
HEAD
AGR pl
fish
OBJ filled
SUBJ HEAD CAT noun
CAT verb
HEAD
AGR
can ;;; auxiliary verb
OBJ HEAD
CAT verb
SUBJ
HEAD CAT noun
HEAD CAT verb
AGR pl
can HEAD CAT noun ;;; transitive verb
OBJ
OBJ filled
SUBJ HEAD
CAT noun
Root
structure:
HEAD
CAT verb
OBJ filled
filled
SUBJ
Briefly, HEAD contains information which is shared between the lexical entries and phrases of the same category:
e.g., nouns share this information with the noun phrase which dominates them in the tree, while verbs share head
information with verb phrases and sentences. So HEAD is used for agreement information and for category information
(i.e., noun, verb etc). In contrast, OBJ and SUBJ are about subcategorization: they contain information about what can
43
combine with this sign. For instance, an intransitive verb will have a SUBJ corresponding to its subject ‘slot’ and
a value of filled for its OBJ.20 You do not have to memorize the precise details of the feature structure architecture
described here for the exam (questions that assume knowledge of details will give an example). The point of giving
this more complicated grammar is that it starts to demonstrate the power of the feature structure framework, in a way
that the simple grammar using agreement does not.
The grammar has just two rules, one for combining a verb with its subject and another for combining a verb with its
object.
• The subject rule says that, when building the phrase, the SUBJ value of the second daughter is to be equated
(unified) with the whole structure of the first daughter (indicated by 2 ). The head of the mother is equated with
the head of the second daughter ( 1 ). The rule also stipulates that the AGR values of the two daughters have to
be unified and that the subject has to have a filled object slot.
• The verb-object rule says that, when building the phrase, the OBJ value of the first daughter is to be equated
(unified) with the whole structure of the second daughter (indicated by 2 ). The head of the mother is equated
with the head of the first daughter ( 1 ). The SUBJ of the mother is also equated with the SUBJ of the first
daughter ( 3 ): this ensures that any information about the subject that was specified on the lexical entry for the
verb is preserved. The OBJ value of the mother is stipulated as being filled: this means the mother can’t act as
the first daughter in another application of the rule, since filled won’t unify with a complex feature structure.
This is what we want in order to prevent an ordinary transitive verb taking two objects.
These rules are controlled by the lexical entries in the sense that it’s the lexical entries which determine the required
subject and object of a word.
As an example, consider analysing they fish. The verb entry for fish can be unified with the second daughter position
of
the subject-verb rule,
giving
the following partially instantiated rule:
verb noun
HEAD 1 CAT
HEAD CAT
3 pl
HEAD 1
AGR
→ AGR
3
, OBJ filled
2
OBJ filled OBJ filled
SUBJ 2
SUBJ filled SUBJ filled
The first daughter of this result can be unified with the structure for they, which in this case returns the same structure,
since it adds no new information. The result can be unified with the root structure, so this is a valid parse.
On the other hand, the lexical entry for the noun fish does not unify with the second daughter position of the subject-
verb rule. The entry for they does not unify with the first daughter position of the verb-object rule. Hence there is no
other parse.
The rules in this grammar are binary: i.e., they have exactly two daughters. The formalism allows for unary rules (one
daughter) and also for ternary rules (three daughters) quaternary rules and so on. Grammars can be defined using only
unary and binary rules which are weakly equivalent to grammars which use rules of higher arity: some approaches
avoid the use of rules with arity of more than 2.
speech as well as verbs: e.g., in Kim was happy to see her, happy subcategorises for the infinitival VP.
44
The need for copying is often discussed in terms of the destructive nature of the standard algorithm for unification
(which I won’t describe here), but this is perhaps a little misleading. Unification, however implemented, involves
sharing information between structures. Assume, for instance, that the FS representing the lexical entry of the noun
for fish is underspecified for number agreement. When we parse a sentence like:
the part of the FS in the result that corresponds to the original lexical entry will have its AGR value instantiated. This
means that the structure corresponding to a particular edge cannot be reused in another analysis, because it will contain
‘extra’ information. Consider, for instance, parsing:
is:
i.e., the fish (sg) is near the town. If we instantiate the AGR value in the FS for fish as sg while constructing this parse,
and then try to reuse that same FS for fish in the other parses, analysis will fail. Hence the need for copying, so we can
use a fresh structure each time. Copying is potentially extremely expensive, because realistic grammars involve FSs
with many hundreds of nodes.
So, although unification is very near to linear in complexity, naive implementations of FS formalisms are very in-
efficient. Furthermore, packing is not straightforward, because two structures are rarely identical in real grammars
(especially ones that encode semantics).
Reasonably efficient implementations of FS formalisms can nevertheless be developed. Copying can be greatly re-
duced:
1. by doing an efficient pretest before unification, so that copies are only made when unification is likely to succeed
2. by sharing parts of FSs that aren’t changed
3. by taking advantage of locality principles in linguistic formalisms which limit the need to percolate information
through structures
Packing can also be implemented: the test to see if a new edge can be packed involves subsumption rather than equality.
As with CFGs, for real efficiency we need to control the search space so we only get the most likely analyses. Defining
probabilistic FS grammars in a way which is theoretically well-motivated is much more difficult than defining a PCFG.
Practically it seems to turn out that treating a FS grammar much as though it were a CFG works fairly well, but this is
an active research issue.
5.5 Templates
The lexicon outlined above has the potential to be very redundant. For instance, as well as the intransitive verb fish,
a full lexicon would have entries for sleep, snore and so on, which would be essentially identical. We avoid this
redundancy by associating names with particular feature structures and using those names in lexical entries. For
instance:
fish INTRANS VERB
sleep INTRANS VERB
snore INTRANS VERB
45
where the template is specified as:
CAT verb
HEAD
AGR pl
INTRANS VERB
OBJ filled
SUBJ HEAD CAT noun
The lexical entry may have some specific information associated with it (e.g., semantic information, see next lecture)
which will be expressed as a FS: in this case, the template and the lexical feature structure are combined by unification.
A stem for a noun is generally assumed to be uninstantiated for number (i.e., neutral between sg and pl). So the lexical
entry for the noun dog in our fragment would be the structure for the stem:
CAT noun
HEAD
AGR
OBJ filled
SUBJ filled
One simple way of implementing inflectional morphology in FSs is simply to unify the contribution of the affix with
that of the stem. If we unify the FS corresponding to the stem for dog to the FS for PLURAL_NOUN, we get:
HEAD CAT noun
AGR pl
OBJ filled
SUBJ filled
This approach assumes that we also have a template SINGULAR_NOUN, where this is associated with a ‘null’ affix.
Notice how this is an implementation of the idea of a morphological paradigm, mentioned in §2.2.
In the case of an example such as feed incorrectly analysed as fee -ed, discussed in §2.5, the affix information will fail
to unify with the stem, ruling out that analysis.
Note that this simple approach is not, in general, adequate for derivational morphology. For instance, the affix -ize,
which combines with a noun to form a verb (e.g., lemmatization), cannot be represented simply by unification, because
it has to change a nominal form into a verbal one. This reflects the distinction between inflectional and derivational
morphology that we saw in §2.2:while inflectional morphology can be seen as simple addition of information, deriva-
tional morphology converts feature structures into new structures.
46
6 Lecture 6: Compositional semantics
Compositional semantics is the study of how meaning is conveyed by the structure of a sentence (as opposed to lexical
semantics, which is primarily about word meaning, which we’ll discuss in lectures 7 and 8). In the previous two
lectures, we’ve looked at grammars primarily as a way to describe a language: i.e., to say which strings are part of the
language and which are not, or (equivalently) as devices that, in principle, could generate all the strings of a language.
However, what we usually want for language analysis is some idea of the meaning of a sentence. At its most basic,
this is ‘who did what to whom?’ but clearly there is much more information that is implied by the structure of most
sentences. The parse trees we saw in lecture 4 and the feature structures of lecture 5 have some of this information, but
it is implicit rather than explicit. In the simple examples covered by those grammars, syntax and semantics are very
closely related, but if we look at more complex examples, this is not the case.
Consider the following examples:
The meaning these two sentences convey is essentially the same (what differs is the emphasis) but the parse trees
are quite different.21 A possible logical representation would be chase0 (k, r), assuming that k and r are constants
corresponding to Kitty and Rover and chase0 is the two place predicate corresponding to the verb chase.22 Note the
convention that a predicate corresponding to a lexeme is written using the stem of the lexeme followed by 0 : chase0 . A
logical meaning representation constructed for a sentence is called the logical form of the sentence. Here and in what
follows I am ignoring tense for simplicity although this is an important aspect of sentence meaning.
Another relatively straightforward example that shows a syntax/semantics mismatch is pleonastic pronouns: i.e., pro-
nouns that do not refer to actual entities. See the examples below (with indicative logical forms).
(3) a It barked.
b ∃x[bark0 (x) ∧ PRON(x)]
(4) a It rained.
b rain0
In it rains, the it does not refer to a real entity (it will not be resolved, see §9.8), so the semantics should not involve
any representation for the it.
More complex examples include verbs like seem: for instance Kim seems to sleep means much the same thing as it
seems that Kim sleeps (contrast this with the behaviour of believe). An indicative representation for both would be
seem0 (sleep0 (k)) — but note that there is no straightforward way of representing this in a first order logic.
There are many more examples of this sort that make the syntax/semantics interface much more complex than it first
appears and demonstrate that we cannot simply read the compositional semantics off a parse tree or other syntactic
representation.
Grammars that produce explicit representations for compositional semantics are often referred to as deep grammars.
Deep grammars that do not overgenerate too much are said to be bidirectional. This means they can be used in the
realization step in a Natural Language Generation system to produce text from an input logical form (see lecture 10).
This generally requires somewhat different algorithms from parsing (although chart generation is a variant of parsing),
but this will not be discussed in this course.
In this lecture, I will start off by showing how simple logical representations can be produced from CFGs and also from
feature structure representations. I will outline how such structures can be used in inference and describe experiments
on robust textual entailment.
21 Very broadly speaking, the second sentence is more appropriate if the dog is the topic of the discourse: we’ll very briefly return to this in
system. Another option is: ∃x, y[chase0 (x, y) ∧ Kitty0 (x) ∧ Rover0 (y)].
47
6.1 Compositional semantics using lambda calculus
The assumption behind compositional semantics is that the meaning of each whole phrase must relate to the meaning
of its parts. For instance, to supply a meaning for the phrase chased Rover, we need to combine a meaning for chased
with a meaning for Rover in some way.
To enforce a notion of compositionality, we can require that each syntactic rule in a grammar has a corresponding
semantic rule which shows how the meaning of the daughters is combined. In linguistics, this is usually done using
lambda calculus, following the work of Montague. The notion of lambda expression should be familiar from previous
courses (e.g., Computation Theory, Discrete Maths). Informally, lambda calculus gives us a logical notation to express
the argument requirements of predicates. For instance, we can represent the fact that a predicate like bark0 is ‘looking
for’ a single argument by:
λx[bark0 (x)]
Syntactically, the lambda behaves like a quantifier in FOPC: the lambda variable x is said to be within the scope of
the lambda operator in the same way that a variable is syntactically within the scope of a quantifier. But lambda
expressions correspond to functions, not propositions. The lambda variable indicates a variable that will be bound by
function application. Applying a lambda expression to a term will yield a new term, with the lambda variable replaced
by the term. For instance, to build the semantics for the phrase Kitty barks we can apply the semantics for barks to the
semantics for Kitty:
λx[bark0 (x)](k) = bark0 (k)
Replacement of the lambda variable is known as lambda-conversion. If the lambda variable is repeated, both instances
are instantiated: λx[bark0 (x) ∧ sleep0 (x)] denotes the set of things that bark and sleep
A partially instantiated transitive verb predicate has one uninstantiated variable, as does an intransitive verb: e.g.,
λx[chase0 (x, r)] — the set of things that chase Rover.
Lambdas can be nested: this lets us represent transitive verbs so that they apply to only one argument at once. For
instance: λx[λy[chase0 (y, x)]] (which is often written λxλy[chase0 (y, x)] where there can be no confusion). For
instance:
λx[λy[chase0 (y, x)]](r) = λy[chase0 (y, r)]
That is, applying the semantics of chase to the semantics of Rover gives us a lambda expression equivalent to the set
of things that chase Rover.
The following example illustrates that bracketing shows the order of application in the conventional way:
In other words, we work out the value of the bracketed expression first (the innermost bracketed expression if there is
more than one), and then apply the result, and so on until we’re finished.
A grammar fragment In this fragment, I’m using X 0 to indicate the semantics of the constituent X (e.g. NP0 means
the semantics of the NP), where the semantics may correspond to a function: e.g., VP0 (NP0 ) means the application of
the semantics of the VP to the semantics of the NP. The numbers are not really part of the CFG — they are just there
to identify different constituents.
S -> NP VP
VP0 (NP0 )
VP -> Vditrans NP1 NP2
(Vditrans0 (NP10 ))(NP20 )
VP -> Vtrans NP
Vtrans0 (NP0 )
48
VP -> Vintrans
Vintrans0
Vditrans -> gives
λxλyλz[give0 (z, y, x)]
Vtrans -> chases
λxλy[chase0 (y, x)]
Vintrans -> barks
λz[bark0 (z)]
Vintrans -> sleeps
λw[sleep0 (w)]
NP -> Kitty
k
NP -> Lynx
l
NP -> Rover
r
The postlecture exercises ask you to work through some examples using this fragment.
This can be roughly glossed as: ‘there was a barking event, Rover did the barking, and the barking was loud’. The
events are said to be reified: literally ‘made into things’.
FOPC has the disadvantage that it forces quantifiers to be in a particular scopal relationship, and this information is
not (generally) overt in NL sentences.
is ambiguous between:
∀x[dog0 (x) =⇒ ∃y[cat0 (y) ∧ chase0 (x, y)]]
and the less-likely, ‘one specific cat’ reading:
Many current NLP systems construct an underspecified representation which is neutral between these readings, if they
represent quantifier scope at all. There are several different alternative formalisms for underspecification.
There are a range of natural language examples which FOPC cannot handle, however. FOPC only has two quantifiers
(roughly corresponding to English some and every), but it is possible to represent some other English quantifiers in
49
FOPC, though representing numbers, for instance, is cumbersome. However, the quantifier most cannot be represented
in a first order logic at all. The use of event variables does not help with the representation of adverbials such as perhaps
or maybe, and there are also problems with the represention of modal verbs, like can. In these case, there are known
logics which can provide a suitable formalization (although the inference problem may not be tractable).
There are other cases where the correct formalization is quite unclear. Consider sentences involving bare plural
subjects, such as:
What quantifier could you use to capture the meaning of each of these examples? In some cases, researchers have ar-
gued that the correct interpretation must involve non-monotonicity or defaults: e.g., birds fly unless they are penguins,
ostriches, baby birds . . . But no fully satisfactory solution has been developed along these lines.
Another issue is that not all phrases have meanings which can be determined compositionally. For instance, puce
sphere is compositional: it refers to something which is both puce and a sphere. In contrast, red tape may be com-
positional (as in 11), but may also refer to bureaucracy (as in 12), in which case it is a non-compositional multiword
expression (MWE).
Clear cases of MWEs, like red tape, can be accounted for explicitly in the grammar (so red tape is treated as being
ambiguous between the compositional reading and the MWE), but the problem is that many phrases are somewhere in
between being an MWE and being fully compositional. Consider the phrase real pleasure (as in it was a real pleasure
to work with you): this isn’t non-compositional enough to be listed in most dictionaries, but is still a conventional
expression particularly appropriate in certain social contexts. It appears that something else is necessary besides
compositional semantics for such cases.
Finally, although there is an extensive linguistic literature on formal semantics which describes logical representations
for different constructions, the coverage is still very incomplete, even for well-studied languages such as English.
Furthermore, the analyses which are suggested are often incompatible with the goals of broad-coverage parsing, which
requires that ambiguity be avoided as much as possible. Overall, there are many unresolved puzzles in deciding on
logical representations, so even approaches which do use logical representations are actually using some form of
approximate meaning representation. Under many circumstances, it is better to see the logical representation as an
annotation that captures some of the meaning, rather than a complete representation.
50
This can be taken to be equivalent to the logical expression PRON(x) ∧ (like v(x, y) ∧ fish n(y)) by translating the
reentrancy between argument positions into variable equivalence.
The most important thing to notice is how the syntactic argument positions in the lexical entries are linked to their
semantic argument positions. This means, for instance, that for the transitive verb like, the syntactic subject will always
correspond to the first argument position, while the syntactic object will correspond to the second position. In the CFG
and lambda calculus approach, the linking effect is achieved by making the lambda expressions match up with each
of the syntactic rules, which was straightforward in the tiny grammar, but becomes more complex when we consider
passives, topicalisation and so on.
Verb-object rule
HEAD 1
OBJ filled
HEAD 1
SUBJ 3
→
OBJ 2
, OBJ
2
filled
PRED and SUBJ 3 SEM 5
ARG 1
SEM 4
SEM 4
ARG 2 5
Lexicon:
HEAD
CAT verb
AGR pl
HEAD CAT noun
OBJ filled
OBJ
SEM INDEX 2
like ;;; transitive verb
CAT noun
HEAD
SUBJ
SEM INDEX 1
PRED like v
ARG 1 1
SEM
ARG 2 2
CAT noun
HEAD
AGR
OBJ filled
fish
SUBJ filled
;;; noun phrase
INDEX 1
PRED fish n
SEM
ARG 1 1
CAT noun
HEAD
AGR pl
OBJ filled
they
SUBJ filled
;;; noun phrase
INDEX 1
SEM PRED pron
ARG 1 1
Notice the use of the ‘and’ predicate to relate different parts of the logical form. With very very simple examples as
covered by this grammar, it might seem preferable to use an approach where the nouns are embedded in the semantics
for the verb e.g., like v(fish n, fish n) for fish like fish. But this sort of representation does not extend to more complex
sentences (e.g., sentences with quantifiers).
The semantics shown above can be taken to be equivalent to a form of predicate calculus without variables or quanti-
fiers: i.e. the ‘variables’ in the representation actually correspond to constants. As discussed in the previous section,
we need quantifiers in a full representation. These can be encoded in feature structures, either in a scoped or un-
derspecified representation, but the demonstration of this is beyond the scope of this course. The feature structure
51
representation of semantics is appropriate when syntax is expressed in feature structures, since the combined represen-
tation can express some facts about the syntax-semantics interface more elegantly than lambda calculus with CFGs.
But the issues about the underlying logic and its expressive power remain, of course.
It is also possible to produce semantic dependencies: Copestake (2009) describes an approach to semantic dependen-
cies which is interconvertible with an underspecified logical representation. A very simple example is shown below,
though note that some information encoded on the nodes is omitted. The leading underscores are notational equiva-
lents to the 0 used previously and the v etc notation is a broad indication of sense (so bark v cannot refer to tree bark,
for instance).
some q (x, big a(e1,x) ∧ angry a(e2,x) ∧ dog n(x), bark v(e3,x) ∧ loud a(e4,e3))
∃x [ big a(e1,x) ∧ angry a(e2,x) ∧ dog n(x) ∧ bark v(e3,x) ∧ loud a(e4,e3) ]
6.5 Inference
There are two distinct ways of thinking about inference in language processing. The first can be thought of as inference
on an explicit formally-represented knowledge base, while the second is essentially language-based. It is possible to
use theorem provers with either approach, although relatively few researchers currently do this: it is more common to
use shallower, more robust, techniques.
Inference on a knowledge base This approach assumes that there is an explicit underlying knowledge base, which
might be represented in FOPC or some other formal language. For instance, we might have a set of axioms such as:
C(k, b)
C(r, b)
H(r)
U (k)
∀x[C(k, x) =⇒ H(x)]
52
There is also a link between the natural language terms appropriate for the domain covered and the constants and
predicates in the knowledge base: e.g. chase might correspond to C, happy to H, unhappy to U , Bernard to b and so
on. The approaches to compositional semantics discussed above are one way to do this. Under these assumptions, a
natural language question can be converted to an expression using the relevant predicates and answered either by direct
match to the knowledge base or via inference using the meaning representation in the knowledge base. For instance Is
Bernard happy? corresponds to querying the knowledge base with H(b). As mentioned in Lecture 1, the properties
of such a knowledge base can be exploited to reduce ambiguity. This is, of course, a trivial example: the complexity
of the system depends on the complexity of the domain (and the way in which out-of-domain queries are handled) but
in all cases the interpretation is controlled by the formal model of the domain.
Under these assumptions, the valid inferences are given by the knowledge base: language is just seen as a way of
accessing that knowledge. This approach relies on the mapping being adequate for any way that the user might choose
to phrase a question (e.g., Is Kittty sad? should be interpreted in the same way as Is Kitty unhappy?) and acquiring
such mappings is an issue, although machine learning methods can be used, given the right sort of training data. But
the most important difficulty is the limitations of the knowledge base representation itself. This sort of system works
very well for cases where there are clear boundaries to the domain and the reasoning is tractable, which is the case
for most types of spoken dialogue system. It may even turn out to be possible to approach mathematical knowledge
in this way, or at least some areas of mathematics. However, although some researchers in the 1970s and 80s thought
that it would be possible to extend this type of approach to commonsense knowledge, few people would now seriously
advocate this. Many researchers would also question whether it is helpful to think of human reasoning as operating in
this way: i.e., as translating natural language into a symbolic mental representation, although in some ways this model
does approximate human behaviour: most importantly, it is clearly the case that humans exploit their knowledge of a
particular context in order to disambiguate a natural language utterance.
Language-based inference The alternative way of thinking about inference is purely in terms of language: e.g.,
deciding whether one natural language statement follows from another. For instance,
The difference from the approach above is that the inference is entirely expressed in natural language and its validity
is determined by human intuition, rather than validity in any particular logic. We may use logic as a way of helping us
model these inferences correctly, but the basic notion of correctness is the human judgement, not logical correctness.
If an explicit meaning representation is used, it is best seen as an annotation of the natural language that captures some
of the meaning, not as a complete replacement for the text. Such language-based reasoning is not tied to a particular
knowledge base or domain. With such an approach we can proceed without having a perfect meaning representation
for a natural language statement, as long as we are prepared for the possibility that we will make some incorrect
decisions. Since humans are not always right, this is acceptable in principle.
The Recognising Textual Entailment task, discussed below, is a clear example of this methodology, but implicitly
or explicitly this is the most common approach in current NLP in general. It can be seen as underlying a number
of areas of research, including question answering and semantic search, and is relevant to many others, including
summarisation. A major limitation, at least in the vast majority of the research so far, is that the inferences are made
out of context. There will eventually be a requirement to include some form of modelling of the entities referred to,
to represent anaphora for instance, although this is probably best thought of as an annotation of the natural language
text, rather than an entirely distinct domain model.
The distinction between these two approaches can be seen as fundamental in terms of the philosophy of meaning.
However NLP research can and does combine the approaches. For example, a conversational agent will use an ex-
plicit domain model for some types of interaction and rely on language-based inference for others. While logic and
compositional semantics is traditionally associated with the first approach, it can also be useful in the second.
Lexical meaning and meaning postulates Computational approaches to lexical meaning will be discussed in the
following two lectures, but since interesting natural language inference requires some way of relating different lex-
emes, I will make some preliminary remarks here. Inference rules can be used to relate open class predicates: i.e.,
predicates that correspond to open class words. This is the classic way of representing lexical meaning in formal
53
semantics within linguistics.23 The standard example is:
For computational semantics, perhaps the best way of regarding meaning postulates is simply as one reasonable way
of linking compositionally constructed semantic representations to a specific domain (in the first approach outlined
above) or as relating lexemes to one another (as required by the second approach). In NLP, we’re normally concerned
with implication rather than definition:
Such relationships may be approximate, as long as they are sufficiently accurate to be useful. For instance we’ll see
an example below that requires that find and discover are related, which could be stated as follows:
This implies that all uses of find could be substituted by discover, which is not the case in general, although it may be
completely adequate for some domains and work well enough in general to be useful.
The task is to label the pairs as TRUE (when the entailment follows) or FALSE (when it doesn’t follow, rather than
when it’s known to be an untrue statement) in a way which matches human judgements. The example above was
labelled TRUE.
Examples of this sort can be dealt with using a logical form generated from a grammar with compositional semantics
combined with inference. To show this in detail would require a lot more discussion of semantics than we had space for
above, but I’ll give a sketch of the approach here. Assume that the T sentence has logical form T0 and the H sentence
has logical form H0 . Then if T0 =⇒ H0 we conclude TRUE, and otherwise we conclude FALSE. For example, the
logical form for the T sentence above is approximately as shown below:
The verb has the additional argument, e, corresponding to the event as discussed in §6.2 above. So the semantics can
be glossed as: ‘there was a finding event, the entity found was a girl, the finding event was in Drummondville, and the
finding event happened earlier this month’. The real representation would use a combination of predicates instead of
earlier-this-month0 , but that’s not important for the current discussion.
The hypothesis text is then:
Clearly A ∧ B =⇒ A, which licences dropping earlier this month, so assuming a meaning postulate:
54
the inference T0 =⇒ H0 would go through.
An alternative technique is based on matching dependency structures. Instead of an explicit meaning postulate, a
similarity metric can be used to relate find and discover. The similarity could be extracted from a lexical resource
(next lecture), or from a corpus using distributional methods (lecture 8).
Alternatively, a more robust method can be used which does not require parsing of any type. The crudest technique is
to use a bag-of-words method, analogous to that discussed for sentiment detection in lecture 1: i.e., if there is a large
enough overlap between T and H, the entailment goes though. Note that whether such a method works well or not
crucially depends on the H texts: it would trivially be fooled by hypotheses like:
In fact, the RTE test set was constructed in such a way that word overlap works quite well.
Further examples (all discussed by Bos and Markert, 2005):
The postlecture exercises suggest that you try and work out how a logical approach might handle (or fail to handle)
these examples.
55
7 Lecture 7: Lexical semantics
Lexical semantics concerns word meaning. In the previous lecture, I briefly discussed the use of meaning postulates to
represent the meaning of words. Linguistically and philosophically, any such approach has clear problems. Take the
bachelor example: is the current Pope a bachelor? Technically presumably yes, but bachelor seems to imply someone
who could be married: it’s a strange word to apply to the Pope under current assumptions about celibacy. Meaning
postulates are also too unconstrained: I could construct a predicate ‘bachelor-weds-thurs’ to correspond to someone
who was unmarried on Wednesday and married on Thursday, but this isn’t going to correspond to a word in any natural
language. In any case, very few words are as simple to define as bachelor: consider how you might start to define
table, tomato or thought, for instance.24
Rather than try and build complete and precise representations of word meaning therefore, computational linguists
work with partial or approximate representations. The find/discover implication in the last lecture is an approximate
representation, because expanding the rule to stipulate the exact conditions under which discover can be used instead
of find is impossible (at least in any practical system). In this lecture, I will discuss the classical lexical semantic
relations and then discuss polysemy and word sense disambiguation. In the following lecture, we will look at an
alternative approach to representing meaning.
1. What classes of words can be categorised by hyponymy? Some nouns, classically biological taxonomies, but
also human artifacts, professions etc work reasonably well. Abstract nouns, such as truth, don’t really work very
well (they are either not in hyponymic relationships at all, or very shallow ones). Some verbs can be treated as
being hyponyms of one another — e.g. murder is a hyponym of kill, but this is not nearly as clear as it is for
concrete nouns. Event-denoting nouns are similar to verbs in this respect. Hyponymy is essentially useless for
adjectives.
2. Do differences in quantisation and individuation matter? For instance, is chair a hyponym of furniture? is beer
a hyponym of drink? is coin a hyponym of money?
3. Is multiple inheritance allowed? Intuitively, multiple parents might be possible: e.g. coin might be metal (or
object?) and also money. Artifacts in general can often be described either in terms of their form or their
function.
4. What should the top of the hierarchy look like? The best answer seems to be to say that there is no single top
but that there are a series of hierarchies.
vegetables.
56
meronym of body); steering wheel is a meronym of car. Note the distinction between ‘part’ and ‘piece’: if I
attack a car with a chainsaw, I get pieces rather than parts!
Synonymy i.e., two words with the same meaning (or nearly the same meaning)
True synonyms are relatively uncommon: most cases of true synonymy are correlated with dialect differences
(e.g., eggplant / aubergine, boot / trunk). Often synonymy involves register distinctions, slang or jargons: e.g.,
policeman, cop, rozzer . . . Near-synonyms convey nuances of meaning: thin, slim, slender, skinny.
Antonymy i.e., opposite meaning
Antonymy is mostly discussed with respect to adjectives: e.g., big/little, though it’s only relevant for some
classes of adjectives.
7.3 WordNet
WordNet is the main resource for lexical semantics for English that is used in NLP — primarily because of its very
large coverage and the fact that it’s freely available. WordNets are under development for many other languages,
though so far none are as extensive as the original.
The primary organisation of WordNet is into synsets: synonym sets (near-synonyms). To illustrate this, the following
is part of what WordNet returns as an ‘overview’ of red:
wn red -over
Sense 6
big cat, cat
=> leopard, Panthera pardus
=> leopardess
=> panther
=> snow leopard, ounce, Panthera uncia
=> jaguar, panther, Panthera onca, Felis onca
=> lion, king of beasts, Panthera leo
=> lioness
=> lionet
=> tiger, Panthera tigris
=> Bengal tiger
=> tigress
=> liger
=> tiglon, tigon
=> cheetah, chetah, Acinonyx jubatus
=> saber-toothed tiger, sabertooth
=> Smiledon californicus
=> false saber-toothed tiger
57
Taxonomies have also been extracted from machine-readable dictionaries: Microsoft’s MindNet is the best known
example. There has been considerable work on extracting taxonomic relationships from corpora, including some
aimed at automatically extending WordNet.
• Semantic classification: e.g., for selectional restrictions (e.g., the object of eat has to be something edible) and
for named entity recognition
• Shallow inference: ‘X murdered Y’ implies ‘X killed Y’ etc (as discussed in the previous lecture).
• Back-off to semantic classes in some statistical approaches (for instance, WordNet classes can be used in docu-
ment classification).
• Word-sense disambiguation
• Query expansion for information retrieval: if a search doesn’t return enough results, one option is to replace an
over-specific term with a hypernym
Synonymy or near-synonymy is relevant for some of these reasons and also for generation. (However dialect and reg-
ister haven’t been investigated much in NLP, so the possible relevance of different classes of synonym for customising
text hasn’t really been looked at.)
7.5 Polysemy
Polysemy refers to the state of a word having more than one sense: the standard example is bank (river bank) vs bank
(financial institution).
This is homonymy — the two senses are unrelated (not entirely true for bank, actually, but historical relatedness isn’t
important — it’s whether ordinary speakers of the language feel there’s a relationship). Homonymy is the most obvious
case of polysemy, but is relatively infrequent compared to uses which have different but related meanings, such as bank
(financial institution) vs bank (in a casino).
If polysemy were always homonymy, word senses would be discrete: two senses would be no more likely to share
characteristics than would morphologically unrelated words. But most senses are actually related. Regular or sys-
tematic polysemy (zero derivation, as mentioned in §2.2) concerns related but distinct usages of words, often with
associated syntactic effects. For instance, strawberry, cherry (fruit / plant), rabbit, turkey, halibut (meat / animal),
tango, waltz (dance (noun) / dance (verb)).
There are a lot of complicated issues in deciding whether a word is polysemous or simply general/vague. For instance,
teacher is intuitively general between male and female teachers rather than ambiguous, but giving good criteria as a
basis of this distinction is difficult. Dictionaries are not much help, since their decisions as to whether to split a sense
or to provide a general definition are very often contingent on external factors such as the size of the dictionary or the
intended audience, and even when these factors are relatively constant, lexicographers often make different decisions
about whether and how to split up senses.
58
WSD up to the early 1990s was mostly done by hand-constructed rules (still used in some MT systems). Dahlgren
investigated WSD in a fairly broad domain in the 1980s. Reasonably broad-coverage WSD generally depends on:
• frequency
• collocations
• selectional restrictions/preferences
What’s changed since the 1980s is that various statistical or machine-learning techniques have been used to avoid
hand-crafting rules.
• supervised learning. Requires a sense-tagged corpus, which is extremely time-consuming to construct system-
atically (examples are the Semcor and SENSEVAL corpora, but both are really too small). Often experiments
have been done with a small set of words which can be sense-tagged by the experimenter. Supervised learning
techniques do not carry over well from one corpus to another.
• unsupervised learning (see below)
• Machine readable dictionaries (MRDs). Disambiguating dictionary definitions according to the internal data in
dictionaries is necessary to build taxonomies from MRDs. MRDs have also been used as a source of selectional
preference and collocation information for general WSD (quite successfully).
Until recently, most of the statistical or machine-learning techniques have been evaluated on homonyms: these are
relatively easy to disambiguate. So 95% disambiguation in e.g., Yarowsky’s experiments sounds good (see below),
but doesn’t translate into high precision on all words when target is WordNet senses (in SENSEVAL 2 the best system
was around 70%).
There have also been some attempts at automatic sense induction, where an attempt is made to determine the clusters
of usages in texts that correspond to senses. In principle, this is a very good idea, since the whole notion of a word
sense is fuzzy: word senses can be argued to be artifacts of dictionary publishing. However, so far sense induction
has not been much explored in monolingual contexts, though it could be considered as an inherent part of statistical
approaches to MT.
7.7 Collocations
Informally, a collocation is a group of two or more words that occur together more often than would be expected by
chance (there are other definitions — this is not really a precise notion). Collocations have always been the most useful
source of information for WSD, even in Dahlgren’s early experiments. For instance:
striped is a good indication that we’re talking about the fish (because it’s a particular sort of bass), similarly with
guitar and music. In both bass guitar and striped bass, we’ve arguably got a multiword expression (i.e., a conventional
phrase that might be listed in a dictionary), but the principle holds for any sort of collocation. The best collocates for
WSD tend to be syntactically related in the sentence to the word to be disambiguated, but many techniques simply use
a window of words.
The term collocation is sometimes restricted to the situation where there is a syntactic relationship between the words.
J&M (second edition) define collocation as a position-specific relationship (in contrast to bag-of-words, where position
is ignored) but this is not a standard definition.
59
7.8 Yarowsky’s unsupervised learning approach to WSD
Yarowsky (1995) describes a technique for unsupervised learning using collocates. A few seed collocates (possibly
position-specific) are chosen for each sense (manually or via an MRD), then these are used to accurately identify
distinct senses. The sentences in which the disambiguated senses occur can then be used to learn other discriminating
collocates automatically, producing a decision list. The process can then be iterated. The algorithm allows bad
collocates to be overridden. This works because of the general principle of ‘one sense per collocation’ (experimentally
demonstrated by Yarowsky — it’s not absolute, but there are very strong preferences).
In a bit more detail, using Yarowsky’s example of disambiguating plant (which is homonymous between factory vs
vegetation senses):
1. Identify all examples of the word to be disambiguated in the training corpus and store their contexts.
sense training example
60
6. Apply the classifier to the unseen test data
The following schematic diagrams may help:
Initial state:
? ? ? ? ?
? ?
? ? ?
? ? ?
? ? ?
? ? ?
? ? ? ? ?
? ?
? ? ?
? ?
Seeds
? ? ? ? B
? B
? ? ? manufacturing
? ? ?
? ? life ?
A ? ?
? ? A A ?
? A
? A ?
? ?
Iterating:
? ? ? ? B
B B
? ? ? manufacturing
animal ? ? ?
company
AA life B
A ? ?
? ? A A ?
? A
? A ?
? ?
Final:
A AA B B
B B
A BB
A B B
AA B
A B B
AA A A B
A A
A A B
A B
61
Yarowsky also demonstrated the principle of ‘one sense per discourse’. For instance, if plant is used in the botanical
sense in a particular text, then subsequent instances of plant in the same tense will also tend to be used in the botanical
sense. Again, this is a very strong, but not absolute effect. This can be used as an additional refinement for the
algorithm above, assuming we have a way of detecting the boundaries between distinct texts in the corpus.
Decision list classifiers can be thought of as automatically trained case statements. The experimenter decides on the
classes of test (e.g., word next to word to be disambiguated; word within window 10). The system automatically
generates and orders the specific tests based on the training data.
Yarowsky argues that decision lists work better than many other statistical frameworks because no attempt is made to
combine probabilities. This would be complex, because the criteria are not independent of each other. More details of
this approach are in J&M (section 20.5).
Yarowsky’s experiments were nearly all on homonyms: these principles probably don’t hold as well for sense exten-
sion.
62
8 Lecture 8: Distributional semantics
Copyright c Aurélie Herbelot and Ann Copestake, 2012–2013. The notes from this lecture are partly based on slides
written by Aurélie Herbelot.
Distributional semantics refers to a family of techniques for representing word (and phrase) meaning based on (lin-
guistic) contexts of use. Consider the following examples (from the BNC):
Even if you don’t know the word scrumpy, you can get a good idea of its meaning from contexts like this. Humans
typically learn word meanings from context rather than explicit definition: sometimes these meanings are perceptually
grounded (e.g., someone gives you a glass of scrumpy), sometimes not.
It is to a large extent an open question how word meanings are represented in the brain.25 Distributional semantics uses
linguistic context to represent meaning and this is likely to have some relationship to mental meaning representation.
It can only be a partial representation of lexical meaning (perceptual grounding is clearly important too, and more-or-
less explicit definition may also play a role) but, as we’ll see, distributions based on language alone are good enough
for some tasks. In these models, meaning is seen as a space, with dimensions corresponding to elements in the context
(features). Computational techniques generally use vectors to represent the space and the terms semantic space models
and vector space models are sometimes used instead of distributional semantics. 26 Schematically:
feature1 feature2 ... featuren
word1 f1,1 f2,1 fn,1
word2 f1,2 f2,2 fn,2
...
wordm f1,m f2,m fn,m
There are many different possible notions of features: co-occur with wordn in some window, co-occur with wordn as
a syntactic dependent, occur in paragraphn , occur in documentn . . .
The main use of distributional models has been in measuring similarity between pairs of words: such measurements
can be exploited in a variety of ways to model language. Similarity measurements allow clustering of words. In the rest
of this lecture, we will first discuss some possible models illustrating the choices that must be made when designing
a distributional semantics system and go through a step-by-step example. We’ll then look at some real distributions
and then describe how distributions can be used in measuring similarity between words. We’ll briefly describe how
polysemy affects distributions and conclude with a discussion of the relationship between distributional models and
the classic lexical semantic relations of synonymy, antonymy and hyponymy.
8.1 Models
Distributions are vectors in a multidimensional semantic space, that is, objects with a magnitude (length) and a direc-
tion. The semantic space has dimensions which correspond to possible contexts. For our purposes, a distribution can
be seen as a point in that space (the vector being defined with respect to the origin of that space). e.g., cat [...dog 0.8,
eat 0.7, joke 0.01, mansion 0.2, zebra 0.1...], where ‘dog’, ‘joke’ and so on are the dimensions.
Context:
Different models adopt different notions of context:
• Word windows (unfiltered): n words on either side of the lexical item under consideration (unparsed text).
Example: n=2 (5 words window):
... the prime minister acknowledged that ...
25 In fact, it’s common to talk about concepts rather than word meanings when discussing mental representation, but while some researchers treat
some (or all) concepts as equivalent to word senses (or phrases), others think they are somehow distinct. We will return to this in lecture 11.
26 Vector space models in IR are directly related to distributional models. Some more complex techniques use tensors of different orders rather
than simply using vectors, but we won’t discuss them in this lecture.
63
• Word windows (filtered): n words on either side of the lexical item under consideration (unparsed text). Some
words are not considered part of the context (e.g. function words, some very frequent content words). The stop
list for function words is either constructed manually, or the corpus is POS-tagged and only certain POS tags
are considered part of the context.
Example: n=2 (5 words window), underlines words are in the stop list.
... the prime minister acknowledged that ...
• Lexeme windows: as above, but a morphological processor is applied first that converts the words to their stems.
• Dependencies: syntactic or semantic. The corpus is converted into a list of directed links between heads and
dependents. Context for a lexical item is the dependency structure it belongs to. The length of the dependency
path can vary according to the implementation (Padó and Lapata, 2007).
Context weighting:
Different models use different methods of weighting the context elements:
64
– Positive PMI (PPMI): as PMI but 0 if PMI < 0.
– Derivatives such as Mitchell and Lapata’s (2010) weighting function (PMI without the log).
Most work uses some form of characteristic model in order to give most weight to frequently cooccuring features, but
allowing for the overall frequency of the terms in the context. Note that PMI is one of the measures used for finding
collocations (see previous lecture): the distributional models can be seen as combining the collocations for words.
Semantic space:
Once the contexts and weights have been decided on, models also vary in which elements are included in the final
vectors: i.e., what the total semantic space consists of. The main options are as follows (with positive and negative
aspects indicated):
• Entire vocabulary.
– + All information included – even rare, but important contexts
– - Inefficient (100,000s dimensions). Noisy (e.g. 002.png—thumb—right—200px—graph n)
• Top n words with highest frequencies.
– + More efficient (5000-10000 dimensions). Only ‘real’ words included.
– - May miss out on infrequent but relevant contexts.
• Singular Value Decomposition (LSA – Landauer and Dumais, 1997): the number of dimensions is reduced by
exploiting redundancies in the data. A new dimension might correspond to a generalisation over several of the
original dimensions (e.g. the dimensions for car and vehicle are collapsed into one).
– + Very efficient (200-500 dimensions). Captures generalisations in the data.
– - SVD matrices are not interpretable.
The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is
that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at
or repair.
Note that the single token of goes is counted twice, because it occurs with two different tokens of wrong.
Distribution for wrong (PPMI):
For instance, for the context possibly, fwc is 2 (as in the raw frequency distribution table), fc is the total count of
possibly which is also 2 (as in the overall frequency count table), fw is 4 (again, as in the overall frequency count
table) and ftotal is 20 (i.e., the sum of the frequencies in the overall frequency count table), so PMI is log(5).
• For nouns: head verbs (+ any other argument of the verb), modifying adjectives, head prepositions (+ any other
argument of the preposition).
e.g. cat: chase v+mouse n, black a, of p+neighbour n
• For verbs: arguments (NPs and PPs), adverbial modifiers.
e.g. eat: cat n+mouse n, in p+kitchen n, fast a
• For adjectives: modified nouns; rest as for nouns (assuming intersective composition).
e.g. black: cat n, chase v+mouse n
66
The model uses a semantic space of the top 100,000 contexts (because we wanted to include the rare terms) with a
variant of PMI (Bouma 2007) for weighting:
log( fwcfw∗f∗ftotal
c
)
pmiwc = (26)
−log( fftotal
wc
)
Corpus choice is another parameter that has to be considered in building models. Some research suggests that one
should use as much data as possible. Some commonly used corpora:
In general, more data does give better models but the domain has to be considered: for instance, huge corpora of
financial news won’t give models that work well with other text types. Furthermore, more data is not realistic from
a psycholinguistic point of view. We perhaps encounter 50,000 words a day (although nobody actually has good
estimates of this!) so the BNC, which is very small by the standards of current experiments, perhaps corresponds to 5
years’ exposure.
It is clear that sparse data is a problem for relatively rare words. For instance, consider the following distribution for
unicycle, as obtained from Wikiwoods:
67
0.12::fast a 0.07::come v
0.11::red a 0.06::high a
Note that humans exploit a lot more information from context and can get a good idea of word meanings from a small
number of examples.
Distributions are generally constructed without any form of sense disambiguation. The semantic space can be thought
of as consisting of subspaces for different senses, with homonyms (presumably) relatively distinct. For instance,
consider the following distribution for pot:
Finally, note that distributions contain many contexts which arise from multiword expressions of various types, and
these often have high weights. The distribution of pot contains several examples, as does the following distribution for
time:
To sum up, there is a wide range of choice in constructing distributional models. Manually examining the characteristic
contexts gives us a good idea of how sensible different weighting measures are, for instance, but we need to look at
how distributions are actually used to evaluate how well they model meaning.
8.4 Similarity
Calculating similarity in a distributional space is done by calculating the distance between the vectors.
68
• Law of cosines: c2 = a2 + b2 − 2ab cos γ
Cosine similarity: P
v1k ∗ v2k
pP pP (27)
v12k ∗ v22k
This measure calculates the angle between two vectors and is therefore length-independent. This is important, as
frequent words have longer vectors than less frequent ones.
The following examples should give some idea of the scales of similarity found:
Note that perfect similarity gives a cosine of 1, but that even near-synonyms like gem and jewel have much lower
cosine similarity.
Words most similar to cat, as chosen from the 5000 most frequent nouns in Wikipedia.
This notion of similarity is very broad. It includes synonyms, near-synonyms, hyponyms, taxonomical siblings,
antonyms and so on. But it does correlate with a psychological reality. One of the favourite tests of the distri-
butional semantics community is the calculation of rank correlation between a distributional similarity system and
human judgments on the Miller & Charles (1991) test set shown below:
69
3.92 automobile-car 3.05 bird-cock 0.84 forest-graveyard
3.84 journey-voyage 2.97 bird-crane 0.55 monk-slave
3.84 gem-jewel 2.95 implement-tool 0.42 lad-wizard
3.76 boy-lad 2.82 brother-monk 0.42 coast-forest
3.7 coast-shore 1.68 crane-implement 0.13 cord-smile
3.61 asylum-madhouse 1.66 brother-lad 0.11 glass-magician
3.5 magician-wizard 1.16 car-journey 0.08 rooster-voyage
3.42 midday-noon 1.1 monk-oracle 0.08 noon-string
3.11 furnace-stove 0.89 food-rooster
3.08 food-fruit 0.87 coast-hill
The human similarity results can be replicated: the Miller & Charles experiment is a re-run of Rubenstein & Good-
enough (1965): the correlation coefficient between them is 0.97. A good distributional similarity system can have a
correlation of 0.8 or better with the human data (although there is a danger the reported results are unreasonably high,
because this data has been used in so many experiments).
Another frequently used dataset is the TOEFL (Test of English as a Foreign Language) synonym test. For example:
Stem: levied
Choices: (a) imposed
(b) believed
(c) requested
(d) correlated
Solution: (a) imposed
Non-native English speakers are reported to average around 65% on this test (US college applicants): the best corpus-
based results are 100% (Bullinaria and Levy, 2012) . But note that the authors who got this result suggest the test is
not very reliable — one reason is probably that the data includes some extremely rare words.
Similarity measures can be applied as a type of backoff technique in a range of tasks. For instance, in sentiment
analysis (discussed in lecture 1), an initial bag of words acquired from the training data can be expanded by including
distributionally similar words.
70
0.23::and c+interference n 0.23::and c+detective n 0.22::dead a 0.21::pron rel +evade v
0.23::arrive v 0.22::look v+way n 0.22::pron rel +stab v
Synonyms and similarity: some further examples (all from Wikiwoods, processed as discussed above):
In general, true synonymy does not correspond to higher similarity scores than near-synoynmy.
Antonyms have high similarity, as indicated by the examples below:
• cold/hot 0.29
• dead/alive 0.24
• large/small 0.68
• colonel/general 0.33
It is possible to automatically distinguish them from (near-)synoyms using corpus-based techniques, but this requires
additional heuristics. For instance, it has been observed that antonyms are frequently coordinated while synonyms are
not:
Similarly, it is possible to acquire hyponymy relationships from distributions, but this is much less effective than
looking for explicit taxonomic relationships in Wikipedia text.
71
9 Lecture 9: Discourse
The techniques we have seen in lectures 2–8 relate to the interpretation of words and individual sentences, but utter-
ances are always understood in a particular context. Context-dependent situations include:
4. Implicit relationships between events: Max fell. John pushed him — the second sentence is (usually) understood
as providing a causal explanation.
In the first part of this lecture, I give a brief overview of rhetorical relations which can be seen as structuring text
at a level above the sentence. I’ll then go on to talk about one particular case of context-dependent interpretation —
anaphor resolution.
This is yet another form of ambiguity: there are two different interpretations but there is no syntactic or semantic
ambiguity in the interpretation of the individual sentences. There seems to be an implicit relationship between the two
original sentences: a discourse relation or rhetorical relation. (I will use the terms interchangeably here, though differ-
ent theories use different terminology, and rhetorical relation tends to refer to a more surfacy concept than discourse
relation.) In 1 the link is a form of explanation, but 2 is an example of narration. Theories of discourse/rhetorical
relations reify link types such as Explanation and Narration. The relationship is made more explicit in 1 and 2 than it
was in the original sentence: because and and then are said to be cue phrases.
9.2 Coherence
Discourses have to have connectivity to be coherent:
Both of these sentences make perfect sense in isolation, but taken together they are incoherent. Adding context can
restore coherence:
Kim got into her car. Sandy likes apples, so Kim thought she’d go to the farm shop and see if she could
get some.
The second sentence can be interpreted as an explanation of the first. In many cases, this will also work if the context
is known, even if it isn’t expressed.
Language generation requires a way of implementing coherence. For example, consider a system that reports share
prices. This might generate:
72
In trading yesterday: Dell was up 4.2%, Safeway was down 3.2%, HP was up 3.1%.
Computer manufacturers gained in trading yesterday: Dell was up 4.2% and HP was up 3.1%. But retail
stocks suffered: Safeway was down 3.2%.
Here but indicates a Contrast. Not much actual information has been added (assuming we know what sort of company
Dell, HP and Safeway are), but the discourse is easier to follow.
Discourse coherence assumptions can affect interpretation:
If we interpret this as Explanation, then ‘he’ is most likely Bill. But if it is Justification (i.e., the speaker is providing
evidence to justify the first sentence), then ‘he’ is John.
It should be clear that it is potentially very hard to identify rhetorical relations. In fact, recent research that simply uses
cue phrases and punctuation is quite promising. This can be done by hand-coding a series of finite-state patterns, or
by supervised learning.
We get a binary branching tree structure for the discourse. In many relationships one phrase depends on
the other. In fact we can get rid of the subsidiary phrases and still have a reasonably coherent discourse.
Other relationships, such as Narration, give equal weight to both elements, so don’t give any clues for summarization.
Rather than trying to find rhetorical relations for arbitrary text, genre-specific cues can be exploited, for instance for
scientific texts. This allows more detailed summaries to be constructed.
73
9.5 Referring expressions
I’ll now move on to talking about another form of discourse structure, specifically the link between referring expres-
sions. The following example will be used to illustrate referring expressions and anaphora resolution:
Niall Ferguson is prolific, well-paid and a snappy dresser. Stephen Moss hated him — at least until he
spent an hour being charmed in the historian’s Oxford study. (quote taken from the Guardian)
Some terminology:
referent a real world entity that some piece of text (or speech) refers to. e.g., the two people who are mentioned in
this quote.
referring expressions bits of language used to perform reference by a speaker. In, the paragraph above, Niall Fergu-
son, him and the historian are all being used to refer to the same person (they corefer).
antecedent the text initially evoking a referent. Niall Ferguson is the antecedent of him and the historian
anaphora the phenomenon of referring to an antecedent: him and the historian are anaphoric because they refer to a
previously introduced entity.
What about a snappy dresser? Traditionally, this would be described as predicative: that is, it is a property of some
entity (similar to adjectival behaviour) rather than being a referring expression itself.
Generally, entities are introduced in a discourse (technically, evoked) by indefinite noun phrases or proper names.
Demonstratives (e.g., this) and pronouns are generally anaphoric. Definite noun phrases are often anaphoric (as above),
but often used to bring a mutually known and uniquely identifiable entity into the current discourse. e.g., the president
of the US.
Sometimes, pronouns appear before their referents are introduced by a proper name or definite description: this is
cataphora. E.g., at the start of a discourse:
Although she couldn’t see any dogs, Kim was sure she’d heard barking.
(28) A little girl is at the door — see what she wants, please?
(29) My dog has hurt his foot — he is in a lot of pain.
(30) * My dog has hurt his foot — it is in a lot of pain.
Complications include the gender neutral they (some dialects), use of they with everybody, group nouns, conjunctions
and discontinuous sets:
(31) Somebody’s at the door — see what they want, will you?
(32) I don’t know who the new teacher will be, but I’m sure they’ll make changes to the course.
(33) Everybody’s coming to the party, aren’t they?
(34) The team played really well, but now they are all very tired.
(35) Kim and Sandy are asleep: they are very tired.
(36) Kim is snoring and Sandy can’t keep her eyes open: they are both exhausted.
74
9.7 Reflexives
(37) Johni cut himselfi shaving. (himself = John, subscript notation used to indicate this)
(38) # Johni cut himj shaving. (i 6= j — a very odd sentence)
The informal and not fully adequate generalisation is that reflexive pronouns must be co-referential with a preced-
ing argument of the same verb (i.e., something it subcategorizes for), while non-reflexive pronouns cannot be. In
linguistics, the study of inter-sentential anaphora is known as binding theory:
9.9 Salience
There are a number of effects related to the structure of the discourse which cause particular pronoun antecedents to
be preferred, after all the hard constraints discussed above are taken into consideration.
Recency More recent antecedents are preferred. Only relatively recently referred to entities are accessible.
(44) Kim has a big car. Sandy has a small one. Lee likes to drive it.
it preferentially refers to Sandy’s car, rather than Kim’s.
Grammatical role Subjects > objects > everything else:
(45) Fred went to the Grafton Centre with Bill. He bought a CD.
he is more likely to be interpreted as Fred than as Bill.
Repeated mention Entities that have been mentioned more frequently are preferred:
(46) Fred was getting bored. He decided to go shopping. Bill went to the Grafton Centre with Fred. He
bought a CD.
He=Fred (maybe) despite the general preference for subjects.
Parallelism Entities which share the same role as the pronoun in the same sort of sentence are preferred:
(47) Bill went with Fred to the Grafton Centre. Kim went with him to Lion Yard.
Him=Fred, because the parallel interpretation is preferred.
Coherence effects The pronoun resolution may depend on the rhetorical/discourse relation that is inferred.
(48) Bill likes Fred. He has a great sense of humour.
He = Fred preferentially, possibly because the second sentence is interpreted as an explanation of the first, and
having a sense of humour is seen as a reason to like someone.
75
9.10 Lexical semantics and world knowledge effects
The made-up examples above were chosen so that the meaning of the utterance did not determine the way the pronoun
was resolved. In real examples, world knowledge may override salience effects. For instance (from Radio 5):
(49) Andrew Strauss again blamed the batting after England lost to Australia last night. They now lead the series
three-nil.
Here they has to refer to Australia, despite the general preference for subjects as antecendents. The analysis required
to work this out is actually non-trivial: you might like to try writing down some plausible meaning postulates which
would block the inference that they refers to England. (Note also the plural pronoun with singular antecedent, which
is normal for sports teams.)
Note, however, that violation of salience effects can easily lead to an odd discourse:
(50) The England football team won last night. Scotland lost. ? They have qualified for the World Cup with a
100% record.
Systems which output natural language disourses, such as summarization systems, have to keep track of anaphora to
avoid such problems.
Niall Ferguson is prolific, well-paid and a snappy dresser. Stephen Moss hated him — at least until he
spent an hour being charmed in the historian’s Oxford study.
Pronoun he, candidate antecedents: Niall Ferguson, a snappy dresser, Stephen Moss, him, an hour, the
historian, the historian’s Oxford study.
Notice that this simple approach leads to a snappy dresser being included as a candidate antecendent and that a
choice had to be made as to how to treat the possessive. I’ve included the possibility of cataphors, although these are
sufficiently rare that they are often excluded.
For each such pairing, we build a feature vector27 using features corresponding to some of the factors discussed in the
previous sections. For instance (using t/f rather than 1/0 for binary features for readability):
76
Sentence distance Discrete: { 0, 1, 2 . . . } The number of sentences between pronoun and candidate.
Grammatical role Discrete: { subject, object, other } The role of the potential antecedent.
Parallel Binary: t if the potential antecedent and the pronoun share the same grammatical role.
Linguistic form Discrete: { proper, definite, indefinite, pronoun } This indicates something about the syntax of the
potential antecedent noun phrase.
pronoun antecedent cataphoric num gen same distance role parallel form
him Niall Ferguson f t t f 1 subj f prop
him Stephen Moss f t t t 0 subj f prop
him he t t t f 0 subj f pron
he Niall Ferguson f t t f 1 subj t prop
he Stephen Moss f t t f 0 subj t prop
he him f t t f 0 obj f pron
Notice that with this set of features, we cannot model the “repeated mention” effect mentioned in §9.9. It would be
possible to model it with a classifier-based system, but it requires that we keep track of the coreferences that have
been assigned and thus that we maintain a model of the discourse as individual pronouns are resolved. I will return to
the issue of discourse models below. Coherence effects are very complex to model and world knowledge effects are
indefinitely difficult (AI-complete in the limit), so both of these are excluded from this simple feature set. Realistic
systems use many more features and values than shown here and can approximate some partial world knowledge via
classification of named entities, for instance.
To implement the classifier, we require some knowledge of syntactic structure, but not necessarily full parsing. We
could approximately determine noun phrases and grammatical role by means of a series of regular expressions over
POS-tagged data instead of using a full parser. Even if a full syntactic parser is available, it may be necessary to
augment it with special purpose rules to detect pleonastic pronouns.
The training data for this task is produced from a corpus which is marked up by humans with pairings between
pronouns and antecedent phrases. The classifier uses the marked-up pairings as positive examples (class TRUE), and
all other possible pairings between the pronoun and candidate antecendant as negative examples (class FALSE). For
instance, if the pairings above were used as training data, we would have:
Note the prelecture exercise which suggests that you participate in an online experiment to collect training data. If you
do this, you will discover a number of complexities that I have ignored in this account.
In very general terms, a supervised classifier uses the training data to determine an appropriate mapping (i.e., hypoth-
esis in the terminology used in the Part 1B AI course) from feature vectors to classes. This mapping is then used when
classifying the test data. To make this more concrete, if we are using a probabilistic approach, we want to choose the
class c out of the set of classes C ({ TRUE, FALSE } here) which is most probable given a feature vector f~:
ĉ = argmax P (c|f~)
c∈C
(See §3.5 for the explanation of argmax and ĉ.) As with the POS tagging problem, for a realistic feature space, we will
be unable to model this directly. The Naive Bayes classifier is based on the assumption that we rewrite this formula
77
using Bayes Theorem and then treat the features as conditionally independent (the independence assumption is the
“naive” part). That is:
P (f~|c)P (c)
P (c|f~) =
P (f~)
As with the models discussed in Lecture 3, we can ignore the denominator because it is constant, hence:
Treating the features as independent means taking the product of the probabilities of the individual features in f~ for
the class:
Yn
ĉ = argmax P (c) P (fi |c)
c∈C i=1
In practice, the Naive Bayes model is often found to perform well even with a set of features that are clearly not
independent.
There are fundamental limitations on performance caused by treating the problem as classification of individual
pronoun-antecedent pairs rather than as building a discourse model including all the coreferences. Inability to im-
plement ‘repeated mention’ is one such limitation, another is the inability to use information gained from one linkage
in resolving further pronouns. Consider yet another ‘team’ example:
(51) Sturt think they can perform better in Twenty20 cricket. It requires additional skills compared with older
forms of the limited over game.
A classifier which treats each pronoun entirely separately might well end up resolving the it at the start of the second
sentence to Sturt rather than the correct Twenty20 cricket. However, if we already know that they corefers with Sturt,
coreference with it will be dispreferred because number agreement does not match (recall from §9.6 that pronoun
agreement has to be consistent). This type of effect is especially relevant when general coreference resolution is
considered. One approach is to run a simple classifier initially to acquire probabilities of links and to use those results
as the input to a second system which clusters the entities to find an optimal solution. I will not discuss this further
here, however.
Sally met Andrew in town and took him to the new restaurant. He was impressed.
Our algorithm has successfully linked the coreferring expressions, but if we consider the evaluation approach of
comparing the individual links to the test material, it will be penalised. Of course it is trivial to take the transitive
closure of the links, but it is not easy to develop an evaluation metric that correctly allows for this and does not, for
example, unfairly reward algorithms that link all the pronouns together into one cluster. As a consequence of this sort
of issue, it has been difficult to develop agreed metrics for evaluation.
78
9.13 Statistical classification in language processing
Many problems in natural language can be treated as classification problems: besides pronoun resolution, we have seen
sentiment classification and word sense disambiguation, which are straightforward examples of classification. POS-
tagging is also a form of classification, but there we take the tag sequence of highest probability rather than considering
each tag separately. As we have seen above, we actually need to consider relationships between coreferences to model
some discourse effects.
Pronoun resolution has a more complex feature set than the previous examples of classification that we’ve seen and
determination of some of the features requires considerable processing, which is itself error prone. A statistical clas-
sifier is somewhat robust to this, assuming that the training data features have been assigned by the same mechanism
as used in the test system. For example, if the grammatical role assignment is unreliable, the weight assigned to that
feature might be less than if it were perfect.
One serious disadvantage of supervised classification is reliance on training data, which is often expensive and difficult
to obtain and may not generalise across domains. Research on unsupervised methods is therefore popular.
There are no hard and fast rules for choosing which statistical approach to classification to use on a given task. Many
NLP researchers are only interested in classifiers as tools for investigating problems: they may either simply use the
same classifier that previous researchers have tried or experiment with a range of classifiers using a toolkit such as
WEKA.28
Performance considerations may involve speed as well as accuracy: if a lot of training data is available, then a classifier
with faster performance in the training phase may enable one to use more of the available data. The research issues
in developing a classifier-based algorithm for an NLP problem generally center around specification of the problem,
development of the labelling scheme and determination of the feature set to be used.
28 http://www.cs.waikato.ac.nz/ml/weka/ Ian H. Witten and Eibe Frank (2005) “Data Mining: Practical machine learning tools
79
10 Lecture 10: Generation
“Generation from what?!” (attributed to Yorick Wilks)
In the lectures so far, we have concentrated on analysis, but there is also work on generating text. Although Natural
Language Generation (NLG) is a recognised subfield of NLP, there are far fewer researchers working in this area than
on analysis. There are some recognised subproblems within NLG, some of which will be discussed below, but it is not
always easy to break it down into subareas and in some cases, systems which create text are not considered to be NLG
system (e.g., when they are a component of an MT system). The main problem, as the quotation above indicates, is
there is no commonly agreed starting point for NLG. The possible starting points include:
Generating from data usually requires the help of some domain experts to provide interpretations.
Such systems can be contrasted with regeneration systems, which start from text and produce reformulated text. Here
the options include:
• Constructing coherent sentences from partially ordered sets of words: e.g., in statistical MT (although this might
better be considered as a form of realization).
• Paraphrase.
• Summarization (single- or multi- document).
• Article construction from text fragments.
• Text simplification.
Content determination deciding what information to convey. This often involves selecting information from a body
of possible pieces of information. e.g., weather reports — may be a mass of information, not all of which is
relevant. The content may vary according to the type of user (e.g., expert/non-expert). This step often involves
domain experts: i.e., people who know how to interpret the raw data.
Discourse structuring the broad structure of the text or dialogue (Dale and Mellish refer to document structuring).
For instance, scientific articles may have an abstract, introduction, methods, results, comparison, conclusion:
rearranging the text makes it incoherent. In a dialogue system, there may be multiple messages to convey to the
user, and the order may be determined by factors such as urgency.
Aggregation deciding how information may be split into sentence-sized chunks: i.e., finer-grained than document
structuring.
80
Referring expression generation deciding when to use pronouns, how many modifiers to include and so on.
Lexical choice deciding which lexical items to use to convey a given concept. This may be straightforward for many
applications — in a limited-domain system, there will be a preferred knowledge base to lexical item mapping,
for instance, but it may be useful to vary this.
Surface realization mapping from a meaning representation for an individual sentence to a string (or speech output).
This is generally taken to include morphological generation.
Fluency ranking this is not included by Dale and Mellish but is very important in modern approaches. A large
grammar will generate many strings for a given meaning representation. In fact, in many approaches, the
grammar will overgenerate and produce a mixture of grammatical and ungrammatical strings. A fluency ranking
component, which at its simplest is based on ngrams (compare the discussion of prediction in lecture 2), is used
to rank such outputs. In the extreme case, no grammar is used, and the fluency ranking is performed on a
partially ordered set of words (for instance in SMT).
When Dale and Mellish produced this classification, there was very little statistical work within NLG (work on SMT
had started but wasn’t very successful at that point) and NLG was essentially about limited domains. In recent years,
there have been statistical approaches to all these subtasks. Most NLG systems are still limited domain however,
although regeneration systems are usually not. Many of the approaches we have described in the previous lectures
can be adapted to be useful in generation. Deep grammars may be bidirectional, and therefore usable for surface
realization as well as for parsing. Distributions are relevant for lexical choice: e.g., in deciding whether one word is
substitutable for another. However, many practical NLG systems are based on extensive use of templates (i.e., fixed
text with slots which can be filled in) and this is satisfactory for many tasks, although it tends to lead to rather stilted
output.
The input Cricket produces quite rich summaries of the progress of a game, in the form of scorecards. The example
below is part of a scorecard from a one day international match.
The output Here is part of actual human-written report (from Wisden) roughly corresponding to some of the data
shown:
The highlight of a meaningless match was a sublime innings from Tendulkar, . . . he drove with elan to
make 113 off just 102 balls with 12 fours and a six.
This requires additional knowledge compared to the scorecard: the previous results of the series which made the
result here meaningless and (perhaps) knowledge of Tendulkar’s play. Without such information, a possible automatic
summary would be:
81
India beat Sri Lanka by 63 runs. Tendulkar made 113 off 102 balls with 12 fours and a six. . . .
• Granularity: we need to be able to consider individual (minimal?) information chunks (sometimes called ‘fac-
toids’).
• Abstraction: generalize over instances.
• Faithfulness to source versus closeness to natural language?
• Inferences over data (e.g., amalgamation of scores)?
• Formalism: this might depend on the chosen methodology for realization.
A simple approach is to use a combination of predicates and constants to express the factoids, though note we need to
be careful with abstraction levels:
name(team1/player4, Tendulkar)
balls-faced(team1/player4, 102)
...
Content selection There are thousands of factoids in each scorecard: we need to select the most important. For
instance, the overall match result, good scores by a batsman. In this case, it is clear that Tendulkar’s performance was
the highlight:
name(team1, India)
total(team1, 304)
name(team2, Sri Lanka)
result(win, team1, 63)
name(team1/player4, Tendulkar)
runs(team1/player4, 113)
balls-faced(team1/player4, 102)
fours(team1/player4, 12)
sixes(team1/player4, 1)
Discourse structure and (first stage) aggregation This requires that we distribute data into sections and decide on
overall ordering:
Sports reports often state the highlights and then describe key events in chronological order.
Predicate choice (lexical selection) If we were using a deep grammar for realization, we would need to construct
an input representation that corresponded to that used by the grammar.
Mapping rules from the initial scorecard predicates:
82
This gives:
Realistic systems would have multiple mapping rules. This process may require refinement of aggregation.
named(t1p4, ‘Tendulkar’), made v(e,t1p4,r), card(r,‘113’), run(r), off p(e,b), ball(b), card(b,‘102’), with (e,f),
card(f,‘12’), four n(f), with (e,s), card(s,‘1’), six n(s)
corresponds to:
Tendulkar made 113 runs off 102 balls with 12 fours with 1 six.
with (e,f), card(f,‘12’), four n(f), with (e,s), card(s,‘1’), six n(s)
into:
Also: ‘113 runs’ to ‘113’ because, in this context, it is obvious that 113 would refer to runs.
Tendulkar made 113 off 102 balls with 12 fours and one six.
Tendulkar made 113 with 12 fours and one six off 102 balls.
...
113 off 102 balls was made by Tendulkar with 12 fours and one six.
The highlight of a meaningless match was a sublime innings from Tendulkar (team1 player4), . . . and
this time he drove with elan to make 113 (team1 player4 R) off just 102 (team1 player4 B) balls with 12
(team1 player4 4s) fours and a (team1 player4 6s) six.
Content selection can now be treated as a classification problem: all possible factoids are derived from the data source
and each is classified as in or out, based on training data. The factoids can be categorized into classes and grouped. One
problem that Kelly found was that the straightforward technique returns ‘meaningless’ factoids, e.g. player names with
no additional information about their performance. This can be avoided, but there’s a danger of the approach becoming
task-specific.
Discourse structuring then involves generalising over reports to see where particular information types are presented.
Fluency ranking and predicate choice could straightforwardly use the collected texts. Kelly didn’t do these steps but
they have been done in other contexts (e.g., Wikipedia article generation).
83
10.4 Referring expressions
One subtopic in NLG which has been extensively investigated is generation of referring expressions: given some
information about an entity, how do we choose to refer to it? This has several aspects: we have to decide whether to
use ellipsis or coordination (as in the cricket example above). We also have to decide which grammatical category of
expression to use: pronouns, proper names or definite expressions for instance. It is necessary to make sure that if a
pronoun is used it will be correctly resolved: there are some bidirectional approaches to anaphora but an alternative is
to generate and test using a standard anaphora resolution algorithm as discussed in the previous lecture. Finally, if we
choose to use a full noun phrase we need to consider attribute selection: that is we need to include enough modifiers
to distinguish the expression from possible distractors in the discourse context: e.g., the dog, the big dog, the big dog
in the basket. This last aspect has received the most attention and will be described in more detail here.
Emiel Krahmer, Sebastiaan van Erg and André Verleg (2003) ‘Graph-based generation of referring expressions’ de-
scribe a meta-algorithm for generating referring expressions: a way of thinking about and/or implementing a variety
of algorithms for generating referring expressions which had been discussed in the literature. In their approach, a
situation is described by predicates in a knowledge base (KB): these can be thought of as arcs on a graph, with nodes
corresponding to entities. An example from their paper is given below.
84
A description (e.g., the dog, the small brown dog) is then a graph with unlabelled nodes: it matches the KB graph if it
can be ‘placed over’ it (formally, this is subgraph isomorphism). A distinguishing graph is one that refers to only one
entity (i.e., it can only be placed over the KB graph in one way). If we have a description that can refer to entities other
than the one we want, the other entities are referred to as distractors. In general, there will be multiple distinguishing
graphs: we’re looking for the one with the lowest cost — what distinguishes algorithms is essentially their definition
of cost.
The algorithm starts from the node which we wish to describe and expands the graph by adding adjacent edges (so
we always have a connected subgraph). The cost function is given by a positive number associated with an edge. We
want the cheapest graph which has no distractors: we explore the search space so we never consider graphs which are
more expensive than the best one we have already. If we put an upper bound K on the number of edges in a distractor,
the complexity is nK (i.e., technically it’s polynomial, since K is a constant, and probably it’s going to be less than
5). Various algorithms then correspond to the use of different weights.
The full brevity algorithm is described in Dale (1992): in terms of the meta-algorithm, it’s equivalent to giving each
arc a weight of 1. It is guaranteed to produce the shortest possible expression (in terms of logical form rather than the
string). Dale (1992) also describes a greedy heuristic, which can be emulated by assuming that the edges are ordered
by discrimating power. This gives a smaller search space.
However, subsequent experiments suggest that these algorithms may not give similar results to humans. One issue is
that verbosity isn’t always a bad thing: we don’t want the user to have to make complex inferences to determine a
referent, and sometimes slightly longer phrases might be useful to reinforce a salient point. Occasionally, they even
just sound better. In dialogue contexts it may be advisable to be verbose because it sounds more polite or just because
it improves the user’s chance of understanding a speech synthesizer. So later work on referring expressions has relaxed
the requirement to minimize noun phrase modifiers.
There are also difficulties associated with the idea that the algorithm is run using KB predicates without knowledge
of syntax of the natural language expressions. For example, consider the alternative terms, earlier and before. In a
domain about diaries and meetings, these lexemes might be considered to map to the same KB predicate. However,
they differ in their syntax:
This matters, because when generating expressions where two entities are required, we need to be able to generate
good referring expressions for both entities. Hence the abstraction of considering referring expressions in terms of
KB predicates is not entirely satisfactory. Furthermore, referring expression generation is also needed in regeneration
contexts without a limited domain and hence where there is no KB. For instance, after simplifying a text by replacing
relative clauses by new sentences, it is often necessary to reformulate the referring expressions in the original text.
Taken together, these issues imply that corpus-based approaches to referring expression generation may be preferable.
85
11 Lecture 11: Computational psycholinguistics
No notes: copies of slides will be made available after the lecture.
AI-complete A half-joking term, applied to problems that would require a solution to the problem of representing the
world and acquiring world knowledge (lecture 1).
agreement The requirement for two phrases to have compatible values for grammatical features such as number and
gender. For instance, in English, dogs bark is grammatical but dog bark and dogs barks are not. See IGE.
(lecture 5)
ambiguity The same string (or sequence of sounds) meaning different things. Contrasted with vagueness.
anaphora The phenomenon of referring to something that was mentioned previously in a text. An anaphor is an
expression which does this, such as a pronoun (see §9.5).
Sandy is an argument but on Tuesday is an adjunct. Arguments are specified by the subcategorization of a verb
etc. Also see the IGE. (lecture 5)
aspect A term used to cover distinctions such as whether a verb suggests an event has been completed or not (as
opposed to tense, which refers to the time of an event). For instance, she was writing a book vs she wrote a
book.
86
backoff Usually used to refer to techniques for dealing with data sparseness in probabilistic systems: using a more
general classification rather than a more specific one. For instance, using unigram probabilities instead of
bigrams; using word classes instead of individual words (lecture 3).
bag of words Unordered collection of words in some text.
baseline In evaluation, the performance produced by a simple system against which the experimental technique is
compared (§3.6).
bidirectional Usable for both analysis and generation (lecture 2).
case Distinctions between nominals indicating their syntactic role in a sentence. In English, some pronouns show a
distinction: e.g., she is used for subjects, while her is used for objects. e.g., she likes her vs *her likes she.
Languages such as German and Latin mark case much more extensively.
ceiling In evaluation, the performance produced by a ‘perfect’ system (such as human annotation) against which the
experimental technique is compared (§3.6).
CFG context-free grammar.
chart parsing See §4.5.
Chomsky Noam Chomsky, professor at MIT. His work underlies most modern approaches to syntax in linguistics.
Not so hot on probability theory.
classifier A system which assigns classes to items, usually using a machine learning approach.
closed class Refers to parts of speech, such as conjunction, for which all the members could potentially be enumerated
(lecture 3).
coherence See §9.2
collocation See §7.7
complement For the purposes of this course, an argument other than the subject.
compositionality The idea that the meaning of a phrase is a function of the meaning of its parts. compositional
semantics is the study of how meaning can be built up by semantic rules which mirror syntactic structure
(lecture 6).
constituent A sequence of words which is considered as a unit in a particular grammar (lecture 4).
constraint-based grammar A formalism which describes a language using a set of independently stated constraints,
without imposing any conditions on processing or processing order (lecture 5).
context The situation in which an utterance occurs: includes prior utterances, the physical environment, background
knowledge of the speaker and hearer(s), etc etc. Nothing to do with context-free grammar.
corpus A body of text used in experiments (plural corpora). See §3.1.
cue phrases Phrases which indicates particular rhetorical relations.
denominal Something derived from a noun: e.g., the verb tango is a denominal verb.
dependency structure A syntactic or semantic representation that links words via relations See §6.4
derivational morphology See §2.2
determiner See IGE or notes for prelecture exercises in lecture 3.
deverbal Something derived from a verb: e.g., the adjective surprised.
direct object See IGE. Contrast indirect object.
87
distributional semantics Representing word meaning by context of use (lecture 8).
error analysis In evaluation, working out what sort of errors are found for a given approach (§3.6).
expletive pronoun Another term for pleonastic pronoun: see §9.8.
feature Either: a labelled arc in a feature structure
Or: a characteristic property used in machine learning.
full-form lexicon A lexicon where all morphological variants are explicitly listed (lecture 2).
generation The process of constructing text (or speech) from some input representation (lecture 10).
generative grammar The family of approaches to linguistics where a natural language is treated as governed by rules
which can produce all and only the well-formed utterances. Lecture 4.
genre Type of text: e.g., newspaper, novel, textbook, lecture notes, scientific paper. Note the difference to domain
(which is about the type of knowledge): it’s possible to have texts in different genre discussing the same domain
(e.g., discussion of human genome in newspaper vs textbook vs paper).
gloss An explanation/translation of an obscure/foreign word or phrase.
grammar Formally, in the generative tradition, the set of rules and the lexicon. Lecture 4.
head In syntax, the most important element of a phrase.
hearer Anyone on the receiving end of an utterance (spoken, written or signed). §1.3.
Hidden Markov Model See §3.5
indirect object The beneficiary in verb phrases like give a present to Sandy or give Sandy a present. In this case the
indirect object is Sandy and the direct object is a present.
interannotator agreement The degree of agreement between the decisions of two or more humans with respect to
some categorisation (§3.6).
language model A term generally used in speech recognition, for a statistical model of a natural language (lecture 3).
lemmatization Finding the stem and affixes for words (lecture 2).
lexical ambiguity Ambiguity caused because of multiple senses for a word.
lexicon The part of an NLP system that contains information about individual words (lecture 1).
88
linking Relating syntax and semantics in lexical entries (§6.3).
local ambiguity Ambiguity that arises during analysis etc, but which will be resolved when the utterance is com-
pletely processed.
logical form The semantic representation constructed for an utterance (lecture 6).
long-distance dependency See §4.12
meaning postulates Inference rules that capture some aspects of the meaning of a word.
meronymy The ‘part-of’ lexical semantic relation (§7.2).
modifier Something that further specifies a particular entity or event: e.g., big house, shout loudly.
multiword expression A conventional phrase that has something idiosyncratic about it and therefore might be listed
in a dictionary.
mumble input Any unrecognised input in a spoken dialogue system (lecture 2).
n-gram A sequence of n words (§3.2).
named entity recognition Recognition and categorisation of person names, names of places, dates etc (lecture 4).
NL Natural language.
NLG Natural language generation (lecture 10).
ontology In NLP and AI, a specification of the entities in a particular domain and (sometimes) the relationships
between them. Often hierarchically structured.
open class Opposite of closed class.
orthographic rules Same as spelling rules (§2.3)
overgenerate Of a grammar, to produce strings which are invalid, e.g., because they are not grammatical according
to human judgements.
packing See §4.8
passive chart parsing See §4.6
89
pleonastic Non-referring (esp. of pronouns): see §9.8
predicate In logic, something that takes zero or more arguments and returns a truth value. (Used in IGE for the verb
phrase following the subject in a sentence, but I don’t use that terminology.)
prefix An affix that precedes the stem.
probabilistic context free grammars (PCFGs) CFGs with probabilities associated with rules (lecture 4).
realization Construction of a string from a meaning representation for a sentence or a syntax tree (lecture 10).
referring expression See §9.5
relative clause See IGE.
A restrictive relative clause is one which limits the interpretation of a noun to a subset: e.g. the students who
sleep in lectures are obviously overworking refers to a subset of students. Contrast non-restrictive, which is a
form of parenthetical comment: e.g. the students, who sleep in lectures, are obviously overworking means all
(or nearly all) are sleeping.
selectional restrictions Constraints on the semantic classes of arguments to verbs etc (e.g., the subject of think is
restricted to being sentient). The term selectional preference is used for non-absolute restrictions.
summarization Producing a shorter piece of text (or speech) that captures the essential information in the original.
synonymy Having the same meaning (§7.2).
syntax See §1.2
90
taxonomy Traditionally, the scheme of classification of biological organisms. Extended in NLP to mean a hierarchical
classification of word senses. The term ontology is sometimes used in a rather similar way, but ontologies tend
to be classifications of domain-knowledge, without necessarily having a direct link to words, and may have a
richer structure than a taxonomy.
template In feature structure grammars, see 5.5
utterance A piece of speech or text (sentence or fragment) generated by a speaker in a particular context.
vagueness Of word meanings, contrasted with ambiguity : see §7.5.
verb See IGE or notes for prelecture exercises in lecture 3.
91
Exercises for NLP course, 2013
Notes on exercises
These exercises are organised by lecture. They are divided into two classes: prelecture and postlecture. The prelecture
exercises are intended to review the basic concepts that you’ll need to fully understand the lecture. Depending on your
background, you may find these trivial or you may need to read the notes, but in either case they shouldn’t take more
than a few minutes. The first one or two examples generally come with answers, other answers are at the end (where
appropriate).
Answers to the postlecture exercises are available to supervisors (where appropriate). These are mostly intended as
quick exercises to check understanding of the lecture, though some are more open-ended.
A Lecture 1
A.1 Postlecture exercises
Without looking at any film reviews beforehand, write down 10 words which you think would be good indications of a
positive review (when taken in isolation) and 10 words which you think would be negative. Then go through a review
of a film and see whether you find there are more of your positive words than the negative ones. Are there words in
the review which you think you should have added to your initial lists?
Have a look at http://www.cl.cam.ac.uk/˜aac10/stuff.html for pointers to sentiment analysis data
used in experiments.
B Lecture 2
B.1 Prelecture exercises
1. Split the following words into morphological units, labelling each as stem, suffix or prefix. If there is any
ambiguity, give all possible splits.
(a) dries
answer: dry (stem), -s (suffix)
(b) cartwheel
answer: cart (stem), wheel (stem)
(c) carries
(d) running
(e) uncaring
(f) intruders
(g) bookshelves
(h) reattaches
(i) anticipated
2. List the simple past and past/passive participle forms of the following verbs:
(a) sing
Answer: simple past sang, participle sung
(b) carry
(c) sleep
92
(d) see
Note that the simple past is used by itself (e.g., Kim sang well) while the participle form is used with an auxiliary (e.g.,
Kim had sung well). The passive participle is always the same as the past participle in English: (e.g., Kim began the
lecture early, Kim had begun the lecture early, The lecture was begun early).
C Lecture 3
C.1 Pre-lecture
Label each of the words in the following sentences with their part of speech, distinguishing between nouns, proper
nouns, verbs, adjectives, adverbs, determiners, prepositions, pronouns and others. (Traditional classifications often
distinguish between a large number of additional parts of speech, but the finer distinctions won’t be important here.)
There are notes on part of speech distinctions below, if you have problems.
1. The brown fox could jump quickly over the dog, Rover. Answer: The/Det brown/Adj fox/Noun could/Verb(modal)
jump/Verb quickly/Adverb over/Preposition the/Det dog/Noun, Rover/Proper noun.
2. The big cat chased the small dog into the barn.
3. Those barns have red roofs.
4. Dogs often bark loudly.
Notes on parts of speech. These notes are English-specific and are just intended to help with the lectures and the exer-
cises: see a linguistics textbook for definitions! Some categories have fuzzy boundaries, but none of the complicated
cases will be important for this course.
Noun prototypically, nouns refer to physical objects or substances: e.g., aardvark, chainsaw, rice. But they can also
be abstract (e.g. truth, beauty) or refer to events, states or processes (e.g., decision). If you can say the X and
have a sensible phrase, that’s a good indication that X is a noun.
Pronoun something that can stand in for a noun: e.g., him, his
Proper noun / Proper name a name of a person, place etc: e.g., Elizabeth, Paris
93
Verb Verbs refer to events, processes or states but since nouns and adjectives can do this as well, the distinction
between the categories is based on distribution, not semantics. For instance, nouns can occur with determiners
like the (e.g., the decision) whereas verbs can’t (e.g., * the decide). In English, verbs are often found with
auxiliaries (be, have or do) indicating tense and aspect, and sometime occur with modals, like can, could etc.
Auxiliaries and modals are themselves generally treated as subclasses of verbs.
Adjective a word that modifies a noun: e.g., big, loud. Most adjectives can also occur after the verb be and a few
other verbs: e.g., the students are unhappy. Numbers are sometimes treated as a type of adjective by linguists
but generally given their own category in traditional grammars. Past participle forms of verbs can also often be
used as adjectives (e.g., worried in the very worried man). Sometimes it’s impossible to tell whether something
is a participle or an adjective (e.g., the man was worried).
Adverb a word that modifies a verb: e.g. quickly, probably.
Determiner these precede nouns e.g., the, every, this. It is not always clear whether a word is a determiner or some
type of adjective.
Preposition e.g., in, at, with
Nouns, proper nouns, verbs, adjectives and adverbs are the open classes: new words can occur in any of these cate-
gories. Determiners, prepositions and pronouns are closed classes (as are auxiliary and modal verbs).
C.2 Post-lecture
Try out one or more of the following POS tagging sites:
http://alias-i.com/lingpipe/web/demos.html
http://www.lingsoft.fi/demos.html
http://ucrel.lancs.ac.uk/claws/trial.html
http://l2r.cs.uiuc.edu/˜cogcomp/pos_demo.php
The Lingpipe tagger uses an HMM approach as described in the lecture, the others use different techniques. Lingsoft
give considerably more information than the POS tag: their system uses hand-written rules.
Find two short pieces of naturally occurring English text, one of which you think should be relatively easy to tag
correctly and one which you predict to be difficult. Look at the tagged output and estimate the percentage of correct
tags in each case, concentrating on the open-class words. You might like to get another student to look at the same
output and see if you agree on which tags are correct.
D Lecture 4
D.1 Pre-lecture
Put brackets round the noun phrases and the verb phrases in the following sentences (if there is ambiguity, give two
bracketings):
1. The cat with white fur chased the small dog into the barn.
Answer: ((The cat)np with (white fur)np )np chased (the small dog)np into (the barn)np
The cat with white fur (chased the small dog into the barn)vp
2. The big cat with black fur chased the dog which barked.
94
Note that noun phrases consist of the noun, the determiner (if present) and any modifiers of the noun (adjective,
prepositional phrase, relative clause). This means that noun phrases may be nested. Verb phrases include the verb
and any auxiliaries, plus the object and indirect object etc (in general, the complements of the verb) and any adverbial
modifiers.29 The verb phrase does not include the subject.
D.2 Post-lecture
Using the CFG given in the lecture notes (section 4.3):
1. show the edges generated when parsing they fish in rivers in December with the simple chart parser in 4.7
2. show the edges generated for this sentence if packing is used (as described in 4.9)
3. show the edges generated for they fish in rivers if an active chart parser is used (as in 4.10)
E Lecture 5
E.1 Pre-lecture
The distinction between intransitive, transitive and ditransitive verbs can be illustrated by examples such as:
sleep — intransitive. No object is (generally) possible: * Kim slept the evening.
adore — transitive. An object is obligatory: *Kim adored.
give —- ditransitive. These verbs have an object and an indirect object. Kim gave Sandy an apple (or Kim gave an
apple to Sandy).
List three verbs that are intransitive only, three which are simple transitive only, three which can be intransitive or
transitive and three which are ditransitives.
E.2 Post-lecture
1. Give the unification of the following feature structures:
CAT CAT VP
(a)
unified with AGR
AGR pl
CAT VP
MOTHER
AGR 1
DTR 1
CAT V
CAT V
(b) unified with
DTR 1
AGR
1
AGR sg
CAT NP
DTR 2
AGR
F
J a
F 1
(c) unified with
G 1 J
G
K b
a
F 1
(d)
G 1
unified with G b
F J a
F 1
(e)
G 1
unified with
G J b
K b
F G 1
(f)
unified with FH 1
1
H 1
29 A modifier is something that further specifies a particular entity or event: e.g., big house, shout loudly.
95
F 1
G 1
(g) unified with FJ 1
H 2
1
J 2
F F 2
G 1
(h)
unified with H J 2
H 1
2. Add case to the initial FS grammar in order to prevent sentences such as they can they from parsing.
3. Work though parses of the following strings for the second FS grammar, deciding whether they parse or not:
(a) fish fish
(b) they can fish
(c) it fish
(d) they can
(e) they fish it
4. Modify the second FS grammar to allow for verbs which take an indirect object as well as an object. Also add a
lexical entry for give (just do the variant which takes two noun phrases).
F Lecture 6
F.1 Pre-lecture
A very simple form of semantic representation corresponds to making verbs one-, two- or three- place logical pred-
icates. Proper names are assumed to correspond to constants. The first argument should always correspond to the
subject of the active sentence, the second to the object (if there is one) and the third to the indirect object (i.e., the
beneficiary, if there is one). Give representations for the following examples:
F.2 Post-lecture
Using the sample grammar provided, produce a derivation for the semantics of:
• Kitty sleeps.
• Kitty gives Lynx Rover.
Extend the grammar so that Kitty gives Rover to Lynx gets exactly the same semantics as Kitty gives Lynx Rover. You
can assume that to is semantically empty in this use.
If you did the exercise associated with the previous lecture to add ditransitive verbs to the grammar, amend your
modified grammar so that it produces semantic representations.
Go through the RTE examples given in the lecture notes, and decide what would be required to handle these inferences
correctly.
96
G Lecture 7
G.1 Pre-lecture
Without looking at a dictionary, write down brief definitions for as many senses as you can think of for the following
words:
1. plant
2. shower
3. bass
If possible, compare your answers with another student’s and with a dictionary.
Using the BNC Simple search (or another suitable corpus search tool: i.e., one that doesn’t weight the results returned
in any way), go through at least 10 sentences that include the verb find and consider whether it could have been
replaced by discover. You might like to distinguish between cases where the example ‘sounds strange’ but where
meaning is preserved from cases where the meaning changes significantly.
G.2 Post-lecture
1. Give hypernyms and (if possible) hyponyms for the nominal senses of the following words:
(a) horse
(b) rice
(c) curtain
2. List some possible seeds for Yarowsky’s algorithm that would distinguish between the senses of shower and
bass that you gave in the prelecture exercise.
H Lecture 8
H.1 Pre-lecture
Without looking at a dictionary, write down brief definitions for as many senses as you can think of for the following
words:
1. give
2. run
If possible, compare your answers with another student’s and with a dictionary. How does this exercise compare with
the pre-lecture exercise for lecture 7?
H.2 Post-lecture
Using the BNC Simple search (or another suitable corpus search tool: i.e., one that doesn’t weight the results re-
turned in any way), find 10 or more sentential contexts for shower. For each of the different notions of context
described in the lecture, find the features which a distributional model might associate with shower. You may
want to use an online dependency parser: the Stanford dependency format is one of the most popular approaches
(see nlp.stanford.edu/software/stanford-dependencies.shtml, online demo at http://nlp.
stanford.edu:8080/corenlp/). If you use an online parser, note that the output is unlikely to be perfectly
accurate.
There are a number of online demonstrations of distributional similarity:
97
• http://swoogle.umbc.edu/SimService/
• http://www.linguatools.de/disco/wortsurfer.html
This is described in http://www.linguatools.de/disco/disco_en.html. The interface to the
demo seems to be German only, but should be obvious (you can choose ‘Englisch’ searches).
Search for the words that you wrote down definitions for in the prelecture exercises for this lecture and lecture 7. Do
the similarities make sense? Do they suggest any senses (usages) that you missed? Were any of these also missing
from the dictionaries you looked at?
I Lecture 9
I.1 Pre-lecture
There is an online experiment to collect training data for anaphor resolution at http://anawiki.essex.ac.uk/
phrasedetectives/. Spending a few minutes on this will give you an idea of the issues that arise in anaphora
resolution: there are a series of tasks which are intended to train new participants which take you through progressively
more complex cases. Note that you have to register but that you don’t have to give an email address unless you want
to be eligible for a prize.
I.2 Post-lecture
Take a few sentences of real text and work out the values you would obtain for the features discussed in the lecture.
See if you can identify some other easy-to-implement features that might help resolution.
Try out the Lingpipe coreference system at http://alias-i.com/lingpipe/web/demos.html
98
(d) intruders
intrude (stem) er (suffix) s (suffix)
Note that in- is not a real prefix here
(e) bookshelves
book (stem) shelf (stem) s (suffix)
(f) reattaches
re (prefix) attach (stem) s (suffix)
(g) anticipated
anticipate (stem) ed (suffix)
2. (a) carry
Answer: simple past carried, past participle carried
(b) sleep
Answer: simple past slept, past participle slept
(c) see
Answer: simple past saw, past participle seen
99
J.6 Lecture 6 (pre-lecture)
1. Kim sleeps
sleep(Kim)
2. Sandy adores Kim
adore(Sandy, Kim)
3. Kim is adored by Sandy
adore(Sandy, Kim)
4. Kim gave Rover to Sandy
give(Kim, Rover, Sandy)
100