The Phraseological View of Language
The Phraseological View of Language
Edited by
Thomas Herbst
Susen Faulhaber
Peter Uhrig
De Gruyter Mouton
ISBN 978-3-11-025688-8
e-ISBN 978-3-11-025701-4
grateful that Professor Elena Togmm Bonelli came to take part in the work-
shop and to receive this distinction on his behalf.
We owe a great deal of gratitude to Michael Stubbs and Stig Johansson
for providing very detailed surveys of Sinclair's outstanding contribution to
the development of the subject which forms section I of the this volume. In
"A tribute to John McHardy Sinclair (14 June 1933 - 13 March 2007)",
Michael Stubbs provides a detailed outline of John Sinclair's academic
career and demonstrates how he has left his mark in various linguistic fields
which are, in part, intrinsically connected to his name. Of particular signifi-
cance in this respect is his focus on refining the notion of units of meaning,
the role of linguistics in language learning and discourse analysis and, of
course, his contributions in the field of lexicography and corpus linguistics
in the context of the Cobuild project. We are particularly grateful to Stig
Johansson, who, although he was not able to take part in the workshop for
reasons of his own health, wrote a tribute to John Sinclair for this occasion.
His article "Corpus, lexis, discourse: a tribute to John Sinclair" focuses in
particular on Sinclair's work in the context of the Bank of English and his
influence on the field of corpus linguistics.
The other contributions to this volume take up different issues which
have featured prominently in Sinclair's theoretical work. The articles in
section II focus on the concept of collocation and the notions of open-
choice and idiom principle. Thomas Herbst, in "Choosing sandy beaches -
collocations, probabemes and the idiom principle" discusses different types
of collocation in the light of Sinclair's concepts of single choice and ex-
tended unit of meaning, drawing parallels to recent research in cognitive
linguistics. The notions of open choice and idiom principle are also taken
up by Dirk Siepmann, who in "Sinclair revisited: beyond idiom and open
choice" suggests introducing a third principle, which he calls the principle
of creativity, and discusses its implications for translation teaching. Eugene
Mollet, Alison Wray, and Tess Fitzpatnck widen the notion of collocation
in their article "Accessing second-order collocation through lexical co-
occurrence networks". By second-order collocation they refer to word
combinations depending on the presence or absence of other items, mod-
elled in the form of networks. Thus they address questions such as to what
extent the presence of several second-order collocations can influence the
target item. A number of contributions focus on aspects of collocation in
foreign language teaching. In "From phraseology to pedagogy: challenges
and prospects" Sylviane Granger discusses the implications of the interde-
pendence of lexis and grammar and the idiom principle with respect to the
Preface vii
Susen Faulhaber
Thomas Herbst
Peter Uhrig
Preface ix
Preface
Susen Faulhaber, Thomas Herbst and Peter Uhrig
IV Computational aspects
1. Abstract
2. Introduction
Sinclair was a Scot, and proud of it. He was born in 1933 in Edinburgh,
attended school there, and then studied at the University of Edinburgh,
where he obtained a first class degree in English Language and Literature
(MA 1955). He was then briefly a research student at the University, before
being appointed to a Lectureship in the Department of English Language
and General Linguistics, where he worked with Michael Halliday. His work
in Edinburgh centred on the computer-assisted analysis of spoken English
and on the linguistic styhstics of literary texts.
In 1965, at the age of 31, he was appointed to the foundation Chair of
Modern English Language at the University of Birmingham, where he then
stayed for the whole of his formal university career. His inaugural lecture
was entitled "Indescribable English" and argues that "we use language
rather than just say things" and that "utterances do things rather than just
mean things" (Sinclair 1966, emphasis in original). His work in the 1970s
focussed on educational linguistics and discourse analysis, and in the 1980s
he took up again the corpus work and developed his enormously influential
approach to lexicography.
A Tribute to John McHardy Sinclair 3
It is obviously rather simplistic to pick out just one theme in all his work,
but "the search for units of meaning" (Sinclair 1996) might not be far off.
This is the title of one his articles from 1996. In his first work in corpus
linguistics in the 1960s, he had asked: "(a) How can collocation be objec-
tively described?" and "(b) What is the relationship between the physical
evidence of collocation and the psychological sensation of meaning?" (Sin-
clair, Jones and Daley 2004: 3). In his work on discourse in the classroom,
he was looking for characteristic units of teacher-pupil dialogue. In his later
corpus-based work he was developing a sophisticated model of extended
lexical units: a theory of phraseology. And the basic method was to search
for patterning in long authentic texts.
Along with this went his impatience with the very small number of short
invented, artificial examples on which much linguistics from the 1960s to
the 1990s was based. The title of a lecture from 1990, which became the
title of his 2004 book, also expresses an essential theme in his work: "Trust
the Text" (Sinclair 2004). He argued consistently against the neglect and
devaluation of textual study, which affected high theory in both linguistic
and literary study from the 1960s onwards (see Hoover 2007).
4 Michael Stubbs
5. Language in education
One of his major contributions in the 1960s and 1970s was to language in
education and educational linguistics. In the 1970s, he was very active in
developing teacher-training in the Birmingham area. He regularly made the
point that knowledge about language is "sadly watered down and trivial-
ized" in much educational discussion (Sinclair 1971: 15), and he succeeded
in making English language a compulsory component of teacher-training in
BEd degrees in Colleges of Education in the West Midlands. Also in the
early 1970s, along with Jim Wight, he directed a project entitled Concept 7
to 9. This project produced innovative teaching materials. They consisted
of a large box full of communicative tasks and games. They were originally
designed for children of Afro-Caribbean origin, who spoke a variety of
English sometimes a long way from standard British English, but it turned
out that they were of much more general value in developing the communi-
cative competence of all children. The tasks focussed on "the aims of
communication in real or realistic situations" and "the language needs of
urban classrooms" (Sinclair 1973: 5). In the late 1970s, he directed a pro-
ject which developed ESP materials (English for Specific Purposes) for the
University of Malaya. The materials were published as four course books,
entitled Skills for Learning (Sinclair 1980).
In the early 1990s, he became Chair of the editorial board for the journal
Language Awareness, which started in 1992. One of his last projects is
PhraseBox. This is a project to develop a corpus linguistics programme for
schools, which Sinclair worked on from around 2000. It was commissioned
by Scottish CILT (Centre for Information on Language Teaching and Re-
search) and funded by the Scottish Executive, Learning and Teaching Scot-
land and Canan (the Gaelic College on Skye). The software gives children
in Scottish primary schools resources to develop their vocabulary and
grammar by providing them with real-time access to a 100-million-word
corpus. The project is described in one of Sinclair's more obscure publica-
tions, in West Word, a community newspaper for the western highlands in
Scotland (Sinclair 2006a).
In a word, Sinclair did not just write articles about language, but helped
to develop training materials for teachers and classroom materials for stu-
dents and pupils.
A Tribute to John McHardy Sinclair 5
6. Discourse analysis
The analysis of literary texts was part of Sinclair's demand that linguis-
tics must be able to handle all kinds of authentic texts. He argued further
that, if linguists cannot handle the most prestigious texts in the culture, then
there is a major gap in linguistic theory. Conversely, of course, the analysis
of literary texts must have a systematic basis, and not be the mere swapping
of personal opinions. In an analysis of a poem by Robert Graves, he argued
that the role of linguistics is to expose "the public meaning" of texts in a
language (Sinclair 1968: 216). He similarly argued that "if literary com-
ment is to be more than exclusively personal testimony, it must be inter-
pretable with respect to objective analysis" (Sinclair 1971: 17). In all of this
work there is a consistent emphasis on long texts, authentic texts, including
literary texts, and on observable textual evidence of meaning.
Post-1990, Sinclair was mainly known for his work in corpus linguistics.
This work started in Edinburgh, in the 1960s, and was informally published
as the "OSTI Report" (UK Government Office for Scientific and Technical
Information, Sinclair, Jones and Daley 2004). This is a report on quantita-
tive research on computer-readable corpus data, earned out between 1963
and 1969, but not formally published until 2004.
The project was in touch with the work at Brown University: Francis
and Kucera's Computational Analysts of Present Day American English,
based on their one-million-word corpus of written American English, had
appeared in 1967. But again, it is difficult to project oneself back to a pe-
riod in which there were no PCs, and in which the university mainframe
machine could only handle with difficulty Sinclair's corpus of 135,000
running words of spoken language.
Yet the report worked out many of the main ideas of modern corpus lin-
guistics in astonishing detail. This work in the 1960s formulated explicitly
several principles which are still central in corpus linguistics today. It put
forward a statistical theory of collocation in which collocations were inter-
preted as evidence of meaning. It asked: What kinds of lexical patterning
can be found in text? How can collocation be objectively described? What
size of span is relevant? How can collocational evidence be used to study
meaning? Some central principles which are explicitly formulated include:
The unit of lexis is unlikely to be the word in all cases. Units of meaning
8 Michael Stubbs
Few of the very common words in the language "have a clear meaning
independent of the cotext". Nevertheless, "their frequency makes them
dominatealltext"(Sinclairl999:158,163).
Here is a fragment of output from some modern concordance software:
all the examples of the three words way - long - go co-occurring in a six-
million-word corpus.2 The concordance lines were generated by software
developed by Martin Warren and Chris Greaves (Cheng, Greaves and War-
ren 2006), in a project that Sinclair was involved in.
01 added that there was still a long way to go In overcoming Stalinist structure
02 ges In 1902 there was i H T T a long way to go: A. M. Falrbalrn warned Sir Alfre
03 e," he said. There Is sTTII a long way to go, however, to reach the 1991 high
04 handicapped. There Is i H T T a long way to go before the majority of teachers 1
05 on peas. But we s t i l l l ^ i a long way to g o . ^ i l F w e Imagine our blizzard rag
06 hem to Church-we sTiTI have a long way to go to reach our African Church stand
07 other we still h ^ T e ^ n awful long way to go. TRADES UNIONS AND THE EUROPEAN
08 real g o o d ^ l ^ i Gemma's got a long way to go before she gets to eighty You're
09 s demonstrates that we have a long way to go bifoTi we have true democracy In
10 Ice, but I'm afraid we have a long way to go bifoTi we catch up to the Japanes
11 dlcatlons are that there Is a long way to go bifoTi the Algerian problem Is fl
12 f small atomic reactors has a long way to go bifoTi It becomes a commercial pr
13 seventy. I mean he's, he's a long way to go.^cT^ould you cut me a slice of t
14 aell occupation, Israel has a long way to go to convince anyone that It Is ser
15 Ichael Calne." But he's got a long way to go. "David Who? Never heard of him,"
16 teen. My turn to draw. A long long way to go though. Difficult. Well look wher
17 y the earth's gases will go a long way toward bolstering or destroying cosmic
18 nd and motion, fountains go a long way to^aTd selling themselves in showrooms
19 ntion. Vaccinations also go a long way to^aTd eliminating the spread of more v
20 and ready to act, would go a long way to^aTd making the 'new world order' mor
21 s father owned, it might go a long way to^aTd explaining why she was reluctant
22 these very simple cases go a long way to^aTds explaining puzzling features of
23 duction in wastage would go a long way to easing the manpower problems. In gen
24 and guineas. That should go a long way to easing the strain on an amateur team
25 11, if voters are loyal, go a long way to ensuring the election of one or even
26 s in partnership, we can go a long w a y . ^ o u do believe that?" The expression 1
27 ius scarf, and a boy can go a long way with those things. You got a job yet? W
28 Conolly said. "We could go a long way on this. I didn't know Major Fitzroy wa
29 Of course we go back an awful long way don't we? Yes. Yeah. Are you going to t
30 I go to London. And we go any damn way I please, as long as I don't interfere
Data of this kind make visible the kinds of patterning which occur across
long texts, and provide observable evidence of the meaning of extended
lexical units. Several things are visible in the concordance lines. They show
that way "appears frequently in fixed sequences" (Sinclair 2004: 110), and
that the unit of meaning is rarely an individual word. Rather, words are co-
selected to form longer non-compositional units. Also, the three words way
- long - go tend to occur in still longer sequences, which are not literal, but
metaphorical. There are two main units, which have pragmatic meanings:
(a) is used in an abstract extended sense to simultaneously encourage hear-
ers about progress in the past and to warn them of efforts still required in
the future; (b) is also used in exclusively abstract senses.
A Tribute to John McHardy Sinclair 11
8. Publications
Sinclair's work was for a long time not as well known as it deserved to be.
This was partly his own fault. He often published in obscure places, not
always as obscure as community newspapers from the Scottish Highlands,
but nevertheless frequently in little known journals and book collections,
and it was only post-1990 or so that he began to collect his work into books
with leading publishers (Oxford University Press, Routledge, Benjamins).
He once told me that he had never published an article in a mainstream
refereed journal. I questioned this and cited some counter-examples, which
he argued were not genuine counter-examples, since he had not submitted
the articles: they had been commissioned. He was always very sceptical of
journals and their refereeing and gate-keeping processes, which he thought
were driven by fashion rather than by standards of empirical research.
He was also particularly proud of the fact that, when he was appointed
to his chair in Birmingham, he had no PhD and no formal publications. His
12 Michael Stubbs
first publication was in 1965, the year when he took up his chair: it was an
article on stylistics entitled "When is a poem like a sunset?", which was
published in a literary journal (Sinclair 1965). It is a short experimental
study of the oral poetic tradition which he earned out with students. He got
them to read and memorize a ballad ("La Belle Dame Sans Merci" by
Keats) and then studied what changes they introduced into their versions
when they tried to remember the poem some time later.
His last book Lmear Unit Grammar, co-authored with Anna Mauranen,
is typical Sinclair (Sinclair and Mauranen 2006). It is based on one of his
most fundamental principles: if a grammar cannot handle authentic raw
texts of any type whatsoever, then it is of limited value. The book points
out that traditional grammars work only on input sentences which have
been very considerably cleaned up (or simply invented). Sinclair and Mau-
ranen demonstrate that analysis of raw textual data is possible. On the one
hand, the proposals are so simple as to seem absolutely obvious: once
someone else has thought of them. On the other hand, they are so innova-
tive, that it will take some time before they can be properly evaluated. I will
not attempt this here, and just note that the book develops the view that
significant units of language in use are multiword chunks. But here, the
approach is via a detailed discussion of individual text fragments as op-
posed to repeated patterns across large text collections. Either way, it is a
significant break with mainstream linguistic approaches.
9. In summary
Notes
1 Some biographical details are from the English Department website at Bir-
mingham University and from obituaries in The Guardian (3 May 2007), The
Scotsman (10 May 2007) and Functions of Language 14 (2) (2007). Special is-
sues of two journals are devoted to papers on Sinclair's work: International
Journal of Corpus Linguistics 12 (2) (2007) and International Journal of Lexi-
cography 21 (3) (2008). I am grateful to Susan Hunston and Michaela Mahl-
berg for comments on a previous version of this paper.
2 The corpus consisted of Brown, LOB, Frown and FLOB plus BNC-baby: five
million words of written data and one million words of spoken data.
References
Corpora
BNC Baby The BNC Baby, version 2. 2005. Distributed by Oxford University
Computing Services on behalf of the BNC Consortium. URL:
http://www.natcorp.ox.ac.uk/.
BROWN A Standard Corpus of Present-Day Edited American English, for use
with Digital Computers (Brown). 1964, 1971, 1979. Compiled by W.
N. Francis and H. Kucera. Brown University. Providence, Rhode Is-
land.
FROWN The Freiburg-Brown Corpus ('Frown') (original version) compiled
by Christian Mair, Albert-Ludwigs-Umversitat Freiburg.
LOB The LOB Corpus, original version (1970-1978). Compiled by Geof-
frey Leech, Lancaster University, Stig Johansson, University of Oslo
(project leaders) and Knut Holland, University of Bergen (head of
computing).
FLOB The Freiburg-LOB Corpus ('F-LOB') (original version) compiled by
Christian Mair, Albert-Ludwigs-Umversitat Freiburg.
Sttg Johansson*
It is an honour to have been asked to give this speech for John Sinclair,
pioneer in corpus linguistics, original thinker and a source of inspiration
for countless numbers of language students.
The use of corpora, or collections of texts, has a venerable tradition in
language studies. Many important works have drawn systematically on
evidence from texts. To take just two examples, the great grammar by Otto
Jespersen was based on collections of several hundred thousand examples.
The famous Oxford EngUsh Dictionary could use several million examples
collected from English texts. There is no doubt that the data collections, or
rather the intelligent use of evidence from the collections, contributed
greatly to the success of these monumental works.
But these data collections had the drawback that the examples had been
collected in a more or less impressionistic manner, and there is no way of
knowing what had been missed. Working in this way, there is a danger that
the attention is drawn to oddities and irregularities and that what is most
typical is overlooked. Just as important, the examples were taken out of
their context.
When we talk about corpora these days, we think of collections of run-
ning text held in electronic form. Given such computer corpora, we can
study language in context, both what is typical and what is idiosyncratic.
This is where we have an edge on Jespersen and the original editors of the
Oxford EngUsh Dictionary. With the computational analysis tools which
are now available we can observe patterns that are beyond the capacity of
ordinary human observation.
The compilation and use of electronic corpora started about forty-fifty
years ago. At that time, corpora were small by today's standards, and they
were difficult to compile and use. There were also influential linguists who
* Sadly, Stig Johansson died in April 2010. The editors of this volume would
lrke to express their thanks to Professor Hilde Hasselgard for taking care of
the final version of his paper.
18 Stig Johansson
rejected corpora, notably Noam Chomsky and his followers. Those who
worked with corpora were a small select group. One of them was John
Sinclair.
In the course of the last few decades there has been an amazing devel-
opment, made possible by technological advances but also connected with
the foresight and ability of linguists like John Sinclair to see the possibili-
ties of using the new tools for the study of language. We now have vast
text collections, numbering several hundred million words, and analysis
tools that make it possible to use these large data sources. The number of
linguists working with computer corpora has grown from a select few to an
ever increasing number, so that Jan Svartvik, another corpus pioneer, could
say in the 1990s that "corpora are becoming mainstream" (Svartvik 1996).
We also have a new term for the study of language on the basis of com-
puter corpora: corpus linguistics. As far as I know, this term was first used
in the early 1980s by another pioneer, Jan Aarts from the University of
Nijmegen in Holland (see Aarts and Metis 1984). Now it has become a
household word. A search on the Internet provides over a million hits.
Many people working with corpora probably associate the beginnings
of corpus linguistics with Randolph Quirk's Survey of English Usage, a
project which started in the late 1950s, but the corpus produced by Quirk
and his team was not computerised until much later. What really got the
development of computer corpora going was the Brown Corpus, compiled
in the early 1960s by W. Nelson Francis and Henry Kucera at Brown Uni-
versity in the United States. The Brown Corpus has been of tremendous
importance in setting a pattern for the compilation and use of computer
corpora. Not least, it was invaluable that the pioneers gave researchers
across the world access to this important data source, which has been used
for hundreds of language studies: in lexis, grammar, stylistics, etc.
Around this time John Sinclair was engaged in a corpus project in Brit-
ain. The reason why this is less known is probably that the corpus was not
made publicly available. We can read about the project in a book published
a couple of years ago: John M. Sinclair, Susan Jones and Robert Daley,
English Collocation Studies: The OSTI Report, edited by Ramesh Knsh-
namurthy, including a new interview with John M. Sinclair, conducted by
Wolfgang Teubert. The book is significant both because it gives access to
the OSTI Report, which had been difficult to get hold of, and because of
the interview, which gives insight into the development of John Sinclair's
thinking.
Corpus, lens, discourse: a tribute to John Sinclair 19
The OSTI Report was, as we can read on the title page, the final report
to the Office for Scientific and Technical Information (OSTI) on the Lexi-
cal Research Project C/LP/08 for the period January 1967 - September
1969, and it was dated January 1970, but the project had started in 1963.
There are two things which I find particularly significant in connection
with this project. In the first place, it included the compilation of a corpus
of conversation, probably the world's first electronic corpus of spoken
language compiled for linguistic studies. The corpus was fairly small,
about 135,000 words, but considering the difficulties of recording, tran-
scribing and computerising spoken material, this was quite an achievement.
In addition, some other material was used for the project, including the
Brown Corpus. The most significant aspect of the project was that the fo-
cus of the study was on lexis. We should remember that at this time lexis
was disregarded, or at least underestimated, by many - perhaps most -
linguists, who regarded the lexicon as a marginal part attached to grammar.
Schematically, we could represent it in this way:
Lexicon
Grammar
(no
Residual
Lexical items independent
grammar
semantics)
I will come back later to the notion of lexical item. Let's return to the ori-
gin of John Sinclair's thinking on lexis. We find it in the OSTI Report and
in a paper with the title "Beginning the study of lexis", published for a
collection of papers in memory of his mentor J. R. Firth (Bazell et al.
1966). Firth had stressed the importance of collocations, representing the
20 Stig Johansson
significant co-occurrence of words. But he did not have the means of ex-
ploring this beyond typical examples, such as dark mght (Firth 1957: 197).
What is done in the OSTI Report is that systematic procedures are de-
vised for defining collocations in the corpus. Here we find notions such as
node, collocate and span, which have become familiar later:
A node is an item whose total pattern of co-occurrence with other words is
under examination; a collocate is any one of the hems which appears with
the node within the specified span. (Sinclair, Jones and Daley [1970] 2004:
10)
In the interview with Wolfgang Teubert, John Sinclair reports that the op-
timal span was calculated to be four words before and four words after the
node, and he says that, when this was re-calculated some years ago based
on a much larger corpus, they came to almost the same result (Sinclair,
Jones and Daley 2004: xix).
It was a problem that the corpus was rather small for a systematic study
of collocations. In the opening paragraph of the paper I just referred to,
John Sinclair says:
[... ] if one wishes to study the 'formal' aspects of vocabulary organization,
all sorts of problems He ahead, problems which are not likely to yield to
anything less imposing than a very large computer. (Sinclair 1966: 410)
Later in the paper we read that "it is likely that a very large computer will
be strained to the utmost to cope with the data" (Sinclair 1966: 428). There
was no way of knowing what technological developments lay ahead, and
that we would get small computers with an infinitely larger capacity than
the large computers at the time this was written.
John Sinclair says that he did very little work on corpora in the 1970s
(Sinclair, Jones and Daley 2004: xix), frustrated by the labonousness of
using the corpus and by the poor analysis programs which were available.
But he and his team at Birmingham did ground-breaking work on dis-
course, leading to an important publication on the English used by teachers
and pupils (Sinclair and Coulthard 1975). As I have understood it, what
was foremost for John Sinclair was his concern with discourse and with
studying discourse on the basis of genuine data. We must "trust the text",
as he puts it in the title of a recent book (Sinclair 2004). This applies both
to the discourse analysis project and to his corpus work.
Around 1980 John Sinclair was ready to return to corpus work. We
were fortunate to have him as a guest lecturer at the University of Oslo in
February 1980, and a year later he attended a conference in Bergen, the
Corpus, lexis, cUscourse: a tribute to John Sinclcur 21
Here we find some verbs: teetermg, teetered, porsed, havering and some
nouns denoting disasters: starvation, extinction, bankruptcy, collapse, dis-
aster, destruction. These are the words which most typically co-occur with
brink, identified by a measure of co-occurrence called mutual information.
The pattern could have been shown more clearly if I had given the lists for
left and right contexts separately, but there should be no need to do this for
the present purpose. The results agree very well with the findings presented
in John Sinclair's article, though he used a different corpus. He summa-
rises the results in this way (Sinclair 1999: 12):
[A]M/IpreptheEofD
This is a lexical item. It is used about some actor (A) who is on (I), or is
moving towards (M), the edge (E) of something disastrous (D). It has an
invariable core, brink, and there are accompanying elements which con-
form to the formula. By using the item "the speaker or writer is drawing
attention to the time and risk factors, and wants to give an urgent warning"
(loc. sti.). There is a negative semantic prosody, reflecting the communica-
tive purpose of the item.
Let's take a second example, from a book called Reading Concordances
(Sinclair 2003: 141-151). This is a bit more complicated. How do we use
the sequence true feelings'!
Corpus, lexis, discourse: a tribute to John Sinclcur 23
After examining material from his corpus, John Sinclair arrives at the
following analysis:
PROSODY reluctance
COLLOCATION hide
his
reveal their truefeeUngs
express your
Notes
1 This text was prepared for the ceremony in connection with the award of an
honorary doctorate to John Sinclair (Erlangen, November 2007). As the audi-
ence was expected to be mixed, it includes background information which is
well-known among corpus linguists. The text is virtually unchanged as it was
prepared early in 2007, before we heard the sad news that John had passed
away. The tense forms have not been changed.
2 See also Sinclair (1998).
References
Aarts,JanandWillemMeijs
1984 Corpus Linguistics: Recent Developments in the Use of Computer
Corpora in English Language Research. Amsterdam: Rodopi.
Bazell, Charles Ernest, John Cunnison Catford, Michael Alexander Kirkwood Hal-
May and Robert H. Robins (eds.)
1966 In Memory of J. R. Firth. London: Longman.
Corpus, lexis, discourse: a tribute to John Sinclair 25
Corpus
... a language user has available to him or her a large number of semi-
preconstructed phrases that constitute single choices (Sinclair 1991: 110)
... patterns of co-selection among words, which are much stronger than any
description has yet allowed for, have a direct connection with meaning.
(Sinclair 2004: 133)
... a text is a unique deployment of meaningful units, and its particular
meaning is not adequately accounted for by any organized concatenation of
the fixed meanings of each unit. (Sinclair 2004: 134)
2. Collocations-compounds-concepts
BNC rain rainfall wind storm gale smoker drinking tea coffee taste
heavy 254 21 5 5 1 47 43 0 0 0
strong 0 0 223 0 4 0 0 28 19 3
severe 2 0 1 15 10 0 0 0 0 0
Choosing sandy beaches - collocations, probabemes and the idiom principle 31
BNC ram ramfall mnd storm gale smoker drmkmg tea coffee taste
light 47 1 61 0 0 1 0 2 0 2
slight 0 0 3 0 0 0 0 0 0 2
moderate 0 5 8 0 3 2 11 0 0 1
weak 0 0 0 0 0 0 0 16 3 0
NPV„ t NPNP + + + + +
NPV„ t NPasNP + + + + +
NPV„,ofNPasNP +
etc.
Thus if one takes the definition of manage provided by the Cobmld English
Language Dictionary (mi)
(2) If you manage to do something, you succeed in doing it.
one could argue that the choice of a particular valency earner such as man-
age or succeed entails a simultaneous choice of a particular valency pattern
(and obviously the fact that manage combines with a [to INF]-complement
and succeed with an [in V-ing]-complement must be stored in the mind
and is unpredictable for the foreign learner of the language).
If we consider heavy and strong or the valency patterns listed (and the
list of patterns and verbs could be expanded to result in an even more com-
plex picture), then we are confronted with important combinatorial proper-
ties. Methodologically, it is not easy to decide whether (or in which cases)
these should be described as restrictions or merely as preferences: while the
32 Thomas Herbst
BNC does not show any occurrences of strong + storm/s or strong + ram-
fall/s (span ±5), for example, it does contain 5 instances of heavy wmd/s
(and artificial teeth). Similarly, extremely rare occurrences of a verb in a
pattern in which it does not "normally" occur such as
(3) ... they regard management a very important ingredient within their strategy
<BNC:HJ5 6002>
cannot be regarded as a sufficient reason to see this as an established use
and to make this pattern part of the valency description of that verb. Never-
theless, acceptability judgments are highly problematic in this area, which
is why it may be preferable to speak of established uses, which however
means that frequency of occurrence is taken into account. In any case, the
observation that the collocational and colhgational properties of words
display a high degree of idiosyncracy is highly compatible with usage-
based models of cognitive linguistics (e.g. Tomasello 2003; Lieven forth-
coming; Behrens 2007; or Bybee 2007). 12
Although cases such as heavy ram or heavy drinking represent an argu-
ment in favour of storage, it is not necessarily the case that we have to talk
about a single choice. In some cases it is quite plausible to assume, as
Hausmann (1985: 119, 2007: 218) does, that the base is chosen first and
then a second choice limited by the collocational options of that base takes
place. This choice, of course, can be relatively limited as in the case of
Hausmann's (1984: 401-402, 2007: 218) examples schutteres Haar and
confirmed bachelor/eingefleischter Junggeselle or, on a slightly different
line, physical attack, scientific experiment and full enquiry discussed by
Sinclair (2004: 21).
In other cases, however, accounting for a collocation in terms of a single
choice option may be more plausible. This applies to examples such as
white wme or red wme, also commented on by Sinclair (2004). These show
great similarity to compounds but they allow interpolation -
(4) white Franconian wine
- a n d predicative uses
(5) Wmes can be red, white or rose, and still or sparkling - where the natural
carbon dioxide from fermentation is trapped in the wme. <BNC: C9F 2172>
and also, of course, uses such as
(6) 'I always know when I'm in England,' said Morris Zapp, as Philip Swallow
went off, 'because when you go to a party, the first thing anyone says to you
is, "Red or whiter" <NW323>
Choosing sandy beaches - collocations, probabemes and the idiom principle 33
In this respect, the case of sandy beaches seems to be quite different. One
could argue that the mere fact that beaches and sandy co-occur with a log-
likelihood value of 3089.83 ({beach/N} ±3) has to do with the fact that
beaches are often sandy and therefore tend to be described with the adjec-
tive sandy. In this respect, sandy beaches can be compared to the collocates
of winds, where the fact that the BNC contains 22 instances of westerly
wmd/wmds, 12 of south-westerly wtnd/wtnds and only 2 of southerly
wmd/wmds ({wind/N} -1) is a reflection of the facts of the world discussed
in the texts of the corpus rather than of the language.
Sandy beaches could then be analysed as a free combination - the type
that Hausmann (1984: 399-400) refers to as a "Ko-Kreation" - and this
could be taken as an argument for not including sandy beaches in a diction-
ary - at least not as a significant combination, although perhaps, like in
Cobuildl and LDOCE5,13 as an example of a rather typical use.
On the other hand, in German the meaning of sandy beaches is usually
expressed by a compound - Sandstrande. The question to be asked is
whether it is realistic to imagine that the same meaning or concept is real-
ized by a free combination in one language and by a compound in another,
in other words, whether the meaning of the German lexicahzed compound
can be seen as the same as that of the collocation in English.
34 Thomas Herbst
mgl It seems that, again, the co-occurrence of particular lexical items cre-
ates a particular meaning, and as such they can be considered a single
choice.
However, the fact that high tide and the tide ts high do not represent
identical concepts shows that we have to consider not only the individual
words that make up an "extended unit of meaning", to use Sinclair's (2004:
24) term, but that the meaning of the construction must also be taken into
account, where one could speculate that in the case of adjective-noun-
combinations there is a gradient from compounds to attributive collocations
(such as sandy beach, confirmed bachelor, high tide) to predicative uses
{the tide ts high) as far as the concreteness and stability of the concepts is
concerned.
Even if there are no clear-cut criteria for the identification of concepts or
semantic units, it has become clear that they do not coincide with the classi-
fication of collocations on the basis of the criteria of semantic or statistical
significance. While semantically significant collocations such as guilty
conscience or white wme can be seen as representing concepts, this is not
necessarily the case in the same way with other such collocations like
heavy ram or heavy smoker, for instance. Similarly, the fact that a statisti-
cally significant collocation such as sandy beach can be seen as represent-
ing a concept does not necessarily mean that all frequent word combina-
tions represent concepts in this way. It must be doubted, for instance,
whether the fact that the most frequent collocate of the verbs buy and sell in
the BNC is house should be taken as evidence for claiming that "buying a
house" has concept status for native speakers of English.
This means that traditional distinctions such as the ones between differ-
ent types of collocation or between collocations and compounds are not
necessarily particularly helpful when it comes to identifying single choices
or semantic units in Sinclair's sense.
The case of sandy beaches shows how scope and perspective of linguistic
analysis influence its outcome. If one studies combinations of two words,
as Hausmann (1984) does, then weak tea and heavy storm will be classified
as semantically significant collocations since tea and storm do not combine
with semantically similar adjectives such as feeble (tea) or strong (storm).
On the other hand, one could argue that the uses of the adjectives sandy and
sandtg follow the open-choice principle: sandy occurs with nouns such as
38 Thomas Herbst
beach, heath, soil etc.; sandrg with nouns such as Boden, but also with
Schuhe, Strumpfe etc. If one identifies two senses of sandrg in German -
one meaning 'consisting of sand', one 'being covered by sand', then of
course all of the uses are perfectly "regular". Seen in this light, such com-
binations can be attributed to the principle of open choice, which Sinclair -
at least in (1991: 109) - believed to be necessary alongside the idiom prin-
ciple "in order to explain the way in which meaning arises from language
text". The open-choice principle can be said to operate whenever there are
no restrictions in the combinations of particular lexical items. However, as
demonstrated above, the fact that sandy beach is frequently used in English
(because sand beach seems to be restricted to technical language) where
Sandstrand is used in German means that - at least when we talk about
established language use - the choice is not an entirely open one.
In any case, open choice must not be seen as being identical with pre-
dictability. For example, verbs such as buy, sell, propose or object seem to
present good examples of the principle of open choice since there do not
seem to be any restrictions concerning the lexical items that can occur as
the second valency complement of the verbs. The fact that shares, house
and goods are the most frequent noun collocates of buy and of sell in the
BNC (span ± 3 f can be seen as a reflection of the facts of the world (or of
the world discussed in the texts of the BNC) but one would hardly regard
this as a sufficient reason to consider these as phraseological or conceptual
units. So there is open choice in the language, but how speakers know that
(and when) this is the case, is a slightly different matter. In a way this is a
slightly paradoxical situation in that speakers will only produce combina-
tions such as buy shares or buy a book (a) because they have positive evi-
dence that the respective meanings can be expressed in this way and/or (b)
because they lack evidence that this is not the case - facts that cognitive
grammar would account for in terms of entrenchment or pre-emption.22
This open choice paradox can be taken as evidence for the immense role of
storage even in cases where the meaning of an extended unit of meaning
can be analysed as being entirely compositional.
A further complication about deciding whether a particular combination
of words can be attributed to the open-choice or the idiom principle is due
to the unavoidable element of circularity caused by the fact that this deci-
sion is based on our analysis of the meanings of the component parts of the
combination, which in turn however is based on the combinations in which
these words occur.
Choosing sandy beaches - collocations, probabemes and the idiom principle 39
3. Probabemes
tive of the fact whether these expressions take the form of one word or
several words. The term probabeme can then be used to refer to umts such
as six months, i.e. the (most) likely (or established) verbalizations of a par-
ticular meaning in a language. If we talk about the idiom principle, we talk
about what is established, usual in a speech community, and thus identify-
ing probabemes is part of the description of the idiom principle.
This aspect of idiomaticity was highlighted by Pawley and Syder (1983:
196), who point out that utterances of the type
(16) I desire you to become married to me
(17) Your manying me is desired by me
"might not achieve the desired response". More recently, Adele Goldberg
(2006: 13) observes that "it's much more idiomatic to say
(18) I like lima beans
than it would be to say
(19) Lima beans please me."
These examples show that if we follow the frame-semantics approach out-
lined by Fillmore (1977: 16-18),25 which includes such factors as perspec-
tivization, choice of verb and choice of construction, the idea of a single
choice may apply to larger units than the ones discussed so far.
A further case in point concerning probabemes is presented by the
equivalent of combinations such as wrr drer oder die drer, which in English
tends to be the three of us (BNC: 137) and the three of them (BNC: 189)
rather than we three (BNC: 22) or they three (BNC: 4).
The three of us is a very good example of how difficult it is to account
for the recurrent chunks in the language. What we have here is a kind of
construction which could not really be called item-specific since it can be
described in very general terms: the + numeral + of + personal pronoun.
Again, bilingual dictionaries seem more explicit than monolingual diction-
aries. Langenscherdt's Power Dictionary (1997) and Langenscherdt Collins
Grofiworterbuch English (2004) give the three of us under wir, but no
equivalent under ihr or die.
4. Meaning-carrying units
of several words but not to treat them any differently from single words.26
Units of meaning in this sense include elements that traditionally could be
classified as single words {beach), compounds (lighthouse), collocations
(sandy beaches, weak tea, set the table) or units such as bear resemblance
to or be of duration. Basically, all items that could be regarded as con-
structions (or at least item-based constructions) in the sense of form-
meaning pairings should be included in such an approach.27
There may be relatively little point in attempting a classification of all
the types of meaning-carrying units to be identified in the language (or a
language) in that little seems to be gained by expanding the lists of phrase-
ological units identified so far (Granger and Paquot 2008: 43-44; or Glaser
1990), especially since many of the units observed do not neatly fall into
any category.
In the light of the range of phraseological units identified it may rather
be necessary to radically rethink commonly established principles and cate-
gories in syntactic analysis. Thus Sinclair (1991: 110-111) points out that
the "o/in of course is not the preposition of that is found in grammar
books" and likewise Fillmore, Kay and O'Connor (1988: 538) ask whether
we have "the right to describe the the" in the the Xer the reconstruction
"as the definite article".28 Similarly, from a semantic point of view it seems
counterintuitive to analyse number and deal as heads of noun phrases in
cases such as
(21) a number of novels <NW72>
(22) a great deal oftime<NW46>
where one could also argue for an analysis in terms of complex determiners
(Herbst and Schuller 2008: 73-74).29 A further case in point is presented by
examples such as
(23) The possibilities, I suppose, are almost endless <VDE>
where an analysis which takes suppose as the governing verb and the clause
as a valency complement of the verb is not entirely convincing (which is
why such cases are given a special status in the Valency Dtctionary ofEng-
Itsh, for instance). In fact, there are good arguments for treating / suppose
as a phraseological chunk that has the same function as an adverb such as
presumably and can occur in the same positions of the sentence as adver-
bials with the same pragmatic function. It has to be said, however, that the
precise nature of the construction is more complex: it is characterized by
the quasi-monovalent use of the verb (or a particular class of verbs com-
prising, for example, suppose, assume, know or explain) under particular
42 Thomas Herbst
contextual or structural conditions, and at the same time one has to say that
the subject of the verb need not be a first person pronoun as in (23), al-
though statistically it very often is
(24) PMUp Larkin, one has to assume, was joking when he said that sexual
intercourse began in 1962. <VDE>
The precise identification and delimitation of such units raises a number of
problems, however, both at the levels of form and meaning. In the case of
example (1), for instance, one might ask whether the unit to be identified is
[of duration] or [be® of duration} or [be® of 'time span' duration],
where be® stands for different forms of the verb be and 'time W indi-
cates a slot to be filled by expressions such as ten weeks or short. One
might also argue in favour of a compositional account and not see be as
part of the unit at all. At the semantic level, one would have to ask to what
extent we are justified in treating expressions such as [be® of 'time
span' duration] or [last 'time span'] as alternative expressions ofthe"same
meaning or not. In his description of the idiom principle, Sinclair (1991:
111-112) himself points at the indeterminate nature of such "phrases" by
pointing out their "indeterminate extent", "internal lexical" and "syntactic
variation" etc.30 Nevertheless, the question of to what extent formal and
semantic criteria coincide in the delimitation of chunks deserves further
investigation, in particular with respect to the notions of predictability and
storage relevant in foreign language linguistics and cognitive linguistics.
There can be no doubt that in the last thirty or forty years, "the analysis of
language has developed out of all recognition", as John Sinclair wrote in
1991, and, indeed, the "availability of data" (Sinclair 1991: 1) has contrib-
uted enormously to this breakthrough. On the other hand, the use and the
interpretation of the data available is easier in some areas than in others.
For instance, we can now find out from corpora such facts as that
- there is a very similar number of occurrences of the verbs suppose
(11,493) and assume (10,956) and that 6 6 % of the uses of suppose
(7,606) are first person, but only 7 % (773) of those of assume;
- the verb agree (span ±1) shows 113 co-occurrences with entirely, 31
^th fully, 27 with wholeheartedly (or whole-heartedly), 25 with to-
tally, 20 with completely and 9 with wholly;
Choosing sandy beaches - collocations, probabemes and the idiom principle 43
- and that 88 % of the entirely agra-cases are first person singular, but
only 52 % of the fully agree-czses, compared with only 16 % of all
uses of agree.*
With respect to the relevance of these data, one will probably have to say
that some of these findings can be explained in terms of general principles
of communication such as the fact that one tends to ask people questions of
the type
(25) Do you agree?
rather than
(26) Do you agree entirely?
Like the statistically significant co-occurrence of the verb buy with particu-
lar nouns such as house or the predominance of westerly winds in the Brit-
ish National Corpus, such corpus data can be seen as a reflection of certain
types of human behaviour or facts of the world described in the corpus
analysed. Although Sinclair (1991: 110) mentions "the recurrence of simi-
lar situations in human affairs" in his discussion of the idiom principle, co-
occurrences of this type may be of more relevance with respect to psycho-
linguistic phenomena concerning the availability of certain prefabricated
items to a speaker than to the analysis of a language as such. However, the
fact that / suppose is much more common than / assume makes it a proba-
beme which is relevant to foreign language teaching and foreign language
lexicography. This is equally true of the co-occurrence of entirely and
agree, where a comparison of the collocations of agree in the BNC with the
ICLE learner corpus shows a significant underuse of entirely by learners.
Thus, obviously, the insights to be gained from this kind of analysis have to
be filtered and evaluated with regard to particular purposes of research.
While the occurrence of certain combinations of words or the overall fre-
quency of particular words or combinations of words may be relevant to
some research questions, what is needed in foreign language teaching and
lexicography is information about the relative frequency of units expressing
the same meaning in the sense of the probabeme concept32
Recognizing the idiom principle thus requires a considerable amount of
detailed and item-specific description, which is useful and necessary for
applied purposes and this needs to be given an appropriate place in linguis-
tic theory. At the same time, it is obvious that when we discuss the idiom
principle, we are not concerned with what is possible in a language but with
what is usual in a language - with de Saussure's (1916) parole, Cosenu's
(1973) Norm or what in British linguistics has been called use. In other
44 Thomas Herbst
Notes
1 I would like to thank Susen Faulhaber, Eva Klein, David Heath, Michael
Klotz, Kevin Pike and Peter Uhng for their valuable comments.
2 Compare Giles's (2008: 6) definition of a phraseologism as "the co-
occurrence of a form or a lemma of a lexical item and one or more additional
linguistic elements of various kinds which functions as one semantic unit in a
clause or sentence and whose frequency of co-occurrence is larger than ex-
pected on the basis of chance".
3 See, for instance, Altenberg (1998) and Johansson and Holland (1989). See
also Biber (2009). Cf also Mukherjee (2009: 101-116).
4 For the role of phraseology in linguistic theory see Ones (2008). For a com-
parison of traditional phraseology and the Smclairean approach see Granger
and Paquot (2008: 28-29). For parallels with pattern grammar and construc-
tion grammar see Stubbs (2009: 31); compare also Gnes (2008: esp. 12-15).
5 Cf. Croft and Cruse (2004: 225), who point out that "construction grammar
grew out of a concern to find a place for idiomatic expressions".
6 For a discussion of definitions of construction see Fischer and Stefanowitsch
(2006: 5-7). See also Goldberg (1995: 4).
7 For a detailed discussion of different concepts of collocation cf. Nesselhauf
(2005: 11^0) or Handl (2008). For a discussion of the frequency-based, the
semantically-based approach and a pragmatic approach to collocation see
Siepmann (2005: 411). See also Cowie (1981) or Schmid (2003). Handl
(2008: 54) suggests a multi-dimensional classification in terms of a semantic,
lexical and statistical dimension.
8 "Der Kollokator ist em Wort, das beim Formuheren in Abhangigkeit von der
Basis gewahlt wird und das folglich mcht ohne die Basis defimert, gelernt
und iibersetzt werden kann" (Hausmann 2007: 218).
9 This table shows the number of occurrences of the adjectives listed with the
corresponding nouns (query adjective + {noun/N}). It must be stressed that
the figures given refer to absolute frequencies of occurrence and should in no
way be taken as a measure of collocational strength. Grey highlighting means
that the respective collocation is listed under the noun in the Oxford Colloca-
tions Dictionary (2002). Obviously, not all possible collocates of the nouns
have been included. Furthermore, one has to bear in mind that in some cases -
such as weak tea and light tea - the adjectives refer to different lexical units.
Cf. also Herbst (2010: 133-134).
10 For the criterion of "begrenzte Kombmationsfahigkeit" see Hausmann (1984:
396).
11 See also the pattern grammar approach taken by Hunston and Francis (2000).
The patterns in the table can be illustrated by sentences such as the following:
NP V„tNP NP: / wasn 't really what you 'd call a public school boy... (VDE); NP
48 Thomas Herbst
23 The picture is complicated somewhat by the fact that six months can occur as
a premodifier in noun phrases. Excluding uses of the type 4 to 6 months the
BNC yields the following figures: six months (3750), 6 months (187), six
month (135), 6 month (6), six-month (214), 6-month (14) versus half a year
(46), half year (171), half-year (81) and halfyear (1). However, it is worth
noting that in the BNC the verbs last and spend do not seem to co-occur with
half a year but that there are over 60 co-occurrences of the verb spend with
six months and 27 of the verb last (span ±5).
24 Example numbers added by me; running text in original.
25 Cf also Fillmore (1976) and Fillmore and Atkins (1992).
26 This seems very much in line with the following statement by Firth (1968:
18): "Words must not be treated as if they had isolate meaning and occurred
and could be used in free distribution".
27 Cf. e.g. Goldberg (2006: 5) or Fillmore, Kay and O'Connor (1988: 534).
28 Compare also the list of complex prepositions given in the Comprehensive
Grammar of the English Language (1985: 9.10-11) including items such as
ahead of instead of subsequent to, according to or in line with, whose con-
stituents, however, are analyzed in traditional terms such as adverb, preposi-
tion etc.
29 On the other hand, within a lexically-oriented valency approach these of-
phrases could be seen as optional complements, which would have to be part
of the precise description of the corresponding units.
30 Note, however, the considerable amount of variation of idiomatic expressions
indicated in the Oxford Dictionary of Current Idiomatic English by Cowie,
Mackm and McCaig 1983). For the related problem of defining constructions
in construction grammar see Fischer and Stefanowitsch (2006: 4-12).
31 For a discussion of the collocates of agree on the basis of completion tests see
Greenbaum (1988: 118) and Herbst (1996).
32 For instance, it could be argued that specialised collocation dictionaries such
as the Oxford Collocations Dictionary would be even more useful to learners
if they provided some indication of relative frequency in cases where several
synonymous collocates are listed.
33 Compare also all the sun long, a grief ago wi farmyards away discussed by
Leech (2008: 15-17).
34 The frequencies of door (27,713) and car (33,942) cannot account for this
difference. The analysis is based on the Erlangen treebank.mfo project (Uhrig
and Proisl 2011).
35 See Underwood, Schmitt and Galpin (2004: 167) for experimental "evidence
for the position that formulaic sequences are stored and processed hohsti-
cally". Compare also the research carried out by Ellis, Frey and Jalkanen
(2009). See also Schmitt, Grandage and Adolphs (2004: 147), who come to
50 Thomas Herbst
the conclusion that "corpus data on its own is a poor indication of whether
those clusters are actually stored in the mind".
36 The latest editions of learner's dictionaries such as the Longman Dictionary
of Contemporary English (LDOCE5), the Oxford Advanced Learner's Dic-
tionary (OALD8) and the Macmillan English Dictionary for Advanced
Learners (MEDAL2) make use of rather sophisticated ways of covering
multi-word units such as collocations; cf. Herbst and Mittmann (2008) and
Gotz-Votteler and Herbst (2009), which can be seen as a direct reflection of
the developments described. Similarly, dictionaries such as the Longman
Language Activator (1993), the Oxford Learner's Thesaurus (2008) or the
thesaurus boxes of LDOCE5 list both single words as well as word combina-
tions under one lemma.
37 Compare the approach of constructional analysis presented by Stefanowitsch
and Ones (2003).
References
Altenberg,Bengt
1998 On the phraseology of spoken English: The evidence of recurrent
word-combinations. In Phraseology: Theory, Analysis, and Applica-
tions, Anthony P. Cowie (ed.), 101-122. Oxford: Clarendon Press.
Behrens,Heike
2007 The acquisition of argument structure. In Valency: Theoretical, De-
scriptive and Cognitive Issues, Thomas Herbst and Katrin Gotz-
Votteler (eds.), 193-214. Berlin/New York: Mouton de Gruyter.
Behrens,Heike
2009 Usage-based and emergentist approaches to language acquisition.
Linguistics V (2): 383-411.
Biber, Douglas
2009 A corpus-driven approach towards formulaic language in English:
Extending the construct of lexical bundle. In Anglistentag 2008 Tu-
bingen: Proceedings, Chnstoph Remfandt and Lars Eckstein (eds.),
367-377. Trier: Wissenschaftlicher Verlag Trier.
Bybee,Joan
2007 The emergent lexicon. In Frequency of Use and the Organization of
Language, Joan Bybee (ed.), 279-293. Oxford: Oxford University
Press.
Cosenu,Eugemo
1973 ProblemederstrukturellenSemantik.Tubmgm-.mn.
Choosing sandy beaches - collocations, probabemes and the idiom principle 51
Gilqum,Gaetanelle
2007 To err is not all: What corpus and dictation can reveal about the use
of collocations by learners. In Collocation and Creativity, Zeitschrift
fur AnglistikundAmerikanistik 55 (3): 273-291.
Glaser,Rosemane
1990 Phraseologie der englischen Sprache. Leipzig: Enzyklopadie.
Gotz-Votteler, Katrm and Thomas Herbst
2009 Innovation in advanced learner's dictionaries of English. Lexico-
graphica 25: 47-66.
Goldberg, AdeleE.
1995 A Construction Grammar Approach to Argument Structure. Chi-
cago/London: Chicago University Press.
Goldberg, AdeleE.
2006 Constructions at Work: The Nature of Generalizations in Language.
Oxford/New York: Oxford University Press.
Granger, Sylviane
1998 Prefabricated patterns in advanced EFL writing: Collocations and
formulae. In Phraseology: Theory, Analysis and Applications, An-
thony Paul Cowie (ed.), 145-160. Oxford: Oxford University Press.
Granger, Sylviane
2011 From phraseology to pedagogy: Challenges and prospects. This
volume.
Granger, Sylviane and MagahPaquot
2008 Disentangling the phraseological web. In Phraseology: An Interdis-
ciplinary Perspective, Sylviane Granger and Fanny Meumer (eds.),
37-49. Amsterdam/Philadelphia: Benjamins.
Greenbaum, Sidney
1988 Good English and the Grammarian. London/New York: Longman.
Gnes, Stefan
2008 Phraseology and linguistic theory. In Phraseology: An Interdiscipli-
nary Perspective, Sylviane Granger and Fanny Meumer (eds.), 3-35.
Amsterdam/Philadelphia: Benjamins.
Handl, Susanne
2008 Essential collocations for learners of English: The role of colloca-
tional direction and weight. In Phraseology in Foreign Language
Learning and Teaching, Fanny Meumer and Sylviane Granger (eds.),
43-66. Amsterdam/Philadelphia: Benjamins.
Hausmann, Franz Josef
1984 Wortschatzlernen ist Kollokationslernen. Praxis des neusprachlichen
Unterrichts 31: 395-406.
Choosing sandy beaches - collocations, probabemes and the idiom principle 53
Schmid,Hans-J6rg
2011 English Morphology and Word Formation. Berlin: Schmidt. 2nd
revised and translated edition of Englische Morphologie und Wort-
bildung 2005.
Schmitt, Norbert, Sarah Grandage and Svenja Adolphs
2004 Are corpus-derived recurrent clusters psychologically valid? In For-
mulaic Sequences, Norbert Schmitt (ed.), 127-151. Amster-
dam/Philadelphia: Benjamins.
Siepmann,Dirk
2005 Collocation, colligation and encoding dictionaries. Part I: Lexico-
logical aspects. InternationalJournal of Lexicography 18: 409-443.
Siepmann,Dirk
2011 Sinclair revisited: Beyond idiom and open choice. This volume.
Sinclair, John McH.
1991 Corpus, Concordance, Collocation. Oxford: Oxford University
Press.
Sinclair, John McH.
2004 Trust the Text: Language, Corpus and Discourse. London/New
York:Routledge.
Stefanowitsch,Anatol
2005 New York, Dayton (Ohio) and the Raw Frequency Fallacy. Corpus
Linguistics and Linguistic Theoiy 1 (2): 295-301.
Stefanowistch,Anatol
2008 Negative entrenchment: A usage-based approach to negative evi-
dence. Cognitive Linguistics 19 (3): 513-531.
Stefanowitsch, Anatol and Stefan Th. Ones
2003 Collostructions: Investigating the interaction between words and
constructions. International Journal of Corpus Linguistics 8 (2):
209-243.
Stubbs, Michael
2009 Technology and phraseology: With notes on the history of corpus
linguistics. In Exploring the Grammar-Lexis Interface, Ute Romer
and Rainer Schulze (eds.), 15-31. Amsterdam/Philadelphia: John
Benjamins.
Togmm-Bonelli, Elena
2002 Functionally complete units of meaning across English and Italian:
Towards a corpus-driven approach. In Lexis in Contrast Corpus-
Based Approaches, Bengt Altenberg and Sylviane Granger (ed.), 7 3 -
95. Amsterdam/Philadelphia: Benjamins.
Tomasello, Michael
2003 Constructing a Language: A Usage-based Theoiy of Language Ac-
quisition. Cambridge, MA/London: Harvard University Press.
56 Thomas Herbst
Dictionaries
1. Introduction
In the present article I have set myself a triple goal. First, I would like to
suggest a new take on the principles of idiom and open choice. Second, I
wish to highlight the need to complement these principles by what I have
chosen to term "the principle of creativity". Third, I shall endeavour to
show how these three principles can be applied conjointly to the teaching of
translation.
In 1991 the late John Sinclair, who is renowned for his pioneering work in
the field of corpus-based lexicography, propounded an elegantly simple
theory. In Sinclair's view, the prime determinants of our language behav-
iour are the principles of idiom and open choice, and the principle of idiom
takes precedence over the principle of open choice: "The principle of idiom
is that a language learner has available to him or her a large number of
semi-preconstructed phrases that constitute single choices, even though
they might appear to be analysable into segments" (Sinclair 1991: 110).
The principle in question finds its purest expression in what Sinclair
(1996) terms the lexical item. Here is a straightforward example:
a red suitcase, a nice to curb one's anger, a the lisping bay, a loose
house, to wait for the peremptory tone, a procession of slurring
postman cracked wall feet, the water chuck-
led, the body ebbs
mmdyour own damn business, I can't believe that, history never repeats
Uself or I love you Amotion as indices pointing towards concepts or patterns
which are familiar to the linguistic community. The link between the lin-
guistic sequence and its semantic extension or "reference" is not, as Frath
and Gledhill apparently assume, a direct one. The social "value" of a lin-
guistic sign is constituted not by what it refers to, but rather by the conven-
tional manner in which it is used. In the present instance we should speak
of self-reference rather than reference, for we are here concerned with cases
where one act of communication refers to another rather than to an object.
Using the terminology of Gestalt psychology, we might say that a set
phrase is a linguistic "figure" or "fore-ground" which refers to a situational
"ground" or "back-ground" (cf.Feilke 1994: 143-151).
The validity of the principle of idiom has now been proved beyond dispute.
In several of his own publications Sinclair clearly demonstrated that his
assumption was correct (cf Sinclair 1991, 1996, 1998); his ideas have ap-
parently been taken up by exponents of construction grammar as well as by
scholars subscribing to other schools of thought; and the author of this arti-
cle has added a small stone to this vast edifice by extending the principle of
idiom in two directions.
First, I have shown that Sinclair's principle applies not only to isolated
words, but also to syntagmas comprising several units. Thus, for instance,
the word group with this m mind collocates with syntagmas such as let us
turn to and let us consider (for further details, see Siepmann 2005: 100-
105).
Second, I have followed up on the idea that semantic features exert a
collocational attraction on each other - an idea which, in my opinion, is
implicit in the postulate that there are such things as collocational configu-
rations. Thus, in configurations such as the work is arranged m eight sec-
tions or the second volume consists of five long chapters, the nouns which
occupy the subject position contain the semantic feature /text/.
Semantic features, like living creatures, may be attracted to each other
even when they are separated by considerable distances. In extreme cases
there may be dozens of words between the semic elements involved. Con-
venient examples are provided by structures extending over formal sen-
tence or even paragraph boundaries, e.g. certainly [...] but, it often seems
that [...]. Not so and you probably think that [...]. Not so. I have termed
Sinclair revisited: beyond idiom and open choice 63
Numbers are Significant, of course, but what really counts is quality: only a
few inventions end up as big money-spinners. Dr Narin thinks he can spot
these, too. Patent applications often refer to other patents. By looking at the
number of times a particular patent is cited in subsequent applications, and
comparing this to the average number of citations for a patent in that indus-
try, he gams a measure of its importance to the development of the field.
{TheEconomist20.U.\992)
3. Commentary
At 9pm, with the lights, telly, dishwasher and washing machine on, Wattson
is registering a whopping £3,000. But what most shocks me is the power be-
ing drawn from the socket when my appliances are supposedly turned off.
{Times Online 15.7.2004)
It would be counter-intuitive to postulate a correlation between the use of
specific terms and a non-specific linguistic phenomenon like topic shifting.
Yet it is precisely this postulate that is borne out by our statistics. If we
look at the adjectives and adjectival phrases that crop up repeatedly in
pseudo-cleft sentences hinging on the copula to be, we find that some 70%
of these sentences contain a limited number of specific lexical items. This
can be seen from the following table:
It follows from this that even the use of certain syntactic constructions is, to
a certain extent, determined by the principle of idiom, even though wnters
may have some leeway regarding the meanings that have to be expressed.
Does this mean that all linguistic behaviour boils down to the principle
of idiom? The answer is "no". Nonetheless, there is no escaping the fact
that language users have little room for manoeuvre in situations where they
have to arrange morphemes, individual words and syntagmas consisting of
several lexical items. In such situations the principle of open choice only
comes into play when language users display their incapacity to conform to
linguistic norms or happen to be motivated by the desire to flout such
Sinclair revisited: beyond idiom and open choice 65
Such analogical transfers underlie the phenomena which Hoey (2005) de-
scribes as "semantic associations". Hoey argues that a word combination
such as a two-hour drive is based on an associative pattern of the type
Sinclair revisited: beyond idiom and open choice 67
By combining the verb sigh with an inanimate noun (gust), Murdoch per-
sonifies the wind. But how can we classify a word combination like sighing
gusts! It is neither a co-creation ("a regularly formed, normal-sounding
combination"9) nor a collocation ("a manifestly current combination"). If
we adopt Hausmann's classification, we are therefore forced to conclude
68 DirkSiepmann
It is interesting to compare the extract from The Red and the Green
(1965) with a passage from a much later novel entitled The Good Appren-
tice (1985).
Here we have exactly the same poetic word combination as in The Red and
the Green, but Amberm's translation is quite different from the rendering
suggested by Soulac. Instead of applying the creativity principle, Amberni
has gone to the opposite extreme by opting for an open choice which sa-
vours of affectation (en rafales de soupirs). This calls for a number of
comments. Although the pattern en NP de NP is quite common in contem-
porary French (e.g. en cascades de dtamants), en NP de souptrs is an ex-
tremely rare pattern, and en rafales de NP is subject to severe selectional
restrictions. In French prose we often come across syntagms where the slot
following the preposition en is occupied by a noun denoting a sudden gush
of fluid or semi-liquid matter (e.g. en cascades d'eau clatre, en torrents de
bone), but such syntagms sound distinctly odd as soon as we replace cas-
cades or torrents by rafales. In meteorological contexts syntagms such as
en rafales de 45 nceuds or en rafales de plutes torrentielles sound perfectly
normal, but these patterns are only marginally acceptable when nouns like
nceuds, plute or grele are replaced by words denoting sounds or emotions
(souptrs, rales, hatne, rage). Amberm's phrase is not un-French, but it is a
counter-creation and therefore sounds much less natural than Murdoch's
phrase.
Our third and final example provides even more convincing evidence of
the workings of the creativity principle. Here is a sentence which shows
how a vivid stylistic effect can be achieved by means of a syntactic trans-
formation:
Sarah Harrison, a slimly attractive, brown-eyed brunette in her late twenties
[= slim and attractive] (Dexter 2000: 29)
70 DirkSiepmann
5.1. Polysemy
the EngHsh collocation have an interest (+ in) can mean either to be inter-
ested or to have a stake}* Even Sinclair's (1996) prototypical example (the
postulated link between the phraseological combination with / to the naked
eye and the notions of difficulty and visibility) can pose problems for the
translator since the French expression a I 'ceil nu is not always used in the
szme^y as with/to the naked eye}5
Sense 2 (present in
French but not in Eng-
lish)
= strikes the eye [...Jlecontrasteavecles
Etats-Unissevoital'ceil
nu/Les« accords de
Mat lg non»sontdeve-
nusfragnes.Lesfelures
sevoiental'ceilnu./
[...Jpresqueal'cennu
Ilestvraiquequiconque
peutconstateral'ceilnu
quelesdegatssont
consequents [...](= sans
regarderdepres)
Sanscetteechappee
lusitanienne.onvoit
bien,al'ceilnu,que
dans6y 0 «ety 0 « e /,ilya
«jou»,commedans
«joujou» [...].
blanches (or des dents toutes blanches). This shows that translation
work requires a perfect mastery of both the source and the target lan-
guage - a mastery that can only be attained with the aid of large cor-
pora 16
2. Transposition is impossible in cases where the qualities expressed by
the adverb and the adjectives do not add up (cf Ballard 1987: 189; Gal-
lagher 2005: 16):
previously white areas - des zones jusqu'alors blanches
3. Other kinds of transposition might be envisaged (e.g. ebloutssant de
blancheur).
4. The example cited by Chuquet and Paillard is atypical, for the colour
adjective white is generally combined with other adjectives (e.g. pure,
dead, bright, brilliant) or with nouns and adjectives designating white
substances (e.g. mtlk(y), creamfy), chalkfy)). Moreover, white often oc-
curs in comparative expressions like as white as marble. It is this type
of word combination that ought to constitute the starting-point for any
systematic contrastive study of the combinatorial properties of white,
whiteness, blanc and blancheur. When we embark on this kind of study
we soon notice that the French almost invariably use constructions such
as d'une blancheur I d'un blanc + ADJECTIVE {absolu(e), fantomattque,
latteux, latteuse, etc.) or d'une blancheur de + NOUN / d'un blanc de
{crate, ecume, porcelame, etc.). The English, by contrast, use a variety
of expressions such as pure white, ghostly white, milky white or as
white as foam.
clusters of facts which are unrelated to the creatures which observe them. If
we compare the collocations of the French noun impression with their Eng-
lish equivalents, we find that avoir I'impression is rarely rendered by its
direct equivalent have the impression. One of the reasons for this is that the
word combination avoir I'impression frequently occurs in subjectivist con-
structions. Here is an example from Multiconcord (GroG, MMer and Wolff
1996):
Comment se peut-il qu'en l'espace d'une demi-heure, alors qu'on s'est
borne a deposer les bagages dans le vestibule, a preparer un peu de cafe, a
sortir le pain, le beurre et le miel du refngerateur, on ait une telle impression
de chaos?
How in the world did it happen that within half an hour - though all they
had done was to make some coffee, get out some rye crisp, butter, and
honey, and place their few pieces of baggage in the hall - chaos seemed al-
ready to have broken loose, [...].
The same kind of interlingual divergence can be observed when we exam-
ine the English translation equivalents of French noun phrases where im-
pression is followed by the preposition de and another noun:
cette impression de vertige disparut - this giddiness disappeared
6. Conclusion
Notes
de l'umte lexicale a la phrase, au paragraphe, au texte tout entier. Elle n'a ain-
si pas d'existence 'en soi'" (Frath/Gledhill 2005a).
3 Michael Hoey describes these moulds as "primings" (Hoey 2005).
4 Herbst and Klotz (2003: 145-149) use the termprobabeme to denote multi-
word units which speakers are likely to use to express standard thought con-
figurations. Thus, according to Herbst and Klotz, a native speaker of English
might say grind a cigarette end into the ground, while a native speaker of
German would probably use the verb austreten to express the same idea. The
kinetic verb grind evokes a vivid image of a cigarette butt being crushed be-
neath a foot, while the less graphic verb austreten gives more weight to the
idea of extinction. One might postulate a gradient ranging from valencies to
probabemes via collocations (in the traditional sense of the term).
5 This example, like the one that follows, was suggested by John D. Gallagher.
6 The search was carried out on 11 March 2008.
7 His eyes widened with shock and her eyes sparkled with happiness have a
distinctly literary flavour, but an expression like out of his mind with grief
might occur in everyday conversation. This shows that the creativity principle
operates at every level.
8 The hypothesis that semantic features exert a powerful attraction on each
other offers a plausible explanation for word combinations like his eyes spar-
kled with joy / happiness /glee and he was light-headed / almost unconscious
with tiredness.
9 Cf.Hausmann (1984) and (2007).
10 Hausmann cites word combinations such as la route se rabougrit and le jour
estfissure.
11 When we looked for la route se rabougrit and le jour est fissure we were
unable to find any occurrences on the Internet or in our corpus.
12 It should, however, be noted that en bourrasque is more common.
13 We can say le cheval est encore capable d'elever son niveau or ilfaut elever
le niveau dudebat.
14 For further examples see Siepmann (2006b).
15 According to Sinclair, the phraseological combination the naked eye consists
of a semantic prosody ("difficult"), a semantic preference ("see"), a colliga-
tion (preposition) and an invariable core (the collocation "the naked eye").
Our own findings indicate that the semantic prosody postulated by Sinclair is
not always present in English (cf. our counter-examples). In our opinion "de-
gree of difficulty" would be a more appropriate expression here.
16 Similar remarks apply to the word combination critically ill, which can be
rendered as dans un etat grave or gravement malade (cf. Chuquet and Paillard
1987: 18).
17 A particularly enlightening example can be found in a translation manual by
Mary Wood (Wood 1995: 106, 109 [note 24]).
Sinclair revisited: beyond idiom and open choice 79
18 Cf. the following example from a French daily: Si tel devait etre le cas, Pri-
makov ne ferait que retrouver, dans l'ordre mterieur, la fonction que lui avait
devolue en son temps son veritable patron histonque, Youn Andropov, chef
du KGB de 1967 a 1982, sur la scene diplomatico-strategique du monde
arabe: organiser, canaliser et moderer, en attendant des jours meilleurs, les
bouffees neo-staliniennes en provenance de l'Orient comphque. {Le Monde
5.11.1998:17)
19 It should however be noted that en attendant des jours meilleurs is more
common than en attendant des jours heureux.
20 When we searched the Web we found fewer than ten examples in texts writ-
ten by native speakers of English.
21 We use automatic in the fullest sense of the word.
22 We have demonstrated this by means of a detailed analysis of the collocations
of Fr. impression and their English translation equivalents.
References
Ballard, Michel
1987 La traduction de I'anglais aufrancais. Paris: Nathan.
Chevalier, Jean-Claude and Marie-France Delport
1995 Problemes linguistiques de la traduction: L'horlogerie de Saint
Jerome. Paris: L'Harmattan.
Chuquet,Helene and Michel Paillard
1987 Approche linguistique des problemes de traduction. Anglais-
Francars. Paris: Ophrys.
Crowther, Jonathan, Sheila Dignen and Diana Lea (eds.)
2002 Oxford Collocations Dictionary for Students of English. Oxford:
Oxford University Press.
Feilke, Helmut
1994 Common sense-Kompetenz: Uberlegungen zu einer Theorie "sympa-
thischen" und "naturlichen" Meinens und Verstehens. Frankfurt a.
M.: Suhrkamp.
Francis, Gill, Susan Hunston and Elizabeth Manning (eds.)
1998 Collins Cobuild Grammar Patterns. 2: Nouns and Adjectives. Lon-
don: HarperCollins.
Frath, Pierre and Christopher Gledhill
2005a Qu'est-ce qu'une unite phraseologique? In La phraseologie dans
tous ses etats: Actes du collogue "Phraseologie 2005" (Louvam,
13-15 Octobre 2005), Catherine Bolly, Jean Rene Klein and Beatrice
Lamiroy (eds.). Louvain-la-Neuve: Peeters. [cf. www.res-per-
nomen.org/respernomen/pubs/lmg/SEM12-Phraseo-louvam.doc].
80 DirkSiepmann
Stubbs, Michael
2002 Words and Phrases: Corpus Studies of Lexical Semantics. Oxford:
Blackwell.
Vinay, Jean-Paul and Jean Darbelnet
1958 Stylistique compare dufrancais et de I'anglais. Pans: Didier.
Sources used
Dexter, Colin
1991 The Third Inspector Morse Omnibus. London: Pan Books.
Dexter, Colin
2000 The Remorseful Day. London: Pan Books.
Green, Julian
1942 Memories of Happy Days. New York: Harper.
Murdoch, Ins
1965 The Red and the Green. New York: The Viking Press.
Murdoch, Ins
1967 Pdques sanglantes. Paris: Mercure de France.
Murdoch, Ins
1987 L'apprenti du Men. Pans: Gallrmard.
Murdoch, Ins
2001 The Good Apprentice. London: Penguin.
Wood, Mary
1995 Theme anglais: Filiere classique. Pans: Presses Universitaires de
France.
1.V +impression
2. impression+ ADJ
DE OF + N
86 DirkSiepmann
NOT MUCH OF AN
S.O'S IMPRESSION IS OF S.TH. (my chief
impression was of a city of'retiree [...])
THE (ADJ) IMPRESSION IS ONE OF (e.g.
great size)
A (ADJ)/THE IMPRESSION IS THAT +
CLAUSE
EVERY
O^ (my own impression from the
literature is that [...])
THE SAME
SOME
CLINICAL (special.) (the clinical impres-
sion of hepatic involvement)
Eugene Mollet, Alison Wraf and Tess Fitzpatrick
1. Introduction2
Before the computer age, there were many questions about patterns in lan-
guage that could not be definitively answered. Dictionaries were the pri-
mary source of information about word meaning, and said little if anything
about the syntagmatic aspect of semantics - how words derive meaning
from their context. Now, with computational approaches to language de-
scription and analysis, we have an Aladdin's cave of valuable information,
and can pose questions, the answers to which derive from the sum of many
instances of a word in use.
However, there remain questions with elusive answers, and there is still
an onward journey for linguistic research that is dependent on new ad-
vances in computation. This paper offers one contribution to that journey,
by proposing a method by which new and useful questions can be posed of
text. It addresses a challenge that Sinclair identified: "What we still have
not found is a measure relating the importance of collocation events to each
other, but which is independent of chance" (Sinclair, Jones and Daley
[1970] 2004: xxii, emphasis added).
Specifically, we have applied an existing analytic method, network
modelling, to the challenge of finding out about patterns of lexical co-
occurrence. To date, networks have been used little, if at all, as a practical
means of extracting collocation, and for good reason: they are computa-
tionally heavy and render results that, even if different enough to be worth
using, remain broadly compatible with those obtained using the existing
statistical approaches such as T-score (see later). However, the very com-
plexity that makes networks a disadvantageous option for exploring basic
co-occurrence patterns also presents a valuable opportunity. For encoded in
a network is information that is not available using other methods, regard-
ing notable patterns of behaviour by one word within the context of another
word's co-occurrence patterns. It is this phenomenon, which we term 'sec-
ond-order collocation'3 that we propose offers very valuable opportunities
88 Eugene Mollet, Alison Wray and Tess Fitzpatrick
not only for investigating subtle aspects of the meanings of words in con-
text, but also for work in a range of applied domains, including critical dis-
course analysis, authorship and other stylistics studies, and second language
acquisition research4
In the remainder of this paper we develop the case for using network mod-
els and exemplify the process in detail. In section 2, we review the existing
uses of lexical networks in linguistic analysis and consider the potential and
limitations of using them to examine first-order collocations. We also ex-
plain what a network is and how decisions are taken about the parameters
for its construction. In section 3 we describe the network model adopted
here and illustrate the model at work using a corpus consisting of just two
lines of text by Jane Austen. In section 4 we demonstrate what the model
can do, exploring the lexical item ORDER in the context of the lexical item
SOCIAL on the one hand and of the tag <[MATHEMATICAL]FORMULAE> on
the other. Finally, we suggest how second-order collocation information
might be used in linguistic analysis.
3. Method
In this section we describe the principles of the network algorithm used for
the illustration presented in section 4. We have selected an algorithm that
we think is generally effective, but we have constrained it here in ways that
make it easier to demonstrate. There is, in other words, broader scope for
parameter setting than is exemplified here. The generation of a network
proceeds in two stages. The first is mathematical. A computer program ap-
plies an algorithm to extract information from a text or corpus of texts and
to calculate relationships. As will become clear, the procedures combine to
calculate measures of weighted curvature (defined later), as an expression
of the multidimensional space in which a lexical item operates and by
which it is influenced. Second, the results of this analysis are filtered to
produce a graphic expression of selected information.
In our algorithm, two lexical items (types), A and B, are linked if B occurs
within a window of four content words either side of A. In the early days of
computational research into collocation, Sinclair identified a five-word
window as optimal (Knshnamurthy 2006: 596). Our window is smaller, but
wider in scope, because we have elected to capture only content words -
see below for the reason. As a result, many more content words will tend to
be linked in our ±4 window than would be the case in a standard five-word
window. We have also experimented with other types of window, including
the authorial sentence, and find some merit in them for certain kinds of
analysis. The parameters must be set with consideration of one's specific
research question. As noted earlier, simply linking words occurring in the
same window means that we have chosen not to encode directionality. An-
other algorithm could encode it, by, for instance, linking A and B only if B
followed A within the window.
The weight of the connection between A and B is encoded here accord-
ing to a measure of the distance between them. We have used the reciprocal
of the distance in content words 1/distance, but other options also exist,
including Instance 2 ). 'Distance' in both cases can be understood as the
number of content word steps between A and B. If B is adjacent to A, only
one step is required to get from A to B, so it scores 1/1 = 1. If B and A are
separated by one word, two steps are required, so the connection scores 1/2
= 0.5. If they are separated by two words, the score is 1/3= 0.33, and so on.
94 Eugene Mollet, Alison Wray and Tess Fitzpatrick
After calculating the score for each word pair occurrence within the ±4
window identified by the focus item, the program calculates the strength of
the overall relationship between each two-word pair, by adding up all the
scores calculated for that word pair. We also calculate the weighted dis-
tance, which is the total amount of weight of the connections to and from
each node. This measure indicates how important the node is in the context
we have created. When the network graph is drawn, there are mechanisms
for positioning the most important nodes at the centre. However, although
weight measures are very important, they combine information about fre-
quency and distance: frequency contaminates the values because a word
occurring frequently will have more connections, each contributing to the
weight score. That is, a total weight score of 2 could be the result of two
occurrences immediately adjacent, four with one word intervening, six with
two words intervening, etc. To alleviate this problem, we calculate the Fre-
quency Normalized Weighted Distance (FNWD). The effect of this calcula-
tion is to factor out the influence of frequency of occurrence (i.e. how
many tokens of a given type occur in the sampled windows) without affect-
ing the expression of frequency of co-occurrence (i.e. how often, when an
item occurs, it is in the vicinity of a given other item). The calculation of
FNWD entails squaring the values in each row cell, summing the squares
and taking the square root. Then the sum of the columns is calculated *
Frequency normalized weighted distance provides a good indication of
the patterns of co-occurrence. Yet, among the long lists of words that co-
occur with a focus word, only a subset are of interest as collocates. In other
words, co-occurrence is a necessary, but not sufficient criterion for colloca-
tion. We want to know not only which words occur close to our focus word
but also which of them are there in some measure because o/tbe focus
word, rather than because they just turn up rather indiscriminately. To as-
certain this we need to understand the notion of curvature of a node in a
network. Curvature is the extent to which words that are each connected to
the focal item are also connected to each other. In other words, how often,
if word A is connected to word K and also to word M, do we find that K
and Mare also connected?
Curvature provides an expression of, one might say, the 'promiscuity' of
a word in the created context. If a word tends to turn up in different lexical
company each time it occurs, the curvature value will be low. But if it co-
occurs with the same words each time, then the curvature value will be
Accessing second-order collocation through lexical co-occurrence networks 95
It was noted earlier that we have elected to engage with just content words
in these illustrations. Stop-listing, the removal of certain words before the
networks are constructed, acts as a filter on certain kinds of co-occurrence
(and indeed collocation) that are not of interest. We have used a stop-list
that combines the Glasgow7 and New York University stop-words.8 The
effect of doing so is to apply a fairly brutal filter, comprising all function
words, all digits, numbers and isolated letter characters, as well as several
very high frequency words such as KNOW, LIKE - which tend to be
bleached earners - and a range of discourse-related adverbs, such as HOW-
EVER, ALTHOUGH. While some other words in the stop-list would be early
contenders for readmission for many kinds of analysis (e.g. FILL, MILL,
SIDE), for our present purposes their exclusion is not of major significance.
More generally, it will be clear that for many researchers, stop-listing all
but content words would be unhelpful. For instance, if one is trying to iden-
96 Eugene Mollet, Alison Wray and TeSS Fttzpatnek
tify multiword lexical units, one would not want to exclude function words:
too many such units contain them - and indeed are differentiated from
other strings solely through them. In the same way, any analysis that re-
quires the distribution of proforms to be tracked would obviously not bene-
fit from their omission. Finally, we should note that we did not lemmatize
the data. Doing so conflates distributional information about particular
forms of a lemma, and we considered it better not to permit that to happen,
since "sometimes different forms of a lemma behave differently" (Stubbs
2001:30).
We each begin, probably, with a little bias towards our own SEX; and upon that bias build every circumstance in favour of it
perhaps be a little soured by finding, like many others of his SEX, that through some unaccountable bias in favour of beauty
The distance between each content word and its neighbours is calculated.
As described in 3.1 above, distance is based on the number of intervening
content words. For example, in the first string, SEX is followed by BIAS
with no intervening content words. It scores 1/1 = 1. As it happens, BIAS
also precedes SEX with no intervening content word. This also scores 1/1 =
1, giving an interim total of 2 (figure 2c).
Continuing the SEX - BIAS calculation with the second string, in which they
also both occur, we see that the distance between them is 2 steps (SEX -»
UNACCOUNTABLE - BIAS) (figure 2d). Accordingly this link scores 1/2=
0.5. Thus, the final value for the link between SEX and BIAS is 2.5.
When this process has been repeated for all the possible combinations of
content words in the texts, we have a series of weight scores associated
with connections between pairs of words occurring in the selected windows
(table 1). Thus, at the intersection of LITTLE and BEAUTY lies a value of
0.14, deriving from their co-occurrence in the second string, at a distance of
seven content word steps (LITTLE - SOURED - FINDING - SEX - UNAC-
COUNTABLY - BIAS - FAVOUR - BEAUTY). The score is 1/7 = 0.14. Note
that this distance measure of seven content words is possible only because
the window has been defined as four content words either side of SEX. We
can liken it to defining the relationship between two people in a room who
do not know each other but both know our target academic.
98 Eugene Mollet, Alison Wray and Tess FUzpatrick
Table 1. The weight seores for the content words m two strings from Jane Aus-
ten
Beauty Begin Bias Build Circ'e Favour Find'g Little P'haps Prob'ly Sex Soured Unac'ble
Beauty 0.00 0.00 0.50 0.00 0.00 1.00 0.20 0.14 0.12 0.00 0.25 0.17 0.33
Begin 0.00 0.00 0.53 0.17 0.14 0.12 0.00 0.50 0.00 1.00 0.25 0.00 0.00
Bias 0.50 0.53 0.00 1.33 0.75 1.53 0.33 1.53 0.17 0.75 2.50 0.25 1.00
Build 0.00 0.17 1.33 0.00 1.00 0.50 0.00 0.25 0.00 0.20 0.50 0.00 0.00
Circumstance 0.00 0.14 0.75 1.00 0.00 1.00 0.00 0.20 0.00 0.17 0.33 0.00 0.00
Favour 1.00 0.12 1.53 0.50 1.00 0.00 0.25 0.33 0.14 0.14 0.58 0.20 0.50
Finding 0.20 0.00 0.33 0.00 0.00 0.25 0.00 0.50 0.33 0.00 1.00 1.00 0.50
Little 0.14 0.50 1.53 0.25 0.20 0.33 0.50 0.00 1.00 1.00 0.83 1.00 0.25
Perhaps 0.12 0.00 0.17 0.00 0.00 0.14 0.33 1.00 0.00 0.00 0.25 0.50 0.20
Probably 0.00 1.00 0.75 0.20 0.17 0.14 0.00 1.00 0.00 0.00 0.33 0.00 0.00
Sex 0.25 0.25 2.50 0.50 0.33 0.58 1.00 0.83 0.25 0.33 0.00 0.50 1.00
Soured 0.17 0.00 0.25 0.00 0.00 0.20 1.00 1.00 0.50 0.00 0.50 0.00 0.33
Unaccountable 0.33 0.00 1.00 0.00 0.00 0.50 0.50 0.25 0.20 0.00 1.00 0.33 0.0
We can rank the words according to their total weight, by totalling the
rows. Note that as the table is a mirror tmage (as indtcated by shaded-
unshaded areas) each cell entry occurs twice (signifying the connection A
to B and also the connection B to A). It is this fact that enables the fre-
quency normalized wetghted dtstance calculate to be done later. The to-
tals from htghest ranked down are: BIAS (11.17), SEX (8.32), LITTLE (7.53),
FAVOUR (6.29). The remaining words are in tied sets: FINDING and UNAC-
COUNTABLE (4.11), SOURED and BUILD (3.95), CIRCUMSTANCE and
9
PROBABLY (3.59) and BEAUTY, BEGIN and PERHAPS (2.71).
Next we apply the frequency normalizing procedure as described earlier.
The FNWD values are: BIAS (5.55), LITTLE (4.01), SEX (3.96), FAVOUR
(3.25), FINDING (2.03), SOURED (2.02), UNACCOUNTABLE (1.81), PROBA-
BLY (1.72), BUILD (1.66), CIRCUMSTANCE (1.55), PERHAPS (1.28), BEGIN
(1.24), BEAUTY (1.21). Note that these figures express the mutual relation-
ships between all of the content words. For example, in this rearranged
ranking SOURED is above UNACCOUNTABLE even though both words occur
only once (string 2) and the latter is closer to the focus word SEX than the
former. The reason is that the scores for SOURED and UNACCOUNTABLE are
determined in part by how their own netghbours behave: the 'space' de-
scribed by the network calculation is defined in terms of all the different
relationships in it, and the location of a given word in that space is deter-
mined by the nature of the entire space, SOURED is elevated relative to UN-
ACCOUNTABLE because it occurs closer to LITTLE, which itself occurs in
both strtngs and so has other connections in the 'space'. Thts is the crucial
principle underlying the weighted curvature measure.
Weighted curvature is the means by whtch we can establtsh how much
notice to take of co-occurrences - that is, do they signtfy that there is a true
Accessing second-order collocation through lexical co-occurrence networks 99
STANCE and BEAUTY which are not joined to each other. The score is the
number of triangles as a proportion of the total possible triangles associated
with a given node. For instance, a triangle will be formed in the network
each time two words connecting to BIAS also co-connect (BIAS-
CIRCUMSTANCE-SEX; BIAS-BEAUTY-SEX). No triangle is formed if they do
not co-connect (BIAS-CIRCUMSTANCE-BEAUTY). AS two of the three possi-
ble triangles with BIAS are found, it scores 2/3 = 67%.
The full algorithm we use for weighted curvature" is as follows. As
noted, it incorporates FNWD. It also allows for directionality. In non-
directional weighted calculations this aspect is simply ignored in the calcu-
lation, but having it there enables the use of a single algorithm for a range
of different specific analyses.
counters that includes information about how often the same people turn up
together and how directly he engages with them.
Now we come to the v1Sual representation of the relationships. Those
words which turn up in the vicinity of the focus word irrespective of the
context (i.e. the other words that co-occur with them are different each
time) have a low weighted curvature value. In a drawn network, such words
can be placed at the centre, close to the focus word (figure 4). The network
in figure 4 nicely illustrates that it is based on two strings, and that the
strings have four words in common. With larger texts, of course, the rela-
tionship back to the original will be obscured by the greater number of oc-
currences of words, and the patterns in the network, including the items
closest to the centre, will indicate distributions not so easily observed in the
original text(s).
Those with a primarily semantic focus may comment on how the find-
ings supplement the information in formal dictionary entries (Sinclair 2004:
133-136). That is, a full view of the meaning of an item is developed from
observing its behaviour in many different texts - by analyzing large cor-
pora. Meanwhile, for cognitive linguists, the meaning of an item, within
and outside a given context, is construed in terms of the individual's
knowledge. Knowledge is based on experience, which can include not only
previously seen texts but also the use of dictionaries. The cognitive view
aims to capture the process by which a reader turns observation into experi-
ence by relating it to previous knowledge.
The most significant contrast between the two approaches resides in the
inevitability of the cognitive view modelling each individual's knowledge
as unique, since it is dependent on his or her particular experiences. Where
corpus linguists would hope, by virtue of using a large enough collection of
material, to capture information about an item in 'the language', blurring
the differences arising from the subset of texts that any given individual has
encountered, the cognitive linguist must inevitably view the mental experi-
ence of a language as fundamentally discrete to each user, with the effec-
tiveness of communication entirely contingent on overlaps in experience
and, consequently, knowledge.
We find it helpful never to lose sight of the reader as interpreter of the
text. For one of the benefits of second-order collocation analysis may be the
opportunity to understand more about what a reader's 'intuition' is when
extracting meaning. In this way, we may be able to progress somewhat the
convergence of the two approaches, leading to a better understanding of the
relationship between patterns in the corpus and the role of the individual in
constructing and sustaining the platform of shared meaning that underpins
effective communication.
The curvature figures for SOCIAL (table 2) tell us that withm the environ-
ment of SOCIAL, the item that itself co-occurs most often with other words
in SOCIAL'S window is ECONOMIC. It is a 'well-connected' item. In terms of
the analogy, it represents a person who, when in the company of the De-
partment Chair, holds the greatest number of conversations of his own with
the others present. Recall that it does not tell us what that person does when
not in the Department Chair's company (this would entail a separate analy-
sis). However, it is not ECONOMIC that we are interested in, but ORDER, the
seventh most well-connected item according to the algorithm we have
used.13 Meanwhile, we see from table 3 that The most well-connected item
in the windows around <FORMULA> is EQUATION, while ORDER is the four-
teenth.
The next stage entails shifting the focus of our attention from the influ-
encing word to the second-order item, ORDER, our true target. We do this in
the context of the network graph that reflects the weighted curvature val-
ues. As noted earlier, an important aspect of creating network graphs from
the values is setting thresholds for what is visually represented, so as to en-
sure the graph can be easily read and interpreted. To this end, the values for
weighted curvature are set at a level that generates the desired amount of
detail - that is, so as to feature the desired number of nodes. Setting the
106 Eugene Mollet, Alison Wray and Tess FUzpatnck
PSYCHOLOGICAL
POLICY
IMPACT CONTRACT
PHYSICAL
INSTITUTIONS
DEVELOPMENT
B CONCEPT HISTORY
ISSUES
FACTORS ™RMS STATK
STATUS CHANGE
GROUP ECONOMIC IMPORTANT THEORY SC.KNtl.
LIKE
TTMT7 IU1.IUIIAL ..,.*...,»,
IUVIJ& WOMFN
COMMUNITY ORI)F.R '0,, A
' JNTERACTION
C L A S S
POLITICAL
CFNDRR BEHAVIOUR
I.IJNDEK PEOPLE
CLASSES GROUPS WORLD STRUCTURE
\ \ INDIVIDUAL SOCIETY
RESEARCH RELATIONS
PI .vm*
CONTEXT INDIVIDUALS
WORK CULTURE
POWER
STRUCTURES
FORMS
FORM STUDY
I'osirio RELATIONSHIPS
Next, the focus moves, so that ORDER is itself examined, within the same
graph. This is like turning to examine the interactions of the target aca-
demic in just those instances where one of the influencing other people is
present. What we do next is rather like asking the people present to gather
Accessing second-order collocation through lexical co-occurrence networks 107
behaviour of other words that, like ORDER itself, are attractive collocates of
the respective first-order focus, SOCIAL or <FORMULA>. 14
RESPONSIBILITY
CORPORATE PROBLEM
SIGNIFICANT
APPROACH PSYCHOLOGICAL
SE^UKCAPITAL EFFE<
INSTITUTIONS PHYSICAL 3NTROL
CONCEPT RESEARCH
ISS
"ES DEVELOPMENT
CONDITIONS STRICTURE PROCESS
STATE CARE
THEORY ROLE
POWER GROUP
MEN
LANGUAGE
IDENTITY
STRUCTURES
5. Applications
To what uses, then, might this method be put in linguistic analysis? There is
relatively little impediment to learning the very simple procedures involved
in calculating weighted curvature values and generating network graphs.
The procedures involve pasting a short stretch of code into R, and import-
ing the desired corpus, which does not need to be stop-listed, and which can
be either tagged or not tagged. The instruction to analyze particular target
words is achieved by simply typing the word into the specified space in the
code ... and waiting. The mam constraint on a PC is in not selecting items
that are too frequent, or working on a corpus that is too large. On more
powerful machines, more, of course, is possible (see Conclusion).
There are some linguists who will be contented simply to explore the
potential of the method to reveal interesting patterns. Others, however, will
Accessing second-order collocation through lexical co-occurrence networks 111
have a clear research question in mind, and we list a small number here as
examples.
5.2. Stylistics
Research on style, including genre and authorship, might benefit from sec-
ond-order collocation, to capture aspects of how an effect is created
through language not so much through the selection of lexical items that are
themselves all that distinctive, but through the 'circle of friends' found to-
gether. For example, how might an advertising company manipulate public
perceptions of a product, company or political party by creating texts that
reveal no biases in their first-order collocations, but convey subtle positive
overtones through their second-order associations?
How does an author or playwright succeed in presenting characters dif-
ferently through descriptions of them, or through the words they use? Char-
acters in plays by Harold Pinter or Alan Bennett might be interesting to
track, by examining how certain target words are affected by other words in
one way when from the mouth of character A and another from the mouth
of character B.
Those who engage in authorship analysis might see potential to probe
deeper into the individual patterns seen in a person's writing, at a level
unlikely to be open to conscious control even when attempting to obscure
identity - for the curvature patterns may be interpreted as expressing the
composite knowledge of the writer about the meanings of words, with no
awareness that, at this level, the word's meaning and the author's personal
style are very closely entwined.
We are all familiar with the observation that one text's GUERILLA or TER-
RORIST is another text's FREEDOM FIGHTER, but just how subtle might the
secondary co-occurrence patterns of sets of words like this be? How, for
example, are the collocates of BOMB different, according to the description
used of the bombers? Would the description of a bomb as POWERFUL rather
than DEVASTATING be under the influence of the network of co-occurring
words that weave the subtle context leading to a 'preferred' interpretation?
How does political correctness impact on the wider network of colloca-
tions? What happens to the rest of the text when coffee is referred to as
'with milk/cream' rather than 'white'? That is, if we assume some general
network pattern for a set of words, what is the impact of removing one of
them? Can another one simply move into its place, or are all the other con-
nections too different for that to work? What more subtle levels of social
Accessing second-order collocation through lexical co-occurrence networks 113
6. Conclusion
The method we have presented here promises, we believe, to offer new in-
sights into the nature of patterns in language, in the spirit of Sinclair's in-
terest in how collocation events relate to each other (Sinclair, Jones and
Daley 2004: xxh). We believe there are more layers to be uncovered in lan-
guage than current approaches in corpus linguistics easily reveal. They are
layers that the native speaker reader or hearer knows about and takes into
account when engaging with text. Each new foray into the exploration of
text patterns may bring us another step closer to modelling computationally
the essence of linguistic intuition. Second-order collocation patterns, as
described in this paper, may be central to such new developments for they
do not simply divide up the known world in new ways, but uncover infor-
mation of a different order. What second-order collocation measurements
reveal is something about a word's location in cognitive space - space that
is determined not only by the behaviour of the focus word, nor only of its
collocates, but also of its collocates' collocates. As the network image use-
fully reminds us, when everything is joined together, a movement in one
place has an impact on everything else.
114 Eugene Mollet, Alison Wray and Tess Fitzpatrick
Two final remarks should be made. The first regards the extent to which
information about secondary collocates is already available by other means.
In WordSmith Tools,16 it is possible to create a concordance for all occur-
rences of word A which have word B within a specified distance, such as
five words. Having created this 'sub-corpus', it is possible to explore the
collocations of word B. Where word B has more than one meaning, its col-
locates can be contrasted by selecting different contextualising words as A.
For example, one could explore the collocates of BANK (word B) in the
context of, separately, MONEY and RIVER as word A. However, this Word-
Smith procedure is much shallower than the network method. The collo-
cates of B are simply those that occur in the new sub-corpus of concor-
dances of A. They are computed locally and separately from the computa-
tions for A, whereas the network computes the information about B at the
same time as, and in relation to, A - and indeed all the other words that co-
occur. To put it another way, while the network method offers deeper views
into the total space of the collocates, Wordsmith approach maps only a sur-
face view and does not provide the analyst with the rich mathematical in-
formation that underpins the network relationships.
A more promising approach to second-order collocation is developed in
Collier, Pacey and Renouf (1998), though the relationships identified there
are arguably closer to 'mutual collocations', because they capture informa-
tion about collocates shared between two lexical items rather than the inter-
relationships of all lexical items relative to a second-order focus. Useful
features of Collier et al.'s method might fruitfully be combined with those
of our own method in future research.
The second observation that should be made here regards the ambitious-
ness of a methodology that is computationally so demanding. Not only
have we laid out, as a baseline for network research, computations that take
some time to complete on a PC. We have ventured to imply in some places
that much larger-scale computations might be interesting to carry out. Al-
though the personal computer continues to grow in memory and in proces-
sor size, processing speeds have changed little in recent years. It may there-
fore seem pointless to lay out a research agenda too powerful to be under-
taken. However, new technological advances have shifted attention from
processing speed to overall processing potential through the creation of vast
parallel systems. High end computing operations line up thousands of pro-
cessors to operate together, sharing out time-costly jobs so they are com-
pleted in a fraction of the time. As a result, we have the opportunity, for the
Accessing second-order collocation through lexical co-occurrence networks 115
first time, to pose a new level of questions about texts, knowing that there
is a means to answer them.
As Carter (2004: 6) points out in his introduction to Sinclair’s reissued
texts, “The landscapes of language study are changing before our eyes as a
result of the radically extended possibilities afforded by corpus and compu-
tational linguistics”. If language study is a landscape, then network anal-
yses offer us a chance to examine more than a two-dimensional map – ra-
ther, we can enter the terrain itself and explore the ways that words locate
themselves in multidimensional space.
Notes
the total number of connections to and from each node, as a share of the total
‘space’. Squaring preserves an underlying Euclidean geometry for nodes. This
feature, which we will not develop further here, provides some particularly
useful properties, notably that the inner product of rows yields the cosine of the
angle between those rows and the same inner product after centering (i.e.
subtracting the row mean from each element) and renormalization yields the
correlation coefficient (Pearson’s r) between the distributional patterns of the
nodes in question (Jackson 1924, Rodgers and Nicewander 1988 and, most
succinctly, Kundert 1980). This allows a more direct comparison between our
graph theoretical model and the more familiar vector models used in allied
fields such as Information Retrieval and also in earlier work on second-order
collocation, most notably Collier, Pacey and Renouf (1998).
7 http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words.
8 http://nlp.cs.nyu.edu/GMA_files/resources/english.stoplist.
9 The ties are a consequence of the text being too small for the scores to separa-
te further.
10 C˜iD refers to the weighted, directed, clustering coefficient of node i. W is the
weighted adjacency matrix. This value (C˜iD) is the total number of weighted,
directed triangles centered on node i (˜tiD), divided by the total number of such
triangles possible (TiD). W[1/3] refers to the matrix formed by multiplying all the
weights in W individually by 1/3. W T refers to the transpose of the matrix W. The
notation [X]3ii refers to the ith diagonal entry of the third power of the matrix X.
The two remaining terms refer to the total (binary) degree of node i (ditot ) and the
number of bilateral edges incident to i (di ). For a more detailed explanation of
this formula see Fagiolo (2007). We implemented the function in R from Prof.
Fagiolo’s own MATLAB code, and are grateful to Prof. Fagiolo for providing it.
11 There is a discussion to have about whether we should imply that influence
can be laid at the door of just one co-occurring lexical item (though examples
such as Sinclair’s (2004) white in the context of wine suggest it sometimes
can). It is certainly true that an analysis of the type demonstrated here is lim-
ited by searching only for co-occurrences of the target and influencing items
within a narrow window of 4 content items either side. However, provided the
computational power is available (see Conclusion) analyses are not limited to
a small window size, nor in the potential to amalgamate the outcomes of
examining several different co-occurring items for their individual and joint
effects on the patterns observed. Even then, the analysis would miss out on
directly capturing aspects of meaning that are only implied, such as the
description of Governor Pyncheon in chapter 1 8 of Nathaniel Hawthorne’s
The House of the Seven Gables, where the reader can only infer that he is in
fact dead. On the other hand, our analysis may have the capacity to help us
better understand just how the reader does work out that Pyncheon is dead –
how are the words interacting to permit that inference?
Accessing second-order collocation through lexical co-occurrence networks 117
12 The Bntish Academic Written English (BAWE) corpus was developed at the
Universities of Warwick, Reading and Oxford Brookes under the directorship
of Hilary Nesi and Sheena Gardner (formerly of the Centre for Applied Lin-
guistics/CELTE, Warwick), Paul Thompson (formerly of the Department of
Applied Linguistics, Reading) and Paul Wickens (Westminster Institute of
Education, Oxford Brookes), with funding from the ESRC (RES-000-23-
0800). The corpus is freely available via the Oxford Text Archive (resource
number 2539 http://ota.ahds.ac.uk/headers/2539.xml).
13 The algorithm can be modified according to the importance the research ques-
tion places on proximity. For instance, exploring the occurrence of preposi-
tions at the end of certain multiword expressions would place less importance
on the same preposition occurring later in the window. Conversely, research
into the stylistic effects of lexical repetition would perhaps widen the window
and score distant co-occurrences more evenly with close ones than our algo-
rithm has.
14 This re-orientation results in different items appearing most proximal. This is
a function of the underlying additional nodes, suppressed on the graph. The
true picture is the one obtained from the values that generate the visual repre-
sentation.
15 They largely failed.
16 We are grateful to Chris Butler for pointing this out.
References
Caldeira, Silvia M. G., Thierry C. Petit Lobao, R. F. S. Andrade, Alexis Neme and
J. G.V.Miranda
2006 The network of concepts in written texts. European Physical Journal
5 49:523-529.
Carter, Ronald
2004 Introduction. In Trust the text, John McH. Sinclair and Ronald Carter
(eds.), 1-6. London/New York: Routledge.
Collier, Alex, Mike Pacey and Antoinette Renouf
1998 Refmmg the automatic identification of conceptual relations in large-
scale corpora. Proceedings of the Sixth Workshop on Very Large
Corpora. Association for Computational Linguistics.
http://www.aclweb.Org/anthology-new/W/W98/W98-1109.pdf
Csardi,Gabor
2008 igraph 0.5.1. Url: http://cneurocvs.rmki.kfki.hu/igraph/mdex.html.
Evert, Stefan
2005 The Statistics of Word Cooccurrences: Word Pairs and Word Collo-
cations. Ph.D. thesis, Umversitat Stuttgart: Institut fur maschmelle
Sprachverarbeitung. http://elib.um-stuttgart.de/opus/volltexte/2005/
2371/pdf/Evert2005phd.pdf
Fagiolo, Giorgio
2007 Clustering in complex directed networks. Physical Review E 76:
026107.
Ferrer i Cancho, Ramon
2004 The Euclidean distance between syntactically linked words. Physical
Review E 70: 056135.
Ferrer i Cancho, Ramon
2005 The structure of syntactic dependency networks: Insights from recent
advances in network theory. In The Problems of Quantitative Lin-
guistics, Gabriel Altmann, Viktor Levicky and Valentma Perebyims
(eds.), 60-75. Chermvtsi:Ruta.
Ferrer i Cancho, Ramon
2007 Why do syntactic links not cross? Europhysics Letters 76 (6): 1228-
1234.
Ferrer i Cancho, Ramon, Andrea Capocci and Guido Caldarelh
2007 Spectral methods cluster words of the same class in a syntactic de-
pendency network. International Journal of Bifurcation and Chaos
17 (7): 2453-2463.
Ferrer i Cancho, Ramon and Richard V. Sole
2001 The small-world of human language. Proceedings of the Royal Soci-
ety of'London Series B 268: 2261-2266.
Accessing second-order collocation through lexical co-occurrence networks 119
Meara,PaulM.
2007 Simulating word associations in an L2: The effects of structural
complexity. Language Forum 33 (2): 13-31.
Meara, Paul M. and Ellen Schur
2002 Random word association networks: A baseline measure of lexical
complexity. In Unity and Diversity in Language Use, Knstyan
Spelman Miller and Paul Thompson (eds.), 169-182. London: Con-
tinuum.
Mehler, Alexander
2008 Large text networks as an object of corpus linguistic studies. In Cor-
pus Linguistics: An International Handbook, vol. 1, Anke Ludeling
and Merja Kyto (eds.), 328-382. Berlin: Mouton de Gruyter.
Milroy, Lesley
1987 Language and Social Networks (2nd edition). Oxford: Blackwell.
Motter, Adilson E., Alessandro P. S. de Moura, Ymg-Cheng Lai and Partha Das-
gupta
2002 Topology of the conceptual network of language. Physical Review E
65:065102.
Park, Young C. and Key-Sun Choi
1999 Automatic thesaurus construction using Bayesian networks. Informa-
tion Processing and Management 32 (5): 543-553.
R Development Core Team
2008 R: A language and environment for statistical computing. R Founda-
tion for Statistical Computing, Vienna, Austria. http://www.R-
project.org.
Rodgers, Joseph Lee and W. Alan Nicewander
1988 Thirteen ways to look at the correlation coefficient. The American
Statistician 42(1): 59-66.
Schur, Ellen
2007 Insights into the structure of LI and L2 vocabulary networks: Intima-
tions of small worlds. In Modelling and Assessing Vocabulary
Knowledge, Helmut Daller and James Milton (eds.), 182-203. Cam-
bridge: Cambridge University Press.
Sigman, Mariano and Guillermo A. Cecchi
2002 Global organization of the Wordnet lexicon. Proceedings of the Na-
tional Academy of Sciences of the United States of America 99 (3):
1742-1747.
Sinclair, John McH.
2004 Trust the Text. London: Routledge.
Sinclair, John McH., Susan Jones and Robert Daley
2004 English Collocation Studies: The OSTI Report. London/New York:
Continuum. First published 1970.
Accessing second-order collocation through lexical co-occurrence networks 121
Scares, Marcio Medeiros, Gilberto Corso and Liacir dos Santos Lucena
2005 The network of syllables in Portuguese. Physica A 355 (2-4): 678-
684.
Sole, Richard V., Bernat Corommas Murtra, Sergi Valverde and Luc Steels
2005 Language Networks: Their Structure, Function and Evolution. Tech-
nical Report 05-12-042, Santa Fe Institute Working Paper.
http://www.santafe.edu/research/pubhcations/workmgpapers/05-12-
042.pdf
Steyvers, Mark and Joshua B. Tenenbaum
2005 The large-scale structure of semantic networks: Statistical analyses
and a model of semantic growth. Cognitive Science 29: 41-78.
Stubbs, Michael
2001 Words and Phrases. Oxford: Blackwell.
Vitevitch, MichaelS.
2008 What can graph theory tell us about word learning and lexical re-
trieval? Journal of Speech Language Hearing Research 51: 408-
422.
Wilks, Clarissa and Paul M. Meara
2002 Untangling word webs: Graph theory and the notion of density in
second language word association networks. Second Language Re-
search 18(4): 303-324.
Wilks, Clarissa and Paul M. Meara
2007 Implementing graph theory approaches to the exploration of density
and structure an L1 and L2 word association networks. In Modelling
and Assessing Vocabulary Knowledge, Helmut Daller, James Milton
and Jeanme Treffers-Daller (eds.), 167-182. Cambridge: Cambridge
University Press.
Wilks, Clarissa, Paul M. Meara and Brent Wolter
2005 A further note on simulating word association behaviour in a second
language. Second Language Research 2\: 1-14.
Zhou, Shuigeng, Guobiao Hu, Zhongzhi Zhang and Jihong Guan
2008 An empirical study of Chinese language networks. Physica A 387:
3039-3047.
Zlatic, Vinko, Miran Bozicevic, Hrvoje Stefancic and Mladen Domazet
2006 Wikipedias: Collaborative web-based encyclopedias as complex
networks. Physical Review E 74: 016115.
From phraseology to pedagogy: challenges and
Sylviane Granger
1. Introduction
dauntingly wide and has so far not been given the attention it deserves.
Lewis's (1993, 2000, [1997] 2002) Lexical Approach has admittedly
opened up exciting new avenues for pedagogical implementation and cre-
ated an upsurge of interest in lexical approaches to teaching. However, the
diverging interpretations of the very concept of lexical approach and the
very forceful pronouncements found in the literature are liable to create
confusion in the minds of teachers and materials designers and may even
end up being less - rather than more - efficient in learning terms. This
chapter is an effort to reconcile Sinclair's contextual approach and the reali-
ties of the teaching/learning environment. In section 2, I start by defining
the lexical approach and circumscribing its scope before highlighting what
I view as its major strengths and weaknesses (section 3). Section 4 de-
scribes three major challenges to the pedagogical implementation of the
lexical approach - methodology, terminology and selection - and focuses
more particularly on the potential contribution of learner corpora. Section 5
draws together the threads of the discussion and offers pointers for future
research.
204) does not consider the Lexical Approach as "a new all-embracing
method, but a set of principles based on a new understanding of language".
His list of 20 key principles (Lewis 1993: vi-vn) contains some fairly un-
controversial principles which most proponents of lexical approaches are
likely to adhere to, but also contains some more radical statements that are
implicitly or explicitly rejected by a number of them. In this connection, it
is interesting to note that Harwood feels the need to point out that his un-
derstanding of the term "lexical approach" is not exactly the same as
Lewis's although he does not explicitly say where the differences he.
Among the principles on which there is no general consensus are
Lewis's pronouncements on grammar. In his 1993 book, he is extremely
critical of grammar, proposing "a greatly diminished role for what is usu-
ally understood by 'grammar teaching'" (Lewis 1993: 149), underlining
"the dubious value of grammar explanations" and advising teachers to treat
them "with some scepticism" (Lewis 1993: 184). In a later publication,
however, he expresses a more qualified view, saying: "The Lexical Ap-
proach suggests the content and role of grammar in language courses needs
to be radically revised but the Approach in no way denies the value of
grammar, nor its unique role in language" (Lewis 2002: 41). The revision
of the grammar syllabus involves introducing as lexical phrases a large
number of phenomena that used to be treated as part of sentence grammar:
"The Lexical Approach implies a decreased role for sentence grammar, at
least until post-intermediate levels. In contrast, it involves an increased role
for word grammar (collocation and cognates) and text grammar (supra-
sentential features)" (Lewis 1993: 3). Although Lewis (1993: 146) cites
phenomena like the passive or reported speech as items that "could unques-
tionably be deleted", he does not give a precise list of the candidates for
shifting and the contours of the reduced sentence grammar component re-
main very hazy. In spite of that, the principle has been taken over by sev-
eral proponents of the lexical approach. Porto (1998), for example, lists the
following phenomena as candidates for a shift from grammar to lexis: first,
second and third conditionals; the passive; reported speech; the -mg form;
the past participle; will, would and gomg to; irregular past tense forms; and
the concept of time which "may be most efficiently presented as lexis
rather than tense". No arguments are given that justify the selection of these
phenomena and it is interesting to note that other authors make quite differ-
ent selections. Lowe (2003), for example, suggests keeping core tenses in
his slimmed-down core grammar component.
126 Sylvian Granger
In this section I list what I view as the major strengths and weaknesses of
the lexical approach. By offering a non-partisan view of this approach, I
hope to contribute constructively to the debate surrounding it and help
counter-balance the somewhat over-optimistic and at times downright
dogmatic statements found in the literature.
3.1. Pros
The major advantage of the lexical approach is its close fit with contextual
models of language (Sinclair's model, construction grammar, pattern
grammar) that integrate the intertwining of lexis and grammar and give
From phraseology to pedagogy: challenges and prospects 127
phraseology a more central role in language than was previously the case.
The days when phraseology was viewed as a peripheral component of lan-
guage are dead and gone. Corpus-based studies have uncovered a "huge
area of syntagmatic prospection" (Sinclair 2004c), which contains a much
wider range of units than the highly fixed non-compositional units - id-
ioms, proverbs, phrasal verbs - that used to constitute the main focus of
attention. This wide view of phraseology includes "a large stock of recur-
rent word-combinations that are seldom completely fixed but can be de-
scribed as 'preferred' ways of saying things - more or less conventional-
ized building blocks that are used as convenient routines in language pro-
duction" (Altenberg 1998: 121-122). The relevance of this wide view of
phraseology for teaching is demonstrated by Nattinger and DeCarnco
(1992), who describe the essential functions of conventionalized lexical
phrases in discourse, both spoken and written, and suggest ways of incorpo-
rating them into teaching.
3.1.2. Fluency
3.1.3. Accuracy
In Sinclair's (2004a: 274) view, one of the positive outcomes of the contex-
tual approach to meaning is that it is likely to facilitate language learning:
"If a more accurate description eliminates most of the apparent ambiguities,
the language should be easier to learn because the relationship between
form and meaning will be more transparent". This idea is taken up by many
From phraseology to pedagogy: challenges and prospects 129
proponents of the lexical approach. Porto (1998), for example, states that
"[frequency of occurrence and context association make lexical phrases
highly memorable for learners and easy to pick up". Lewis (2000: 133)
goes further and claims that memorability is enhanced by the length of the
phrase: "The larger the chunks are which learners originally acquire, the
easier the task of re-producing natural language later". The main argument
behind this assertion is that it is easier to deconstruct a chunk than to con-
struct it: "We have already seen that learners acquire most efficiently by
learning wholes which they later break into parts, for later novel re-
assembly, rather than by learning parts and then facing a completely new
task, building those parts into wholes" (Lewis 2002: 190). On this basis, he
gives the following advice to teachers: "don't break language down too far
in the false hope of simplifying; your efforts, even if successful in the short
term, are almost certainly counterproductive in terms of long-term acquisi-
tion" (Lewis 2000: 133). Strong assertions on the ease of learning afforded
by the lexical approach abound in the literature and many - though by far
not all - sound intuitively right. However, it is important to note that at this
stage they are more like professions of faith as validation studies are very
rare. Studies like those of Tremblay et al. (2008) and Ellis, Simpson-Vlach
and Maynard (2008) that demonstrate the effect of frequency of word se-
quences on ease of acquisition and production are still quite rare. At this
stage, therefore, ease of learning cannot entirely be taken for granted.
3.2. Cons
Within the lexical approach, "phrases acquired as wholes are the primary
resource by which the syntactic system is mastered" (Lewis 1993: 95). This
assertion, frequently found in the Lexical Approach literature, is based on
LI acquisition studies which have demonstrated that children first acquire
chunks and then progressively analyze the underlying patterns and generalize
them into regular syntactic rules (Wray 2002). According to Nattinger and
DeCarnco (1992: 27), there is no reason to believe that L2 acquisition works
differently: "The research above concerns the language acquisition of children
in fairly natural learning situations. Because of infrequent studies of adult
learners in similar situations, the amount of prefabricated speech in adult ac-
quisition has never been determined. However, there is no reason to think that
adults would go about the task completely differently". Similarly, Lewis
130 Sylvian Granger
(1993: 25), while recognizing that the question is a contentious one, argues
that "it seems more reasonable to assume that the two processes are in some
ways similar than to assume that they are totally different". In fact, there are
very good reasons for doubting that L2 acquisition functions in the same way
as child acquisition in this respect. One major reason is that L2 learners do not
usually get the amount of exposure necessary for the "unpacking" process to
take place. In her overview of findings on formulaicity in SLA, Wray (2002:
148) notes that formulaic sequences do not seem to contribute to the mastery
of grammatical forms. While lexical phrases are likely to have some genera-
tive role in L2 learning, it would be a foolhardy gamble to rely primarily on
the generative power of lexical phrases. Pulverness (2007: 182-183) is right
to point to the "risk of the so-called 'phrasebook effect', whereby lexical
items accumulate in an arbitrary way, and learners are saddled with an
ever-expanding lexicon without the generative power of a coherent struc-
tural syllabus to provide a framework within which to make use of all the
lexis they are acquiring". Lowe (2003) insists on the crucial role played by
a process akin to "cobbling together", especially amongst L2 learners: "The
less expert we are, the more makeshift is our speech". The most sensible
course, as rightly pointed out by Wray (2002: 148), is to maintain "a balance
between formulaicity and creativity".
breadth rather than depth and in a context where teachers usually have a
very limited number of teaching hours at their disposal? Sinclair (2004a:
282) is well aware of "the risk of a combinatorial explosion, leading to an
unmanageable number of lexical items" and Harwood (2002: 142) expli-
citly warns against "learner overload", insisting that "implementing a lexi-
cal approach requires a delicate balancing act" between exploiting the rich-
ness of fine-grained corpus-derived descriptions and keeping the learning
load at a manageable level.
4.1. Methodology
suggestion that there are a few multi-word items which have in the past
been overlooked [my emphasis]" (Lewis 1993: 104). His later (2002)
statement is laden with ambiguity on this issue as he claims both that
"[implementing the Lexical Approach in your classes does not mean a
radical upheaval" and that "[implementation may involve a radical change
of mindset, and suggest many changes in classroom procedure" (Lewis
2002: 3). The ambiguity probably comes from the fact that Lewis wants to
leave the door open for both a strong and a weak implementation of the
lexical approach, though there is little doubt that the strong version has his
preference (2002: 12-16).
In my view, the most exciting methodological contribution of the lexical
approach, in both its weak and strong versions, is its promotion of language
awareness activities. Lewis's publications contain a wealth of innovative
types of exercises which aim to make learners aware of the existence of
chunks, viz. apply to lexical phrases the type of discovery learning advo-
cated by Johns (1986) and many others after him. Numerous studies have
reported success in implementing these methods in a variety of teaching
contexts and have further extended the battery of exercise types (cf e.g.
Woolard 2000; Conzett 2000; Kavaliauskiene and Janulevieiene 2001;
Hamilton 2001; Deveci 2004). However, here too one might speak of a
strong and a weak version. For Lewis, these methods are meant to replace
the previous teacher-led methodology: "The Lexical Approach totally re-
jects the Present-Practise-Produce paradigm advocated within the behav-
iourist learning model; it is replaced by the Observe-Hypothesise-
Expenment cyclical paradigm" (Lewis 1993: 6). For many, however, these
techniques are a complement to the battery of existing techniques. Diver-
gences are particularly strong as regards grammar. While Lewis considers
grammar to be primarily receptive (Lewis 1993: 149) and is extremely
critical of full-frontal grammar teaching, Willis (2003: 42) considers that
there is a place for explicit grammar instruction: "different aspects of the
grammar demand different learning processes and different instructional
strategies. The grammar of structure, for example, is very much rule gov-
erned and instruction can provide a lot of support for system building".
4.2. Terminology
Although phraseology has always been "a field bedevilled by the prolifera-
tion of terms and by conflicting uses of the same term" (Cowie 1998: 210),
the widening of the field spurred by Sinclair's corpus-driven approach has
From phraseology to pedagogy: challenges and prospects 133
terminology of multi-word units remains one of the major desiderata for the
future. To be maximally effective this terminology should cover the full
spectrum of multi-word units, from the most fixed to the loosest ones, and
follow a number of principles, among which the following four strike me as
especially important:
(a) Whatever the terminology used, list the criteria that have been used
to identify/select the multi-word units;
(b) Distinguish clearly between linguistic and distributional categories;
(c) Avoid using the same terms to refer to quite different types of unit;
(d) Choose the level of granularity that best fits the teaching objectives.
Principle (b) aims to avoid typologies that mix up terms and criteria per-
taining to the traditional approach to phraseology, viz. linguistic criteria of
semantic non-compositionality, syntactic fixedness and lexical restriction,
with the terminology used in the Sinclair-inspired distributional approach to
refer to quantitatively-defined units, i.e. units identified on the basis of
measures of recurrence and co-occurrence (for more details, see Granger
and Paquot 2008).
4.3. Selection
4.3.1. Criteria
"Pedagogically the main problem with phrases is that there are so many of
them." This statement by Willis (2003: 166) points to one of the biggest
challenges of the lexical approach, i.e. the selection of lexical phrases. The
criterion that occupies a clearly dominant position in the lexical approach is
corpus-based frequency. Corpora make it possible to identify "the common
uses of the common words" that a lexical syllabus should focus on (Sinclair
and Renouf 1988: 154). There is no denying that frequency is a crucial cri-
terion. Far too much teaching time is wasted on words and phrases that are
not even worth bringing to learners' attention for receptive purposes, let
alone for productive purposes. There is much to gain from teaching high-
frequency words such as the high frequency verbs see or give (Sinclair and
Renouf 1988: 151-153) in all their richness rather than focusing exclu-
sively on their primary meanings. However, it is important to bear in mind
that there is no such thing as generic frequency. Hugon (2008) reminds us
that frequency ranking varies in function of the overall composition of the
corpus from which it is derived.
From phraseology to pedagogy: challenges and prospects 135
Frequency
Learnability Teachability
Learner Variables
Figure 1. Criteria for the selection of lexical phrases
that the lexical phrases selected for teaching are extracted from texts that
learners have already processed for meaning, which ensures better contex-
tualization, increased relevance and hence higher motivation for learning
them.
but it is important for teachers and syllabus designers to have access to this
information (for further discussion of this issue, cf Granger 2009).
5. Conclusion
References
Altenberg,Bengt
1998 On the phraseology of spoken English: The evidence of recurrent
word-combmations. In Phraseology: Theory, Analysis and Applica-
tions, Anthony P. Cowie (ed.), 101-122. Oxford: Oxford University
Press.
Barfield,Andy
2001 Review of M. Lewis (ed.), Teaching Collocation: Further Develop-
ments in the Lexical Approach. ELT Journal 55 (4): 413-415.
ConUin,Ka%andNorbertSchmitt
2008 Formulaic sequences: Are they processed more quickly than non-
formulaic language by native and normative speakers? Applied Lin-
guistics 29 (1): 72-89.
ConzetWane
2000 Integrating collocation into a reading and writing course. In Teach-
ing Collocation: Further Developments in the Lexical Approach,
Michael Lewis (ed.), 70-87. Boston: Tomson Heinle.
Cowie, Anthony Paul
1998 Phraseological dictionaries: Some East-West comparisons. In Phra-
seology: Theory, Analysis and Applications, Anthony P. Cowie (ed.),
209-228. Oxford: Oxford University Press.
From phraseology to pedagogy: challenges and prospects 141
Coxhead,Averil
2000 A new academic word list. TESOL Quarterly 34 (2): 213-238.
Coxhead,Avenl
2008 Phraseology and English for academic purposes: Challenges and
opportunities. In Phraseology in Foreign Language Learning and
Teaching Fanny Meumer and Sylviane Granger (eds.), 149-161.
Amsterdam/Philadelphia: Benjamins.
Cullen, Richard
2008 Teaching grammar as a liberating force. ELT Journal 62 (3): 221-
230.
Dechert, Hans-Wilhelm and Paul Lennon
1989 Collocational blends of advanced second language learners: A pre-
liminary analysis. In Contrastive Pragmatics, Wieslaw Olesky (ed.),
131-168. Amsterdam: Benjamins.
DeCock,Sylvie
2000 Repetitive phrasal chunkmess and advanced EFL speech and writing.
In Corpus Linguistics and Linguistic Theory, Christian Mair and
Marianne Hundt (eds.), 51-68. Amsterdam: Rodopi.
DeCock,Sylvie
2004 Preferred sequences of words in NS and NNS speech. Belgian Jour-
nal of English Language and Literatures (BELL): 225-246.
DeCock,Sylvie
2007 Routimzed building blocks in native speaker and learner speech:
Clausal sequences in the spotlight. In Spoken Corpora in Applied
Linguistics, Man C. Campoy and Maria J. Luzon (eds.), 217-233.
Bern: Peter Lang.
Deveci,Tanju
2004 Why and how to teach collocations. English Teaching Forum. April
2004: 16-20.
Ellis, Nick, Rita Simpson-Vlach and Carson Maynard
2008 Formulaic language in native and second-language speakers: Psy-
cholinguists, corpus linguistics and TESOL. TESOL Quarterly 42
(3): 375-396.
Gabnelatos,Costas
2005a Collocations: Pedagogical implications and their treatment in peda-
gogical materials. SHARE 6/146. Available from http://www.
shareeducation.com.ar/past%20issues2/SHARE%20146.htm. First
published in 1994.
Gabnelatos,Costas
2005b Corpora and language teaching: Just a fling or wedding bells?
TESL-EJH4Y 1-37.
142 Sylvian Granger
Howarth, Peter
1999 Phraseological standards in EAP. In Academic Standards and Expec-
tations, H. Bool and P. Lugord (eds), 143-158. Nottingham: Not-
tingham University Press.
Hugon, Claire
2008 Towards a variationist approach to frequency in ELT. Paper pre-
sented at ICAME 29, Ascona, May 2008.
Hunston, Susan
2002 Corpora in Applied Linguistics. Cambridge: Cambridge University
Press.
Hyland,Ken
2008 Academic clusters: Text patterning in published and postgraduate
writing. International Journal ofApplied Linguistics 18 (1): 41-62.
Johns, Tim
1986 Micro-concord: A language learner's research tool. System 14 (2):
151-162.
Kavaliauskiene, Galina and Violeta Janulevieiene
2001 Using the lexical approach for the acquisition of ESP vocabulary.
The Internet TESL Journal VII (3), March 2001.
Knshnamurthy,Ramesh
2002 Learning and teaching through context: A data-driven approach.
Downloaded from http://www.developmgteachers.com/articles
tchtrammg/corpora3 ramesh.htm.
Lewis, Michael
1993 The Lexical Approach: The State of ELT and a Way Forward. Hove:
Language Teaching Publications.
Lewis, Michael (ed.)
2000 Teaching Collocation: Further Developments in the Lexical Ap-
proach. Boston: Thomson Heinle.
Lewis, Michael
2002 Implementing the Lexical Approach. Boston: Thomson Heinle. First
published in 1997.
Lowe, Charles
2003 Lexical approaches now: The role of syntax and grammar. IH Jour-
nal of Education and Development, http://www.ihworld.com/
ihjournal/charleslowe.asp.
Meehan,Paul
2003 Lexis - the new grammar? How new materials are finally challeng-
ing established course book conventions. http://www.
developmgteachers.com/articlestchtrammg/lexnewpfjaul.htm.
144 Sylvian Granger
Milton, John
1999 Lexical thickets and electronic gateways: Making text accessible by
novice writers. In Writing: Texts, Processes and Practices, Christo-
pher N. Candlm and Ken Hyland (eds.), 221-243. London: Long-
man.
Myles, Florence
2002 Second Language Acquisition (SLA) research: Its significance for
learning and teaching. In The Guide to Good Practice for Learning
and Teaching in Languages, Linguistics and Area Studies. South-
ampton: LTSN Subject Centre for Languages, Linguistics and Area
Studies. Downloaded from http://www.llas.ac.uk/resources/
goodpractice.aspx?resourceid=421.
Nattmger, James and Jeanette S. DeCamco
1992 Lexical Phrases and Language Teaching. Oxford: Oxford University
Press.
Nesselhauf,Nadja.
2005 Collocations in a Learner Corpus. Amsterdam/Philadelphia: Benja-
mins.
Paquot,Magali
2007 Towards a productively-oriented academic word list. In Corpora and
ICT in Language Studies, Jacek Walmski, Krzysztof Kredens and
Stamslav Gozdz-Roszkowski (eds.), 127-140. Frankfurt a. M.: Peter
Lang.
Paquot,Magali
2010 Academic Vocabulary in Learner Writing: From Extraction to
Analysis. London/New York: Continuum.
Porto, Melma
1998 Lexical phrases and language teaching. Forum 36: 3. Downloaded
from http://exchanges.state.gov/forum/vols/vol36/no3/mdex.htm.
Pulverness,Alan
2007 Review of McCarthy, Michael and Felicity O'Dell, English Colloca-
tions in Use. E M V O K , ™ / 6 1 : 182-185.
Richards, Jack C.
1976 The role of vocabulary teaching. TESOL Quarterly 10 (1): 77-89.
Rogers, Ted
2000 Methodology in the new millennium. English Language Teaching
Forum 38 (2): November 2000.
Schmitt,Norbert(ed.)
2004 Formulaic Sequences. Amsterdam/Philadelphia: Benjamins.
Sinclair, John McH.
1991 Corpus, Concordance, Collocation. Oxford: Oxford University
Press.
From phraseology to pedagogy: challenges and prospects 145
Woolard, George
2000 Collocation: Encouraging learner independence. In Teaching Collo-
cation: Further Developments in the Lexical Approach, Michael
Lewis (ed.), 28-46. Boston: Thomson Heinle.
Wray, Alison
2002 Formulaic Language and the Lexicon. Cambridge: Cambridge Uni-
versity Press.
Wray, Alison and Tess Fitzpatnck
2008 Why can't you just leave it alone? Deviations from memorized lan-
guage as a gauge of nativelike competence. In Phraseology in For-
eign Language Learning and Teaching, Fanny Meumer and Sylviane
Granger (eds.), 123-147. Amsterdam/Philadelphia: Benjamins.
Dieter Gotz
1. Towards "chunks-
Nowadays many linguists entertain the idea that language in use contains
quite a number of phraseological units, or ready-made, prefabricated units,
or chunks. We can observe various strands of thought, which, in hindsight,
might have contributed toward this idea.
When Austin wrote about speech acts (in the 1960s) he was one of the
first to point out that we do not simply talk when we talk but that by talking
we actually do something, like asking, confirming, naming, declaring war,
repenting, forgiving, etc. Accordingly, pieces of language were seen as
used in and linked to specific situations, as a means of coping with the
world around and also with people. Actual language behaviour came to be
viewed as a performance by an individual, under certain circumstances and
for certain purposes (reminiscent of Malinowski, also of Buhler and Jakob-
son).
Not only can it be asked what the function of an utterance is but also
what a certain part of the utterance is for. This is a functional grammatical
approach (e.g. by Czech scholars like Mathesius, Vachek and Firbas), taken
further by Halliday, with influence on the London School. Lyons linked
words to the world, by using "referring, reference" - meaning and things no
longer belong to different universes. The description of words developed
into a description of their use (Wittgenstein), or usage, which meant that
collocation (and the like) received attention (e.g. by Firth and later by Sin-
clair). The phenomenon of concomitant words led to introducing the prin-
ciple of idiomaticity (Sinclair), or in other terms, repeated speech (Cosenu).
This is the stage at which one could speculate that like situations may have
like (or similar) wordings, with strings of words strung together.
Empirically though, repetition (or recurrence, or related terms) is hard to
describe. Here is the beginning of an Economist article (5 May 2007), with
frequency counts for some stretches (according to the BNC).
148 Dieter Gotz
A cagey game
AT THIS week's meeting in Sharm el-Sheikh, an Egyptian resort, a group
of representatives from Middle Eastern and other countries gathered to talk
about Iraq. But the buzz was not directly about that troubled country. In-
stead, eyes were on the interaction between two big players. Whether Con-
doleezza Rice, America's secretary of state, would converse with her coun-
terpart from Iran, and whether she would take the chance to raise the issue
of Iran's nuclear programme, were the big questions of the gathering.
cagey game 0
this week 4233
meeting in 1589
group of 7254
group of representatives 7
representatives from 455
gather* to 170
gather* to talk 2
troubled country 1
eyes on 634
interaction between 2285
big player* 23
take the chance 39
take* the chance to 8
raise* the issue of 157
big question 72
in the end 3120
face-to-face meeting 14
nuclear programme 88
To anyone speaking English, the frequency of this week, in the end, ratse
the tssue and others come as no surprise. But it would be futile to try and
define repeatedness numerically - perhaps like "from 75 occurrences per
million onwards". In their Longman Grammar of Spoken and Written Eng-
lish, Biber et al. (1999: 992) require 10 occurrences per one million words
for a given stretch to count as a "recurrent" lexical bundle. This is reason-
able, perhaps eminently so, but it has more to do with the genre of publish-
ing than with linguistics and need not affect a theory of language.
Moreover, there may be some doubts about what makes such a stretch.
One might view in the end as an option offered by a larger unit (m the {end,
beginning, mtddle, meantime, afternoon...}). Also note that, in a sense,
Chunks and the effective learner 149
raise the issue might not be a collocation at all. Raise goes with words that
signify 'matter of, case of, instance of, topic of, etc. If you use raise as in,
say, raise Harry's jealousy (i.e. the conversational topic that Harry is jeal-
ous), then the direct object expresses the semantic role associated with the
valency of raise (which of course results in some kind of collocation, and
rouse jealousy would be something different).
Repetition is relative: there is considerable variation in occurrence with
regard to register or subregisters. Chunks in sports commentaries for exam-
ple will not be the same as those in school essays, or poetry, or classroom
English, or in academic medical English.1
There appears to be no theoretical flaw in saying that different persons
may have different stores of repeated speech items. There must be room for
individual and subjective bits and pieces in "language". We should not
forget that corpora do not represent language: they can present language in
a reasonable way, namely, depending on the choice of genres, registers
(etc.) and the purposes of the corpus. But let us keep in mind: language in
use does show recurrent strings of words and these strings can be related to
extra-linguistic situations.
Now suppose a German learner would like to know what kind of food will
be served for the next meal, a situation, or an intention that is usually real-
ised as Was gibts zu essen? in German or as What are we going to have for
dinner/lunch (etc.)? in English. Note that a so-called literal translation of
the German expression will not yield an adequate expression in English.
That is to say: you cannot translate the German "chunk" unless you know
the English "chunk" (or situationally appropriate string). Although What
are we going to have for dinner/lunch would not be called an idiom, it can
hardly be spontaneously generated by a learner of English. Similarly, there
is, from the point of view of English, nothing idiomatic about Dogs wel-
come (in a farm holiday brochure), but that does not mean that you can
spontaneously generate Hunde willkommen in German or cam benvenuti in
Italian (an important issue in translation equivalence, cf Tognim-Bonelli
and Manca 2004). This is why it is essential that in foreign language teach-
ing and learning learners should come to understand the importance of
chunks for fluent and idiomatic foreign language production - something
which has been recognized by applied linguists and didacticians for quite
150 DieterGotz
some time. Firstly learners should be advised to spot "gaps" of the kind just
described, in their FL storage.
Here is what learners should do. When they hear or read parts of the
language to be learnt, they might want to split these parts into two groups:
into what they already know and into what they do not yet know. The for-
mer - what they already know - serves to confirm their prior knowledge.
But if what they come across is recognised as new, or different from
what has already been learnt, they will be alert to spot their personal lin-
guistic gaps. The new item will be focussed upon, as a possible piece for
expanding one's knowledge: once learners realise that strings of words
refer to situations, they will develop a linguistic alertness. That is to say
learners react to words like, say, crepuscular or units like The liver can
repair itself or You 're afraid, aren 't you? or even to Under certain circum-
stances there is nothing more agreeable than the hour dedicated to the
ceremony known as afternoon tea. They will not react to crepuscular by
saying to themselves "Oh, never mind!". Instead they will react by saying
"I couldn't have produced that myself!" They will react to You're afraid,
aren'tyou? by making a mental note: so that is what people say in a situa-
tion like the one at hand. In other words, what is required is a "monitoring
of the input". When reading a foreign language text this monitoring can
easily be applied: there is nearly always enough time for pondering over a
phrase. When engaged in listening, the situation at hand will trigger the
attention (provided of course that listening comprehension does not require
an undue amount of mental capacity). There is a fair chance that if you
connect a concrete utterance with a concrete situation, you might co-
remember communicative factors like register, social distance, jargon etc.
It is not often that you can use a single word to describe a situation. A word
like passport is likely to appear in the company of valid, invalid, or check,
issue, apply for, expire, etc. If a customs officer tells you Your passport is
going to expire soon and if you as a foreign speaker of English do not know
"the meaning of expire" you might make an intelligent guess at what he
means or you might look up expire later. You would then memorise "pass-
port + expire" both as a specific situation and as the words that cover it.
And if you were an efficient German learner of English, you would not
store it as expire =ablaufen, but probably as Pass+ablaufen > pass-
port+expire, linking both expressions to the same situation. The option of
Chunks and the effective learner 151
4. Metalingual faculties
choose something else if you want to play safe. It is up to you to run risks
or not. While some people think that trusting native speakers is too risky,
learners who trust themselves and their own poor translation, run a much
greater risk. Metalinguistic inspection may of course be applied to any level
of linguistic description (from phonology to discourse analysis).
The function of repetition when acquiring language skills is more than ob-
vious. Clearly, one of the most important keys to listening comprehension
is repetition. Repetition equals redundancy and redundancy will raise the
degree of expectability. Learners cannot learn listening by listening, but
they can learn listening by detecting co-occurrent vocabulary. Fast reading
is another skill that needs chunk stores.
When writing, learners can choose to play safe and use only those
stretches which they know to be correct,2 and should they leave firm
ground they will at least know that they are doing just that. Advanced con-
versational skills is another point. Here, repetition facilitates quick compre-
hension (and quick comprehension is necessary) and it is also the basis for
producing prefabricated items as quickly as is normal. These items also
help learners to gain and compete for the speaker's role. Moreover, chunks
allow a non-native speaker to monitor their production and to know that
what they said was what they meant.
6. Exploiting chunks
The concept of chunks, together with the implications for language learn-
ing, has been around for quite a number of years, cf e.g. Braine (1971),
Asher , Kusudo and de la Torre (1974), Gotz (1976). Chunkiness, however,
was not really a popular idea in advanced generative grammar - but for
some time, and perhaps due to the idiomatic principle, it has no longer been
frowned upon (see e.g. Sylviane Granger, this volume).
on chunks. OALD8 s.v. watch, illustrates the pattern ~ where, what, etc...
by Hey, watch where you 're going! This is a good example, but only under
very favourable circumstances. It can re-enforce the learner's knowledge -
in case he or she already knows the phrase. Learners who do not know it,
cannot decide what it really means, to what kind of situation it really refers.
Does it mean a) 'make sure you take a direction/road etc. that leads to
where you want to go', perhaps, or specifically, b) '... where you want to
go in life', or c) 'be inquisitive about the things around you!' or perhaps d)
'look where you set your foot, might be slippery, muddy, etc'. Learners
cannot know intuitively that d) is correct, and hence this example needs
some comment or a translation, e.g. Pass auf, wo du hmtnttst! in German.
(Admittedly, the hey would be a kind of hint for those that know.) Exam-
ples of usage might be chosen and translated in such a way that they indi-
cate clearly what sort of situation they refer to - and can show how co-
selection (see e.g. Sinclair 1991) works.
In short, we are approaching the idea of a bridge dictionary - one of the
many ideas suggested by John Sinclair. In a bridge dictionary, foreign lan-
guage items are presented in the native learner's language. Using this kind
of metalanguage will ensure that a learner has no difficulty understanding
what is said even if it is fairly subtle 4
Incidentally, a COBUILD-style explanation is one that tries to depict a
situation, cf "1 If you watch someone or something, you look at them,
usually for a period of time and pay attention to what is happening"
(COBUILD4).
To my knowledge, various lexicographers (including myself) have tried
to find publishers for bridge dictionaries (such as English - German, Eng-
lish - Italian etc.), but they have tried in vain. However, a dictionary that
contains information like the following article (based here on an OALD
version) need not necessarily become a flop:
the road Sie sah, wie die Kinder iiber die StraBe gmgen 2 watch
(over) + N sich um etwas oder jemanden kiimmern, indem man da-
rauf aufpasst: Could you watch (over my clothes while I swim Passt
du auf meine Kleider auf, wahrend ich beim Schwimmen bin? 3
(Umgangssprache) auf das aufpassen, was man tut, etwas mit
Sorgfalt tun: Watch it! Pass auf! Watch yourself. Pass auf und ... (fall
nicht hm, sag mchts Falsches, lass dich mcht erwischen) You 'd better
watch your language tiberleg dir, wie du es formulierst Watch what
you say Pass auf, was du sagst!
watch + for + N (ausschauen und) warten, dass jemand kommt oder
dass etwas passiert: You'll have to watch for the right moment Du
musst den richtigen Zeitpunkt abpassen; watch + out (besonders im
Imperativ) aufpassen, well Vorsicht notig ist Watch out! There's a
car coming Achtung! ...; watch + out + for + N 1 konzentnert
zuschauen, hmschauen, damit einem mchts Wichtiges entgeht: The
staff were asked to watch out for forged banknotes Die Angestellten
mussten sorgfaltig auf gefalschte Geldscheme achten 2 bei etwas
sehr vorsichtig sein: Watch out for the steps they're rather steep
Pass beiderTreppe auf...
Concomitance of words is due to the fact that some situations are alike, or
viewed alike. It is imperative for a learner to be aware of this phenomenon,
and hence it should be an integral part of learner's dictionaries. Although
the recent editions of English learners' dictionaries such as the Longman
Dictionary of Contemporary English have made great progress in this di-
rection, Langenscheidt's Grofiworterbuch Deutsch ah Fremdsprache
(1993) and its derivatives can be seen as one of the first really systematic
treatments of collocations in this type of dictionary as it provides lists of
collocations for many headwords: the entry for e.g. Sturm contains a sec-
tion in < >, namely <em S. kommt auf, bricht los, wutet, flaut ab, legt stch;
in emen S. geraten>. A list like this need not be representative or exhaus-
tive or meticulously structured - its main purpose is to demonstrate con-
comitant words (some of them certainly useful) and remind the learner of
concomitance.
A surface syntactic pattern, such as Noun + Verb + Noun is of course
much too general to make sense as a chunk. Usually, however, such a sur-
face pattern is in reality a kind of cover term for "chunkable" items, pro-
vided the pattern is filled semantically. In the case of e.g. the verb fly we
get several distinguishable semantic subpatterns of the syntactic pattern
156 DieterGotz
Notes
References
Herbst, Thomas, David Heath, Ian F. Roe and Dieter Gotz (eds.)
2004 A Valency Dictionary of English: A Corpus-Based Analysis of the
Complementation Patterns of English Verbs, Nouns and Adjectives.
Berlm/NewYork:MoutondeGruyter.
Hermger,HansJurgen
2009 Valenzchunks: Empirisch fundiertes Lernmaterial. Miinchen: Indici-
um.
Selmker, Larry
1972 Interlanguage./iMI 10 (2): 209-231.
Siepmann,Dirk
2005 Collocation, colligation and encoding dictionaries. Part I: Lexico-
logical aspects. International Journal of Lexicography 18: 409-443.
Sinclair, John McH.
1991 Corpus, Concordance, Collocation. Oxford: Oxford University
Press.
Togmm-Bonelli, Elena and Elena Manca
2004 Welcoming children, pets and guests: Towards functional equiva-
lence in the languages of 'Agritunsmo' and 'Farmhouse Holidays'.
In Advances in Corpus Linguistics: Papers from the 23rd Interna-
tional Conference on English Language Research on Computerized
Corpora (ICAME 23), Goeteborg 22-26 May 2002, Karin Aymer
and Bengt Altenberg (eds.), 371-385. Amsterdam/New York: Ro-
dopi.
Dictionaries
Corpus
BNC The British National Corpus, version 3 (BNC XML Edition). 2007.
Distributed by Oxford University Computing Services on behalf of
the BNC Consortium, http://www.natcorp.ox.ac.uk/.
Exploring the phraseology of ESL and EFL
NadjaNesselhauf
1. Introduction
John Sinclair was among the leading proponents of the centrality of phrase-
ology, or what he referred to as the "idiom principle", in language. He also
advocated that this aspect of language be investigated with a corpus-
approach. These convictions have been proved right over and over again by
what must by now be tens of thousands of corpus-based studies on the
phraseology of (LI) English (and other languages). The study of the phra-
seology of EFL varieties1 has also intensified over the past few years, al-
though only a relatively small proportion of this work is corpus-based.
What is rare to date, however, is studies, in particular corpus-based ones,
on the phraseology of ESL varieties. What is practically non-existent (in
any type of approach) is comparisons of the phraseology of ESL and EFL
varieties. Given the pervasiveness of the phenomenon in any variety and
the relatedness of the two types of variety, this is a gap that urgently needs
to be filled. The present paper is therefore going to explore phraseological
features in ESL and EFL varieties and to investigate to what degree and in
what respects the phraseology of the two types of variety is similar.2
The paper starts out by providing a brief overview of previous research
as well as an overview of the corpora and methodology used for the inves-
tigation. Then, three types of analyses into the phraseology of ESL and
EFL varieties will be presented. In Section 3.1, it will be investigated how
"competing collocations" (or collocations that share at least one lexical
element and are largely synonymous) are dealt with in the two types of
varieties. In 3.2, the treatment of internally variable collocations will be
considered. Finally, I am going to look at what have been referred to as
"new prepositional verbs" (Mukherjee 2007), i.e. verbs that are simple
verbs in LI English but have become or are treated as if they were verb-
preposition collocations (or prepositional verbs) in ESL or EFL varieties
(Section 3.3).
160 NadjaNesselhauf
Three types of corpora were needed for the present investigation: ESL cor-
pora, learner corpora, and, as a point of reference, LI corpora. As a corpus
representing several ESL varieties of English, I used the ICE-corpus {Inter-
national Corpus of English). The varieties included in the present study are
Indian English, Singaporean English, Kenyan English and Jamaican Eng-
lish (cf. table 1). It is important to note that the degree of institutionaliza-
tion of English vanes in these four countries and, in particular, that Jamai-
can English occupies a special position among the four, in that it may also
with some justification be classified as an ESD (English as a second dia-
lect) variety rather than as an ESL variety.3 The composition of all the ICE-
corpora, with the exception of ICE-East Africa, is the same, with each sub-
corpus containing 1 million words in total, of which 60 % are spoken and
40 % written language. In the case of ICE-East Africa, only the Kenyan
part was included in the present investigation, which contains slightly less
than one million words and about 50 % of spoken and written language
each.
162 NadjaNesselhauf
The starting point for the analysis of competing collocations was an obser-
vation I made in my analysis of collocations in German learner language
(e.g. Nesselhauf 2005). The learners tended to overuse the collocation play
a role, while they hardly used the largely synonymous and structurally
similar collocation play a part. In LI English, on the other hand, both of
these competing collocations occur with similar frequencies. The use of the
two expressions was thus investigated in LI, L2 and learner varieties, to
find out whether the behaviour of the ESL varieties in any way resembled
the behaviour of the learner varieties and if so, to what degree and why.
The results of this investigation (with only the written parts of the cor-
pora considered) are provided in figure 1.
ICE-Jam 28 12
ICE-Sing 48 4
DPLAY+ROLE
pPLAY+PART
ICE-Ind 55 3
ICE-Ken 67 3
ICLE-4L1 12 1 32
Figure 1. PLAY + ROLE / PART in the wntten parts of the BNC, ICE and ICLE
Here and elsewhere, the bars in the graphs indicate the relative frequencies
of the relevant expressions, but the absolute frequencies are also given on
each individual bar. The results confirm my earlier observations. In the
written part of the BNC the two expressions have almost the same frequen-
cies, and in ICLE-4Ll,^aj a role is used in about 80 % and play apart in
about 20 % of the occurrences of either expression. In the ESL varieties,
164 NadjaNesselhauf
the proportion of play a role is also consistently greater than that of play a
part. Except in the case of Jamaican English, the proportion of play a role
is even greater in the ESL varieties than in the learner varieties. So it seems
that overuse of play a role at least partly at the expense of play apart is a
feature of both types of varieties under investigation.
To find out whether this result is restricted to this particular pair of
competing collocations or whether it reveals a more general tendency, an-
other group of competing collocations was investigated: take into consid-
eration, take into account and take account of. The results are displayed in
figure 2.
ICE-Ken 11 15 2
ICE-Sing 9 9
m
ICLE-4L1 70 66 b
Figure 2. TAKE + consideration / account in the written parts of the BNC, ICE
andlCLE
In the BNC, a clear dominance of take into account can be observed, fol-
lowed by the expression take account of. In ICLE, take account of is hardly
present (with only three instances) and take into consideration is slightly
more frequent than take into account. In other words, the expression with
the lowest proportion in LI English has the highest proportion in written
learner language (or at least in the type of learner language investigated
here). The ESL corpora resemble the learner corpus in that take account of
is also hardly present (with two instances in total, in the Kenyan corpus).
They also resemble the learner corpus in that the proportion of take into
Exploring the phraseology ofESL and EFL varieties 165
ever, role in this sense is fairly frequent, whereas this sense of part is infre-
quent (in particular in relation to the overall frequency of this word). So, in
a sense, it can perhaps also be claimed that play a role is language-
internally more regular than its counterpart.
There is only one instance of HAVE no intention of-ing, plus one with
intention in the plural, none of the other instances is either premodified by a
negative or restrictive modifier or an instance of the other patterns preferred
in LI, and complementation tends to be infinitival rather than with of-ing
BNC* 182 22 30
"BNC"*-ICE-
4,1 3,5 1,1
comp D HAVE no INTENTION
ICLE 2 1 5
In LI English, come into contact is the major variant, and come tn contact
the minor one, with about 15 % of the occurrences in the BNC and about
25 % in ICE-GB (cf figure 8). In contrast, in ICLE, both variants are used
equally often, and in the ESL corpora, come into contact is only slightly
more frequent than come tn contact. The behaviour of the L2 varieties
therefore again lies in-between the LI and learner varieties, although it is
fairly close to the latter here, and an infrequent variant is again treated as an
equivalent variant in learner language. Due to the small numbers here and
above, further research is needed to confirm the results, but the fact that the
tendencies are the same throughout the analyses here might be an indication
that the observed tendencies are more general.
When the results concerning variable collocations are compared to those
obtained in the previous section on competing collocations, it appears that
170 NadjaNesselhauf
these two groups are treated in what may be called opposite ways in both
EFL and ESL varieties: in the case of competing collocations, it seems that
the variants tend to be reduced, whereas in the case of variable collocations,
it seems that minor variants tend to be treated as equivalent or even major
ones.
The third type of phraseological units to be treated here are new preposi-
tional verbs (cf Mukherjee 2007). In descriptions of individual ESL vane-
ties, certain new prepositional verbs are sometimes listed as features of
different varieties by different authors. Comprise of for example, is cited as
a feature of Kenyan English by Skandera (2003) and as a feature of Indian
English by Mukherjee (2007). What is more, the phenomenon as such, i.e.
the use of simple LI verbs used with a preposition, also occurred in my
analysis of German learner language (Nesselhauf 2005). One non-Ll
prepositional verb that occurred in my learner data was enter into, used in
cases where LI English would simply have enter?
(1) ... we have all those friendly ... guys who usually use ... their cars to enter
into the city (ICLE-German)
(2) Probably they 11 lose importance and English expressions will enter into
those languages much [more easily] (ICLE-German)
To find out whether this particular use also occurred in ESL varieties, I
consulted the ICE-corpora, and indeed, instances such as the following do
occur:
(3) The plasma membrane of a cell which forms its outer wall does not
normally allow external molecules to enter into the cell (ICE-Ind, w2b-027)
(4) ... the drugs are allowed to enter into the country. (ICE-EA, classlessonK)
In order to investigate whether parallel usage can also be found with re-
spect to other new or non-Ll prepositional verbs in EFL and ESL corpora, I
examined a number of such verbs that had either been cited as a feature of
one or even several ESL varieties or that had been found to occur in learner
language. Those verbs which the analysis has shown to occur in at least two
EFL and at least two ESL varieties are provided in table 2.
Exploring the phraseology ofESL andEFL varieties 171
While both the procedure and the low numbers again do not allow definite
conclusions, a quantitative tendency that can be inferred from the table is
that Jamaican English seems to make much less use of new prepositional
verbs than the other varieties under investigation. It also appears that many
of the prepositional verbs are used more frequently in the ESL varieties
than in the EFL varieties (N.B. that ICLE is more than double the size of
the individual ICE-corpora, cf Section 2.1). A possible reason for this is
that these prepositional verbs have actually become or are in the process of
becoming features of several ESL varieties, while in learner language they
probably tend to be created on the spot by individual learners.
Nevertheless, if the same prepositional verbs are created in the two
types of varieties, it does not seem unreasonable to assume that similar
processes must be at work, and that these processes must go beyond simple
LI influence and instead be based on constellations found in LI English.
An investigation of this question (cf. also Nesselhauf 2009) revealed four
factors that seem to play a part in the creation of new or non-Ll preposi-
tional verbs (cf. figure 9).
172 NadjaNesselhauf
derivationally related
noun exists & has
preposition in question
e.g. request (N) for,
answer (N) to, emphasis
on
nswer (h
demand (N) for
:L
comprise of
In the case of demand for, for example, it seems to be the circumstance that
both the corresponding noun and the semantically related verb ask take the
preposition/or that leads to its creation. In the case o? answer to, three fac-
tors seem to be involved: the fact that the noun as well as several semanti-
cally related verbs {reply I respond to) take the preposition, and the fact that
the verb-preposition combination does exist, both in the sense of "have to
explain one's actions to sb." and in the collocation answer to a name. A
hypothesis would therefore be that if at least two factors coincide, the prob-
ability that a new or non-Ll prepositional verb is created in ESL and EFL
is high.
4. Conclusion
Notes
1 The term 'Variety" is used in a broad sense here and includes the output of
(advanced) foreign learners with a certain LI background, although such "va-
rieties" lack stability.
2 Parts of this paper are based on Nesselhauf (2009).
3 The classification depends on whether or not Jamaican Creole is considered
as a variety of English.
4 Thanks go to Christian Mair and the English Department of the University of
Freiburg for letting me use this preliminary version of ICE-Janwca and to
Andrea Sand and Ute Romer for providing me with a stripped version of ICE-
GB.
5 The terms for such collocations vary widely; some examples are "support
verb constructions", "tight verb constructions" and "stretched verb construc-
tions".
6 The BNC is marked with an asterisk in this table, as only a random sample of
300 instances of HAVE + INTENTION in a span of 5 were considered (the
reason being the necessary manual disambiguation of instances, as not all in-
stances thrown up by this search constitute instances of the relevant colloca-
tion).
7 For learner language, it might be more appropriate to speak of "non-Ll
prepositional verbs" than of "new prepositional verbs".
Exploring the phraseology ofESL and EFL varieties 175
References
Ahulu, Samuel
1995 Variation in the use of complex verbs in international English. Eng-
lish Today 42: 28-34.
Channel!, Joanna
1981 Applying semantic theory to vocabulary teaching. ELT Journal 35
(2): 115-122.
Crystal, David
2003 English as a Global Language. 2nd edition. Cambridge: Cambridge
University Press. First published in 1997.
Granger, Sylviane
1998 Prefabricated patterns in advanced EFL writing: Collocations and
formulae. In Phraseology: Theory, Analysis, and Applications, An-
thony P. Cowie (ed.), 145-160. Oxford: Clarendon.
Herbst, Thomas
1996 What are collocations: Sandy beaches or false teeth? English Studies
77 (4): 379-393.
Hoffmann, Sebastian, Marianne Hundt and Joybrato Mukherjee
2007 Indian English: An emerging epicentre? Insights from web-derived
corpora of South Asian Englishes. Paper presented at ICAME-18,
23-27 May 2007.
Howarth, Peter
1996 Phraseology in English Academic Writing: Some Implications for
Language Learning and Dictionary Making. Tubingen: Niemeyer.
Kaszubski,Przemyslaw
2000 Selected aspects of lexicon, phraseology and style in the writing of
Polish advanced learners of English: A contrastive, corpus-based ap-
proach, http://mam.amu.edu.pl/~przemka/research.html.
Mair, Christian
2007 Varieties of English around the world: Collocational and cultural
profiles. In Phraseology and Culture in English, Paul Skandera (ed.),
437^68. Berlin/New York: de Gruyter.
Mukherjee, Joybrato
2007 Structural nativisation in Indian English: Exploring the lexis-
grammar interface. In Rainbow of Linguistics, Niladn Sekhar Dash,
Probal Dasgupta and Pabitra Sarkar (eds.), 98-116. Calcutta: T. Me-
dia Publication.
Nesselhauf,Nadja
2003 Transfer at the locutional level: An investigation of German-
speakmg and French-speaking learners of English. In English Core
176 NadjaNesselhauf
Corpora
Christian Mair
1. Introduction
The transition from specificational clefts of the type All I did was to ask
(with the focussed element realized as a marked infinitival clause) to All I
did was ask (unmarked infinitive) represents a relatively little noticed in-
stance of ongoing syntactic change in present-day English. The phenome-
non has received some attention in a number of studies by Giinter Rohden-
burg, who - based on the strength of an analysis of an electronic anthology
Writing the history of spoken standard English in the twentieth century 181
Table 1 and figure 1 summarize the relevant findings from the five corpora
of the Brown family - based on searches for possible combinations of do
and be in adjacent position (such as, for example, did was, does is, did is,
etc.).3
lUU/o
80% -
60% - •mg
• baremf.
40% l •to-inf.
20%
Uyo ^
Covering sixty years of diachronic development, the British data show that
the only attested form in the 1930s was the All I did was to ask-typc, with
the to-infinitive, while by 1991 preferences had clearly been reversed to All
Ididwasask, with the unmarked or bare infinitive, with the 1961 LOB data
representing a transitional stage. The thirty years' development in Ameri-
can English covered by Brown and Frown shows a parallel direction of the
Writing the history of spoken standard English in the twentieth century 183
tamed from the "Brown family"). Table 2 gives the frequencies of the four
recurrent types of specificational clefts, i.e. those which could be consid-
ered conventional grammatical constructions, in the spoken corpus. "LLC"
indicates that examples are from the "old" (London-Lund Corpus, 1958-
1977) part of the DCPSE, whereas ICE-GB indicates origin in the "new"
(ICE-GB, 1990-1992) part.
Let us focus on the three constructions familiar from the written corpora
first, that is the two types of infinitival complements and the rare -mg-
complement. Here the most striking result is that the reversal of preferences
in British English spoken usage is virtually simultaneous with the one ob-
served in writing. Clearly, this is not what one would expect given the gen-
erally conservative nature of writing. As for -/^-complements, they are as
marginal in this small diachromc spoken corpus as in the written corpora of
the Brown family.
The exclusively spoken finite-clause complement (All I did was I
asked), on the other hand, is amply attested and apparently even on the rise
in terms of frequency. This raises an interesting question: Why is it that one
innovative structure, the bare infinitival complement (All I did was ask),
should show up in written styles so immediately and without restriction,
whereas the other, the finite-clause type, should be blocked from taking a
similar course? The reason is most likely that the finite-clause variant is not
a grammatically well-formed and structurally complete complex sentence
and therefore not felt to be fully acceptable in writing. That is writers re-
frain from using it for essentially the same reasons that they shun left- and
right-dislocation structures or the use of copy pronouns (e.g. this man, I
know Mm; that was very rude, just leaving without a word). And just as
such dislocation structures are presumably very old, the finite type of speci-
ficational cleft, unlike the unmarked infinitive, may not really be an innova-
tion but an old and established structure which merely failed to register in
our written sources.
Writing the history of spoken standard English in the twentieth century 185
Modal verbs, both the nine central modals and related semi-auxihanes and
periphrastic forms, have been shown to be subject to fairly drastic dia-
chromc developments in twentieth and twenty-first century written English.
The point has been made in several studies based on the Brown family (e.g.
Leech 2003; Smith 2003; Mair and Leech 2006; Leech et al. 2009). Other
studies, such as Krug (2000), have explored the bigger diachromc picture
since the Early Modern English period and show that such recent changes
are part of a more extended diachromc drift. Considering the central role of
modality in speech and writing, modals are thus a top priority for research
in the DCPSE 5
Table 3 shows the frequency of selected modal verbs and periphrastic
forms in the oldest (1958-1960) and most recent (1990-1992) portions of
the DCPSE. The restriction to the first three years, at the expense of the
intervening period from 1961 to 1977, was possible because modals are
sufficiently frequent. It was also desirable because in this way the extreme
points of the diachromc developments were highlighted. What is a potential
complication, though, is the fact that it is precisely the very earliest DCPSE
texts which contain the least amount of spontaneous conversation, so that a
genre bias might have been introduced into the comparison.
186 ChnstianMatr
Log likelihood: a value of 3.84 or more equates with chi-square values of p < 0.05;
a value of 6.63 or more equates with chi-square values of p < 0.01. *HAVE to
1958-60 vs.1990-92: significant at p < 0.05; **must, need to 1958^0 vs.1990-92:
significant at p < 0.01.
CAPITALIZED forms represent all morphological variants.
Much of what emerges from these spoken data is farmhar from the study of
contemporaneous wntten English: the dominant position of have to among
the present-day exponents of obligation and necessity, the decline of must,
the marginal status of need in auxiliary syntax and the phenomenal spread
of main-verb need to in modal functions. Note, for example, that in the
span of a little more than 30 years the normalized frequency of must drops
from around 10 instances per 10,000 words to a mere 5, thus leading to its
displacement as the most frequent exponent of obligation and necessity. By
the early 1990s this position has been clearly ceded to have to. Note further,
that main-verb need to, which barely figured in the late 1950s data, has
firmly established itself 30 years later.
However, as table 4 shows, normalized frequencies (per 10,000 words
of running text in this case) and, more importantly, relative rank of the in-
vestigated forms still differ considerably across speech and writing.
Writing the history of spoken standard English in the twentieth century 187
the present discussion is that, as will be shown, all and any changes ob-
served in genitive usage seem to be confined to writing (or writing-related
formal genres of speech such as broadcast news). This emerges in striking
clarity from a comparison of genitive usage in the Brown family and in the
DCPSE. For ease of comparison, DCPSE figures have been normalized as
"N per million words", with absolute frequencies given in brackets:6
The table shows that genitives in spoken language are consistently less
frequent than in writing in both periods compared, which is an expected
spin-off from the general fact that noun phrases in spontaneous speech tend
to be much shorter and less structurally complex than in writing. More in-
teresting, though, is the fact that while nothing happens diachromcally in
speech (with the frequency of genitives hovering around the 2,000 in-
stances per million word mark), there are steep increases in the written
corpora, which in the thirty-year interval of observation even document the
emergence of a significant regional difference between American and Brit-
ish English.7 In other words, on the basis of the recent diachrony of the s-
gemtive (and a number of related noun-phrase modification structures),8 we
can make the point that written language has had a separate and autono-
mous history from spoken English in the recent past. This history is appar-
ently a complex one as the observed development manifests itself to differ-
ent extents in the major regional varieties. How this partial diachromc
autonomy of writing can be modelled theoretically is a question which we
shall return to in the following section.
perspective. It was opened up only a very short while ago with the publica-
tion of the DCPSE and is currently still restricted to the study of one single
variety, British Standard English. As I hope to have shown in the present
contribution, it is definitely worth exploring.
Change which proceeds simultaneously in speech and writing is possi-
ble but rare. In the present study, it was exemplified by the spread of un-
marked infinitives at the expense of ^-infinitives in specificational clefts.
The more common case by far is change which proceeds broadly along
parallel lines, but at differential speed in speech and writing. This was illus-
trated in the present study by some ongoing developments involving modal
expressions of obligation and necessity. The recent fate o? have got to in
British English, for example, shows very clearly that local British usage
may well persist in speech while it is levelled away in writing as a result of
the homogenizing influences exerted by globalized communication. Con-
versely, must, which decreases both in speech and writing, does so at a
slower rate in the latter.
The potential for autonomous developments in speech and writing was
shown by finite-clause clefts (the Allldidwas Iasked-typc) and s-gemtives
respectively. Of course, the fact that there are developments in speech
which do not make it into writing (and vice versa) does not mean that there
are two separate grammars for spoken and written English. Genuine struc-
tural changes, for example the grammaticalization of modal expressions,
usually arise in conversation and are eventually taken up in writing - very
soon, if the new form does not develop any sociolinguistic connotations of
informality and non-standardness, and with a time lag, if such connotations
emerge and the relevant forms are therefore made the object of prescriptive
concerns. What leads to autonomous grammatical developments is the dif-
ferent discourse uses to which a shared grammatical system may be put in
speech and writing.
Spoken language is time-bound and dialogic in a way that formal edited
writing cannot be. On the whole, spoken dialogue is, of course, as gram-
matical as any written text, but this does not mean that the grammatical-
structural integrity of any given utterance unit is safe to the same extent in
spontaneous speech as that of the typical written sentence. Structurally
complete grammatical units are the overwhelming norm in writing but
much more easily given up in the complex trade-offs between grammatical
correctness, information distribution and rhetorical-emotional effects which
characterize the online production of speech. This is witnessed by "disloca-
tion" patterns such as that kind of people, I really love them or - in the con-
190 ChrmanMarr
text of the present study - the finite "echo clause" subtype of specifica-
tional clefts (4// / did was I asked)9 Tins structure shows a sufficient de-
gree of conventionalisation to consider it a grammatical construction. How-
ever, it is not a grammatical construction which is likely to spread into writ-
ing because the subordinate part of the cleft construction is not properly
embedded syntactically.
Conversely, compression of information as it is achieved by expanding
noun heads by modifiers such as genitives, prepositional phrases or attribu-
tively used nouns is not a high priority in spontaneous speech. However, it
is a central functional determinant of language use in most written genres.
More than ever before in history, writers of English today are having to
cope with masses of information, which will give a tremendous boost to
almost any structurally economical compression device in the noun phrase,
as has been shown for the s-gemtive in the present study.
Thus, even if spoken and written English share the same grammar, as
soon as we move to the discourse level and study language history as a
history of genres or as the history of changing traditions of speaking and
writing, it makes sense to write a separate history of the written language in
the Late Modern period. This history will document the linguistic coping
strategies which writers have been forced to develop to come to terms with
the increasing bureaucratization of our daily lives, the complexities intro-
duced by the omnipresence of science and technology in the everyday
sphere and the general "information explosion" brought about by the me-
dia.
Above and beyond all this, however, close attention to the spoken lan-
guage in diachromc linguistics is salutary for a more general reason. It
keeps challenging us to question and re-define our descriptive categories.
As was shown in the case of specificational clefts, the variable and its van-
ants were easy to define in the analysis of the written language, and diffi-
culties of classification of individual corpus examples were rare. This was
entirely different in the spoken material, where we were constantly faced
with the task of deciding which of the many instances of discourse-
pragmatic focusing which contain chunks such as what X did was or all X
did was represented a token of the grammatical construction "specifica-
tional cleft sentence" whose history we had set about to study. Grammar
thus "emerges" in psychological time in spontaneous discourse long before
it develops as a structured system of choices in historical time.
Writing the history of spoken standard English in the twentieth century 191
Notes
1 That is the famihar array of the Brown Corpus (American English, 1961), its
Bntish counterpart LOB (1961), their Freiburg updates (F-LOB, British Eng-
lish 1991; Frown, American English 1992) and - not completed until recently
- B-LOB ("before LOB"), a matching corpus illustrating early 1930s British
English. I am grateful to Geoff Leech, Lancaster, and Nick Smith, Salford, for
allowing me access to this latter corpus, which is not as yet publicly available.
2 Note that this example has the speaker correcting an unmarked infinitive into
m-ing form.
3 This admittedly unsophisticated strategy secures relatively high precision and
even higher recall, although of course a very small number of instances with
material intervening between be and do, such as All I did to him was criticise
to will be missed).
4 In particular, the following two issues are in need of clarification, on the basis
of much larger corpora than the Brown family: (1) Are there -ing-
complements without the preceding trigger (type All I did was asking), and
(2) are there unmarked infinitival complements following a preceding pro-
gressive (type All I was doing was ask)? The one instance found of the latter,
quoted as (3) above, shows instant self-correction by the speaker.
5 And this research was duly carried out by Barbara Klein in an MA thesis
(Klein 2007). The author wishes to thank Ms. Klein for her meticulous work
in one of the first DCPSE-based studies undertaken.
6 The DCPSE consists of matching components of London-Lund (1958-1977)
and ICE-GB (1990-1992) material, totalling ca. 855,000 words.
7 Judging from the B-LOB data, it also seems that the trend picked up speed in
the second half of the twentieth century in British English. Pending the
completion of a "pre-Brown" corpus of 1930s written American English, it is,
however, difficult to determine the precise significance of the B-LOB fin-
dings.
8 Chiefly, these are nouns used in attribute function, for which similarly drastic
increases have been noted in Biber (2003), for example. See also Biber (1988)
and (1989). Indeed, in terms of information density, a noun phrase such as
Clinton Administration disarmament initiative could be regarded as an even
more compressed textual variant of the Clinton Administration's disarmament
initiative, which in turn is a compressed form of the disarmament initiative of
the Clinton Administration. Raab-Fischer (1995) was the first to use corpus
analysis to prove that the increase in genitives went hand in hand with a de-
crease in 0/-phrases post-modifying nominal heads. Her data was the then a-
vailable untagged press sections of LOB and F-LOB. Analysis of the POS-
tagged complete versions of B-LOB, LOB and F-LOB shows that her provi-
sional claims have stood the test of time quite well. 0/-phrases decrease from
192 ChrmanMarr
References
Besch, Werner
2003 Schriftemheit - Sprechvielfalt: Zur Diskussion urn die nationalen
Vananten der deutschen Standardsprache. In Deutsche Sprache im
Wandel: Kleinere Schriften zur Sprachgeschichte, Werner Besch
(ed.), 295-308. Frankfurt: Lang.
Biber, Douglas
1988 Variation Across Speech and Writing. Cambridge: Cambridge Uni-
versity Press.
Biber, Douglas
2003 Compressed noun-phrase structures in newspaper discourse: The
competing demands of popularization vs. economy. In New Media
Language, Jean Aitchison and Diana M. Lewis (eds.), 169-181. Lon-
don: Routledge.
Biber, Douglas and Edward Finegan
1989 Drift and evolution of English style: A history of three genres. Lan-
guage 65: W-511.
Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad and Edward Fine-
gan
1999 Longman Grammar of Spoken and Written English. Harlow: Long-
man.
Hmrichs, Lars and Benedikt Szmrecsanyi
2007 Recent changes in the function and frequency of standard English
genitive constructions: A multivariate analysis of tagged corpora.
English Language and Linguistics 11: 437-474.
Hoffmann, Sebastian
2005 Grammaticalization and English Complex Prepositions: A Corpus-
Based Study. London: Routledge.
Hopper, Paul
1998 Emergent grammar. In The New Psychology of Language, Michael
Tomasello (ed.), 155-175. Mahwah, NJ: Lawrence Erlbaum.
Writing the history of spoken standard English in the twentieth century 193
Hopper, Paul
2001 Grammatical constructions and their discourse origins: Prototype or
family resemblance? In Applied Cognitive Linguistics I: Theory and
Language Acquisition, Martin Ptitz and Susanne Niemeier (eds.),
109-129. Berlin: MoutondeGruyter.
Hopper, Paul
2004 The openness of grammatical constructions. Chicago Linguistic
Society 40: 239-256.
Klein, Barbara
2007 Ongoing morpho-syntactic changes in spoken British English: A
study based on the DCSPE. Unpublished Master's thesis. University
ofFreiburg.
Krug, Manfred
2000 Emerging English Modals: A Corpus-Based Study of Grammaticali-
zation. Berlin/New York: Mouton de Gruyter.
Leech, Goeffrey
2003 Modality on the move: The English modal auxiliaries 1961-1992. In
Modality in Contemporary English, Roberta Facchmetti, Manfred
Krug and Frank R. Palmer (eds.), 223-240. Berlin: Mouton de
Gruyter.
Leech, Geoffrey, Mananna Hundt, Christian Mair and Nicholas Smith
2009 Change in Contemporary English: A Grammatical Study Cam-
bridge: Cambridge University Press.
Mair, Christian
2006 Inflected genitives are spreading in present-day English, but not
necessarily to inanimate nouns. In Corpora and the History of Eng-
lish: Festschrift for Manfred Markus on the Occasion of his 65th
Birthday, Christian Mair and Reinhard Heuberger (eds.), 243-256.
Heidelberg: Winter.
Mair, Christian
2007 British English/American English grammar: Convergence in writing,
divergence in speech. Anglia 125: 84-100.
Mair, Christian and Geoffrey Leech
2006 Current changes. In The Handbook of English Linguistics, Bas Aarts
and April McMahon (eds.), 318-342. Oxford: Blackwell.
Raab-Fischer,Roswitha
1995 Lost der Gemtiv die 0/-Phrase ab? Erne korpusgestiitzte Studie zum
Sprachwandel im heutigen Englisch. Zeitschrift fur Anglistik und
Amerikanistik 43: 123-132.
Rohdenburg,Gunter
2000 The complexity principle as a factor determining grammatical varia-
tion and change in English. In Language Use, Language Acquisition
194 ChrmanMarr
Corpora
Brigitta Mittmann
1. Introduction
This article discusses the method of using parallel corpora from what are
arguably the two most important regional varieties of English - American
English and British English - for finding prefabricated or formulaic word
combinations typical of the spoken language. It sheds further light on the
nature, shape and characteristics of the most frequent word combinations
found in spoken English as well as the large extent to which the language
can be said to consist of recurrent elements. It thus provides strong evi-
dence supporting John Sinclair's idiom principle (1991: 110-115). The
article is in parts an English synopsis of research that was published in
German in monograph form in Mittmann (2004) and introduces several
new aspects of this research hitherto not published in English. This research
is highly relevant in connection with several issues discussed elsewhere in
this volume.
2.1. Material
The study is based upon two corpora which both aim to be representative of
natural every-day spoken English: for British English, the spoken demo-
grapMc part of the British National Corpus (BNCSD) and for American
English, the Longman Spoken American Corpus (LSAC). Both corpora
contain recordings of people from all age groups, both sexes and from a
variety of social and regional backgrounds of the two countries. The size of
the two corpora is similar: the BNCSD contains about 3.9 million and the
LSAC about 4.9 million words of running text.
Despite some minor differences, these corpora are similar enough to be
compared as parallel corpora. When the research was earned out, they be-
longed to the largest corpora of spoken English available for this purpose.
Nonetheless, frequency-based and comparative studies of word combina-
198 BrigittaMittmann
tions demand a certain minimum of occurrence. This meant that it was ne-
cessary to concentrate upon the most frequent items since otherwise the
figures become less reliable.
2.2. Method
For determining the most frequent word combinations in the BNCSD and
the LSAC, a series of programs was used which had been specially written
for this purpose by Flonan Klampfl. These programs are able to extract n-
grams or clusters (i.e. combinations of two, three, four, or more words)
from a text and count their frequency of occurrence. For example, the sen-
tence Can I have a look at this? contains the following tngrams: CAN I
HAVE-I HAVE A-HAVE A LOOK-A LOOK AT and LOOK AT THIS
(each with a raw frequency of one). The idea to use n-grams - and the term
cluster - was inspired by the WordSmith concordancing package (Scott
1999).
With the help of the X2-test a list was created which sorted the clusters
from those with the most significant differences between the two corpora to
those with the greatest similarity. In addition to this, a threshold was intro-
duced to restrict the output to those items where the evidence was strongest.
This minimum frequency is best given in normalized figures (as parts per
million or ppm), as the LSAC is somewhat larger than the BNCSD. It was
at least 12.5 ppm in at least one of the corpora (this corresponds to around
49 occurrences in the BNCSD and about 61 occurrences in the LSAC).
(like that) (both esp. LSAC), or multi-word expletives like bloody hell, oh
dear (esp. BNCSD) or (oh) my gosh, oh boy, oh man, oh wow (esp. LSAC).
Apart from conversational routines, there was a wide range of other
types of word combinations to be found. Most of the material is unlikely to
fall under the heading of 'classical idioms', but nonetheless a substantial
part of it can be seen as idiomatic in the sense that their constituents are
either semantical^ or syntactically odd. Borrowing Fillmore, Kay and
O'Connor's (1988: 508) expression, one could say that they are mostly
"familiar pieces unfamiliarly arranged". Amongst these pre-assembled
linguistic building blocks, there are idioms such as be on about (sth), have
a go (both esp. British English), 2 but also expressions such as be like, a
reporting construction used predominantly by younger American speakers
as it is exemplified in the following stretch of conversation:
(1) <2159> Call him up and he gets kind of snippy with me on the phone. Well
he's sending in mixed messages anyway but uh I called him <nv_clears
throat> and he's snippy and he's like no I can't go. And I'm like fine that's all
I need to know. And so let him go. Tuesday he comes in and he's like
<mimickmg>hi how are you? </m imi ckmg> and everything and I'm just
like comes up and talks to me like twice and he's like <nv sigh> you don't
believe me do you? I'm like no. (LSAC 150701)
Further types of frequent word combinations include phrasal and preposi-
tional verbs - e.g. British get on with, go out with; American go ahead (and
...), figure out, work out ('exercise') - and certain quantifiers such as a bit
of (esp. BNCSD), a Utile 0 / ( e s p . LSAC) or a load 0 / ( B N C S D ) . In the cor-
pora, there were also instances showing that individual words can have
quite different collocational properties in different varieties. For example, a
lot tends to co-occur much more often with quite and not in British English
than it does in American English.
A large number of clusters points to what one might call the 'fuzzy edges'
of traditional phraseology. On the borderline between phraseology and
syntax there are, for example, tag questions, 3 periphrastic constructions
such as the present perfect (which is used more frequently in British Eng-
lish), semi-modals such as have got (esp. BNCSD) and going to/gonna
(esp. LSAC) or valency phenomena such as the fact that in spoken Amen-
200 BrigittaMittmann
2.5. Evaluation
Hausmann's example Das Haar ist mcht nur ber alten Menschen sondern
auch ber relatrv jungen Menschen bererts recht haufig schutter (Literally:
'The hair not just of older people but also of relatively young ones is quite
often thin already.') (1985: 127). However, this phenomenon tends to be
comparatively rare in comparison with the large amount and wide variety
of other material which can be collected.
The approach chosen is largely data-driven and casts the net wide with-
out either restricting or anticipating results. It proved very useful for what
was effectively a pilot study, as there had not been any systematic treatment
of such a wide variety of word combination differences between spoken
American and British English.
In 2006, John Algeo published a book on British word and grammar
patterns and the ways in which they differ from American English. His
focus and methods are different from the ones reported upon here and he
based his research upon other data (including different corpora). Nonethe-
less, there is some overlap and in these areas, his findings generally cor-
roborate those from Mittmann (2004).
Another recently finished study which has some connection with the
present research is the project of Anne Grimm (2008) in which she studied
differences between the speech of men and women, including amongst
other things the use of hedging expressions and expletives. This project is
also based upon the BNCSD and the LSAC and observes differences be-
tween the regional varieties. Again, the results from Mittmann (2004) are
generally confirmed, while Grimm differentiates more finely between the
statistics for different groups of speakers for the items that she focuses on.
However, in these - and other - works a number of theoretical issues
had to be left undiscussed and it is some of these points that will be ex-
plored in the next sections.
3. Theoretical implications
In comparing two corpora, the problem arises what the basis for the com-
parison (or the tertium comparatioms) should be. If one combination of
words occurs, for example, five times as frequently in one corpus as it does
in the other one, then this may be interesting, but it leaves open the ques-
tion how the speakers in the other corpus would express the same concept
or pragmatic function instead. Therefore it is highly relevant to look for
202 BrigittaMittmann
what one might call "synonymous" word combinations - and take this to
include combinations with the same pragmatic function. Sometimes such
groups of expressions with the same function or meaning can be found
relatively easily, as in figure 1 (taken from Mittmann 2005), which gives a
number of comment clauses which have the added advantage of having
similar or identical structures:
DLSAC HBNCSD
I reckon
I should think
I expect
I suppose
I think
I believe
I figure
I guess
However, finding such neat groups can be difficult and similar surface
structures do not guarantee functional equivalence. For example, it has
been pointed out in the literature on British-American differences (Benson,
Benson and Ilson 1986: 20) that in a number of support verb constructions
such as take a bath vs. have a bath, American English tends to use take,
while British speakers typically use have. While there is no reason to doubt
this contrast between take and have in support verb constructions in gen-
eral, a very different situation obtains with respect to certain specific uses
in conversation. It is remarkable that while HAVE a look does indeed ap-
pear quite frequently in the BNCSD, this is not true of take a look in the
LSAC. Instead, expressions such as let's see or let me see appear to be used
instead. Moreover, both let me see and let's see as well as let me have a
look and let's have a look are often used synonymously, as can be seen
from the following extract from a conversation between a childminder and
a child:
Prefabs in spoken English 203
(2) ... you've got afilthynose. Let's have a look. (BNCSD, kb8)
Quite feasibly, a German speaker might use a very different construction
such as Zeig mal (her) - which consists of an imperative (zeig, 'show') +
pragmatic particle (mal) + adverb (her; here 'to me') - in similar situations.
Pragmatic equivalence is therefore context-dependent and comparisons
between varieties can be made at very different levels of generality. This is
also a problem for anyone studying what Herbst and Klotz (2003: 145-149)
have called probabemes, i.e. likely linguistic realizations of concepts. If
one opts for the level of pragmatic function, then very general functions
such as expressing indirectness are very difficult to take into account, as
they can be realised by such a variety of linguistic means, from modal
verbs and multi-word hedging expressions to the use of questions rather
than statements in English, or the use of certain pragmatic particles (e.g.
vielleicht) in German. Any statement about whether, for example, the
speakers of one group are more or less indirect than those of another will
have to take all those features into account. And while it appears to be true,
for example, that British speakers use certain types of modal verb more
frequently than their American counterparts (Mittmann 2004: 101-106),
there are a number of speakers in the LSAC who use many more hedging
expressions such as like (as in And that girl's going to be like so spoiled,
LSAC 130801, or it's like really important, LSAC 161902).
got is typical for spoken Bntish English (see above), but does not normally
occur together with no idea. Both the forms I've no idea and No idea are
much more frequent than /'ve got no idea.
This also means that the external form of prefabs can be crucial, as there
may be small, but established differences between varieties. They can relate
to the use of words belonging to 'minor' word classes or to conventional
ellipses. For example, the use of articles can differ between the two vane-
ties, as with get a hold of (something), an expression which is more typical
of the American corpus, versus get hold of (something), which is its British
counterpart. Sometimes, the meaning of a phrase depends crucially on the
presence of the article, as in the combinations the odd [+ noun] or a right [+
noun] (both esp. BNCSD). In this use, odtf typically has the meaning 'occa-
sional', as in We are now in a country lane, looking out on the odd passing
car (BNCSD, ksv), whereas right is used as an intensify for nouns denot-
ing disagreeable or bad situations, personal qualities or behaviour, as in
There was a right panic in our house (BNCSD, kcl). However, as seen
above with get (a) hold of (something), there does not have to be any such
change of meaning.
In other cases, interjections are a characteristic part of certain expres-
sions. For example, in both corpora in around 80 % of all cases my god is
preceded by oh. Again, there are a number of clusters containing interjec-
tions which are far more typical of one of the two varieties. Examples for
this are well thank you, no I won >t or yes you can. Often these are re-
sponses, which will be discussed again below.
A further formal characteristic of certain formulaic sequences is that
they are frequently elliptical, such as No idea, which appears on its own in
almost half the cases in the BNCSD, Doesn >t matter (one third of cases
without subject), or Course you can (more than two thirds of cases). All of
them are often used as responses, as in the following examples:
(3a) <PS07D> Oh ah! Can I take one?</u>
<PS079> Course you can. I'll take two. (...) </u> (BNCSD, kbs)
(3b) <PS1 AD> (...) Can I use your phone?</u>
<PS1A9> Yeah, course <ptrt=KBCLC085>yO« can.<Vtr t=KBCLC086>
</u> (BNCSD, kbc)
Prefabs in spoken English 205
tneved from memory. This means that even highly frequent utterances such
as No, it isn >t or Yes, it is, which seem banal in that they are fully analyz-
able and can be constructed following the grammatical rules of the lan-
guage, can be regarded as prefabricated, which should put them at the cen-
tre of any theory of language. Authors such as Wray have argued persua-
sively in favour of seeing prefabricated word combinations (or, as she puts
it, formulaic sequences) as central to linguistic processing (2002: 261),
although using varieties of a language as support for this position appears
to be an approach which had not actually been put into practice before
Mittmann (2004).
One problem which has also been noted by other authors is that it is often
difficult to determine where the boundaries between chunks are. Many
chunks show what one might term 'crystallization', having a stable core
and more or less variable periphery. And while some chunks are compara-
tively easy to delimit, others are not. This is, for example, partly reflected
in Sinclair and Mauranen's distinction between O and M units (2006: 59).
The M units contain what is being talked about whereas the O units (e.g.
hedges, discourse markers and similar items) organize the discourse. The
latter tend to be particularly stable in their form.
On top of this, there are sometimes intriguing differences between the
varieties. For example, in the American corpus certain items, notably cer-
tain discourse markers such as you know or the negative particle not, can
interrupt verb phrases or noun phrases by squeezing in between their con-
stituents like a wedge. In examples (4.1) and (4.2) below, the wedge is
placed between the infinitive particle and the verb, in (4.3), it is between
the article and the premodifier, and in (4.4) between an adverbial and the
verb.
Prefabs in spoken English 207
(4.1) <2396> Yeah, I don't like them either. <nv_laugh> No that's supposed to
be a good school. I'll just try to you know cheer along. Be supportive.
(LSAC, 155401)
(4.2) <2194> It's so much easier to not think, to have somebody else to do it for
you. (LSAC, 150102)
(4.3) <1510> Well and he was saying I wasridingon the sidewalk which you can
do outside of the you know, downtown area. (LSAC, 125801)
(4.4) <2058> and uh, my brother, just you know cruises around on his A T V and
his snowmobile when it's snowmobile season (...) (LSAC, 144301)
In a similar manner, other items such as kind of or / think can function as
wedges in these positions. Apparently, there is a greater tendency in
American English to insert items just in front of the verb or between certain
other closely linked clause or phrase constituents. These places would ap-
pear to be where the speaker conventionally takes time for sentence plan-
ning and there may well be differences between the varieties - as indeed
there are between languages - in this respect. Anybody who has ever stud-
ied English films dubbed into German will probably agree that hesitation
phenomena (notably repetitions and pauses) are somewhat odd in compari-
son to non-scripted, everyday conversational German.
The wedges also affect the rhythmic patterning of sentences in the Ameri-
can corpus. Further research should investigate the links between stress
(and, thus, rhythm) and intonation, pauses, hesitation phenomena and
'chunks'. Sometimes, interesting rhythmical patterns seem to appear in
other portions of the material. For example, many of the responses which
are overwhelmingly found in the BNCSD have a stress pattern of two un-
stressed syllables followed by a stressed syllable (in other words, an ana-
paest), as in / don 7 mind; yes you can; course you can, etc. In addition to
this, there appear to be differences in the use of contracted forms. For sim-
ple modal verbs, for example, the BNCSD has more contractions involving
the negative particle (e.g. can % couldn Y), whereas there is a stronger ten-
dency towards using the full form not (e.g. cannot, could not) in the LSAC.
The same applies to the use of -11 versus will. However, since the study of
such contractions depends crucially on transcription conventions, further
research in this field would need to include the audio files.
208 BrigittaMittmann
5. Conclusion
The project described in this article has shown that American and British
spoken English differ markedly in the word combinations which they typi-
cally use. These word combinations span a wide range of types - from
various kinds of routine formulae to frequently recurring responses. A few
formulaic sequences are grammatically or semantically odd, but many more
are neither of those, although they typically have a special pragmatic or
discourse-related function. Nonetheless, the fact that they are typical of one
variety of a language but not for another indicates that they are to some
extent formulaic.
Thus, the British-American differences reported on here provide further
proof for the fact that everyday language is to a great extent conventional-
ised. Idiomaticity (or formulaicity) pervades language. It consists largely of
recurring word combinations which are presumably stored in the speaker's
memory as entities. The comparison of parallel corpora offers compelling
evidence confirming Sinclair's idiom principle. In the words of Franz Josef
Hausmann, we can say that there is "total idiomaticity" (1993: 477)
Notes
References
Algeo,John
2006 British or American English? A Handbook of Word and Grammar
Patterns. Cambridge: Cambridge University Press.
Benson, Morton, Evelyn Benson and Robert Ilson
1986 Lexicographic Description of English. Amsterdam: John Benjamins.
Fillmore, Charles J., Paul Kay and Mary C. O'Connor
1988 Regularity and idiomaticity in grammatical constructions. Language
64 (3): 501-538.
Prefabs in spoken English 209
R6mer,UteandRamerSchulze(eds.)
2009 Exploring the Lexis-Grammar Interface. Amsterdam/Philadelphia:
Benjamins.
Scott, Mike
1999 WordSmith Tools, version 3, Oxford: Oxford University Press.
Sinclair, John McH.
1991 Corpus, Concordance, Collocation. Oxford: Oxford University
Press.
Sinclair, John McH. and Anna Mauranen
2006 Linear Unit Grammar: Integrating Speech and Writing. Amster-
dam/Philadelphia: John Benjamins.
Wray, Alison
2002 Formulaic Language and the Lexicon. Cambridge: Cambridge Uni-
versity Press.
Corpora
1. Introduction
The past few years have seen an increasing interest in studies based on new
kinds of specialized corpora that capture an ever-growing range of text
types, especially from academic, political, business and medical discourse.
Now that more and larger collections of such specialized texts are becom-
ing available, many corpus researchers seem to switch from describing the
English language as a whole to the description of a number of different
language varieties and community discourses (see, for example, Biber
2006; Biber, Connor and Upton 2007; Bowker and Pearson 2002; Gavioh
2005; Hyland 2004; and the contributions in Connor and Upton 2004; and
in Romer and Schulze 2008).
This paper takes a neo-Firthian approach to academic writing and exam-
ines lexical-grammatical patterns in the discourse of linguistics. It is in
many ways a tribute to John Sinclair and his groundbreaking ideas on lan-
guage and corpus work. One of the things I learned from him is that, more
often than not, it makes sense to "go back" and see how early ideas on lan-
guage, its structure and use, relate to new developments in resources and
methodologies. So, in this paper, I go back to some concepts introduced
and/or used by John Sinclair and by John Rupert Firth, a core figure in
early British contextualism, who greatly influenced Sinclair's work. Con-
tinuing Sinclair's (1996: 75) "search for units of meaning" and using new-
generation corpus tools that enable us to explore corpora semi-
automatically (Collocate, Barlow 2004; ConcGram, Greaves 2005;
kJNgram, Fletcher 2002-2007), the aim of this paper is to uncover the phra-
seological profile of a particular sub-type of academic writing and to see
how meanings are created in a 3.5-million word corpus of linguistic book
reviews written in English, as compared to a larger corpus of a less special-
ized language.
After an explanation of the concept of "restricted language" and a dis-
cussion of ways in which meaningful units can be identified in corpora, the
212 Ute Romer
Continuing Sinclair's search for units of meaning, the question I would like
to address here is: How can we find meaningful units in a corpus? Or, more
specifically (given that BRILC contains a particularly evaluative type of
texts), how can we find units of evaluative meaning in a corpus? Evalua-
tion, seen as a central function of language and broadly defined (largely in
line with Thompson and Hunston 2000) as a term for expressions of what
stance we take towards a proposition, i.e. the expression of what a speaker
or writer thinks of what s/he talks or writes about, comes in many different
shapes, which implies that it is not easy to find it through the core means of
corpus analysis (doing concordance searches or word lists and keyword
lists). As Mauranen (2004: 209) notes, "[identifying evaluation in corpora
is far from straightforward. ... Corpus methods are best suited for searching
items that are identifiable, therefore tracking down evaluative items poses a
methodological problem". On a similar note, Hunston (2004: 157) states
that "the group of lexical items that indicate evaluative meaning is large
and open", which makes a fully systematic and comprehensive account of
evaluation extremely difficult. In fact, the first analytical steps I carried out
in my search for units of evaluative meaning in BRILC (i.e. the examina-
tion of frequency word lists and keyword lists, see Romer 2008) did not
yield any interesting results which, at that point in the analysis, led me to
conclude that words are not the most useful units in the search for meaning
("the word is not enough", Romer 2008: 121) and that we need to move
from word to phrase level. So, instead of looking at single recurring words,
we need to examine frequent word combinations, also referred to as collo-
cations, chunks, formulaic expressions, n-grams, lexical bundles, phrase-
frames, or multi-word units. In Romer (2008), I have argued that the extrac-
tion of such word combinations or phrasal units from corpora, combined
with concordance analysis, can lead to very useful results and helps to high-
light a large number of meaningful units in BRILC.
In the present paper, however, I go beyond the methodology described
in the earlier study in which I only extracted contiguous word combinations
from BRILC (n-grams with a span of n=2 to n=7), using the software Col-
locate (Barlow 2004). I use two additional tools that enable the identifica-
tion of recurring contiguous and non-contiguous sequences of words in
texts: kJNgram (Fletcher 2002-2007) and ConcGram (Greaves 2005). Like
Collocate, kJNgram generates lists of n-grams of different lengths (i.e.
Observations on the phraseology of academic writing 215
i t would be * t o 101 10
it would be interesting to 44
it would be useful to 14
it would be nice to
it would be better to
it would be possible to
it would be helpful to
it would be fair to
it would be difficult to
it would be necessary to
it would be good to
it * be interesting to 58
it would be interesting to 44
it will be interesting to
it might be interesting to 6
it * be interesting to see 33 3
it would be interesting to see 23
it will be interesting to see 7
it might be interesting to see 3
Figure 1. Example p-frames in BRILC, together with numbers of tokens and
numbers of variants (kfNgram output)
Together with the types and the token numbers of the p-frames, kfNgram
also lists how many variants are found for each of the p-frames (e.g. 10 for
tt would be * to). The p-frames in figure 1 exhibit systematic and controlled
variation. The first p-frame (it would be * to) shows that, of a large number
of possible words that could theoretically fill the blank, only a small set of
(mainly positively) evaluative adjectives actually occur. In p-frames two
216 Ute Romer
and three, modal verbs are found in the vanable slot; however not all modal
verbs but only a subset of them {would, will, might).
ConcGram allows an even more flexible approach to uncovering re-
peated word combinations in that it automatically identifies word associa-
tion patterns (so-called "concgrams") in a text (see Cheng, Greaves and
Warren 2006). Concgrams cover constituency variation (AB, ACB) and
positional variation (AB, BA) and hence include phraseological items that
would be missed by Collocate or kfNgram searches but that are potentially
interesting in terms of constituting meaningful units. Figure 2 presents an
example of a BRILC-based concgram extraction, showing constituency
variation (e.g. it would be very interesting, it should also be interesting).
67 pon are backward anaphora, and it would be interesting to see how his theory can
68 spective of grammaticalisation it would be very interesting to have a survey of the
69 semantic transparency; again, it would be very interesting to see this pursued in
70 from a theoretical standpoint, it would be very interesting to expand this analysis
71 r future research, noting that it would be especially interesting to follow the
72 ift from OV to VO in English. It would be particularly interesting to see if this
73 ook as exciting as I had hoped it might be, although Part 4 was guite interesting,
74 d very elegantly in the paper, it would be interesting to discuss the
75 s in semantics. In my opinion, it would be interesting to see how this ontological
76 oun derivatives are discussed, it would be interesting at least to mention verbal
77 felt most positively" (p. 22). It should be noted that some interesting results
78 rs also prove a pumping lemma. It woul! d be interesting to see further
79 pear in Linguist List reviews: it wouldn't be very interesting, I didn't make a
80 is given on this work, though it seems to be very interesting for the linguist's
81 and confined to the endnotes. It would also be interesting to set Hornstein's view
82 n of a book title was omitted. It would also be interesting to see if some of the
83 erative work on corpora. Maybe it would also be interesting to test the analyses in
84 second definition. Of course, it would also be interesting to find out that
85 ub-entries, for instance). So, it should also be interesting to find, among the
86 olved in dictionary-making and it should also be interesting to all dictionary
87 n a constituent and its copy. It might however be interesting to seek a connection
88 ity and their self-perception. It might prove to be interesting to compare the
89 iteria seem fairly reasonable. It would, however, be interesting to study the
90 is, rhetoric, semantics, etc. It would certainly be very interesting to see what
91 nages to carry out the action. It would most certainly be interesting to look at
92 CTIC THEORY" by Alison Henry). It seems to me that it would be interesting to
Before I turn to some of the high-frequency n-grams from my lists and their
use in BRILC, I would like to look at an item that came up in a discussion I
had about evaluation with John Sinclair (and that is also quite common in
BRILC, however not as common as the other items that will be described
here). In an email to me, he wrote: "Re evaluation, I keep finding evalua-
tions in what look like "ordinary" sentences these days. ... I came across the
frame "the - - lies in - -"" (Sinclair 2006, personal communication). I
think Ues in is a fascinating item and I am very grateful to John Sinclair for
bringing it up. I examined Ues tn in my BRILC data and found that gap 1 in
the frame is filled by a noun or noun group with evaluative potential, e.g.
the mam strength of the book in example (1). Gap 2 takes a proposition
about action, usually in the form of a deverbal noun (such as coverage),
which is pre-evaluated by the item from the first gap.
(1) The main strength of the book lies in its wide coverage of psycholinguistic
data and models...
This is a neat pattern, but what type of evaluation does it mainly express?
An analysis of all instances oilies in in context shows that 16 out of 135
concordance lines (12 %) express negative evaluation; see examples in (2)
and (3). We find a number (27.8 %) of unclear cases with "neutral" nouns
like distinction or difference in gap 1 (see examples [4] and [5]), but most
of the instances of Ues in (80, i.e. 60.2 %) exhibit positive evaluation, as
exemplified in (1) and (6). The BRILC concordance sample in figure 3
(with selected nouns/noun groups in gap 1 highlighted in bold) and the two
ConcGram displays of word association patterns in figure 4 serve to illus-
trate the dominance of positively evaluative contexts around Ues tn. This
means that a certain type of meaning (positive evaluation) is linked to the
218 UteRomer
lies in pattern. In section 3.2 we will see if this is a generally valid pattern-
meaning combination or whether this combination is specific to the re-
stricted language under analysis.
(2) The obvious defect of such an approach lies in the nature of polysemy in
natural language.
(3) Probably, the only tangible limitation of the volume lies in some
typographical errors...
(4) The main difference lies in first person authority ...
(5) This distinction lies in the foregrounded nature of literary themes.
(6) The value of this account lies in the detail of its treatment of the varying
degrees and types ofgivenness and newness relevant to these constructions.
1 The strength, then, of The Korean Language, lies in its encyclopedic breadth of cover
2 The main strength of this book probably lies in the fact that it incorporates int
3 A particular strength of Jackson's book lies in its relevant biographical informa
4 synopsis, a major strength of this textbook lies in the integration of essential sema
5 ransformations. The strength of this chapter lies in the discussion where the authors
6 ASSESSMENT The main strength of this book lies in the personal testimonies and stor
7 SUMMARY The main strength of the book lies in its wide coverage of psycholingui
8 rgumentation is convincing, and its strength lies in that it concentrates on one langu
9 book. As a textbook, its main strength lies in the presentation of the details.
10 s the scope of the term. Hinkel's strength lies in the fact that she led her researc
Observations on the phraseology of academic writing 219
1 the preface that the value of the reader lies in bringing together work from VARIOIJ
2 and in my view the main value of the paper lies in the mono- and multi-factorial anal
3that is terribly new in this book; its value lies rather in how it selects, organizes a
4 there is an intellectual value in exposing lies and deceptions, and here I think even
5 addressed. The value of his contribution lies in the realization of the power imbed
6 startling claim. The value of this account lies in the detail of its treatment of the
7 enge, a further added value of this chapter lies in the close link with Newerkla's ch
8 ins strong"(471). The value of this volume lies a) in its bringing together in one pi
9 syntacticians. The real value of this book lies in its treatment of the larger issues
10 al related fields of study. Its true value lies in its compact though penetrating dis
11 not an easy read. Despite this, its value lies in how it still manages to demonstrat
12 Ids true for the present volume. Its value lies in the fact that we can select from t
13 the compound prosodic word), and its value lies mainly in demonstrating how some rece
Let us now take a closer look at three items from the frequency-sorted n-
gram and p-frame lists: at the same time, it seems to me (it seems to *) and
on the other hand. In linguistic book review language as covered in
BRILC, at the same time mainly (in 56 % of the cases) triggers positive
evaluation, as exemplified in (7) and in the concordance sample in figure 5.
With only 5 % of all occurrences (e.g. number [8]), negative evaluation is
very rare. In the remaining 39 % of the concordance lines at the same time
is used in its temporal sense, meaning "simultaneously" (not "also"); see
example (9).
(7) Dan clearly highlights where they can be found and at the same time
provides a good literature support.
(8) At the same time, K's monograph suffers from various inadequacies ...
(9) At the same time, some new words have entered the field...
142 e animal world. At the same time, it includes a careful and honest discussion of wh
14 3 n O c t o b e r 19 9 7 . At the same t i m e , it is a s t a t e - o f - t h e - a r t p a n o r a m a of the (sub-)fi
14 4 e at t i m e s , but at the same t i m e it is a n a l m o s t e n c y c l o p e d i c s o u r c e of i n f o r m a t i o n
14 5 s corpus d a t a ) . At the same t i m e , it is clear t h a t not every author has b e e n u s i n g
146 ghout the b o o k . At the same t i m e , it is f l e x i b l e e n o u g h i n o r g a n i s a t i o n t o a l l o w th
14 7 ian and H e b r e w ; at the same t i m e it is n e v e r t h e case t h a t , say, a c c o m p l i s h m e n t s sh
14 8 ard M a c e d o n i a n . At the same t i m e , it is n o t a b l e t h a t M u s h i n ' s results are c o n s i s t e n
14 9 by _ h i s _ , but at the same t i m e it is the subj ect of the J a p a n e s e p r e d i c a t e p h r a s e
150 anguage change. At the same time, it may equally be used by college teachers who wi
151 of the base. At the same time, it must be no larger than one syllable (as discus
152 taste, and, at the same time, it provides the research with steady foundations
153 osition itself. At the same time, it was cliticised to an immediately following ver
154 ertainly rigid. At the same time, King claims, we can easily account for such utter
155 a events, while at the same time leading to interesting questions about the often
156 ge history, but at the same time maintains an engaging and entertaining style throu
The next selected item, it seems to me, prepares the ground for predomi-
nantly negative evaluation (281 of 398 instances, i.e. 70.5 %), as exempli-
fied in (10) and the concordance sample in figure 6. Positive evaluation, as
220 UteRomer
shown in (11), is rare and accounts for only 4.9% of all cases. About
24.6 % of the BRILC sentences with it seems to me constitute neutral ob-
servations, see e.g. (12).
(10) Finally, it seems to me that the discussion of information structure was
sometimes quite insensitive to the differences between spoken and written
data.
(11) In general, it seems to me this book is a nice conclusion to the process
started in the Balancing Act...
(12) // seems to me that it is a commonplace that truth outstrips epistemic
notions...
11 new. It seems to me, nevertheless, that there are some difficulties related to this
12 lem; it seems to me, rather, that precedence is always transitive; it is the particu
13 ail, it seems to me that a more explicit definition of word would be needed to handl
14 68). It seems to me that a high price has been paid in terms of numbers of categorie
15 ude, it seems to me that as for theoretical results, much more [...] should be said
16 ies. It seems to me that both fields would benefit from acting a little more like th
17 ath; it seems to me that Copper Island Aleut is not a good example of such process
18 per. It seems to me that in some cases this could lead M to certain misinterpretatio
19 ry). It seems to me that it would be interesting to examine such problems in a more
20 3c). It seems to me that M-S coniures up notions of abstract constructs that are not
21 Yet It seems to me that one oan likewise make a strong ease for claiming that espeo
22 VIEW It seems to me that one of the central guestions being analyzed in this book is
23 ary. It seems to me that some additional topics could have been incorporated into th
24 ded. It seems to me that such a term is used in more than one sense, having to do bo
25 lly, it seems to me to be a weakness of this approach that it will not easily handle
85 M., on the other hand, comes to the opposite conclusion on the same point. It would
86 on, on the other hand, concerns the marking of event sequences through lexical and s
87 ts, on the other hand, consider the legitimate explanations to be those that do not
88 on. On the other hand, context in CA is not a priori but something that emerges from
89 nd. On the other hand, corpus linguists who want to develop their own tailor-made so
90 t). On the other hand, C denies the existence of the notion of "subject" as a univer
91 un. On the other hand, C importantly neglects other hypotheses on the origin of pers
92 re, on the other hand, denote type shifted, generalized quantifier-like or ,>-type e
93 an, on the other hand, despite its importance in the United States, left very little
94 acy on the other hand develops more slowly, influenced by production ease, salience,
95 ns, on the other hand, do not contribute to the truth conditions of the utterance bu
96 R , on the other hand, do provide support for D, and 0 and L (2002), while finding f
97 en, on the other hand do seem to have such restrictions, lengthening such words and
98 B, on the other hand, does not view optimization of language very seriously. Instea
99 On the other hand, [...], articles are missing on subjects that FG did attend to
3.2. Corpus comparison: How "local" are these patterns and meanings?
The items we have just analyzed clearly show interesting patterns and pat-
tern-meaning relations. Their existence in BRILC alone, however, does not
say much about their status as "local" patterns, i.e. patterns that are charac-
teristic of linguistic book review language as a restricted language (in
Firth's sense). In order to find out how restncted-language-specific the
above-discussed phraseological items {lies m, at the same time, it seems to
me, on the other hand) are, I examined the same items and their patterns
and meanings in a larger reference corpus of written English, the 90-million
word written component of the British National Corpus (BNC written).
In a first step, I compared the frequencies of occurrence (normalized per
million words, pmw) of the four items in BRILC with those in
BNC written. As we can see in table 1, all units of evaluative meaning are
more"frequent in BRILC than in BNC written, which may not be all that
surprising if we consider the highly evaluative type of texts included in
BRILC. Moving on from frequencies to functions, the next step then in-
volved an analysis of the meanings expressed by each of the phraseological
items in BNC written. For lies in I did not find a clear preference for one
type of evaluation (as in BRILC). Instead, there was a roughly equal distri-
bution of examples across the three categories "positive evaluation"
(34.5 %), "negative evaluation" (32.5 %) and "neutral/unclear" (33 %).
While negative evaluation was rather rare in the context of Ites in in the
book review corpus, the item forms a pattern with nouns like problem and
difficulty in BNC written, as the concordance samples in figure 8 show.
222 Ute Romer
For at the same time we also find a lower share of positive contexts in the
BNC written than in the BRILC data. While authors of linguistic book
reviews use the item predominantly to introduce positive evaluation, this
meaning is (with 9 %) very rare in "general" written English (i.e. in a col-
lection of texts from a range of different text types). An opposite trend can
be observed with respect to it seems to me. Here, positive contexts are
much more frequent in BNC written than in BRILC, where negative
evaluation dominates (with 70.5 %; only 30 % of the BNC written exam-
ples express negative evaluation). Finally, with on the other hand positive
evaluation or a positive semantic prosody is (with 33 %) also much more
common in BNC written than in BRILC (see [16] and [17] for
BNC written examples). For book reviews, I found that on the other hand
mostly introduces negative evaluation and that only 8 % of the BRILC
Observations on the phraseology of academic writing 223
4. Concluding thoughts
Referring back to the groundbreaking work of John Firth and John Sinclair,
this paper has stressed the importance of studying units of meaning in re-
stricted languages. It has tried to demonstrate how a return to Firthian and
Sinclainan concepts may enable us to better deal with the complex issue of
meaning creation in (academic) discourse and how corpus tools and meth-
ods can help identify meaningful units in academic writing or, more pre-
cisely, in the language of linguistic book reviews. We saw that the identifi-
cation of units of (evaluative) meaning in corpora is challenging but not a
hopeless case and that phraseological search-engines like Collocate,
kjNgram and ConcGram can be used to automatically retrieve lists of
meaningful unit candidates for further manual analysis. It was found to be
important to complement concordance analyses by n-gram, p-frame and
concgram searches and to go back and forth between the different analytic
procedures, combining corpus guidance and researcher intuition in a maxi-
mally productive way. In the analysis of high-frequency items from the
meaningful unit candidate lists, it then became clear that a number of "in-
nocent" n-grams and p-frames have a clear evaluative potential and that
apparently "neutral" items have clear preferences for either positive or
negative evaluation.
The paper has also provided some valuable insights into the special na-
ture of book review language and highlighted a few patterns that are par-
ticularly common in this type of written discourse. One result of the study
was that it probably makes sense to "think local" more often because the
isolated patterns were shown to be actually very restricted-language-
224 UteRomer
Notes
References
Barlow, Michael
2004 Collocate 1.0: Locating Collocations and Terminology. Houston,
TX: Athelstan.
Barnbrook, Geoff
2002 Defining Language: A Local Grammar of Definition Sentences.
Amsterdam: John Benjamins.
Biber, Douglas
2006 University Language: A Corpus-based Study of Spoken and Written
Registers. Amsterdam: John Benjamins.
Biber, Douglas, Ulla Connor and Thomas A. Upton
2007 Discourse on the Move: Using Corpus Analysis to Describe Dis-
course Structure. Amsterdam: John Benjamins.
Observations on the phraseology of academic writing 225
Corpora
1. Introduction
Ever since John Sinclair introduced his idiom principle and his notion of
collocation (see Sinclair 1991: 109-121), there has been an increasing inter-
est in the study of different aspects of phraseology. In this article we would
like to present a work-in-progress report on a project exploring the colloca-
tional behaviour of text samples.2 By this term we refer to the extent to
which a text relies on or uses collocations. A text that is classified as "col-
locationally strong" can therefore be defined as a text in which a substantial
number of statistical collocations can be found, a "collocationally weak"
text as a text which contains fewer statistical collocations and consists of
more free combinations (in the sense of Sinclair's open choice principle).
In order to determine the collocational behaviour of a text, a computer
program was designed to compare the co-occurrence of words within a
certain text with co-occurrence data from the British National Corpus
(BNC). For the analysis outlined here, eight different text samples repre-
senting different text types were compiled. This selection of samples was
chosen in order to test whether certain interrelations between different texts
(or text types) and their collocational behaviours can be found.
The following three hypotheses summarize three kinds of interrelation
that we expected to occur:
3. Text samples
Each sample contains 20,000 words of English text. The non-fictional cate-
gory is composed of newspaper texts from the Guardian and articles from
academic journals by British authors. The fictional category comprises
Elizabeth George,6 P D. James, Ian McEwan, and Virginia Woolf. Addi-
tionally, a sample of EFL essays by German students7 was compiled. The
last sample is an automatic translation of a 19th century German novel
(Theodor Fontane's Effi Briest) by AltaVista Babelfish (now called Yahoo!
Babelfish*). This sample was included to double-check the results against
some unnatural and umdiomatic language.
The criteria for the selection of texts were
a) varying degrees of difficulty in orderto test Hypothesis 1,
b) coverage of different text types in order to test Hypothesis 2,
c) different levels of idiomaticity in order to test Hypothesis 3.
A few words have to be said about criterion a): even though it is very
common to describe a certain article, story or novel as "difficult to read", it
is - from a linguistic point of view - hard to determine the linguistic fea-
tures that support this kind of subjective judgment. A quantitative analysis
of fictional literature, using some of the authors above, has shown that the
degree of syntactic complexity seems to correspond to the evaluation of
difficulty (Gotz-Votteler 2008); the inclusion of Hypothesis 1 can be seen
as a complementation of that study.
4. Results
As can be seen, the graphs show different amounts of score 0, i.e. non-
occurrences in the BNC. The highest amount of score 0 can be found with
the Guardian texts and P. D. James: the fact that the Guardian texts produce
this result quite often is a consequence of the high frequency of proper
nouns, proper nouns that do not exist and therefore do not collocate in the
BNC. The P. D. James sample, on the other hand, can be characterized as
combining low-frequency lexical words with a sometimes highly formal
style, which creates word combinations that cannot be found in the BNC.10
The automatic translation of Fontane also produces many zeroes; in this
case this is also due to a substantial number of proper nouns, but addition-
ally to a large number of words which were not translated by the program
because they were assumed to be proper nouns (e.g. satisfy + mitspielen-
den, thereby + entzucken). The EFL graph shows a very small amount of
score 0; this result is at first quite surprising and in a way contradicts Hy-
pothesis 3. It might however be a consequence of the fact that the students
are well trained in using frequent combinations and only use a small range
of vocabulary.
A second difference between the graphs in the chart above is their
length: the longest curves are the ones for the Guardian texts and the aca-
demic texts. This means that they represent more scores than the other
graphs, because more words were counted, implying that the Guardian and
Collocational behaviour of different types of text 235
the academic texts contain more lexical words, whereas fictional texts have
a larger number of function words.11 The length of the curves is therefore a
result of the lexical density of the text samples, and the chart shows that the
lexical density correlates with the text type.
This leads to the question whether this finding provides insight into the
collocational behaviour of the text samples. The answer, unfortunately, is
not really, as the shape of the graphs is quite similar. The only exception is
the automatic translation of Fontane, of which the graph is a bit steeper
than the others. This can be seen more clearly in the following chart which
again gives the number of hits in Mutual Information score bands.
Guardian
P.D. James
Written Academic
EFL
Elizabeth George
Virgina Woolf
Ian McEwan
Fontane_Babelfish
Again the curves are quite similar; even though the academic texts and the
samples from the novels by P. D. James and Ian McEwan produce a differ-
ent peak, rise and fall of the graphs are nearly identical. The automatic
translation of Fontane, however, shows a steeper fall, which means that
there are fewer scores with a high Mutual Information value, as this text
consists of fewer specific collocations. But again, the deviance is so small
that this finding can at best be interpreted as a tendency, not as a clear-cut
difference between the texts.
To sum up, also the second query did not display the distinction in col-
locational behaviour between the eight text samples that we had expected.
236 Peter Uhrig andKatrin Gotz-Votteler
The last query was therefore narrowed down to selected word-ck lasses
which have been linked to the degree of idiomaticity (Hausmann 2004,
Nesselhauf 2005). The following chart shows the Mutual Information value
related to percentages for the combinations noun-adjective and adjective-
noun, all other parameters stayed the same.
The text samples written by native speakers result in similar curves. The
automatic translation of Fontane, however, produces quite a large percent-
age of lower scores (between 3 and 5), which represent highly frequent
collocations. On the other hand, this text contains visibly fewer collocations
with higher scores, which means that we encounter fewer specific colloca-
tions here than in the other texts. The texts written by EFL students display
a similar behaviour, even though not to the same extent. Again we assume
that this finding is a result of the fact that the students are well trained in
highly frequent collocations, but use fewer low-frequency specific colloca-
tions.
Guardian
P.D. James
Written Academic
EFL
Elizabeth George
Virgina Woolf
Ian McEwan
Fontane_Babelfish
We will now return to our hypotheses and evaluate them critically in the
light of our findings:
Collocational behaviour of different types of text 237
As mentioned above, the text samples were chosen to cover a certain range
of difficulty. The preceding discussion of the three queries showed that the
results do not support Hypothesis 1. We would even go a step further and
claim that the collocational behaviour of a text does not seem to contribute
to the perceived difficulty.
Neither did the data provide any evidence for Hypothesis 2: from our charts
no interrelation between the text type and the collocational strength of a
text was visible. However, the text samples did display a difference in lexi-
cal density, i.e. text types such as newspaper articles or academic writing
contain a larger number of lexical items, whereas a text type such as fiction
consists of alarger percentage of function words.12
For our third hypothesis the results proved to be the most promising ones.
The largely nonsensical text sample generated by an automatic translation
device showed differences in behaviour for the span -5 - +5. The same is
true of the texts written by EFL learners, even though much less obviously
than we would have assumed.
As the discussion of the three hypotheses reveals, our findings are far
less conclusive than expected. This is partly due to some technical and
methodological problems which shall be briefly outlined in the following:
A whole range of problems is associated with tokemsation, PoS-tagging,
and lemmatisation. Even though excellent software was made available for
the present study, there are still errors.13 These errors would only have been
a minor problem, had they been consistent, but over the past 15 years,
CLAWS has been improved, so the current version of the tagger does not
produce the same errors it consistently produced when the BNC was anno-
tated.14 In addition, multi-word units are problematic in two respects:
firstly, the tagger recognizes many of them as multi-word units while the
lemmatiser lemmatises every orthographic word, rendering mappings of the
238 Peter Uhrig andKatrin Gotz-Votteler
two very difficult. Besides they distort the results, even if function words
are excluded, as they always lead to really high association scores.15
The most serious problem, though, is related to the size of the reference
corpus, the BNC. Even if all proper names are ignored and only lemmatised
combinations of nouns and adjectives in a five-word span to either side are
taken into account, there are still up to 40% of combinations in the samples
which do not exist at all in the BNC. Up to 60% occur less than 5 times - a
limit below which sound statistical claims cannot be maintained. This is of
course partly due to the automatic procedure, which looks at words in a
five-word span and thus may try to score two words which are neither syn-
tactically nor semantical^ related in any way.16 It therefore seems as if the
BNC is still too small for this kind of research by an order of magnitude or
two. This problem may at least be partially solved by augmenting the BNC
dataset with data from larger corpora such as Google's web IT 5-gram or
by limiting the research to syntactically related combinations in parsed
corpora.
Despite (or perhaps even because of) the inconsistencies and inconclu-
sive results of the present study, some of the aspects presented above seem
to very much deserve further investigation: as we have seen that there are
slight differences between native and non-native usage, at least for noun-
adjective collocations, it might be interesting to see whether it is possible to
automatically determine the level of proficiency of learners looking at the
collocational behaviour of their text production.17
Collocational behaviour of different types of text 239
Appendix
"
Notes
8 http://de.babeffish.yahoo.com/
9 According to Sinclair's definition (1991: 170), "only the lexical co-
occurrence of words" counts as collocation.
10 Cf. the following sentence: "Man is too addicted to this intoxicating mixture
of adolescent buccaneering and adult perfidy to relinquish it [spying] en-
tirely."
11 There was no calculation of co-occurrences across sentence boundaries; thus
sentence length may also be held responsible for this finding. However, an
analysis of mean sentence length did not confirm this assumption.
12 For the distribution of some types of function words in different types of texts
see Biberetal. (1999: ch. 2.4).
13 If we assume that the success rate of CLAWS in our study is roughly 97% (as
published in Leech and Smith 2000), we still get about 600 ambiguous or
wrongly tagged items per 20,000 word sample.
14 The word organic*, for instance, is tagged as plural in the BNC and as singu-
lar by the current version of CLAWS. Thus no combinations containing the
word organic* were found by our automatic procedure, which always queries
word/tag combinations. (Since the XML version of the BNC was not yet
available when the present study was started, the database is based on the
BNC World Edition.)
15 A case in point would be Prime Minister.
16 In "carnivorous plant in my office", carnivorous and office are found within a
5-word span. It is not surprising, though, that they do not occur within a 5-
word span in the BNC.
17 The software may also be used for a comparison of different samples from
"New Englishes" in order to find out whether these show similar results to
British usage or have a distinct collocational behaviour. (Thanks to Christian
Man for suggesting this application of our methodology.) In addition, it is ca-
pable of identifying non-text, which means it could be used to find automati-
cally generated spam emails or web pages. So in the end this could spare us
the trouble of having to open emails which, on top of trying to sell dubious
drugs, contain a paragraph which serves to trick spam filters and reads like
the following excerpt: "Interview fired attorney david Iglesias by Shockwave
something."
Collocational behaviour of different types of text 241
References
Biber, Douglas
1988 Variation across Speech and Writing. Cambridge/New York/New
Rochelle/Melbourne/Sydney: Cambridge University Press.
Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad and Edward Fine-
gan
1999 Longman Grammar of Spoken and Written English. Harlow: Pearson
Education Limited.
Ellis, Nick, Eric Frey and Isaac Jalkanen
2009 The psycholinguists reality of collocation and semantic prosody (1):
Lexical access. In Exploring the Lexis-Grammar Interface: Studies
in Corpus Linguistics, Ute Romer and R. Schulze (eds.), 89-114.
Amsterdam: John Benjamins.
Evert, Stefan
2005 The Statistics of Word Cooccurrences: Word Pairs and Collocations.
Dissertation, Institut fur maschmelle Sprachverarbeitung, University
of Stuttgart, URN urn:nbn:de:bsz:93-opus-23714.
G6tz-Votteler,Katrm
2008 Aspekte der Informationsentwicklung im Erzahltext. Tubingen: Gun-
terNarrVerlag.
Granger, Sylviane
2011 From phraseology to pedagogy: Challenges and prospects. This
volume.
Hausmann, Franz Josef
2004 Was sind eigentlich Kollokationen? In Wortverbindungen mehr
oder wenigerfest, Kathrin Steyer (ed.), 309-334. Berlin: Walter de
Gruyter.
Leech, Geoffrey, Roger Garside and Michael Bryant
1994 CLAWS4: The tagging of the British National Corpus. In Proceed-
ings of the 15th International Conference on Computational Linguis-
tics (COLING 94), 622-628. Kyoto, Japan.
Leech, Geoffrey and Nicholas Smith
2000 Manual to accompany The British National Corpus (Version 2) with
improved word-class tagging. Lancaster. Published online at
http://ucrel.lancs.ac.uk/bnc2/.
Nesselhauf,Nadja
2005 Collocations in a Learner Corpus. Amsterdam, Philadelphia: Ben-
jamins.
Rayson,Paul
2008 Wmatrix: A web-based corpus processing environment. Computing
Department, Lancaster University, http://ucrel.lancs.ac.uk/wmatrix/.
242 Peter Uhrig and Katrin Gotz-Votteler
Corpus
Intended for learners rather than fluent speakers of English, the forms de-
clines, declining and declined are explicitly listed (instead of naming the
paradigm).
In a separate column (not shown in figure 1), the CCELD characterizes
reading 1 as an intransitive Verb with the hypernym decrease, the cognate
diminish and the antonym increase Reading 2 is characterized as V V+O
OR V+ to-im, whereby V+O indicates a transitive verb. Reading 3 is
characterized as N UNCOUNT/COUNT:USU SING, i.e. as a noun t h i c h
is usually used in the singular. Chapter 3 of Sinclair (1991) provides a de-
244 RolandHausser
tailed discussion of this entry to explain the form and purpose of entries in
the CCELD in general.
Next consider the corresponding Lancaster analysis:
Language is symbolic. A sign is what has been negotiated between sign us-
ers. The meaning of a sign is not my (non-symbolic) experience of it. Mean-
ings are not in the head as Hilary Putnam4 never got tired of repeating. The
meaning of a sign is the way in which the members of a discourse commu-
nity are using it. It is what happens in the symbolic interactions between
people, not in their minds.
On the one hand, it is uncontroversial that language meanings should not be
treated as something personal left to the whim of individuals. On the other
hand, simply declaring meanings to be real external entities is an irrational
method for making them "objective". The real reason why the conven-
tionalized surface-meaning relations are shared by the speech community is
that otherwise communication would not work.
Even if we accept for the sake of the argument that language meanings
may be viewed (metaphorically) as something out there in the world, they
must also exist in the heads of the members of the language community.
How else could speaker-hearers use language surface forms and the associ-
ated meanings to communicate with each other?
That successful natural language interaction between cognitive agents is
a well-defined mechanism is shown by the attempt to communicate in a
foreign language environment. Even if the information we want to convey
is completely clear to us, we will not be understood by our hearers if we
fail to use their language adequately. Conversely, we will not be able to
understand our foreign communication partners who are using their lan-
guage in the accustomed manner unless we have learned their language.
Given that natural language communication is a real and objective pro-
cedure, it is a legitimate scientific goal to model this procedure as a theory
of how natural language communication works. Such a theory is not only of
academic interest but is also the foundation of free human-machine com-
munication in natural language. The practical implications of having ma-
chines which can freely communicate in natural language are enormous:
instead of having to program the machines we could simply talk with them.
Today, talking robots exist only in fiction, such as C-3PO in the Star Wars
movies (George Lucas 1977-2005) and Roy, Rachael, etc. in the movie
Blade Runner (Ridley Scott 1982). The first and so far the only effort to
model the mechanism of language communication as a computational lin-
guistic theory is Database Semantics (DBS).
Corpus linguistics, generative grammar and database semantics 247
According to this schema, the cognitive agent has a body out there in the
world5 with external interfaces for recognition and action. Recognition is
for transporting content from the external world into the agent's cognition,
action is for transporting content from the agent's cognition into the exter-
nal world.6
In this model, the agent's immediate reference7 with language to corre-
sponding objects in the agent's external environment is reconstructed as a
purely cognitive procedure. An example of immediate reference in the
hearer mode is following a request, based on (i) language recognition, (ii)
transfer of language content to the context level based on matching and (111)
context action. An example in the speaker mode is reporting an observa-
tion, based on (i) context recognition, (ii) transfer of context content to the
language level based on matching and (111) language production including
sign synthesis.8
From the viewpoint of building a talking robot, the language signs exist-
ing in the external reality between communicating agents are merely acous-
tic perturbations (speech) or doodles on paper (writing) which are com-
248 Roland Hausser
The computer may be used not only for the construction of dictionaries, e.g.
by using a machine-readable corpus for improving the structure of the lexi-
cal entries, but also for their use: instead of finding the entry for a word like
decline in the hardcopy of a dictionary using the alphabetical order of the
lemmata, the user may type the word on a computer containing an online
version of the dictionary - which then returns the corresponding entry on
its screen. Especially in the case of large dictionaries with several volumes
and extensive cross-referencing, the electronic version is considerably more
user-friendly to the computer-literate end-user than the corresponding hard-
copy.
Electronic lexical lookup is based on matching the unanalyzed surface
of the word in question with the lemma of the online entry, as shown in the
following schema:
There exist several techniques for matching a given surface form automati-
cally with the proper entry in an electronic lexicon.9
Corpus linguistics, generative grammar and database semantics 249
The method indicated in figure 4 is also used for the automatic word
form recognition in a computational model of natural language communica-
tion, e.g. Database Semantics. It is just that the format and the content of
the lexical descriptions are different w This is because the entries in a dic-
tionary are for human users who already have natural language understand-
ing, whereas the entries in an online lexicon are designed for building lan-
guage understanding in an artificial agent.
The basic concepts in the agent's head are provided by the external inter-
faces for recognition and action. Therefore, an artificial cognitive agent
must have a real body interacting with the surrounding real world. The
implementation of the concepts must be procedural because natural organ-
isms as well as computers require independence from any metalanguage."
It follows that a truth-conditional or Tarskian semantics cannot be used' 2
According to the procedural approach, a robot understands the concept
of shoe, for example, if it is able to select the shoes from a set of different
objects, and similarly for different colours, different kinds of locomotion
like walking, running, crawling, etc. The procedures are based on concept
types, defined as patterns with constants and restricted variables, and used
at the context level for classifying the raw input and output data.13
As an example, consider the following schema showing the perception
of an agent-external square (geometric shape) as a bitmap outline which is
classified by a corresponding concept type and instantiated as a concept
token at the context level:
250 Roland Hauler
The necessary properties," shared by the concept type and the correspond-
ing concept token, are represented by four attributes for edges and four
attrtbutes for angles. Furthermore, all angle attributes have the same value,
namely the constant "90 degrees" in the type and the token. The edge at-
tributes also have the same value, though it is different for the type and the
token.
The accidental property of a square is the edge length, represented by
the variable a in the type. In the token, all occurrences of this variable have
been instantiated by a constant, here 2 cm. Because of its variable, the type
of the concept square is compatible with infinitely many corresponding
tokens, each with another edge length.
At the language level, the type is reused as the literal meaning of the
English surface form square, the French surface form earre and the Ger-
man surface form Quadrat, for example. The relation between these differ-
ent surface forms and their common meaning is provided by the different
conventions of these different languages. The relation between the meaning
at the language level and the contextual referent at the context level is
based on matching using the type-token relation.
The representation of a concept type and a concept token in figure 5 is
of a preliminary holistic nature, intended for simple explanation'5 How
such concepts are exactly implemented as procedures and whether these
Corpus linguistics, generative grammar and database semantics 251
procedures are exactly the same in every agent is not important. All that is
required for successful communication is that they provide the same results
(relative to a suitable granularity) in all members of a language community.
6. Proplets
These proplets contain the same concept type square (illustrated in figure
5) as the value of their respective core attributes, i.e. noun, verb and adj
providing the part of speech. Different surface forms are specified as values
of the surface attribute and different morpho-syntactic properties16 are
specified as values of the category and semantics attributes. For example,
the verb forms are differentiated by the combmatonally relevant cat values
ns3' a' v, n-s3' a' v, n' a' v and a' be, whereby ns3' indicates a valency slot
(Herbst et al. 2004; Herbst and Schiiller 2008) for a nominative 3rd person
singular noun, n-s3' for a nominative non-3rd person singular noun, n' for a
nominative of any person or number, and a' for a noun serving as an accu-
sative. They are further differentiated by the sem values pres,pastlperf and
prog for tense and aspect.
Corpus linguistics, generative grammar and database semantics 253
The intransitive and the transitive verb variants are distinguished by the
absence versus presence of the a' valency position in the respective cat val-
ues. The verbs and the noun are distinguished by their respective core at-
tributes verb and noun as well as by their cat and sem values. The possible
variations of the base form surface forms correspond to those in figure 6.
Compared to the CCELD (1987) dictionary entries for decline (cf figure
1), the corresponding DBS proplets in figure 7 may seem rather meagre.
However, in contrast to dictionary entries, proplets are not intended for
being read by humans. Instead, proplets are a data structure designed for
processing by an artificial agent. The computational processing is of three
kinds, (i) the hearer mode, (ii) the think mode and (hi) the speaker mode.
Together, they model the cycle of natural language communication 17
In the hearer mode, the processing establishes (i) the semantic relations
of functor-argument and coordination structure between proplets (horizon-
tal relations) and (ii) the pragmatic relation of reference between the lan-
guage and the context level (vertical relations, cf. figure 3). In the think
mode, the processing is a selective activation of content in the agent's
memory (Word Bank) based on navigating along the semantic relations
between proplets and deriving new content by means of inferences. 18 In the
speaker mode, the navigation is used as the conceptualization for language
production.
254 Roland Hausser
The proplets are order-free because the grammatical relations between them
are coded solely by attribute-value pairs (for example, [arg: Julia offer] in
the decline proplet and [fnc: decline] in the Julia proplet) - and not in terms
of dominance and precedence in a hierarchy. As a representation of content,
the language-dependent surface forms are omitted. Compared to figure 8,
the proplets are shown with additional cat and sem features.
Linguistically, the DBS derivation in figure 8 and the result in figure 9 are
traditional in that they are based on explicitly coding functor-argument (or
valency) structure 20 as well as morpho-syntactic properties. Given that
other formal grammar systems, even within Chomsky's nativism, have
been showing an increasing tendency to incorporate traditional notions of
grammar, there arises the question of whether DBS is really different from
them. After all, Phrase Structure Grammar, Categonal Grammar, Depend-
ency Grammar and their many subschools 21 have arrived at a curious state
of peaceful coexistence 22 in which the choice between them is more a mat-
ter of local tradition and convenience than a deliberate research decision.
DBS is essentially different from the current main stream grammars
mentioned above because DBS hearer mode derivations map the lexical
analysis of a language surface form directly into an order-free set of prop-
lets which is suitable (i) for storage in and retrieval from a database and
thus (ii) suitable for modelling the cycle of natural language communica-
256 RolandHausser
These DBS requirements are incompatible with the other grammars for the
following reasons: (1) and (2) preclude the use of grammatically meaning-
ful tree structures and as a consequence of (3) there is no place for unifica-
tion. Behind the technical differences of method there is a more general
distinction: the current main stream grammars are ^ - o r i e n t e d , whereas
DBSisag^-onented.
For someone working in sign-oriented linguistics, the idea of an agent-
oriented approach may take some getting used to.24 However, an agent-
oriented approach is essential for a scientific understanding of natural lan-
guage, because the general structure of language is determined by its func-
tion 25 and the function of natural language is communication.
Like any scientific theory, the DBS mechanism of natural language
communication must be verified. For this, the single most straightforward
method is implementing the theory computationally as a talking robot. This
method of verification is distinct from the repeatability of experiments in
the natural sciences and may serve as a unifying standard for the social
sciences.
Furthermore, once the overall structure of a talking robot (i.e., inter-
faces, components and functional flow, cf figures 3 and 5, Hausser 2009a)
has been determined, partial solutions may be developed without the danger
of impeding the future construction of more complete systems.26 For exam-
ple, given that the procedural realization of recognition and action is still in
its infancy in robotics, DBS currently makes do with English words as
placeholders for core values. As an example, consider the following lexical
proplets, which are alike except for the values of their sur and noun attri-
butes:
Corpus linguistics, generative grammar and database semantics 257
These proplets represent a class of word forms with the same morpho-
syntaetie properties. This elass may be represented more abstractly as a
proplet pattern."
By restricting the variable a to the core values used in figure 10, the repre-
sentation in figure 11 as a proplet pattern is equivalent to the explicit repre-
sentation of the proplets class in figure 10. Proplet patterns with restricted
variables are used for the base form lexicon of DBS, making it more trans-
parent and saving a considerable amount of space.
In concatenated (non-lexical) proplets, the (i) core meaning and (ii) the
compositional semantics (based on the coding of morpho-syntaetic proper-
ties) are clearly separated. This becomes apparent when the core values of
any given content are replaced by suitably restricted variables, as shown by
the following variant of figure 9:
258 RolandHausser
By restricting the variable a to the values decline, buy, eat, or any other
transitive verb, p to the values Julia, Susanne, John, Mary or any other
proper name, and y to the values it, offer, proposal, invitation, etc., this
combinatorial pattern may be used to represent the compositional semantics
of a whole set of English sentences, including figure 9.
9. Collocation
At first glance, figure 12 may seem open to the objection that it does not
prevent meaningless or at least unlikely combinations like Susanne ate the
invitation, i.e. that it fails to handle collocation (which has been one of
Sinclair's main concerns). This would not be justified, however, because
the hearer mode of DBS is a recognition system taking time-linear se-
quences of unanalyzed surface forms as input and producing a content,
represented by an order-free set of proplets, as output. In short, in DBS the
collocations are in the language, not in the grammar.
The Generative Grammars of nativism, in contrast, generate tree struc-
tures of possible sentences by means of substitutions, starting with the S
node. Originally a description of syntactic wellformedness, Generative
Grammar was soon extended to include world knowledge governing lexical
selection. For example, according to Katz and Fodor (1963), the grammar
must characterize ball in the man hit the colorful ball as a round object
rather than a festive social event. In this sense, nativism treats collocations
as part of the Generative Grammar and Sinclair is correct in his frequent
protests against nativist linguists' modelling their own intuitions instead of
looking at "real" language.
In response, generative grammarians have turned to annotating corpora
by hand or statistically (treebanks) for the purpose of obtaining broader
data coverage. For example, the University of Edinburgh and various other
Corpus linguistics, generative grammar and database semantics 259
10. Context
also Green 1997). The values of this attribute are called constraints and
have the form of such definitions28 as
(a) "the use of the name John is legitimate only if the intended referent is
named John."
(b) "the complement of the verb regret is presupposed to be true. "
For a meaningful computational implementation this is sadly inadequate,
though for a self-declared "sign-based" approach it is probably the best it
can do.
Instead of cramming more and more phenomena of language use into
the Generative Grammar, Database Semantics clearly distinguishes be-
tween the agent-external real world and the agent-internal cognition. The
goal is to model the agent, not the external world.29 Whether The model is
successful or not can be verified, i.e. determined objectively, (i) by evaluat-
ing the artificial agent's behaviour in its interaction with its environment
and with other agents and (n) by observing the agent's cognitive operations
directly via the service channel (cf.Hausser 2006: Sect. 1.4).
In the agent's cognition, DBS clearly separates the language and the
context component (cf figure 3) and defines their interaction via a compu-
tationally viable matching procedure based on the data structure of proplets
(cf. Hausser 2006: Sect. 3.2). In addition, DBS implements three computa-
tional mechanisms of reference for the sign kinds symbol, mdextcal and
name^ This is the basis for handling the HPSG context definition (a), cited
above, as part of a general theory of signs, whereas definition (b) is treated
as an inference by the agent.
For systematic reasons, DBS develops the context component first, in
concord with ontogeny and phylogeny (cf. Hausser 2006: Sect. 2.1). To
enable easy testing and upscaling, the context component is reconstructed
as an autonomous agent without language. The advantage of this strategy is
that practically all constructs of the context component can be reused when
the language component is added. The reuse, in turn, is crucial for ensuring
the functional compatibility between the two levels.
For example, the procedural definition of basic concepts, pointers and
markers provided by the external interfaces of the context component are
reused by the language component as the core meanings of symbols, in-
dexicals and names, respectively. The context component also provides for
the coding of content and its storage in the agent's memory, for inferencing
on the content and for the derivation of adequate actions, including lan-
guage production.
Corpus linguistics, generative grammar and database semantics 261
11. Conclusion
From the linguists' perspective, the learner is for an English learner's dic-
tionary what the artificial cognitive agent is for Database Semantics: each
raises the question of what language skills the learner/artificial agent should
have.
However, the learner already knows how to communicate in a natural
language. Therefore, the goal is to provide her or him with information of
how to speak English well, which requires the compilation of an easy to
use, accurate representation of contemporary English.
Database Semantics, in contrast, has to get the artificial agent to com-
municate with natural language in the first place. This requires the recon-
struction of what evolution has produced in millions of years as an abstract
theory which applies to natural and artificial agents alike.
In other words, Database Semantics must start from a much more basic
level than a learner's dictionary. For DBS, any given natural language re-
quires
- automatic word form recognition for the expressions to be analyzed,
- syntactic-semantic interpretation in the hearer mode, resulting in
- content which is stored in a database and
- selectively activated and processed in the think mode and
- appropriately realized in natural language in the speaker mode.
On the one hand, each of these requirements constitutes a sizeable research
and software project. On the other hand, the basic principles of how lan-
guage communication works is the same for different languages. Therefore,
once the software components for automatic word form recognition, syn-
tactic-semantic parsing, etc. have been developed in principle, they may be
applied to different languages with comparatively little effort.31
Because the theoretical framework of DBS is more comprehensive than
that of a learner's dictionary, DBS can provide answers to some basic ques-
tions. For example, DBS allows to treat basic meanings in terms of recogni-
tion and action procedures, phenomena of language use with the help of an
262 RolandHausser
Notes
1 This paper benefited from comments by Thomas Proisl, Besim Kabashi, Jo-
hannes Handl and Carsten Weber (CLUE, Erlangen), Haitao Liu (Communi-
cation Univ. of China, Beijing), Kryong Lee (Korea Univ., Seoul) and Brian
MacWhmney (Carnegie Mellon Univ., Pittsburgh).
2 The UCREL CLAWS5 tag-set is available at http://ucrel.lancs.ac.uk/
claws5tags.html.
3 Cf. Hausser ([1999] 2001: 295-299).
4 Putnam attributes the same ontological status to the meanings of language as
Mathematical Realism attributes to mathematical truths: they are viewed as
existing eternally and independently of the human mind. In other words, ac-
cording to Putnam, language meanings exist no matter whether they have
been discovered by humans or not.
What may hold for mathematics is less convincing in the case of language.
First of all, there are many different natural languages with their own charac-
teristic meanings (concepts). Secondly, these meanings are constantly evolv-
ing. Thirdly, they have to be learned and using them is a skill. Treating lan-
guage meanings as pre-existing Platonic entities out there in the world to be
discovered by the members of the language communities is especially doubt-
ful in the case of new concepts such as transistor or ticket machine.
5 The importance of agents with a real body (instead of virtual agents) has been
emphasized by emergentism (MacWhmney 2008).
6 While language and non-language processing use the same interfaces for
recognition and action, figure 3 distinguishes channels dedicated to language
and to non-language interfaces for simplicity: sign recognition and sign syn-
thesis are connected to the language component; context recognition and con-
text action are connected to the context component.
7 Cf. Hausser (2001: 75-77); Hausser (2006: 27-29).
8 For a more extensive taxonomy see the 10 SLIM states of cognition in
Hausser (2001: 466-473).
9 See Aho and Ullman (1977: 336-341).
10 Apart from then formats, a dictionary and a system of automatic word form
recognition differ also in that the entries in a dictionary are for words (repre-
sented by then base form), whereas automatic word form recognition ana-
Corpus linguistics, generative grammar and database semantics 263
References
Putnam, Hilary
1975 The meaning of "meaning". In Mind, Language and Reality: Phi-
losophical Papers, vol. 2, Hilary Patnam (ed.), 215-271. Cambridge:
Cambridge University Press.
Sinclair, John McH.(ed.)
1987 Collins COBUILD English Language Dictionary. London/Glasgow:
Collins.
Sinclair, John McH.
1991 Corpus, Concordance, Collocation. Oxford: Oxford University
Press.
Sells, Peter
1985 Lectures on Contemporary Syntactic Theories: An Introduction to
GB Theory, GPSG andLFG. Stanford: CSLI.
Teubert, Wolfgang
2008 [Corpora-List] Bootcamp: 'Quantitative Corpus Linguistics with R'~
re Louw's endorsement, http://mailman.uib.no/public/corpora/2008-
August/007089.html.
Teubert, Wolfgang and Knshnamurthy, Ramesh (eds.)
2007 Corpus Linguistics. London: Routledge.
Wierzbicka,Anna
1991 Cross-Cultural Pragmatics: The Semantics of Human Interaction.
Berlin: Mouton de Gruyter.
Corpus
BNC The British National Corpus, version 3 (BNC XML Edition). 2007.
Distributed by Oxford University Computing Services on behalf of
the BNC Consortium, http://www.natcorp.ox.ac.uk/.
Chunk parsing in corpora
e.g. May vs. may or the German verb essen vs. the German proper noun
(name of the city) Essen. Furthermore, there are no punctuation marks,
including marks for the beginning and end of sentences, which again raises
many reading ambiguities. This means that dialogue systems have to follow
multiple different paths of interpretation. Therefore, search spaces tend to
become much bigger and we are facing time and memory limitations due to
combinatorial explosion. In practical systems, recognition errors have to be
taken into account and they should be identified as soon as possible.
Base chunks meeting the requirements of Proposition (2) are therefore
often the ultimate structures for spoken language systems as input to higher
analysis levels which have to assign semantic roles to the parts of chunked
input.
To summarize, reliable syntactic structures other than chunks are often
not available in speech processing.
Although on a first glance the idea of chunk parsing promises to push natu-
ral language understanding towards realistic applications, things are not as
easy as they seem.
Named entities such as John Smrth or The United Kingdom should be
identified as soon as possible in language processing. As a consequence,
named entity recognition has to be included in chunk parsing.
But merging subsequent nouns into the same chunk cannot be intro-
duced as a general rule. In fact, the famous example (1) already contains a
trap making it easy for chunk parsers to stumble over:
- jumps could be taken as a noun (plural of jump) and merged into the
preceding noun chunk which is prohibitive for semantic analysis.
- Some more examples of this kind are:
(2) The horse leaps with joy.
(3) This makes the horse leap with joy.
(4) Horse leaps are up to eight meters long.
- Similar considerations hold for named entities comprising a sequence
ofnounsortermssuchas
(5) System under construction, matter of concern, ...
- Chunk Parsers usually do not wrap compound measurement expres-
sions like
Chunk parsing in corpora 273
What is even worse is that the constraints easing the process of verb phrase
chunking in English do not hold for German prefixed verbs which are sepa-
rable in present tense. On the contrary, the space in between verbal stem
and split prefix does not only allow for constituents of arbitrary length, it
even has to include at least passive objects:
(11) Ich hebe Apfel, welche vom Baum gefallen sind, niemak auf.
'I never pick up apples fallen down from the tree.'
(12) Ich hebe Apfel niemals auf, welche vom Baum gefallen sind.
(13) ??Ich hebe niemals auf Apfel, welche vom Baum gefallen sind.
The facts concerning German separable composite verbs in present tense
compared with phrasal verbs in English can be extended without exceptions
to auxiliary and modal constructions, including past, perfect and future.
This, amongst other linguistic phenomena, drastically limits the coverage of
chunk parsing for German. However, even for German dialogue systems,
chunking of at least noun phrases is necessary at the beginning of process-
ing.
Our introduction suggests that a good start for chunk parsing would be to
commence with regular rules. Unfortunately, pure rule-based approaches
lack sufficient coverage. Therefore, to amend the performance of chunk
parsing, examples from corpora have to be included.
The task of selecting the function F*(.) is far from being trivial and so is
the minimization task for the parameters; there are several approaches to
finding a satisfactory result. In general, the problem is given a geometric
interpretation: the vectors of evaluated features are taken as points in a
multidimensional space, each equipped with a tag, namely the assigned
705-tog.Thetaskthenisto:
- Identify clusters of points carrying identical tag;
- Express membership to or distance from clusters by appropriate func-
tions.
- Example: The Support Vector Machine SVM separates areas of differ-
ent tags with hyperplanes of maximum coverage.
Calculating is the true proficiency of computers, so as soon as appropriate
features and Fw(.) are selected, chunk parsing can be done efficiently as
required.
But how can the laborious work of preparing a training corpus n be
facilitated? There are two main options:
1. An automatically annotated corpus is corrected manually. For the be-
ginning, a chunk parser, based on a few simple rules incorporating ba-
sic features, e.g. POS-tags, is used (cf Proposition (2)). This initial
parser is called the "baseline".
2. Chunk structures are derived from a corpus already equipped with
higher level analyses, for example constituency trees (treebank). For
an example, cf. Tjong Kim Sang and Buchholz (2000).
At this place, it is worthwhile to point out that there is no formal and verifi-
able definition of correct chunking. Tagging a corpus to train a chunker
also means to define the chunking task itself.
In view of option (2), another approach to the tagging task in general, in-
cluding IOB-taggmg, opens up. The method outlined in the following is
called Transformation-basedLearnmg (cf. Ramshaw and Marcus 2005).
Chunk parsing in corpora 277
3. Assessment of results
Precision Recall
^rW below 84% below 65%
English below 90% below 90%
Chunk parsing in corpora 279
4. Conclusion
The authors of the CoNLL-2004 Shared Task (Carreras and Marquez 2004)
conclude:
... state-of-the-art systems working with full syntax still perform substan-
tially better, although far from a desired behavior for real-task applications.
Two questions remain open: which syntactic structures are needed as input
for the task, and what other sources of information are required to obtain a
real-world, accurate performance.
There are several chunk parsers which can be downloaded for free from the
World Wide Web. We present a small selection of those we tried with test
data.
One the one hand, there are natural language processing toolkits and
platforms such as NLTK (Natural Language ToolKit),2 or GATE (General
Architecture for Text Engineering)3 which contain part-of-speech taggers
and chunk parsers. In particular, NLTK offers building blocks for ambi-
tious readers, who want to develop chunkers or amend existing ones on
their own (in Python).
SCP is a Simple rule-based Chunk Parser by Philip Brooks (2003),
which is part of ProNTo, a collection of Prolog Natural language Tools. A
POS tagger and a chunker for English with special features for parsing a
"huge collection of documents"4 have been developed by Tsuruoka and
Tsujii (2005). A state-of-the-art pair of a tagger and a chunker, along with
parameter files trained for several languages, has been developed at the
University of Stuttgart.5 The "Stuttgart-Tubingen Tagset" STTS has be-
come quite popular in recent years for the analysis of German and other
languages.
280 Gunther Gorz and Giinter Schellenberger
Notes
1 The authors are indebted to Martin Hacker for critical remarks on an earlier
draft of this paper.
2 http://nltk.sourceforge.net/mdex.php/Mam Page, accessed 31-10-2008.
3 http://gate.ac.uk/, accessed 31-10-2008. "
4 available for download from http://www-tsujii.is.s.u-tokyo.ac.jp/tsuruoka/
chunkparser/, accessed 31-10-2008.
5 available for download from http://www.ims.um-stuttgart.de/projekte/corplex/
TreeTagger/, accessed 31-10-2008.
References
Abney, Steven
1991 Parsing by chunks. In Principle-Based Parsing, Robert Berwick,
Steven Abney and Carol Tenny (eds.), 257-278. Dordrecht: Kluwer.
Bashyam, Vijayaraghavan and Ricky K. Taira
2007 Identifying anatomical phrases in clinical reports by shallow seman-
tic parsing methods. In Proceedings of the 2007 IEEE Symposium on
Computational Intelligence and Data Mining (CIDM 2007), Hono-
lulu, 210-214. Honolulu, Hawati: IEEE.
Bird, Steven, Ewan Klein and Edward Loper
2006 Chunk Parsing. (Tutorial Draft) University of Pennsylvania.
Brooks, Phitip
2003 SCP: A Simple Chunk Parser. University of Georgia. ProNTo
(Prolog Natural Language Tools), http://www.ai.uga.edu/mc/
ProNTo, accessed 31-10-2008.
Carreras.XavierandLluisMarquez
2004 Introduction to the CoNLL-2004 Shared Task: Semantic Role Label-
ing. CoNLL-2004 Shared Task Web Page: http://www.lsi.upc.edu/
srlconll/st04/papers/mtro.pdf, accessed 31-10-2008.
Ludwig, Bernd, Peter Reiss and Gunther Gorz
2006 CONALD: The configurable plan-based dialogue system. In Pro-
ceedings of the 2006IAR Annual Meeting. German-French Institute
for Automation and Robotics, Nancy, November 2006. David Brie,
Keith Burnham, Steven X. Ding, Luc Dugard, Sylviane Gentil,
Chunk parsing in corpora 281
Ulrich Held
1.1. Objectives
about the combinability of lexemes. In fact, not only the idiomatic nature
of collocations, but also of other idiomatic multiword expressions, is char-
acterized by a considerable number of morphological, syntactic, semantic
and pragmatic preferences, which contribute to the peculiarity of these
word combinations (section 2); these properties are observable in corpus
data. From the point of view of language learning, they should be learned
along with the word combination, very much the same way as the corre-
sponding properties of single words (cf Heid 1998 and Heid and Gouws
2006 for a discussion of the lexicographic implications of this assumption).
From the viewpoint of NLP, they should be part of a lexicon.
If we accept this assumption, the task of computational linguistic data
extraction from corpora goes far beyond the identification of significant
word pairs. We will not only show which additional properties may play a
role for German noun+verb combinations (section 2), but also sketch a
computational architecture (section 3.3) that allows us to extract data from
corpora which illustrate these properties. Some such properties can also be
used as criteria for the identification of idiomatic or collocational
noun+verb combinations, others just provide the necessary knowledge one
needs when one wants to write a text and to insert the multiword expres-
sions in a morphologically and syntactically correct way into the surround-
ing sentence.
Our extraction architecture relies on more complex preprocessing than
most of the tool setups used in corpus linguistics: it presupposes a syntactic
analysis down to the level of grammatical dependencies. We motivate this
by comparison with a flat approach, namely the one implemented in the
Sketch Engine (Kilgarnff et al. 2004), for English and Czech (section 3.2).
The firthian notion of collocation (cf. Firth 1957) is mainly oriented to-
wards lexical cooccurrence ("You shall know a word by the company it
keeps" (Firth 1957: 11)). British contextualism has soon discovered co-
occurrence statistics as a device to identify word combinations which are
collocational in this sense. John Sinclair places himself in this tradition, in
Corpus, Concordance, Collocation (Sinclair 1991), emphasizing however
the idiomatic nature of the combinations by contrasting the idiom principle
and the open choree principle. The range of phenomena covered by his
approach as presented in Sinclair (1991: 116-118) includes both lexical
collocations and grammatical collocations (in the sense of Benson, Benson
German noun+verb collocations in the sentence context 285
and Ilson 1986): for example, pay attention and back to both figure in the
lists of relevant data he gives.
The lexicographically and didactically oriented approach advocated,
among others, by Hausmann (1979), Hausmann (2004), Mel'cuk et al.
(1984, 1988, 1992, 1999), Bahns (1996) is more oriented towards a syntac-
tic description of collocations: Hausmann distinguishes different syntactic
types, in terms of the category of the elements of the collocation:
noun+verb collocations, noun+adjective, adjective+adverb collocations,
etc. Moreover, Hausmann and Mel'cuk both emphasize the binary nature
of collocations, distinguishing between the base and the collocate. Haus-
mann (2004) summarizes earlier work by stating that bases are typically
autosemantic (i.e. have the same meaning within collocations as outside),
whereas collocates are synsemantic and receive a semantic interpretation
only within a given collocation. Even though this distinction is not easy to
operationalize, it can serve as a useful metaphor, also for the analysis of
longer multiword chunks, where several binary collocations are combined.
Computational linguistics and NLP have followed the contextualist
view, in so far as they have concentrated on the identification of colloca-
tions within textual corpora, designing different types of tools to assess the
collocation status of word pairs. Most simply, a sorting of word pairs by
their number of occurrence (observed frequency) has been used on the
assumption that collocations are more frequent than non-collocational pairs
(cf. Krenn and Evert 2001). Alternatively, association measures are used to
sort word pairs by a statistical measure of the 'strength' of their association
(cf. Evert 2005); to date over 70 different formulae for measuring the asso-
ciation between words have been proposed.
An important issue in the context of collocation identification from
texts is that of defining the kinds of word pairs to be counted and statisti-
cally analyzed: by which procedures can we extract the items to be
counted? Simple approaches operate on windows around a keyword, e.g.
by looking at items immediately preceding or following that item. Word-
smith Tools (Scott 2008) is a well-known piece of software which embeds
this kind of search as its 'collocation retrieval' function (fixed distance
windows, left and right of a given keyword). Smadja (1993) combines the
statistical sorting of word pair data with a grammatical filter: he only ac-
cepts as collocation candidates those statistically relevant combinations
which belong to a particular syntactic model, e.g. combinations of an ad-
jective and a noun, or of a verb and a subsequent noun; in English, such a
sequence mostly implies that the noun is the direct object of the verb.
286 UlrichHeid
For German and other languages with a more variable word order than
English, the extraction of pairs of grammatically related items is more de-
manding (see below, section 3.2). The guiding principle for collocation
extraction for such languages is to extract word pairs which all homogene-
ously belong to one syntactic type, in Hausmann's (1979) sense, e.g. verbs
and their object nouns. Proposals for this kind of extraction have been
made, among others, by Held (1998), Krenn (2000), Ritz and Held (2006).
This syntactic homogeneity has two advantages: on the one hand, it pro-
vides a classification of the word pairs extracted in terms of their gram-
matical categories, and on the other hand, it leads to samples of word pairs,
from where the significance of the association can be computed with re-
spect to a meaningful subset of the corpus (e.g. all verb+object pairs).
More recent linguistic work on multiword expressions has questioned
some of the restrictions inherent to the lexico-didactic approach. At the
same time, the late John Sinclair has suggested that quantitative and struc-
tural properties seem to jointly characterize most such expressions, cf
Togmm-Bonelli, this volume. In this sense, a combination of the two main
lines of tradition can be seen as an appropriate basis for computational
linguistic data extraction work.
In terms of a modification of the lexico-didactic approach, Schafroth
(2003: 404-409) has noted that there are many idiomatized multiword ex-
pressions which cannot be readily accounted for in terms of the strictly
binary structure postulated by Hausmann (1979). Siepmann (2005) has
given more such examples. Some of them can be explained by means of
recursive combinations of binary collocations: e.g. scharfe Kritik iiben
('criticize fiercely', Schafroth 2003: 408, 409) can be seen as a combina-
tion of Kritik uben ('criticize', lit. 'carry out criticism') and the typical
adjective+noun collocation for Kritik, namely scharfe Kritik (cf. Held
1994: 231; Hausmann 2004; Held 2005). But other cases are not so easy to
explain and the question of the 'size' of collocations is still under debate.
But the notion of collocation has not only been widened with respect to
the size of the chunks to be analyzed; researchers also found significant
word combinations which are of other syntactic patterns than those identi-
fied e.g. by Hausmann (1989). Examples of this are combinations of dis-
course particles in Dutch (e.g. maar even, 'a bit'), where the distinction
between an autosemantic base and a synsemantic collocate is hard to draw.
In conclusion, it seems that the different strands of collocation research,
in the tradition of both the early work of John Sinclair and of the lexico-
graphic and didactic approach, can jointly contribute to a better under-
German noun+verb collocations in the sentence context 287
Helbig's terms), and zur Sprache bnngen ('mention', 'bring to the fore'),
as an example of a legalized support verb construction. In line with e.g.
Burger's (1998) discussion of the modifiability and fixedness of idioms,
Helbig's distinction can also be recast in terms of more vs. less idiomatiza-
tion.
(1) Fragen stellen ('[to] ask questions')
*zuSprachen bringen
(2) eine Frage stellen die Frage stellen diese Frage stellen ('[to] ask a/the/
this question')
*zu einer Sprache bringen *zu der/dieser Sprache bringen
(3) eine relevante Frage stellen ('[to] ask a relevant question')
*zurrelevanten Sprache bringen
(4) eine Frage stellen, die enorm wichtig ist ('[to] ask a question which is
enormously important')
*zur Sprache bringen, die enorm wichtig ist
The examples in (1) to (4) show the morphosyntactically fixed nature of
zur Sprache bnngen, which does not accept any of the operations which
are perfectly possible with Frage + stellen. A task for data extraction from
corpora is thus to test each collocation candidate for these properties (num-
ber, determination, modifiability of the noun) and to note the respective
number of occurrences of each option in the corpus.
(6) zum Ausdruck bringen ('[to] express', lit. 'bring to the expression'),
zum Ausdruck kommen ('[to] be expressed'),
zurSprache bringen ('[to] mention'),
inAbredestellenC[to]dmf\
zu Protokollgeben C[to] state').
The collocations in (6) all s u b c a t e g o r y for a sentential complement, while
the nouns Ausdruck, Protokoll, Sprache and Abrede (except in another
reading) as well as the verbs involved do not allow a complement clause.
Thus, even if the number of cases is relatively small, it seems that some
collocational multiwords require their own subcategonzation description.
A related fact is nominal 'complementation' of the nominal element of
noun+verb collocations, i.e. the presence of genitive attributes. Many col-
locations contain relational nouns or other nouns which have a preference
for a genitive attribute. An example is the collocation im Mittelpunkt (von
X) stehen ('[to] be at the centre [of X]'). Other examples, extracted from
large German newspaper corpora are listed in (7), below. 1
(7) (a) in + Mittelpunkt + GEN + stellen/rucken
('[to] put into the centre of...', Frankfurter Rundschau 40354632)
.... die die Sozialpolitik mehr in den Mittelpunkt des offentlichen
Interesses stellen will ('which wants to put social politics more into
the centre of public interest');
(b) auf + Boden + GEN + stehen
('[to] be on the solid ground of ...', Frankfurter Rundschau,, 440l29)
[...Jfragte er seinen schlaftrunkenen Kollegen, der mit einem Mai
wieder auf dem Boden der Realitat stand ('[...] he asked Ms sleepy
colleague who, all of a sudden, was back to reality');
(c) sich auf+ Niveau + GEN + bewegen
('[to] be at the level of...', Stuttgarter Zeitung4245965S)
[...J wahrend sich der Umfang des Auslandsgeschafts auf dem
Niveau des Voriahres bewegte ('whereas the amount of foreign trade
was at the level of the previous year').
Some of the noun+verb collocations where the noun tends to have a geni-
tive (or a v 0 «-phrase) seem to have, in addition to this syntactic specificity,
also particular lexical preferences: for example, we find auf dem Boden der
Realitat stehen, auf dem Boden der Verfassung stehen ('[to] be rooted in
the constitution'), auf dem Boden (seiner) Uberzeugungen stehen ('[to] be
attached to one's convictions') more frequently than other combinations of
auf+ Boden + stehen with genitives. Moreover, these combinations often
290 UlrichHeid
come with the adverb fast, such that the whole expression is similar to 'be
firmly rooted in ...'. The analysis of such combinations of collocations
requires very large corpora and ideally corpora of different genres: our data
only come from newspapers and administrative texts. A more detailed
analysis should show larger patterns of relatively fixed expressions, likely
specific to text types, genres etc. At the same time, it would show how the
syntactic property of the nouns involved (to take a genitive attribute) inter-
acts with lexical selection, and how collocational selection properties of
different lexemes interact to build larger idiomatic chunks of considerable
usage frequency.
On the other hand, there are collocations which hardly accept the inser-
tion of a genitive after the noun, and if it is inserted, the construction seems
rather to be a result of linguistic creativity than of typical usage. Examples
are given in (8): the collocations have no genitives in over 97 % of all ob-
served cases (the absolute frequency of the collocation in 240 M words is
given in parentheses), and our examples are the only ones with a genitive:
(8) (a) in Flammen aufgehen ('[to] go up in flames', Die Zeit3926l 518, 433):
LaButes Monolog ist ein Selbstrechtfertigungssystem von so
trockener Vernunft, dass es iederzeit in den Flammen des Wahnsinns
aufeehen konnte ('LaBute's monologue is a self-justification system
of such and reason, that it could go up, at any moment, in the flames
of madness');
(b) in die Irre fiihren ('[to] mislead', Frankfurter Allgemeine
Zeitung62122S15, 344):
Ihr 'Requiem' versucht sich in unmittelbarer Emotionalitat, von der
die Kunstja immer wieder traumt, und die doch so oft in die Irre der
Banalitat fuhrt ('Her 'Requiem' makes an attempt at immediate
emotionality, which art tends to dream of every now and then, and
which nevertheless misleads quite often towards banality').
Similar strong preferences for the presence or absence of modifying ele-
ments in noun+verb collocations are also found with respect to adjectival
modification. This phenomenon has however mostly been analyzed in line
with the above mentioned morphosyntactic properties which depend on the
referential availability of the noun. Examples which require modifying
adjectives are listed in (9) and a few combinations which do not accept
them are given in (10).
(9) eine gute/brillante/schlechte/... Entwicklung nehmen
('[to] progress well/brilliantly/not to progress well')
eine gute/schlechte/traurige/... Figur abgeben
German noun+verb collocations in the sentence context 291
The table shows that a noun+verb collocation like Frage + stellen ('ask +
question') can appear under all three word order models. At the time of
writing this article, we are in the process of investigating in more detail
which collocation candidates readily appear in all three models and which
ones do not. This was prompted by work on word order constraints for
Natural Language Generation (cf Cahill, Weller, Rohrer and Heid 2009)
and by the following observation: some collocations which are highly idio-
matized and morphosyntactically relatively fixed, such as Gebrauch ma-
chen ('make use') tend not to have their nominal element in Vorfeld (cf.
Heid and Weller 2008). This is illustrated with the examples in (14), which
are contrasted with instances of Frage + stellen.
(14) (a) [...,] weil derChefeinerelevante Frage stellt.
('because the boss asks a relevant question', VL)
(b) [...,] weil erdavon Gebrauch macht.
('because he makes use of h', VL)
(c) Fine relevante Frage stellt der Chef(V2).
294 UlrichHeid
tools that keep track of the type of preference, both functions could be
achieved: a broad classification into [± idiomatic] and a detailed descrip-
tion.
This inventory of phenomena to be analyzed leads to rather complex
expectations for (semi-)automatic data extraction from corpora. Beyond
lexical selection, there is also a need to extract evidence for the properties
discussed above. And in the ideal case, careful corpus design and a detailed
classification of the corpus data used should allow for the variational
analysis suggested here.
There are three groups of properties of collocations which are used to ex-
tract collocational data from text corpora, at least for English. These are (i)
cooccurrence frequency and significance, (n) linear order and adjacency
and (in) morphosyntactic fixedness. All three properties as well as their
use for data extraction have been briefly mentioned above (in section 1.2).
Extraction procedures based exclusively on statistics (frequency, asso-
ciation measures) are at a risk of identifying at least two types of noise: one
word combinations that are typical of a certain text type, but phraseologi-
cally uninteresting. An example is die Polizei teilt mt(, dass) ('the police
informs [that...]'): this word combination is particularly frequent in Ger-
man newspaper corpora (because newspapers report about many events
where the police has to intervene, and to inform the public thereafter), but
it is not particularly idiomatic, since any noun denoting a human being or
an institution composed of human beings can be a subject of mitteilen.
Next to these semantical^ and lexically trivial word combinations, also
German noun+verb collocations in the sentence context 299
easy, as typically the subject of an English verb can be found to its left and
the object to its right. To identify English noun+verb combinations, part-
of-speech tagged corpora and (regular) models about the adjacency of
noun and verb phrases are mostly sufficient; it is possible to account for
non-standard word order types (e.g. passives, relative clauses) with rela-
tively simple rules.
For inflecting languages (like the Slavonic languages, or Latin), nomi-
nal inflection gives a fair picture of case and thus of the grammatical rela-
tion between a nominal and its governing verb. Even though many inflec-
tional forms of e.g. nouns in Czech are case-ambiguous in isolation, a large
percentage of noun groups (e.g. adjectives plus nouns) is unambiguous.
Thus, inflecting languages, allowing for flexible constituent order, still
lend themselves fairly well to a morphology-based extraction approach of
relational word pairs.
In fact, the collocation extraction within the lexicographic tool Sketch
Engine (Kilgarnff et al. 2004) is based on the above mentioned principles
for English and Czech: the extraction of verb+object pairs from English
texts relies on sequence patterns of items described in terms of parts-of-
speech and the tool for Czech on patterns of cooccurrence of certain mor-
phological forms, in arbitrary order, and within a window of up to five
words.
German is different from both English and Czech, for that matter. Due
to its variable constituent order (see table 1), sequence patterns and the
assumption of verb+noun adjacency do not provide acceptable results.
German has four cases and nominals inflect for case; but nominal inflec-
tion contains very much syncretism, such that, for example, nominative and
accusative, or genitive and dative are formally identical in several inflec-
tion paradigms. Evert (2004) has extracted noun and noun phrase data from
the manually annotated Negra corpus (a subset of the newspaper Frank-
furter Rundschau) and he found that only approx. 21 % of all noun phrases
in that corpus are unambiguous for case, with roughly the same amount not
giving any case information (i.e. being fully four-way ambiguous, as is the
case with feminine plural nouns!), and 58 % being 2- or 3-way ambiguous
(cf table 2, below).
German noun+verb collocations in the sentence context 301
phosyntactic features (case, number, etc., cf. third and fourth column of
figure 1). In the extraction work, we rely on these data, as they give hints
on the morphosyntactic properties of the collocations extracted: the mor-
phosyntactic features, as well as the form of the determiner, possible nega-
tion elements in the noun phrase or in the verb phrase, possible adverbs,
etc. are extracted along with the lemma and form of base and collocate.
This multiparametnc extraction is modular: new features or context part-
ners can be added if this is necessary. Similarly, additional patterns, which
in this case cover the sentence as a whole, can be used to detect passives
and/or to identify the constituent order models involved. In this way, we
get data for the specific analyses discussed in section 2.
All extracted data for a given pair of base and collocate are stored in a
relational database, along with the sentence where these data have been
found. An example of a data set for a sentence is given in (22), below, for
sentence (23). For this example, the database contains information about
the noun and verb lemma (the verb here being a compound: geltend ma-
chen, 'put forward'), but also about the number, the kind of determiner
present in the NP (here: null, i.e. none), the presence of the passive (includ-
ing the lemma of the passive auxiliary, here werden), the sentence type
(verb-second), modifiers found in the sentence (adverbs and prepositional
phrases) and about the fact that the verb is embedded under a modal (for
details on the procedures, see Heid and Weller 2008).
(22) njemma | Grand
vjemma | geltend machen
number | PI
type_of_det | null
active/passive | passive
pass.auxiliaiy | werden
serotype | v-2
modifiers | auch (ADV), PP:fur:Emchtung, PP:fur:Land
modal |konnen
preposition | null
chunk | Solche Grande konnen auch fur die Emchtung ernes
gememsamen Patentamtes for die Lander geltend
gemacht werden
(23) Solche Griinde konnen auch fur die Errichtung eines gemeinsamen
Patentamtes fur die Lander geltend gemacht werden.
('Such reasons can also be put forward for the installation of a common
patent office for the Lander').
German noun+verb collocations in the sentence context 305
4. Conclusions
Notes
1 The examples are taken from ongoing work by Marion Weller (IMS Stuttgart)
on a corpus of German newspaper texts from 1992 to 1998, comprising mate-
rial from SMtgarter Zeitung, Frankfurter Rundschau, Die Zeit and Frank-
furter Allgemeine Zeitung, a total of ca. 240 million words. These sources are
indicated by the title and the onset of the citation in the IMS version of the re-
spective corpora. The text of Frankfurter Rundschau (1993/94) has been pub-
lished by the European Corpus Initiative (ELSNET, Utrecht, The Nether-
German noun+verb collocations in the sentence context 307
lands) in its first multilingual corpus collection (ECI-MC1). The other news-
papers, as well as the juridical corpus cited below have been made available
to the author under specific contracts for research purposes.
2 A related observation has to do with verb-final contexts: there, the support
verb and the pertaining noun tend to be adjacent. Only few types of phrases
can be placed between the two elements, e.g. adverbs, pronominal adverbs or
prepositional phrases. However, in the data used for our preliminary investi-
gations, this criterion does not help much to distinguish idiomatic groups from
non-idiomaticones.
References
Abney, Steven
1991 Parsing by chunks. In Principle-Based Parsing, Robert Berwick,
Steven Abney and Carol Tenny (eds.), 257-278. Dordrecht: Kluwer.
Bahns,Jens
1996 Kollokationen als lexikographiscf.es Problem: Eine Analyse allge-
meiner und spezieller Lernerworterbucher des Englischen. Lexico-
graphica Series Maior 74. Tubingen: Max Niemeyer.
Benson, Morton, Evelyn Benson and Robert Ilson,
1986 The Lexicographic Description of English. Amsterdam/Philadelphia:
John Benjamins.
Burger, Harald
1998 Phraseologie: Eine Einfuhrung am Beispiel des Deutschen. Berlin:
Erich Schmidt Verlag.
Cahill, Aiofe, Marion Weller, Christian Rohrer and Ulnch Held
2009 Using tn-lexical dependencies in LFG parse disambiguation. In
Prodeedings of the LFG09 Conference, Miriam Butt and Tracy Hol-
loway King (eds.), 208-221. Standford: CSLI Publications.
Evert, Stefan
2004 The statistical analysis of morphosyntactic distributions. In Proceed-
ings of the 4th International Conference on Language Resources and
Evaluation (LREC 2004), 1539-1542. Lisbon: ELRA.
Evert, Stefan
2005 The Statistics of Word Cooccurrences: Word Pairs and Collocations.
Stuttgart: University of Stuttgart and http://www.collocations.de/
phd.html.
Evert, Stefan
2009 Corpora and collocations. In Corpus Linguistics: An International
Handbook, Anke Ludelmg and Merja Kyto (eds.), 1212-1248. Ber-
lin/New York: Walter de Gruyter.
308 UlrichHeid
Heid,Ulrich
1998 Building a dictionary of German support verb constructions. In Pro-
ceedings of the 1st International Conference on Linguistic Resources
and Evaluation, Granada, May 1998, 69-73. Granada: ELRA.
Heid,Ulnch
2005 Corpusbasierte Gewmnung von Daten zur Interaktion von Lexik und
Grammatik: Kollokation - Distribution - Valenz. In Corpuslinguistik
in Lexik und Grammatik, Fnednch Lenz and Stefan Schierholz (eds.),
97-122. Tubingen: Stauffenburg.
Heid,Ulnch
2008 Computational phraseology: An overview. In Phraseology: An Inter-
disciplinary Perspective, Sylviane Granger and Fanny Meumer (eds.),
337-360. Amsterdam: John Benjamins.
Held, Ulnch, Fabienne Fntzmger, Susanne Hauptmann, Julia Weidenkaff and Mari-
on Weller
2008 Providing corpus data for a dictionary for German juridical phraseol-
ogy. In Text Resources and Lexical Knowledge: Selected Papers from
the 9th Conference on Natural Language Processing, KONVENS
2008, Angelika Starrer, Alexander Geyken, Alexander Siebert and
Kay-Michael Wiirzner (eds.), 131-144. Berlin: Mouton de Gruyter.
Held, Ulnch and RufusH.Gouws
2006 A model for a multifunctional electronic dictionary of collocations. In
Proceedings of the Xllth Euralex International Congress, 979-989.
Alessandria: Ediziom dell'Orso.
Held, Ulnch and Marion Weller
2008 Tools for collocation extraction: Preferences for active vs. passive. In
Proceedings of LREC-2008: Linguistic Resources and Evaluation
Conference, Marrakesh, Morocco. CD-ROM.
Helbig, Gerhard
1979 Probleme der Beschreibung von Funktionsverbgefugen im Deutschen.
Deutsche Fremdsprache 16: 273-286.
Ivanova, Kremena, Ulnch Held, Sabine Schulte im Walde, Adam Kilgarnff and Jan
Pomikalek
2008 Evaluating a German sketch grammar: A case study on noun phrase
case. In Proceedings of LREC-2008: Linguistic Resources and
Evaluation Conference, Marrakech, Marocco. CD-ROM.
Keil, Martina
1997 Wort fur Wort: Representation und Verarbeitung verbaler Phraseo-
logismen (Phraseolex). Tubingen: Niemeyer.
Kermes, Hannah
2003 Offline (and Online) Text Analysis for Computational Lexicography.
Dissertation, IMS, University of Stuttgart.
310 UlrichHeid
Schiehlen, Michael
2003 A cascaded finite-state parser for German. In Proceedings of the
Research Note Sessions of the 10th Conference of the European
Chapter of the Association for Computational Linguistics (EACL
2003), Budapest, April 2003, 133-166. Budapest: Association for
Computational Linguistics.
Scott, Mike
2008 WordSmith Tools, version 5, Liverpool: Lexical Analysis Software.
Seretan,Violeta
2008 Collocation Extraction Based on Syntactic Parsing. Dissertation No.
653, Dept. de lmguistique, Umversite de Geneve, Geneve.
Siepmann,Dirk
2005 Collocation, colligation and encoding dictionaries. Part I: Lexicologi-
cal Aspect. International Journal of Lexicography 18 (4): 409-444.
Sinclair, John McH.
1991 Corpus, Concordance, Collocation. Oxford: Oxford University Press.
Smadja, Frank
1993 Retrieving collocations from text. Computational Linguistics 19 (1):
143-177.
Starrer, Angelika
2006 Zum Status der nommalen Komponente in Nommahsierungsverbge-
fugen. In Grammatische Untersuchungen: Analysen und Reflexionen,
Eva Bremdl, Lutz Gunkel and Bruno Strecker (eds.), 275-295. Tu-
bingen: Narr.
Corpora
Faulhaber,Susen,31 Grossmann,Franc1S,283
Fazly,Afsaneh,297,299 GroB, Annette, 74, 76
Feilke, Helmut, 62 Guan,Jihong,91
FeUbaum, Christians 287
Ferrer iCancho, Ramon, 91 Hamilton, Nick, 132
Ferret, Olivier, 91 Han,ZhaoHong,152
Fillmore, Charles, 40, 41, 49, 199 Handl,Susanne,47,262
Firth, John Rupert, 5, 6, 12, 19, 49, Harris, ZelhgS., 213
147,211,212,221,223,284 Harwood, Nigel, 124, 126, 131
Fischer, Kerstm, 47, 49 Hasselgren, Angela, 130
Fuzpatnck,Tess,136 Hausmann, Franz Josef, 28-30, 32-
Fletcher, William H., 211, 214 34,37,47,60,61,65-68,77,78,
Fodor, Jerry Alan, 258 200,208,230,236,283,285,
Francis, Gill, 8, 47, 67 286,291
Francis, W.Nelson, 7, 18 Hausser, Roland, 247, 248, 256, 259,
Frath,Pierre,61,62,65,78 260,262-264
Frey,Enc,49,229 Heaton, John B., 161
Fritzinger,Fabienne,303 Heid,Ulrich, 283, 284, 286-288,
293,296,299,302,304
Gabnelatos,Costas,133,140 Helbig, Gerhard, 287, 288
Gallagher, John D., 60, 65, 73, 78 Herbst,Thomas,28,29,31,39,41,
Galpm,Adam,49,229 47,49,50,78,161,203,252
Garside, Roger, 231, 243 Heringer, Hans Jurgen, 156
Gavioli, Laura, 211 Hinrichs, Lars, 187
Gilquin,Gaetanelle,28,127,138 Hockemnaier, Julia, 263
Girard, Rene, 264 Hoey, Michael, 24, 66, 67, 78, 123,
Gledhill, Christopher, 61, 62, 65, 78 128
Glaser,Rosemarie,41 Hoffmann, Sebastian, 160, 180
G6rz,Gunther,280 Hofland,Knut,47
Gotz, Dieter, 50, 153, 156, 239 Hoover, David, 3
G6tz-Votteler,Katrm,50,156,231 Hopper, Paul, 183
Goldberg, AdeleE., 31, 40, 44,47- Howarth, Peter, 127, 161
49 Hu,Guobiao,91
Gouverneur, Celine, 133 Huang, Wei, 263
Gouws,RufusH.,284 Hugon,Cla1re,134
Grandage, Sarah, 49 Hundt, Marianne, 160
Granger,Sylvrane,28,41,46,47, Hunston,Susan,8,14,47,67,123,
127,134,138,153,161,200,230 212,214,224
Grant, Lynn, 136 Hyland, Ken, 137,211
Greaves, Chris, 10,211,214,216
Green, Georgia, 67-69, 74, 260 Ilson, Robert, 202, 285
Greenbaum, Sidney, 49 Ivanova,Kremena,301,302
Gries, Stefan Th., 47, 50
Grimm, Anne, 201, 205 Jackson, Dunham, 116,218
Author index 315
Neumann, Gerald, 287 Sand, Andrea, 35, 160, 161, 165, 174
Nicewander, W.Alan, 116 Santorini, Beatrice, 263
Nichols, Johanna, 251 Saussure, Ferdinand de, 43, 218
Schafroth,Elmar,286
01iveira,OsvaldoN.,91 Schiehlen, Michael, 302, 303
Overstreet,Maryann,198 Schilk, Marco, 160
Schm 1 d,Hans-J6rg,36,47,48
Pacey, Mike, 114-116 Schmied, Josef, 160
Paillard,Michel,72,73,78 Sch mi tt,Norbert,49,127,229
Paquot,Magali,41,46,47,127,134, Schneider, Edgar W., 160
138,200 Schuller,Susen,41,252
Park, Young C, 91 Schur, Ellen, 91
Partington, Alan, 66 Scott, Mike, 198, 246, 285
Pawley, Andrew, 11,40 Selinker, Larry, 152
Pearson, Jennifer, 116,211 Sells, Peter, 263
Piatt, John, 160 Seretan,Violeta,302
Pollard, Carl, 259 Shei,Chi-Chiang,161
Polzenhagen, Frank, 160 S 1 epmann,D 1 rk,30,44,47,61-63,
Porto, Melma, 125, 127, 129 67,78,156,286
Proisl, Thomas, 49, 262 Sigman, Mariano, 91
Pulverness, Alan, 130, 139 Simpson-Vlach, Rita, 127, 129
Putnam, Hilary, 246, 248, 262 Sinclair, John McH., 1-14, 17-24,
27-29,32,37,38,41-43,45,59-
Quirk, Randolph, 18 63,66,67,71,77,78,87,89,93,
103,109,113,115,116,123,
Raab-F1scher,Rosw1tha,187,191 124, 126, 128, 130, 132, 134,
Ramshaw,LanceA.,276,277 135, 138, 139, 147, 154, 156,
Rayson, Paul, 115,231,239 159,179,197,206,208,211,
Reiss, Peter, 280 212,214,215,217,223,224,
Renouf, Antoinette, 8, 114-116, 130, 229,240,243,245,258,284,286
134,135,138 Siyanova, Anna, 127
Richards, Jack C, 128 Skandera, Paul, 160, 170
Risau-Gusman, Sebastian, 91 Smadja, Frank, 285
Ritz, Julia, 286, 299 Smith, Nicholas, 185, 187, 240, 272
Rodgers,G.J.,91 Scares, MarcioMedeiros, 91
Rodgers, Joseph Lee, 116 Sole, Richard V., 90
Romer.Ute, 174,200,211,214 Speares, Jennifer, 48
Rogers, Ted, 131 Steedman, Mark, 263
Rohdenburg,Gunter,180 Steels, Luc, 90
Rohrer, Christian, 293 Stefanci6,Hrvoje,91
Rosenbach,Anette,187 Stefanowitsch,Anatol, 44, 47-50
Stein, Stephan, 65
Sag, Ivan, 259 Stevenson, Suzanne, 297, 299
Salkoff, Morris, 77 Steyvers,Mark,90
Author index 317
lexicon, 19, 36, 48, 130, 179, 248, objective, 7, 76, 246
249,257,263,284 obligatory, 59
lexis, 7, 8, 11, 18, 19, 21, 23, 24, 46, OED,35
123-126,130,131,133,140,200 omission, 96, 160
Hght verb, 174 opacity, 283
opaque, 259, 283
meaning, 1,3, 7, 8, 10-12, 21, 23, open choice principle, 28, 31, 37, 38,
27,29-31,33-39,41-46,48,49, 44,59,61,64-66,68,229,284
64,76,87-89,102,103,109, optional, 49, 271
111-114,116,123,128,130,
133, 134, 137, 138, 140, 147, paradigmatic, 45
150-152,156,165,172,202, paraphrase, 152
204,211,212,214,216,217, parole, 43, 61
219,221-224,245,246,248, parser, 259, 272, 273, 276, 277, 279,
250,251,253,257,260-262, 280,302,303
264,270,274,283,285,291 parsing, 254, 261, 269-272, 274,
mental lexicon, 29, 36 276,278-280,302,303,306
metaphor, 65, 67, 68, 136, 259, 285 participant, 96, 224, 279
metaphorical, 10, 89 particle, 160, 200, 203, 206, 207,
metonymy, 65 286,292
modifier, 167, 190, 244, 270, 304 passive, 60, 125, 133, 274, 304, 305
motivation, 28, 127, 135, 137, 139, patient, 278
269,271 pattern, 2, 9, 12, 13, 17, 18, 20-22,
multi-word, 39, 45, 50, 61, 78, 124, 27,31,32,46,47,62,63,65-67,
131-134,199,203,214,237 69,87-91,94,101-103,110,112,
multilingual, 307 113,116,123,126,129,138,
154,155,165-167,189,201,
n-gram, 198, 214, 216, 217, 219, 223 207,211,212,215-217,219,
Natural Language Processing (NLP), 221-224,229,249,257,258,
269 263,264,269,286,290,292,
negation, 291, 297, 304 300,302-304
network, 87, 88, 90-95, 98-102, 105- performance, 147, 274, 277, 279
108,110,112-115,123 periphery, 28, 127, 206
New Englishes, 240 phrasal, 8, 36, 123, 127, 137, 139,
non-compositionality, 10, 127, 134, 160,199,214,273,274
259,283,297 phrasalverb,36,123,127,137,160,
noun, 22, 31, 37, 43, 47, 60, 62, 69, 274
73, 128, 155, 165, 167, 168, 187, phrase, 6, 8, 11,27,28,41,42,46,
190,191,204,217,221,234, 49,59,60,62,64,65,68,69,74,
238,251,263,271-273,286, 75,78,89,125,127,129,131-
287,289-291,300,301,303 138, 150, 154, 180, 187, 188,
190,191,204,206,207,214,
object, 44, 149, 273, 274, 285-288, 215,219,259,273,274,278,
295,299-303,306
Subject index 323
tagging, 152, 237, 243, 244, 263, valency, 28, 31, 32, 38, 41, 46, 49,
274,276-278 65,149,151,156,199,252,253,
text, 1-3, 6, 7, 9-13, 17, 18, 20-24, 255,288,291
27,28,33,35,38,45,48,49,59, variability, 166, 169, 173, 174, 215
61-63,65,66,68,70,74,79,87- variant, 8, 73, 165, 169, 170, 184,
93,96,97,100,101,103,104, 186,190,191,215,253,293,
109,111-113,115,116,123, 294,306
125, 128, 136, 138, 150, 162, variation, 42, 49, 73, 149, 181, 215,
185, 186, 189, 191, 197, 198, 216,296,297,306
203,211-214,216,218,221, variety, 4, 137, 159-161, 163, 165,
222,224,229-232,234-240,244, 168-171,173,174,181,188,
245,269,270,274,278,284, 189,197-201,203-208,211
285,287,290,295-302,305,306 verb, 22, 23, 27, 31, 32, 37, 38, 40-
theme, 1,3, 180 44,49,60,67,68,74,76,78,89,
token, 91, 94, 100, 104, 190, 215, 123, 134, 138, 155, 156, 159-
249,250,264,269 161, 165, 170-174, 185, 186,
transitive, 220, 243, 253, 258 202,203,206,207,216,243,
translation, 39, 45, 59-61, 67-70, 73- 244,251-253,258,260,263,
79,149,153,154,222,231-237, 270,272-275,278,284-297,299-
263 304,306,307
type, 91, 93, 94, 102, 104, 215, 216,
249,250 word class, 204
word form, 123,200,230,231,233,
umtofmeanmg,l,3,8,ll,13,36, 248,249,251,254,257,259,
45,211,214,223 261-264,269,273,303
usage, 31, 32, 147, 154, 170, 181, written, 7, 14, 113, 123, 127, 161-
184,188,189,238,240,290 165,179-181,183-191,212,220-
usage-based, 32 223
utterance, 2, 40, 46, 147, 150, 152,
189,206,221