KEMBAR78
The Phraseological View of Language | PDF | Linguistics | Parsing
0% found this document useful (0 votes)
1K views336 pages

The Phraseological View of Language

Uploaded by

Natalia Moskvina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views336 pages

The Phraseological View of Language

Uploaded by

Natalia Moskvina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 336

The Phraseological View of Language

The Phraseological View


of Language
A Tribute to John Sinclair

Edited by
Thomas Herbst
Susen Faulhaber
Peter Uhrig

De Gruyter Mouton
ISBN 978-3-11-025688-8
e-ISBN 978-3-11-025701-4

Library of Congress Cataloging-in-Publication Data

The phraseological view of language : a tribute to John Sinclair / edited


by Thomas Herbst, Susen Faulhaber, Peter Uhrig.
p. cm.
Includes bibliographical references and index.
ISBN 978-3-11-025688-8 (alk. paper)
1. Discourse analysis. 2. Computational linguistics. 3. Lin-
guistics — Methodology. I. Sinclair, John McHardy, 1933 — 2007.
II. Herbst, Thomas, 1953— III. Faulhaber, Susen, 1978— IV. Uh-
rig, Peter.
P302.P48 2011
401'.41—dc23
2011036171

Bibliographic information published by the Deutsche Nationalbibliothek

The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie;


detailed bibliographic data are available in the Internet at http://dnb.d-nb.de.

© 2011 Walter de Gruyter GmbH & Co. KG, Berlin/Boston


Printing: Hubert & Co. GmbH & Co. KG, Göttingen
co Printed on acid-free paper
Printed in Germany
www.degruyter.com
Preface
This volume goes back to a workshop entitled Chunks in Corpus Linguis-
tics and Cognitive Linguistics held at the Fnednch-Alexander-Umversitat
Erlangen-Niirnberg in October 2007 on the occasion of awarding an honor-
ary doctorate by the Philosophische Fakultat II to Professor John McHardy
Sinclair.
With this honorary degree the Faculty wished to express its respect and
admiration for John Sinclair's outstanding scholarly achievement and the
enormous contribution he has made to linguistic research. This concerns
both his role in the development of corpus linguistics and his innovative
approach towards the making and designing of dictionaries, which culmi-
nated in the Cobuild project, but also the fact that for him such applied
work was always connected with the analysis of language as such and thus
with theoretical insights about language. For instance, at a time when it was
commonly held that words have more than one meaning and that when
listening to a sentence listeners pick out the appropriate meanings of the
words on the basis of the context in which they are used, it took Sinclair to
point out that, taking the meanings listed in a common English dictionary, a
sentence such as The cat sat on the mat would potentially have more than
41 million interpretations from which listeners choose one as being correct
(Sinclair 2004: 138).2 It is observations such as these - and especially the
many demonstrations of the complex character of prefabricated chunks in
language - that make his books and articles so enlightening and engaging
to read. Unconventional thought and provocative ideas based on unparal-
leled empirical evidence - these are the qualities that made this faculty
regard John Sinclair as an outstanding individual and scholar. His approach
to linguistics encompassed all facets without any division between theory
and practical description and has provided an important impetus for lin-
guists worldwide, including many associated with the Erlangen Interdisci-
plinary Centre for Research on Lexicography, Valency and Collocation. On
a personal note, we are grateful to him for his decisive support of the
valency dictionary project.
That John Sinclair, who had planned to take part in the colloquium, died
in March 2007, came as a great shock; at least, the news of this honorary
doctorate had reached him when he was still well. Nevertheless, we felt it
was appropriate to hold the workshop, in which a number of John's col-
leagues and friends participated, to discuss his ideas and research projects
that were inspired by and related to his work and we are very honoured and
V! Susen Faulhaber, Thomas Herbst and Peter Uhrig

grateful that Professor Elena Togmm Bonelli came to take part in the work-
shop and to receive this distinction on his behalf.
We owe a great deal of gratitude to Michael Stubbs and Stig Johansson
for providing very detailed surveys of Sinclair's outstanding contribution to
the development of the subject which forms section I of the this volume. In
"A tribute to John McHardy Sinclair (14 June 1933 - 13 March 2007)",
Michael Stubbs provides a detailed outline of John Sinclair's academic
career and demonstrates how he has left his mark in various linguistic fields
which are, in part, intrinsically connected to his name. Of particular signifi-
cance in this respect is his focus on refining the notion of units of meaning,
the role of linguistics in language learning and discourse analysis and, of
course, his contributions in the field of lexicography and corpus linguistics
in the context of the Cobuild project. We are particularly grateful to Stig
Johansson, who, although he was not able to take part in the workshop for
reasons of his own health, wrote a tribute to John Sinclair for this occasion.
His article "Corpus, lexis, discourse: a tribute to John Sinclair" focuses in
particular on Sinclair's work in the context of the Bank of English and his
influence on the field of corpus linguistics.
The other contributions to this volume take up different issues which
have featured prominently in Sinclair's theoretical work. The articles in
section II focus on the concept of collocation and the notions of open-
choice and idiom principle. Thomas Herbst, in "Choosing sandy beaches -
collocations, probabemes and the idiom principle" discusses different types
of collocation in the light of Sinclair's concepts of single choice and ex-
tended unit of meaning, drawing parallels to recent research in cognitive
linguistics. The notions of open choice and idiom principle are also taken
up by Dirk Siepmann, who in "Sinclair revisited: beyond idiom and open
choice" suggests introducing a third principle, which he calls the principle
of creativity, and discusses its implications for translation teaching. Eugene
Mollet, Alison Wray, and Tess Fitzpatnck widen the notion of collocation
in their article "Accessing second-order collocation through lexical co-
occurrence networks". By second-order collocation they refer to word
combinations depending on the presence or absence of other items, mod-
elled in the form of networks. Thus they address questions such as to what
extent the presence of several second-order collocations can influence the
target item. A number of contributions focus on aspects of collocation in
foreign language teaching. In "From phraseology to pedagogy: challenges
and prospects" Sylviane Granger discusses the implications of the interde-
pendence of lexis and grammar and the idiom principle with respect to the
Preface vii

learning and teaching of languages. While bearing in mind that such an


approach needs to be reconciled with the realities of language classrooms,
she gives a detailed overview of the pros and cons of a lexical as opposed
to a structural grammar-based syllabus. In "Chunks and the effective
learner - a few remarks concerning foreign language teaching and lexicog-
raphy" Dieter Gotz illustrates approaches towards the treatment of chunks
in lexicography and discusses strategies foreign learners should develop to
expand their repertoire of prefabricated items. Nadja Nesselhauf s article
"Exploring the phraseology of ESL and EFL varieties" addresses questions
of foreign language use (EFL) and compares them with English as a native
language and English as a second language (ESL) with respect to chunks
and phraseological phenomena. With her view of learner English as a vari-
ety of English, Nesselhauf s contribution provides a link to section III,
which focuses on aspects of variation and change.
"Writing the history of spoken standard English in the twentieth cen-
tury" by Christian Mair takes a diachromc perspective and focuses on the
role of the spoken language - as opposed to that of the written language -
in language change. By comparing data from the Diachromc Corpus of
Present-Day Spoken English (DCPSE) with corpora from the "Brown fam-
ily", he shows that speech and writing can develop autonomously and that
changes which develop in one mode need not necessarily be taken over into
the other. In her article "Prefabs in spoken English", Bngitta Mittmann
looks at regional variation and presents results of a corpus-based study of
phraseological units in British and American English. By providing empiri-
cal evidence for regional differences, e.g. in the sense of different prefer-
ences for phraseological units with the same pragmatic function, she under-
lines the arbitrariness and conventionality of such chunks in language.
Ute Romer investigates variation with respect to genre in her article
"Observations on the phraseology of academic writing: local patterns -
local meanings?". Examining what she calls the "phraseological profile" in
this specific type of written English with the help of a corpus of linguistic
book reviews, she shows that certain chunks are often associated with posi-
tive or negative evaluative meaning even if the lexical items themselves are
rather neutral in this respect. The fact that this association in part differs
from the use of the same chunks in more general English is taken as evi-
dence for a "local grammar" and the genre-specificity of units of meaning
and thus their conventionality. Peter Uhng and Katnn Gotz-Votteler also
place a focus on genre-specific differences in their article "Collocational
behaviour of different types of text". They look at different samples of fie-
viii Susen Faulhaber, Thomas Herbst and Peter Uhrig

tional, non-fictional and learner texts using a computer program to deter-


mine different degrees of collocational strength and to explore to what ex-
tent it is possible to relate these to factors such as perceived difficulty, text
typeondiomaticity.
Section IV concentrates on computational aspects of phraseological re-
search. In "Corpus linguistics, generative grammar, and database seman-
tics", Roland Hausser compares the information given in entries of the
COBUILD English Language Dictionary to the data structures needed in
Database Semantics, which is aimed at enabling an artificial cognitive
agent to understand natural language. Giinther Gorz and Girnter Schellen-
berger show in "Chunk parsing in corpora" that chunks also play an impor-
tant role in Natural Language Processing (NLP). They present a method for
chunk parsing and for evaluating the performance of chunk parsers. Finally,
Ulnch Heid's contribution "German noun+verb collocations in the sentence
context: morphosyntactic properties contributing to idiomaticity" combines
computational aspects of analysis with the theoretical insights gained in a
Sinclairean framework and shows that a thorough description of colloca-
tions goes beyond the level of lexical co-occurrence and has to include
morphological, syntactic, semantic and pragmatic properties. He proposes a
computational architecture for the extraction of such units from corpora
that makes use of syntactic dependency parsing.
We hope that the contributions in this volume give some idea of the
wide spectrum of areas in which research has been earned out that has been
either directly inspired by the work of Sinclair or that addresses issues that
were also important to him. The spirit of this volume is to pay tribute to one
of the most outstanding linguists of the twentieth century, who (in German)
could rightly be called a first-class Querdenker, an independent thinker
who, in a rather down-to-earth way, raised questions that were not always
in line with the research fashion of the day but instead pointed towards
unconventional solutions which in the end seemed perfectly straightfor-
ward.

Susen Faulhaber
Thomas Herbst
Peter Uhrig
Preface ix

1 We would like to thank Barbara Gabel-Cunmngham for her invaluable help in


preparing the manuscript, Kevin Pike for his linguistic advice and also Chris-
tian Hauf for his assistance in preparing the index and proofreading.
2 See Sinclair (2004: 138), The lexical item. In Trust the Text: Language, Cor-
pus and Discourse, John McH. Sinclair with Ronald Carter (eds.). Lon-
don/New York: Routledge. First published in Edda Weigand, Contrastive
Lexical Semantics: Current Issues in Linguistic Theory, Amster-
dam/Philadelphia: Benjamins, 1998.
Contents
I John McH. Sinclair and his contribution to linguistics

Preface
Susen Faulhaber, Thomas Herbst and Peter Uhrig

A tribute to John McHardy Sinclair (14 June 1933-13 March 2007) 1


Michael Stubbs

Corpus, lexis, discourse: a tribute to John Sinclair 17


Strg Johansson

II The concept of collocation: theoretical and pedagogical aspects

Choosing sandy beaches - collocations, probabemes and the idiom


principle 27
Thomas Herbst

Sinclair revisited: beyond idiom and open choice 59


DrrkSrepmann

Accessing second-order collocation through lexical co-occurrence


networks 87
Eugene Mollet, Alison Wray and Tess Frtzpatnck

From phraseology to pedagogy: challenges and prospects 123


Sylvrane Granger

Chunks and the effective learner - a few remarks concerning foreign


language teaching and lexicography 147
Dieter Gotz

Exploring the phraseology of ESL and EFL varieties 159


Nadja Nesselhauf
xii Contents

III Variation and change

Wnting the history of spoken standard English in the twentieth


century 179
ChnstianMatr

Prefabs in spoken English 197


BngtttaMMmann

Observations on the phraseology of academic writing: local patterns


- local meanings? 211
UteRomer

Collocational behaviour of different types of text 229


Peter Uhrig and Katrm Gotz-Votteler

IV Computational aspects

Corpus linguistics, generative grammar and database semantics 243


Roland Hausser

Chunk parsing in corpora 269


Gunther Gorz and Gunter Schellenberger

German noun+verb collocations in the sentence context:


morphosyntactic properties contributing to idiomaticity 283
UlnchHetd

Author index 313


Subject index 319
Michael Stubbs

1. Abstract

This Laudatio was held at the University of Erlangen, Germany, on 25


October 2007, on the occasion of the posthumous award of an honorary
doctorate to John McHardy Sinclair. The Laudatio briefly summarizes
some facts about his career in Edinburgh and Birmingham, and then dis-
cusses the major contributions which he made to three related areas: lan-
guage in education, discourse analysis and corpus-assisted lexicography. A
major theme in his work throughout his whole career is signalled in the title
of one of his articles: "the search for units of meaning". In the 1960s, in his
early work on corpus analysis, he studied the relation between objectively
observable collocations and the psychological sensation of meaning. In the
1970s, in his work on classroom discourse, he studied the prototypical units
of teacher-pupil dialogue in school classrooms. And from the 1980s on-
wards, in his influential work on corpus lexicography, for which he is now
best known, he studied the kinds of patterning in long texts which are ob-
servable only with computational help. The Laudatio provides brief exam-
ples of the kind of innovative findings about long texts which his work has
made possible: his development of a "new view of language and the tech-
nology associated with it". Finally, it situates his work within a long tradi-
tion of British empiricism.

2. Introduction

John Sinclair is one of the most important figures in modern linguistics. It


will take a long time before the implications of his ideas, for both applied
linguistics and theoretical linguistics, are fully worked out, because many
of his ideas were so original and innovative. But some points are clear.
Most of us would be pleased if we could make one recognized contribution
to one area of linguistics. Sinclair made substantial contributions to three
areas: language in education, discourse analysis, and corpus linguistics and
2 Michael Stubbs

lexicography. In the case of discourse analysis and corpus-assisted lexico-


graphy, he created the very areas which he then developed.
The three areas are closely related. In the 1960s, in Edinburgh, he began
his work on spoken language, based on the belief that spoken English
would provide evidence of "the common, frequently occurring patterns of
language" (Sinclair et al. [1970] 2004: 19). One central topic in his work on
language in education was classroom discourse: authentic audio-recorded
spoken language. The work on classroom language emphasized the need
for the analysis of long texts, as opposed to the short invented sentences
which were in vogue at the time, post-1965. In turn, this led to the con-
struction of large text collections, machine-readable corpora of hundreds of
millions of running words, and to studying patterning which is visible only
across very large text collections. This allowed the construction of the se-
ries of COBUILD dictionaries and grammars for learners of English. The
COBUILD dictionaries, which were produced from the late 1980s by using
such corpus data, were designed as pedagogic tools for advanced learners
of English. There were always close relations between his interest in spok-
en language, authentic language and language in the classroom, and be-
tween his theoretical and applied interests.

3. Education and career

Sinclair was a Scot, and proud of it. He was born in 1933 in Edinburgh,
attended school there, and then studied at the University of Edinburgh,
where he obtained a first class degree in English Language and Literature
(MA 1955). He was then briefly a research student at the University, before
being appointed to a Lectureship in the Department of English Language
and General Linguistics, where he worked with Michael Halliday. His work
in Edinburgh centred on the computer-assisted analysis of spoken English
and on the linguistic styhstics of literary texts.
In 1965, at the age of 31, he was appointed to the foundation Chair of
Modern English Language at the University of Birmingham, where he then
stayed for the whole of his formal university career. His inaugural lecture
was entitled "Indescribable English" and argues that "we use language
rather than just say things" and that "utterances do things rather than just
mean things" (Sinclair 1966, emphasis in original). His work in the 1970s
focussed on educational linguistics and discourse analysis, and in the 1980s
he took up again the corpus work and developed his enormously influential
approach to lexicography.
A Tribute to John McHardy Sinclair 3

He took partial retirement in 1995, then formal retirement in 2000, but


remained extremely active in both teaching and research. With his second
wife Elena Togmm-Bonelli, he founded the Tuscan Word Centre, and at-
tracted large numbers of practising linguists to teach large numbers of stu-
dents from around the world.1 Throughout his career he introduced very
large numbers of young researchers and PhD students - who were spread
across many countries - to work in educational topics, discourse analysis
and corpus analysis. He travelled extensively in many countries, but never-
theless spent most of his career based in Birmingham. One reason, he once
told me, was that he had built up computing facilities there, and, until the
1990s, it was simply not possible to transfer such work to other geographi-
cal locations.
He died at his home in Florence in March 2007.

4. "The search for units of meaning-

It is obviously rather simplistic to pick out just one theme in all his work,
but "the search for units of meaning" (Sinclair 1996) might not be far off.
This is the title of one his articles from 1996. In his first work in corpus
linguistics in the 1960s, he had asked: "(a) How can collocation be objec-
tively described?" and "(b) What is the relationship between the physical
evidence of collocation and the psychological sensation of meaning?" (Sin-
clair, Jones and Daley 2004: 3). In his work on discourse in the classroom,
he was looking for characteristic units of teacher-pupil dialogue. In his later
corpus-based work he was developing a sophisticated model of extended
lexical units: a theory of phraseology. And the basic method was to search
for patterning in long authentic texts.
Along with this went his impatience with the very small number of short
invented, artificial examples on which much linguistics from the 1960s to
the 1990s was based. The title of a lecture from 1990, which became the
title of his 2004 book, also expresses an essential theme in his work: "Trust
the Text" (Sinclair 2004). He argued consistently against the neglect and
devaluation of textual study, which affected high theory in both linguistic
and literary study from the 1960s onwards (see Hoover 2007).
4 Michael Stubbs

5. Language in education

One of his major contributions in the 1960s and 1970s was to language in
education and educational linguistics. In the 1970s, he was very active in
developing teacher-training in the Birmingham area. He regularly made the
point that knowledge about language is "sadly watered down and trivial-
ized" in much educational discussion (Sinclair 1971: 15), and he succeeded
in making English language a compulsory component of teacher-training in
BEd degrees in Colleges of Education in the West Midlands. Also in the
early 1970s, along with Jim Wight, he directed a project entitled Concept 7
to 9. This project produced innovative teaching materials. They consisted
of a large box full of communicative tasks and games. They were originally
designed for children of Afro-Caribbean origin, who spoke a variety of
English sometimes a long way from standard British English, but it turned
out that they were of much more general value in developing the communi-
cative competence of all children. The tasks focussed on "the aims of
communication in real or realistic situations" and "the language needs of
urban classrooms" (Sinclair 1973: 5). In the late 1970s, he directed a pro-
ject which developed ESP materials (English for Specific Purposes) for the
University of Malaya. The materials were published as four course books,
entitled Skills for Learning (Sinclair 1980).
In the early 1990s, he became Chair of the editorial board for the journal
Language Awareness, which started in 1992. One of his last projects is
PhraseBox. This is a project to develop a corpus linguistics programme for
schools, which Sinclair worked on from around 2000. It was commissioned
by Scottish CILT (Centre for Information on Language Teaching and Re-
search) and funded by the Scottish Executive, Learning and Teaching Scot-
land and Canan (the Gaelic College on Skye). The software gives children
in Scottish primary schools resources to develop their vocabulary and
grammar by providing them with real-time access to a 100-million-word
corpus. The project is described in one of Sinclair's more obscure publica-
tions, in West Word, a community newspaper for the western highlands in
Scotland (Sinclair 2006a).
In a word, Sinclair did not just write articles about language, but helped
to develop training materials for teachers and classroom materials for stu-
dents and pupils.
A Tribute to John McHardy Sinclair 5

6. Discourse analysis

His contribution to discourse analysis continued his early interests in both


spoken language and in language in education. The work started formally
in 1970 in a funded research project on classroom discourse. The work was
published in 1975 in one book written with the project's co-director, Mal-
colm Coulthard, Towards an Analysts ofDtscourse (Sinclair and Coulthard
1975), and then a second book in 1982 with David Brazil Teacher Talk
(Sinclair and Brazil 1982).
It was through this work that I got to know John Sinclair personally. I'd
been doing my PhD in Edinburgh on classroom discourse, and joined him
and his colleague Malcolm Coulthard on a second project on discourse
analysis, which studied doctor-patient consultations, telephone conversa-
tions, and trade union / management negotiations. My job was to make and
analyse audio-recordings in the local car industry. The work from this pro-
ject was never published as a book, but individual journal articles and book
chapters appeared.
It is difficult to remember how little was available on discourse analysis
at the time of these projects, and therefore how innovative Sinclair's ap-
proach was. J. R. Firth had long ago pointed out that "conversation is much
more of a roughly prescribed ritual than most people think" (Firth 1935:
66). But in the early 1970s, the published version of John Austin's How to
Do Thtngs wtth Words was still quite recent (published only ten years be-
fore in 1962). John Searle's Speech Acts had been published only two or
three years before (1969). Paul Gnce had given his lectures on Logtc and
Conversation some five years before (in 1967), but they were formally
published only in 1975 and seem not to have been known to the project.
This Oxford (so-called) "ordinary language" work was, however, not based
on ordinary language at all, but on invented data: anathema to Sinclair's
approach.
Michael Halliday's work was known of course, and provided a general
functional background, but even Language as Soctal Semtotic was not pub-
lished until several years after the project (in 1978). Anthropological and
sociological work also provided general background: Dell Hymes on the
ethnography of speaking (available since the early 1960s), Erving Goffman
on "behaviour in public places" (from the 1960s); William Labov on narra-
tives and ntual insults (from the early 1970s). Harvey Sacks' lectures were
circulating in mimeo form; I arrived in Birmingham with a small collection,
and had heard him lecture in Edinburgh around 1972; but little had been
6 Michael Stubbs

formally published. Otherwise, in the early 1970s, work on classroom dis-


course (by educationalists such as Douglas Barnes) provided insightful
observation, but little systematic linguistic description.
There was a general feeling that discourse should somehow be studied,
but there were few if any attempts to develop formal models of discourse
structure. The two Birmingham projects were the first of their kind, but just
a few years later, "discourse analysis" had become a clearly designated
area, with its own courses and textbooks. One of the first student introduc-
tions was by Malcolm Coulthard (1977). Sinclair was following the princi-
ple proposed by J. R. Firth in the 1930s, that conversation is "the key to a
better understanding of what language really is and how it works" (Firth
1935: 71), but Sinclair's work on discourse was some ten years ahead of
the avalanche of work which it helped to start.
The aspect of the Birmingham discourse model which everyone imme-
diately grasped was the stereotypical teacher-pupil exchange. In classic
structuralist manner, Sinclair proposed that classroom discourse is hierar-
chic: a classroom lesson consists of transactions which consist of ex-
changes which consist of moves which consist of acts. It was probably the
prototypical exchange structure which everyone immediately recognized:
an IRF sequence of initiation - response - feedback (Sinclair and Coulthard
1975: 64):
I Teacher: What is the name we give to those letters? Paul?
R PupU: Vowels.
F Teacher: They're vowels, aren't they.
Nowadays, the IRF model is widely taken for granted, though I suspect that
many people who use it no longer know where it comes from.
In addition, Sinclair never abandoned an interest in literature, and his
work on text and discourse analysis always included literary texts. In the
1970s, along with the novelist David Lodge, who was his colleague in the
English Department at Birmingham, he developed a course on stylistics.
The title of an early article was "The integration of language and literature
in the English curriculum" (Sinclair 1971). For the course, they selected
extracts of literary texts in which a specific linguistic feature was fore-
grounded: such as repetition, verbless sentences, complex noun phrases,
and the like. From one end, they taught grammar through literature, and
from the other end, they showed that grammatical analysis was necessary to
literary interpretation.
A Tribute to John McHardy Sinclair 7

The analysis of literary texts was part of Sinclair's demand that linguis-
tics must be able to handle all kinds of authentic texts. He argued further
that, if linguists cannot handle the most prestigious texts in the culture, then
there is a major gap in linguistic theory. Conversely, of course, the analysis
of literary texts must have a systematic basis, and not be the mere swapping
of personal opinions. In an analysis of a poem by Robert Graves, he argued
that the role of linguistics is to expose "the public meaning" of texts in a
language (Sinclair 1968: 216). He similarly argued that "if literary com-
ment is to be more than exclusively personal testimony, it must be inter-
pretable with respect to objective analysis" (Sinclair 1971: 17). In all of this
work there is a consistent emphasis on long texts, authentic texts, including
literary texts, and on observable textual evidence of meaning.

7. Corpus linguistics and lexicography

7.1. The "OSTI" Report

Post-1990, Sinclair was mainly known for his work in corpus linguistics.
This work started in Edinburgh, in the 1960s, and was informally published
as the "OSTI Report" (UK Government Office for Scientific and Technical
Information, Sinclair, Jones and Daley 2004). This is a report on quantita-
tive research on computer-readable corpus data, earned out between 1963
and 1969, but not formally published until 2004.
The project was in touch with the work at Brown University: Francis
and Kucera's Computational Analysts of Present Day American English,
based on their one-million-word corpus of written American English, had
appeared in 1967. But again, it is difficult to project oneself back to a pe-
riod in which there were no PCs, and in which the university mainframe
machine could only handle with difficulty Sinclair's corpus of 135,000
running words of spoken language.
Yet the report worked out many of the main ideas of modern corpus lin-
guistics in astonishing detail. This work in the 1960s formulated explicitly
several principles which are still central in corpus linguistics today. It put
forward a statistical theory of collocation in which collocations were inter-
preted as evidence of meaning. It asked: What kinds of lexical patterning
can be found in text? How can collocation be objectively described? What
size of span is relevant? How can collocational evidence be used to study
meaning? Some central principles which are explicitly formulated include:
The unit of lexis is unlikely to be the word in all cases. Units of meaning
8 Michael Stubbs

can be defined via statistically defined units of lexis. Homonyms can be


automatically distinguished by their collocations. Collocations differ in
different text-types. Many words are frequent because they are used in fre-
quent phrases. One form of a lemma is regularly much more frequent than
the others (which throws doubt on the lemma as a linguistic unit).
It proposed that there is a relation "between statistically defined units of
lexis and postulated units of meaning" (Sinclair, Jones and Daley 2004: 6).
As Sinclair puts it in the 2004 preface to the OSTI Report, we have a "very
strong hypothesis [that] for every distinct unit of meaning there is a full
phrasal expression ... which we call the canonical form". And he formulates
one of his main ambitious aims: a list of all the lexical items in the lan-
guage with their possible variants would be "the ultimate dictionary" (Sin-
clair, Jones and Daley 2004: xxiv). In a word, the OSTI Report makes sub-
stantial progress with a question which had never had a satisfactory answer:
How can the units of meaning of a language be objectively and formally
identified? It is important to emphasize that this tradition of corpus work
was concerned, from the beginning, with a theory of meaning.
The work then had to be shelved, because the machines were simply not
powerful enough in the 1970s to handle large quantities of data. It was
started again in the 1980s as the COBUILD project in corpus-assisted lexi-
cography.

7.2. The COBUILD project

In the 1980s, Sinclair became the Founding Editor in Chief of the


COBUILD series of language reference materials. He built up the Birming-
ham corpus, which came to be called the Bank of English, and along with a
powerful team of colleagues - many of whom have made important contri-
butions to corpus linguistics in their own right - the first COBUILD dic-
tionary was published in 1987: the first dictionary based entirely on corpus
data. The team for this and later dictionaries and grammars included Mona
Baker, Joanna Channell, Jem Clear, Gwynneth Fox, Gill Francis, Patrick
Hanks, Susan Hunston, Ramesh Knshnamurthy, Rosamund Moon, Antoin-
ette Renouf and others. The first dictionary, Collins COBUILD English
Language Dictionary (Sinclair 1987b) was followed by a whole series of
other dictionaries and grammars, plus associated teaching materials, includ-
ing Collins COBUILD English Grammar (Sinclair 1990) and Collins
COBUILD Grammar Patterns (Francis, Hunston and Manning 1996).
A Tribute to John McHardy Sinclair 9

A recent account of the COBUILD project is provided by Moon (2007),


one of the senior lexicographers in COBUILD, who worked with the pro-
ject from the beginning. She analyses why the "new methodology and ap-
proach" of the project had such "a catalytic effect on lexicography" (176).
When the project started there simply was no "viable lexicographic theory",
(177), whereas lexicography is now part of mainstream linguistics. Yet, it
was in some ways just too innovative to be a total commercial success.
There was a clash between commercial priorities and academic rigour, and
the purist approach to examples turned out to be confusing and not entirely
right for learners. Only the most advanced learners and language profes-
sionals could handle the authentic examples. By 1995 other major British
dictionary publishers (Cambridge University Press, Longman and Oxford
University Press) had copied the ideas. This was imitation as the sincerest
form of flattery, but they subtly changed the attitude to modifying attested
corpus examples and made the dictionaries more user-friendly. It remains
true however that it is the COBUILD project which developed the lexico-
graphic theory. Many of the principles of corpus compilation and analysis
are set out in Looking Up (Sinclair 1987a). The title is of course a play on
words: you look things up in dictionaries, and dictionary making is looking
up, that is, improving with new data and methods.
Many of Sinclair's main ideas are formulated in what is now a modern
classic: Corpus Concordance Collocation (Sinclair 1991). The corpus, the
concordance and the collocations are chronologically and logically related.
First, you need a corpus: a machine-readable text collection, ideally as large
as possible. Second, you need concordance software in order to identify
patterns. Third, these patterns involve collocations: the regular co-selection
of words and other grammatical features.
We have had paper concordances since the Middle Ages. But modern
concordance software can search large corpora very fast, re-order the find-
ings, and help to identify variable extended units of meaning. It is difficult
to illustrate the power of this idea very briefly, because it depends on the
analysis of very large amounts of data. But a simple example is possible. In
both the OSTI Report and in a 1999 article Sinclair points out that way is "a
very unusual word", and that "the very frequent words need to be ... de-
scribed in their own terms", since they "play an important role in phraseol-
ogy" (Sinclair 1999: 157, 159). It doesn't make much sense to ask what the
individual word way means, since it all depends on the phraseology:
all the way to school, half way through, the other way round, by the way, a
possible way of checking,... etc
10 Michael Stubbs

Few of the very common words in the language "have a clear meaning
independent of the cotext". Nevertheless, "their frequency makes them
dominatealltext"(Sinclairl999:158,163).
Here is a fragment of output from some modern concordance software:
all the examples of the three words way - long - go co-occurring in a six-
million-word corpus.2 The concordance lines were generated by software
developed by Martin Warren and Chris Greaves (Cheng, Greaves and War-
ren 2006), in a project that Sinclair was involved in.
01 added that there was still a long way to go In overcoming Stalinist structure
02 ges In 1902 there was i H T T a long way to go: A. M. Falrbalrn warned Sir Alfre
03 e," he said. There Is sTTII a long way to go, however, to reach the 1991 high
04 handicapped. There Is i H T T a long way to go before the majority of teachers 1
05 on peas. But we s t i l l l ^ i a long way to g o . ^ i l F w e Imagine our blizzard rag
06 hem to Church-we sTiTI have a long way to go to reach our African Church stand
07 other we still h ^ T e ^ n awful long way to go. TRADES UNIONS AND THE EUROPEAN
08 real g o o d ^ l ^ i Gemma's got a long way to go before she gets to eighty You're
09 s demonstrates that we have a long way to go bifoTi we have true democracy In
10 Ice, but I'm afraid we have a long way to go bifoTi we catch up to the Japanes
11 dlcatlons are that there Is a long way to go bifoTi the Algerian problem Is fl
12 f small atomic reactors has a long way to go bifoTi It becomes a commercial pr
13 seventy. I mean he's, he's a long way to go.^cT^ould you cut me a slice of t
14 aell occupation, Israel has a long way to go to convince anyone that It Is ser
15 Ichael Calne." But he's got a long way to go. "David Who? Never heard of him,"
16 teen. My turn to draw. A long long way to go though. Difficult. Well look wher
17 y the earth's gases will go a long way toward bolstering or destroying cosmic
18 nd and motion, fountains go a long way to^aTd selling themselves in showrooms
19 ntion. Vaccinations also go a long way to^aTd eliminating the spread of more v
20 and ready to act, would go a long way to^aTd making the 'new world order' mor
21 s father owned, it might go a long way to^aTd explaining why she was reluctant
22 these very simple cases go a long way to^aTds explaining puzzling features of
23 duction in wastage would go a long way to easing the manpower problems. In gen
24 and guineas. That should go a long way to easing the strain on an amateur team
25 11, if voters are loyal, go a long way to ensuring the election of one or even
26 s in partnership, we can go a long w a y . ^ o u do believe that?" The expression 1
27 ius scarf, and a boy can go a long way with those things. You got a job yet? W
28 Conolly said. "We could go a long way on this. I didn't know Major Fitzroy wa
29 Of course we go back an awful long way don't we? Yes. Yeah. Are you going to t
30 I go to London. And we go any damn way I please, as long as I don't interfere

Figure 1. Six-miUion-word corpus: all examples of way-long- go

Data of this kind make visible the kinds of patterning which occur across
long texts, and provide observable evidence of the meaning of extended
lexical units. Several things are visible in the concordance lines. They show
that way "appears frequently in fixed sequences" (Sinclair 2004: 110), and
that the unit of meaning is rarely an individual word. Rather, words are co-
selected to form longer non-compositional units. Also, the three words way
- long - go tend to occur in still longer sequences, which are not literal, but
metaphorical. There are two main units, which have pragmatic meanings:
(a) is used in an abstract extended sense to simultaneously encourage hear-
ers about progress in the past and to warn them of efforts still required in
the future; (b) is also used in exclusively abstract senses.
A Tribute to John McHardy Sinclair 11

(a) there BE still a long way to go before...


(b) (modal) go a long way to(wards) VERB-ing ...
Now, the word way is just one point of origin for a collocation, and it is
shown here with just two collocates. Imagine doing this with say 20,000
different words and all their frequent collocates in a corpus of 500 million
words, and you have some small impression of the ambitious range of Sin-
clair's aim of creating an inventory of the units of meaning in English.
In a series of papers from the 1990s onwards (Sinclair 1996, 1998,
2005), he put forward a detailed model of semantic units of a kind which
had not previously been described. In these articles, he argued consistently
that "the normal earner of meaning is the phrase" (Sinclair 2005), and that
the lack of a convincing theory of phraseology is due to two things: the
faulty assumption that the word is the primary unit of meaning, and the
misleading separation of lexis and grammar. The model is extremely pro-
ductive, and many further examples have been discovered by other re-
searchers. It's all to do with observable empirical evidence of meaning, and
what texts and corpora can tell us about meaning.
The overall finding of this work is that the phraseological tendency in
language use is much greater than previously suspected (except perhaps by
a few scholars such as Dwight Bolinger, Igor Mel'cuk and Andrew Paw-
ley), and its extent can be quantified.

8. Publications

Sinclair's work was for a long time not as well known as it deserved to be.
This was partly his own fault. He often published in obscure places, not
always as obscure as community newspapers from the Scottish Highlands,
but nevertheless frequently in little known journals and book collections,
and it was only post-1990 or so that he began to collect his work into books
with leading publishers (Oxford University Press, Routledge, Benjamins).
He once told me that he had never published an article in a mainstream
refereed journal. I questioned this and cited some counter-examples, which
he argued were not genuine counter-examples, since he had not submitted
the articles: they had been commissioned. He was always very sceptical of
journals and their refereeing and gate-keeping processes, which he thought
were driven by fashion rather than by standards of empirical research.
He was also particularly proud of the fact that, when he was appointed
to his chair in Birmingham, he had no PhD and no formal publications. His
12 Michael Stubbs

first publication was in 1965, the year when he took up his chair: it was an
article on stylistics entitled "When is a poem like a sunset?", which was
published in a literary journal (Sinclair 1965). It is a short experimental
study of the oral poetic tradition which he earned out with students. He got
them to read and memorize a ballad ("La Belle Dame Sans Merci" by
Keats) and then studied what changes they introduced into their versions
when they tried to remember the poem some time later.
His last book Lmear Unit Grammar, co-authored with Anna Mauranen,
is typical Sinclair (Sinclair and Mauranen 2006). It is based on one of his
most fundamental principles: if a grammar cannot handle authentic raw
texts of any type whatsoever, then it is of limited value. The book points
out that traditional grammars work only on input sentences which have
been very considerably cleaned up (or simply invented). Sinclair and Mau-
ranen demonstrate that analysis of raw textual data is possible. On the one
hand, the proposals are so simple as to seem absolutely obvious: once
someone else has thought of them. On the other hand, they are so innova-
tive, that it will take some time before they can be properly evaluated. I will
not attempt this here, and just note that the book develops the view that
significant units of language in use are multiword chunks. But here, the
approach is via a detailed discussion of individual text fragments as op-
posed to repeated patterns across large text collections. Either way, it is a
significant break with mainstream linguistic approaches.

9. In summary

First, Sinclair's work belongs to a long tradition of British empiricism and


British and European text and corpus analysis, derived from his own teach-
ers and colleagues (especially J. R. Firth and Michael Halhday), but repre-
sented in a broader European tradition (for example by Otto Jespersen) and
in a much more restricted American tradition (for example by Charles
Fries). This work, based on the careful description of texts, is very different
from the largely American tradition of invented introspective data which
provided a short interruption to this empirical tradition. As Sinclair pointed
out in a characteristically ironic aside, "one does not study all of botany by
making artificial flowers" (Sinclair 1991: 6).
Second, the description of meaning has always been at the centre of the
British Firth-Halliday-Sinclair tradition of linguistics. It is in Sinclair's
work that one finds the most sustained attempt to develop an empirical
semantics. As he said in a plenary at the AAAL (American Association for
A Tribute to John McHardy Sinclair 13

Applied Linguistics), "corpus research, properly focussed, can sharpen


perceptions of meaning" (Sinclair 2006b).
Third, he is one of the very few linguists whose work has changed the
way we perceive language. In the words of one of his best known observa-
tions: "The language looks rather different when you look at a lot of it at
once" (Sinclair 1991: 100).
Fourth, Sinclair is one of the very few linguists who have made substan-
tial discoveries. As Wilson (1998: 61) has argued: "The true and final test
of a scientific career is how well the following declarative sentence can be
completed: He (or she) dtscovered that ..." Sinclair's work is full of new
findings about English, things that people had previously simply not no-
ticed, despite thousands of years of textual study. But then they are only
observable with the help of the computer techniques which he helped to
invent, and which the rest of us can now use to make further discoveries.
These include both individual phraseological units, but also methods of
analysis - how to extract patterns from raw data - and principles: in par-
ticular the extent of phraseology in language use.
Sinclair's vision of linguistics was always long-term: "a new view of
language and the technology associated with it" (Sinclair 1991: 1). He de-
veloped some of his main ideas in the 1960s, and then waited till the tech-
nology - and everyone else's ideas - had caught up with him. As he re-
marked with some satisfaction:
Thirty years ago [in the 1960s] when this research started it was considered
impossible to process texts of several million words in length. Twenty years
ago [in the 1970s] it was considered marginally possible but lunatic. Ten
years ago [in the 1980s] it was considered quite possible but still lunatic.
Today [in the 1990s] it is very popular. (Sinclair 1991: 1)
John Sinclair's work has shown how to use empirical evidence to tackle the
deepest question in the philosophy of language: the nature of units of mean-
ing.
Like many other people, I owe a very large part of my own academic
development to John Sinclair's friendship and inspiring ideas. I knew him
for over thirty years: from 1973 when he appointed me to my first academic
job (on the second project in discourse analysis) at the University of Bir-
mingham. In October 2007 in Erlangen, he was due to receive his honorary
doctorate personally, and then take part in a round table discussion, where
he would have responded, courteously but firmly, to our papers; and shown
us when we had strayed from his own rigorous standards of empirical re-
search. I was so much looking forward to seeing him again in Erlangen, and
14 Michael Stubbs

to continuing unfinished discussions with him. I will miss him greatly, as


will friends and colleagues in many places in the world. But I am very
grateful that I had the chance to know him.

Notes

1 Some biographical details are from the English Department website at Bir-
mingham University and from obituaries in The Guardian (3 May 2007), The
Scotsman (10 May 2007) and Functions of Language 14 (2) (2007). Special is-
sues of two journals are devoted to papers on Sinclair's work: International
Journal of Corpus Linguistics 12 (2) (2007) and International Journal of Lexi-
cography 21 (3) (2008). I am grateful to Susan Hunston and Michaela Mahl-
berg for comments on a previous version of this paper.
2 The corpus consisted of Brown, LOB, Frown and FLOB plus BNC-baby: five
million words of written data and one million words of spoken data.

References

Cheng, Winnie, Chris Greaves and Martin Warren


2006 From n-gram to skip-gram to congram. International Journal of
Corpus Linguistics 11(4): 411-433.
Coulthard,R. Malcolm.
1977 An Introduction to Discourse Analysis. London: Longman.
Firth, John Rupert
1935 The technique of semantics. Transactions of the Philological Society
34(1): 36-72.
Francis, Gill, Susan Hunston and Elizabeth Manning
1996 Collins COBUILD Grammar Patterns. 2 Vols. London: Harper-
Collins.
Hoover, David
2007 The end of the irrelevant text. DHQ: Digital Humanities Quarterly 1
(2). http://www.digitalhumanities.Org/dhq/vol/001/2/index.html, ac-
cessed 3 Nov 2007.
Kucera, Henry and W.Nelson Francis
1967 Computational Analysis of Present Day American English. Provi-
dence: Brown University Press.
Moon, Rosamund
2007 Sinclair, lexicography and the COBUILD project. International
Journal of Corpus Linguistics 12(2): 159-181.
A Tribute to John McHardy Sinclair 15

Smclmr, John McH.


1965 When is a poem like a sunset? A Review of English Literature 6 (2):
76-91.
Sinclair, John McH.
1966 Indescribable English. Inaugural lecture, University of Birmingham.
Abstract in Sinclair and Coulthard 1975: 151.
Sinclair, John McH.
1968 A technique of stylistic description. Language and Style 1: 215-242.
Sinclair, John McH.
1971 The integration of language and literature in the English curriculum.
Educational Review 23 (3). Page references to reprint in Literary
Text and Language Study, Ronald Carter and Deirdre Burton (eds.).
London: Arnold, 1982.
Sinclair, John McH.
1973 English for effect. Commonwealth Education Liaison Newsletter 3
(11): 5-7.
Sinclair, John McH. (ed.)
1980 Skills for Learning. Nelson: University of Malaya Press.
Sinclair, John McH. (ed.)
1987a Looking Up. London: Collins.
Sinclair, John McH. (ed.)
1987b Collins COBUILD English Language Dictionary. London: Harper-
Collins.
Sinclair, John McH. (ed.)
1990 Collins COBUILD English Grammar. London: HarperCollins.
Sinclair, John McH.
1991 Corpus Concordance Collocation. Oxford: Oxford University Press.
Sinclair, John McH.
1996 The search for units of meaning. Textus 9 (1): 75-106.
Sinclair, John McH.
1998 The lexical item. In Contrastive Lexical Semantics, Edda Weigand
(ed.), 1-24. Amsterdam: Benjamins.
Sinclair, John McH.
1999 A way with common words. In Out of Corpora, Hilde Hasselgard
and Signe Oksefjell (eds.), 157-179. Amsterdam: Rodopi.
Sinclair, John McH.
2004 Trust the Text. London: Routledge.
Sinclair, John McH.
2005 The phrase, the whole phrase and nothing but the phrase: Plenary.
Phraseology 2005, Louvam-la-Neuve, October 2005.
16 Michael Stubbs

Sinclair, John McH.


2006a A language landscape. West Word. January 2006. http://road-to-the-
isles.org.uk/westword/jan2006.html, accessed 3 Nov 2007.
Sinclair, John McH.
2006b Small words make big meanings: Plenary. AAAL (American Asso-
ciation for Applied Linguistics).
Sinclair, John McH. and David Brazil
1982 Teacher Talk. Oxford: Oxford University Press.
Sinclair, John McH. and R. Malcolm Coulthard
1975 Towards an Analysis of Discourse. London: Oxford University
Press.
Sinclair, John McH. and Anna Mauranen
2006 Linear Unit Grammar. Amsterdam: Benjamins.
Sinclair, John McH., Susan Jones and Robert Daley
2004 English Collocation Studies: The OSTI Report. R. Knshnamurthy
(ed.). London: Continuum. Original mimeoed report 1970.
Wilson, Edward O.
1998 Consilience: The Unity of Knowledge. London: Abacus.

Corpora

BNC Baby The BNC Baby, version 2. 2005. Distributed by Oxford University
Computing Services on behalf of the BNC Consortium. URL:
http://www.natcorp.ox.ac.uk/.
BROWN A Standard Corpus of Present-Day Edited American English, for use
with Digital Computers (Brown). 1964, 1971, 1979. Compiled by W.
N. Francis and H. Kucera. Brown University. Providence, Rhode Is-
land.
FROWN The Freiburg-Brown Corpus ('Frown') (original version) compiled
by Christian Mair, Albert-Ludwigs-Umversitat Freiburg.
LOB The LOB Corpus, original version (1970-1978). Compiled by Geof-
frey Leech, Lancaster University, Stig Johansson, University of Oslo
(project leaders) and Knut Holland, University of Bergen (head of
computing).
FLOB The Freiburg-LOB Corpus ('F-LOB') (original version) compiled by
Christian Mair, Albert-Ludwigs-Umversitat Freiburg.
Sttg Johansson*

It is an honour to have been asked to give this speech for John Sinclair,
pioneer in corpus linguistics, original thinker and a source of inspiration
for countless numbers of language students.
The use of corpora, or collections of texts, has a venerable tradition in
language studies. Many important works have drawn systematically on
evidence from texts. To take just two examples, the great grammar by Otto
Jespersen was based on collections of several hundred thousand examples.
The famous Oxford EngUsh Dictionary could use several million examples
collected from English texts. There is no doubt that the data collections, or
rather the intelligent use of evidence from the collections, contributed
greatly to the success of these monumental works.
But these data collections had the drawback that the examples had been
collected in a more or less impressionistic manner, and there is no way of
knowing what had been missed. Working in this way, there is a danger that
the attention is drawn to oddities and irregularities and that what is most
typical is overlooked. Just as important, the examples were taken out of
their context.
When we talk about corpora these days, we think of collections of run-
ning text held in electronic form. Given such computer corpora, we can
study language in context, both what is typical and what is idiosyncratic.
This is where we have an edge on Jespersen and the original editors of the
Oxford EngUsh Dictionary. With the computational analysis tools which
are now available we can observe patterns that are beyond the capacity of
ordinary human observation.
The compilation and use of electronic corpora started about forty-fifty
years ago. At that time, corpora were small by today's standards, and they
were difficult to compile and use. There were also influential linguists who

* Sadly, Stig Johansson died in April 2010. The editors of this volume would
lrke to express their thanks to Professor Hilde Hasselgard for taking care of
the final version of his paper.
18 Stig Johansson

rejected corpora, notably Noam Chomsky and his followers. Those who
worked with corpora were a small select group. One of them was John
Sinclair.
In the course of the last few decades there has been an amazing devel-
opment, made possible by technological advances but also connected with
the foresight and ability of linguists like John Sinclair to see the possibili-
ties of using the new tools for the study of language. We now have vast
text collections, numbering several hundred million words, and analysis
tools that make it possible to use these large data sources. The number of
linguists working with computer corpora has grown from a select few to an
ever increasing number, so that Jan Svartvik, another corpus pioneer, could
say in the 1990s that "corpora are becoming mainstream" (Svartvik 1996).
We also have a new term for the study of language on the basis of com-
puter corpora: corpus linguistics. As far as I know, this term was first used
in the early 1980s by another pioneer, Jan Aarts from the University of
Nijmegen in Holland (see Aarts and Metis 1984). Now it has become a
household word. A search on the Internet provides over a million hits.
Many people working with corpora probably associate the beginnings
of corpus linguistics with Randolph Quirk's Survey of English Usage, a
project which started in the late 1950s, but the corpus produced by Quirk
and his team was not computerised until much later. What really got the
development of computer corpora going was the Brown Corpus, compiled
in the early 1960s by W. Nelson Francis and Henry Kucera at Brown Uni-
versity in the United States. The Brown Corpus has been of tremendous
importance in setting a pattern for the compilation and use of computer
corpora. Not least, it was invaluable that the pioneers gave researchers
across the world access to this important data source, which has been used
for hundreds of language studies: in lexis, grammar, stylistics, etc.
Around this time John Sinclair was engaged in a corpus project in Brit-
ain. The reason why this is less known is probably that the corpus was not
made publicly available. We can read about the project in a book published
a couple of years ago: John M. Sinclair, Susan Jones and Robert Daley,
English Collocation Studies: The OSTI Report, edited by Ramesh Knsh-
namurthy, including a new interview with John M. Sinclair, conducted by
Wolfgang Teubert. The book is significant both because it gives access to
the OSTI Report, which had been difficult to get hold of, and because of
the interview, which gives insight into the development of John Sinclair's
thinking.
Corpus, lens, discourse: a tribute to John Sinclair 19

The OSTI Report was, as we can read on the title page, the final report
to the Office for Scientific and Technical Information (OSTI) on the Lexi-
cal Research Project C/LP/08 for the period January 1967 - September
1969, and it was dated January 1970, but the project had started in 1963.
There are two things which I find particularly significant in connection
with this project. In the first place, it included the compilation of a corpus
of conversation, probably the world's first electronic corpus of spoken
language compiled for linguistic studies. The corpus was fairly small,
about 135,000 words, but considering the difficulties of recording, tran-
scribing and computerising spoken material, this was quite an achievement.
In addition, some other material was used for the project, including the
Brown Corpus. The most significant aspect of the project was that the fo-
cus of the study was on lexis. We should remember that at this time lexis
was disregarded, or at least underestimated, by many - perhaps most -
linguists, who regarded the lexicon as a marginal part attached to grammar.
Schematically, we could represent it in this way:

Lexicon
Grammar

Perhaps the most enduring contribution of John Sinclair's work is that he


has redefined lexis and placed it at the centre of the study of language. This
is how he views the relationship between lexis and grammar in a paper
published as Sinclair (1999: 8):2

(no
Residual
Lexical items independent
grammar
semantics)

I will come back later to the notion of lexical item. Let's return to the ori-
gin of John Sinclair's thinking on lexis. We find it in the OSTI Report and
in a paper with the title "Beginning the study of lexis", published for a
collection of papers in memory of his mentor J. R. Firth (Bazell et al.
1966). Firth had stressed the importance of collocations, representing the
20 Stig Johansson

significant co-occurrence of words. But he did not have the means of ex-
ploring this beyond typical examples, such as dark mght (Firth 1957: 197).
What is done in the OSTI Report is that systematic procedures are de-
vised for defining collocations in the corpus. Here we find notions such as
node, collocate and span, which have become familiar later:
A node is an item whose total pattern of co-occurrence with other words is
under examination; a collocate is any one of the hems which appears with
the node within the specified span. (Sinclair, Jones and Daley [1970] 2004:
10)
In the interview with Wolfgang Teubert, John Sinclair reports that the op-
timal span was calculated to be four words before and four words after the
node, and he says that, when this was re-calculated some years ago based
on a much larger corpus, they came to almost the same result (Sinclair,
Jones and Daley 2004: xix).
It was a problem that the corpus was rather small for a systematic study
of collocations. In the opening paragraph of the paper I just referred to,
John Sinclair says:
[... ] if one wishes to study the 'formal' aspects of vocabulary organization,
all sorts of problems He ahead, problems which are not likely to yield to
anything less imposing than a very large computer. (Sinclair 1966: 410)
Later in the paper we read that "it is likely that a very large computer will
be strained to the utmost to cope with the data" (Sinclair 1966: 428). There
was no way of knowing what technological developments lay ahead, and
that we would get small computers with an infinitely larger capacity than
the large computers at the time this was written.
John Sinclair says that he did very little work on corpora in the 1970s
(Sinclair, Jones and Daley 2004: xix), frustrated by the labonousness of
using the corpus and by the poor analysis programs which were available.
But he and his team at Birmingham did ground-breaking work on dis-
course, leading to an important publication on the English used by teachers
and pupils (Sinclair and Coulthard 1975). As I have understood it, what
was foremost for John Sinclair was his concern with discourse and with
studying discourse on the basis of genuine data. We must "trust the text",
as he puts it in the title of a recent book (Sinclair 2004). This applies both
to the discourse analysis project and to his corpus work.
Around 1980 John Sinclair was ready to return to corpus work. We
were fortunate to have him as a guest lecturer at the University of Oslo in
February 1980, and a year later he attended a conference in Bergen, the
Corpus, lexis, cUscourse: a tribute to John Sinclcur 21

second conference of ICAME, the International Computer Archive of


Modern English, as it was called at the time. A spinoff from the conference
was the publication of a little book on Computer Corpora in English Lan-
guage Research (Johansson 1982). The opening contribution was a vision-
ary paper by John Sinclair called "Reflections on computer corpora in Eng-
lish language research" (Sinclair 1982). In just a few pages he outlines a
program for corpus studies: he draws attention to the new possibilities of
building large corpora using typesetting tapes and optical scanning; he
stresses that we need very large corpora to cope with lexis; and this is, I
believe, where he first introduces his idea of monitor corpora, large text
collections changing with the development of the language.
The 1980s represents the breakthrough of the use of corpora in lexical
studies. John Sinclair and his team in Birmingham started the building of a
large corpus and initiated the COBUILD project (Sinclair 1987) which led
to the first corpus-based dictionary: The Collins COBUILD English Lan-
guage Dictionary. There were a number of innovative features of this dic-
tionary: it was based on fresh evidence from the corpus; the selection of
words was corpus-based, and so was the selection and ordering of senses;
there were large numbers of examples drawn from the corpus; a great deal
of attention was given to collocations; definitions were written in a new
way which simultaneously defined the meaning of words and illustrated
their collocational patterns, etc. Later dictionaries have not followed suit in
all respects, but it is to the credit of the work of John Sinclair and his team
that English dictionaries these days cannot do without corpora.
Later John Sinclair developed his ideas in a steady stream of conference
papers, articles and books. I cannot comment on all of these, but would like
to give a couple of illustrations from his work. The first is from a paper
called "The computer, the corpus, and the theory of language" (Sinclair
1999), the source of the diagram shown above (p. 19). Consider the noun
brink, if we examine its collocations, we discover a consistent pattern. I
have made a collocation study based on the British National Corpus, which
contains a hundred million words:
22 Stig Johansson

No. Word Total no. in As collo- In no. of Mutual information


the whole cate texts value
BNC
1 teetering 68 15 15 8.734477
2 teetered 44 9 9 8.658970
3 porsed 683 10 10 6.022025
4 starvation 466 6 5 5.893509
5 hovering 416 5 5 5.824688
6 extinction 562 5 5 5.523871
7 bankruptcy 999 7 7 5.285090
8 collapse 2568 17 17 5.228266
9 disaster 2837 13 12 4.860382
10 destruction 2360 5 5 4.088956

Here we find some verbs: teetermg, teetered, porsed, havering and some
nouns denoting disasters: starvation, extinction, bankruptcy, collapse, dis-
aster, destruction. These are the words which most typically co-occur with
brink, identified by a measure of co-occurrence called mutual information.
The pattern could have been shown more clearly if I had given the lists for
left and right contexts separately, but there should be no need to do this for
the present purpose. The results agree very well with the findings presented
in John Sinclair's article, though he used a different corpus. He summa-
rises the results in this way (Sinclair 1999: 12):
[A]M/IpreptheEofD
This is a lexical item. It is used about some actor (A) who is on (I), or is
moving towards (M), the edge (E) of something disastrous (D). It has an
invariable core, brink, and there are accompanying elements which con-
form to the formula. By using the item "the speaker or writer is drawing
attention to the time and risk factors, and wants to give an urgent warning"
(loc. sti.). There is a negative semantic prosody, reflecting the communica-
tive purpose of the item.
Let's take a second example, from a book called Reading Concordances
(Sinclair 2003: 141-151). This is a bit more complicated. How do we use
the sequence true feelings'!
Corpus, lexis, discourse: a tribute to John Sinclcur 23

After examining material from his corpus, John Sinclair arrives at the
following analysis:

SEMANTICS GRAMMAR CORE

PROSODY reluctance

PREFERENCE expression possession

COLLIGATION verb verb poss.adj.

COLLOCATION hide
his
reveal their truefeeUngs
express your

EXAMPLES less open about showing their truefeeUngs

you 11 be inclined to hide your true feelings

The communicative purpose is to express reluctance. There is a semantic


preference for some expression of reluctance plus a reference to the person
involved. Colligation defines the grammatical structure. And the colloca-
tions show how the forms may vary. The main claim is that text is made up
of lexical items of this kind, where there is not a strict separation between
form and meaning. Lexis rs more than it used to be. Grammar is seen as
residual, adjusting the text after the selection of the lexical items. The pre-
occupation with lexis does not mean that John Sinclair has neglected
grammar. He has also produced grammar books, most recently Linear Unit
Grammar, co-authored with Anna Mauranen, a new approach designed to
integrate speech and writing (Sinclair and Mauranen 2006).
It is time to sum up. I will start by quoting a passage from the introduc-
tion to a festschrift for John Sinclair:
The career of John McH. Sinclair, Professor of Modern English Language at
the Umvershy of Birmingham, has been characterised by an unending
stream of original ideas, sometimes carefully worked out in detail, some-
times casually tossed out in papers in obscure volumes, occasionally devel-
oped in large research teams, often passed to a thesis student in the course of
24 Stig Johansson

conversation. Indeed it is hard to imagine him writing or saying anything de-


rivative or dull, and, while the reader may on occasion be driven to disagree
with him, he or she is never tempted to ignore him. (Hoey 1993: v)
The author of these lines, Michael Hoey, professor of English language at
the University of Liverpool, is one of many former students of John Sin-
clair who now hold important academic posts. Another one is Michael
Stubbs, professor of English linguistics at the University of Trier.
I whole-heartedly agree with Michael Hoey. In my talk I have not been
able to do justice to the whole of John Sinclair's contribution to linguistics.
He was a multifaceted man. He was concerned both with linguistic theory
and its applications, above all in lexicography. The three words I selected
for the title of my talk - corpus, lexis, discourse - are keywords in John
Sinclair's work, as I see it. There is a remarkable consistency, all the way
back to his paper on "Beginning the study of lexis" (Sinclair 1966). By
consistency I do not mean stagnation. What is consistent is his way of
thinking - original, always developing, yet never letting go of the thought
that the proper concern of linguistics is to study how language is actually
used and how it functions in communication - through corpora, to lexis
and discourse.

Notes

1 This text was prepared for the ceremony in connection with the award of an
honorary doctorate to John Sinclair (Erlangen, November 2007). As the audi-
ence was expected to be mixed, it includes background information which is
well-known among corpus linguists. The text is virtually unchanged as it was
prepared early in 2007, before we heard the sad news that John had passed
away. The tense forms have not been changed.
2 See also Sinclair (1998).

References

Aarts,JanandWillemMeijs
1984 Corpus Linguistics: Recent Developments in the Use of Computer
Corpora in English Language Research. Amsterdam: Rodopi.
Bazell, Charles Ernest, John Cunnison Catford, Michael Alexander Kirkwood Hal-
May and Robert H. Robins (eds.)
1966 In Memory of J. R. Firth. London: Longman.
Corpus, lexis, discourse: a tribute to John Sinclair 25

Firth, John Rupert


1957 Papers in Linguistics 1934-1951. London: Oxford University Press.
Hoey, Michael (ed.)
1993 Data, Description, Discourse: Papers on the English Language in
Honour of John McH. Sinclair on his Sixtieth Birthday. London:
HarperCollins.
Johansson, Stig (ed.)
1982 Computer Corpora in English Language Research. Bergen: Norwe-
gian Computing Centre for the Humanities.
Sinclair, John McH.
1966 Beginning the study of lexis. In /„ Memory of J R. Firth, Charles
Ernest Bazell, John Cunmson Catford, Michael Alexander Kirkwood
Halliday and Robert H. Robins (eds.), 410-430. London: Longman.
Sinclair, John McH.
1982 Reflections on computer corpora in English language research. In
Computer Corpora in English Language Research, Stig Johansson
(ed.), 1-6. Bergen: Norwegian Computing Centre for the Humanities.
Sinclair, John McH. (ed.)
1987 Looking Up: An Account of the COBUILD Project in Lexical Com-
puting London: Collins ELT.
Sinclair, John McH.
1998 The lexical item. In Contrastive Lexical Semantics, Edda Weigand
(ed.), 1-24. Amsterdam: John Benjamins.
Sinclair, John McH.
1999 The computer, the corpus and the theory of language. In Transiti
Linguistici e Culturali. Atti del XVIII Congresso Nazionale
dell A.I.A. (Genova, 30 Settembre 2 Ottobre 1996), Gabnele Azza-
ro and Marghenta Ulrych (eds.), 1-15. Trieste: E.U.T.
Sinclair, John McH.
2003 Reading Concordances: An Introduction. London: Pearson Educa-
tion.
Sinclair, John McH.
2004 Trust the Text: Language, Corpus and Discourse. Edited with
Ronald Carter. London/New York: Routledge.
Sinclair, John McH. and R. M. Coulthard
1975 Towards an Analysis of Discourse: The English Used by Teachers
and Pupils. London: Oxford University Press.
Sinclair, John McH., Susan Jones and Robert Daley
2004 English Collocation Studies: The OSTI Report, Ramesh Knshna-
murthy (ed.), including a new interview with John M. Sinclair, con-
ducted by Wolfgang Teubert. London/New York: Continuum. First
published in 1970.
26 Stig Johansson

Sinclair, John McH. and Anna Mauranen


2006 Linear Unit Grammar. Amsterdam: John Benjamins.
Svartvik,Jan
1996 Corpora are becoming mainstream. In Using Corpora far Language
Research, Jenny Thomas and Mick Short (eds.), 3-13. London/New
York: Longman.

Corpus

BNC The British National Corpus. Distributed by Oxford University


Computing Services on behalf of the BNC Consortium.
http://www.natcorp.ox.ac.uk/.
StssrA'ssi-ssr
Thomas Herbst

... a language user has available to him or her a large number of semi-
preconstructed phrases that constitute single choices (Sinclair 1991: 110)
... patterns of co-selection among words, which are much stronger than any
description has yet allowed for, have a direct connection with meaning.
(Sinclair 2004: 133)
... a text is a unique deployment of meaningful units, and its particular
meaning is not adequately accounted for by any organized concatenation of
the fixed meanings of each unit. (Sinclair 2004: 134)

1. The idiom principle

In a volume compiled in honour of John Sinclair, there is no need to ex-


plain or to defend these positions, which, after all, are central to his ap-
proach. In this article, I would like to raise a few questions concerning the
notions of single choree and meaningful units and relate them to approaches
addressing similar issues that have gained increasing popularity recently.
Single choice can be interpreted to mean that several words - or, as will
be argued below, a word and a particular grammatical construction - are
chosen simultaneously by a speaker to express a particular meaning.2 This
presumably implies that this meaning (or something resembling this mean-
ing) exists before it is being expressed by the speaker. In the case of the
following sentence (taken from a novel by David Lodge)
(1) The winter term at Rummidge was often week's duration ... <NW214>
one could argue that the choice of the word duration entails a choice of the
preposition o/and the use of the verb be and that this single choice ex-
presses a particular meamng or perhaps a "conceptual umt". The same - or
a very similar meaning - could be expressed using the verb last
(la) The winter term at Rummidge lasted for ten weeks.
Be + of + duration can thus be seen as a "simultaneous choice of ...
words", which is a fundamental component of Sinclair's (1991: 110) idiom
28 Thomas Herbst

principle. By formulating the idiom principle, Sinclair has given a theoreti-


cal perspective to the insights into the extent of recurrent co-occurrences of
words in texts that has been brought to light by large-scale corpus analyses
which had become possible through access to large computerized corpora
such as the COBUILD corpus.3 This has opened up a new view on the
phraseological element in language by making it central to language de-
scription and no longer regarding it as peripheral, marginal or somehow
special, as was the case in generative grammar and, in a different way, also
in traditional phraseology.4 At the same time, the focus has shifted away
from true idioms, proverbs etc. to other types of phraseological chunks.
The distinction between the open-choice principle and the idiom princi-
ple has had considerable impact on recent linguistic research. One of the
reasons for this may be that it is the outcome of empirical corpus analysis
and thus has been arrived at on the basis of the analysis of language use,
which distinguishes it from the inductive approach taken, for example, by
Chomsky.
A second reason can be found in the fact that the notion of the idiom
principle is attractive to foreign language linguistics because it can serve to
explain what is wrong with particular instances of learner language or
translated text (Hausmann 1984; Granger 1998, this volume; Gilquin 2007;
Nesselhauf 2005; Herbst 1996 and 2007). The emphasis put on aspects of
language such as collocation or valency phenomena in this area goes hand
in hand with a focus on specific lexical properties in terms of the idiom
principle. This in turn provides an interesting parallel to some approaches
within the framework of cognitive linguistics, especially construction
grammar.
In a way, one could say that what all these approaches have in common
is that they take the phenomenon of what one could call irregularity rather
seriously by saying that idiosyncratic properties of lexical items, especially
the tendency to occur together with certain other linguistic units, is a central
phenomenon of language and not something that could be relegated to the
appendix of an otherwise neatly organized grammar or to the periphery of
linguistic theory.5 This is accompanied by the recognition of units larger
than the individual word. The parallels between different approaches even
show in the phrasing: thus Sinclair (1991: 110) speaks of "semi-
preconstructed phrases" and Hausmann (1984: 398) of "Halbfertigprodukte
der Sprache". Nevertheless, it must not be overlooked that the motivation
for the interest in such units is slightly different:
Choosing sandy beaches - collocations, probabemes and the idiom principle 29

- corpus linguistics provides the ideal tool for identifying recurrent


combinations of words, in particular the occurrence ofn-tuples;
- foreign language linguistics looks at chunks from the point of view of
production difficulty caused by unpredictability;
- construction grammar and related approaches are interested in chunks
in terms of form-meaning pairings,6 investigate the role chunks play in
the process of first language acquisition and look at the way one could
imagine chunks to be represented in the mental lexicon.
These differences in approach may also result in differences with respect to
the concepts developed within the various frameworks. Thus it is worth
noting that Franz Josef Hausmann (2004: 320), one of the leading colloca-
tion specialists of the foreign language linguistics orientation, recognizes
the value of the research on collocation in corpus linguistics but at the same
time speaks of "terminological war" and suggests that corpus linguists
should find a different term for what they commonly refer to as collocation.
Similarly, despite obvious points of contact between at least some conclu-
sions arrived at by Sinclair and some concepts of construction grammar,
"the main protagonists appear to see the similarity between the approaches
as merely superficial", as Stubbs (2009: 27) puts it. It may thus be worth-
while discussing central units identified in these different frameworks with
respect to criteria such as
- delimitation and size;
- semantic compositionahty and semantic autonomy of the component
parts, and, related to that;
- predictability for the foreign leaner.

2. Collocations-compounds-concepts

2.1. Unpredictability and single choice

The different uses of the term collocation referred to by Hausmann focus


on two types of combination - the sandy beaches and the false teeth type
(Herbst 1996). Sandy beaches is a typical example of the kind of colloca-
tion identified by corpus research because it is significant as a combination
on the basis of the frequency with which the two words co-occur in the
language, i.e. statistically significant. False teeth, on the other hand, repre-
sents the type of collocation that foreign language linguistics has focussed
on because the combination of the two items is unpredictable for a foreign
30 Thomas Herbst

learner of the language, i.e. semantically significant. This distinction be-


tween a statistically-oriented and a significance-oriented approach to collo-
cation is also made for example by Nadja Nesselhauf (2005: 12), who dis-
tinguishes between a frequency-based and a phraseological approach, or
Dirk Siepmann (2005: 411), who uses the terms frequency-based and se-
mantically-based.7
Collocations such as false teeth have been called "encoding idioms" (a
term used, for example, by Makkai (1972) or by Croft and Cruse (2004:
250)), which refers to the fact that they can be easily interpreted but there is
no way of knowing that the established way of expressing this meaning is
false teeth and not artificial teeth (as opposed to artificial hip versus ?false
hip)- Similarly, foreign learners of English have no way of predicting com-
binations such as lay the table or strong tea. Hausmann uses semantic con-
siderations to distinguish between the two components of such combina-
tions as the base, which is semantically autonomous ("semantisch
autonom", Hausmann 1984: 401), and the collocate, which cannot be de-
fined, learnt or translated without the base (Hausmann 2007: 218).8 This is
particularly relevant to foreign language teaching and lexicography - at
least in dictionaries which aim to be production dictionaries for foreign
learners. The distinction between Basis (base) and Kollokator (collocate)
introduced by Hausmann (1984: 401) forms the basis for an adequate lexi-
cographical treatment of such collocations: the foreign learner looking for
adjectives to qualify tea or gale will have to find weak and strong under tea
and light and strongvnter gale.
The unpredictability of such combinations arises from the fact that out
of a range of possible collocates for a base, only one or several - but not all
- can be regarded as established in language use. Thus, heavy applies to
ram, rainfall, storm, gale, smoker or drinking but not to tea, coffee or taste,
whereas strong can be used for wind, gale, tea, coffee and taste but not
storm, ram, rainfall, smoker or drinking etc. Interestingly, the semantic
contribution of these collocates can be characterized in terms of degree of
intensity in some way or another.9

BNC rain rainfall wind storm gale smoker drinking tea coffee taste

heavy 254 21 5 5 1 47 43 0 0 0

strong 0 0 223 0 4 0 0 28 19 3

severe 2 0 1 15 10 0 0 0 0 0
Choosing sandy beaches - collocations, probabemes and the idiom principle 31

BNC ram ramfall mnd storm gale smoker drmkmg tea coffee taste

light 47 1 61 0 0 1 0 2 0 2

slight 0 0 3 0 0 0 0 0 0 2

moderate 0 5 8 0 3 2 11 0 0 1

weak 0 0 0 0 0 0 0 16 3 0

What this table illustrates, however, is the limited combinabihty of certain


adjectives (or adjective meanings) with certain nouns, which provides (at
least when we consider established usage) an interesting case of restrictions
on the operation of the open-choice principle.10 This sort of situation finds a
direct parallel in the area of valency, where, while certain generalizations
with respect to argument structure constructions are certainly possible
(Goldberg 1995, 2006), restrictions on the co-occurrence of particular
valency carriers with particular valency patterns will also have to be ac-
counted for (Herbst 2009; Herbst and Uhng 2009; Faulhaber 2011):"

valency pattern consider judge call count regard think see

NPV„ t NPNP + + + + +
NPV„ t NPasNP + + + + +
NPV„,ofNPasNP +

etc.

Thus if one takes the definition of manage provided by the Cobmld English
Language Dictionary (mi)
(2) If you manage to do something, you succeed in doing it.
one could argue that the choice of a particular valency earner such as man-
age or succeed entails a simultaneous choice of a particular valency pattern
(and obviously the fact that manage combines with a [to INF]-complement
and succeed with an [in V-ing]-complement must be stored in the mind
and is unpredictable for the foreign learner of the language).
If we consider heavy and strong or the valency patterns listed (and the
list of patterns and verbs could be expanded to result in an even more com-
plex picture), then we are confronted with important combinatorial proper-
ties. Methodologically, it is not easy to decide whether (or in which cases)
these should be described as restrictions or merely as preferences: while the
32 Thomas Herbst

BNC does not show any occurrences of strong + storm/s or strong + ram-
fall/s (span ±5), for example, it does contain 5 instances of heavy wmd/s
(and artificial teeth). Similarly, extremely rare occurrences of a verb in a
pattern in which it does not "normally" occur such as
(3) ... they regard management a very important ingredient within their strategy
<BNC:HJ5 6002>
cannot be regarded as a sufficient reason to see this as an established use
and to make this pattern part of the valency description of that verb. Never-
theless, acceptability judgments are highly problematic in this area, which
is why it may be preferable to speak of established uses, which however
means that frequency of occurrence is taken into account. In any case, the
observation that the collocational and colhgational properties of words
display a high degree of idiosyncracy is highly compatible with usage-
based models of cognitive linguistics (e.g. Tomasello 2003; Lieven forth-
coming; Behrens 2007; or Bybee 2007). 12
Although cases such as heavy ram or heavy drinking represent an argu-
ment in favour of storage, it is not necessarily the case that we have to talk
about a single choice. In some cases it is quite plausible to assume, as
Hausmann (1985: 119, 2007: 218) does, that the base is chosen first and
then a second choice limited by the collocational options of that base takes
place. This choice, of course, can be relatively limited as in the case of
Hausmann's (1984: 401-402, 2007: 218) examples schutteres Haar and
confirmed bachelor/eingefleischter Junggeselle or, on a slightly different
line, physical attack, scientific experiment and full enquiry discussed by
Sinclair (2004: 21).
In other cases, however, accounting for a collocation in terms of a single
choice option may be more plausible. This applies to examples such as
white wme or red wme, also commented on by Sinclair (2004). These show
great similarity to compounds but they allow interpolation -
(4) white Franconian wine
- a n d predicative uses
(5) Wmes can be red, white or rose, and still or sparkling - where the natural
carbon dioxide from fermentation is trapped in the wme. <BNC: C9F 2172>
and also, of course, uses such as
(6) 'I always know when I'm in England,' said Morris Zapp, as Philip Swallow
went off, 'because when you go to a party, the first thing anyone says to you
is, "Red or whiter" <NW323>
Choosing sandy beaches - collocations, probabemes and the idiom principle 33

Compared with the heavy ram-examples, white wme presents a much


stronger case for conceptualization - partly because the meaning associated
with the adjective is much more item-specific and more complex in its se-
mantics than mere intensification. To what extent white wme (which, as
everybody is aware, is not wme with milk) is a description of wtne or a clas-
sification of wme is difficult to decide on linguistic grounds.
It thus seems that amongst semantically significant collocations one can
distinguish - at least prototypically - between
- collocations which like white wtne represent a unified concept and
thus can be described in terms of a single choice and
- collocations such as heavy ram or strong wtnd where the relationship
of the two components can be seen as that of a base and a modifying
collocate.
In both cases, however, there is an element of storage involved.

2.2. Sandy beaches and high tide - concepts?

In this respect, the case of sandy beaches seems to be quite different. One
could argue that the mere fact that beaches and sandy co-occur with a log-
likelihood value of 3089.83 ({beach/N} ±3) has to do with the fact that
beaches are often sandy and therefore tend to be described with the adjec-
tive sandy. In this respect, sandy beaches can be compared to the collocates
of winds, where the fact that the BNC contains 22 instances of westerly
wmd/wmds, 12 of south-westerly wtnd/wtnds and only 2 of southerly
wmd/wmds ({wind/N} -1) is a reflection of the facts of the world discussed
in the texts of the corpus rather than of the language.
Sandy beaches could then be analysed as a free combination - the type
that Hausmann (1984: 399-400) refers to as a "Ko-Kreation" - and this
could be taken as an argument for not including sandy beaches in a diction-
ary - at least not as a significant combination, although perhaps, like in
Cobuildl and LDOCE5,13 as an example of a rather typical use.
On the other hand, in German the meaning of sandy beaches is usually
expressed by a compound - Sandstrande. The question to be asked is
whether it is realistic to imagine that the same meaning or concept is real-
ized by a free combination in one language and by a compound in another,
in other words, whether the meaning of the German lexicahzed compound
can be seen as the same as that of the collocation in English.
34 Thomas Herbst

Furthermore, sandy beaches tend to come as a relatively fixed chunk. Of


the 258 occurrences14 of sandy and beach ({beach/N} ±3) in the BNC,
there is only one predicative use
(7) The mam beach is sandy and normally sheltered but an easterly storm will
bring up masses of seaweed which occasionally has to be carted away.
<BNC:EEX28>
and only 7 uses of the kind:
(8) They surround a 45-acre lake which is bordered with sandy white beaches,
seven swimming pools, children's playground and pool, and lots of shops
and restaurants. <BNC:G2V39H>
(9) Our day cruises visit sandy and colourful beaches ... <BNC: C2B 818>
Tins finds an interesting parallel in Mgh tide and low tide. Although these
are listed in dictionaries, which points at compound status, they do not
carry stress on the first component and like sandy beach they also allow
predicative use:
(10) Before then it was served with a ford, and a ferry when the tide was high.
<BNC:C93 1668>
One point to be considered in this context is that the tide rs Mgh is not nec-
essarily synonymous with Mgh tide.
(11) By the time they entered the harbour it was high tide and the launch, with
the Wheelridingher stern, lay almost level with the quay. <BNC: GW3 1491>
If you get Mgh tide in the sense of the German Hochwasser at, say, 5 p.m.,
then the tide can presumably be called high between 3 and 7 or even be-
tween 2 and 8. In the same way it seems appropriate to state a difference in
meaning between the predicative and the attributive uses of sandy in (12)
and (13):
(12) The beaches on the North Coast of Cornwall are sandy.
(13) There are many sandy beaches on the North Coast of Cornwall.
Irrespective of whether in the light of these facts sandy beaches should be
analyzed as a collocation or a compound, I would argue that at least from
the foreign learner's point of view, the combination has to be accounted
for. For the foreign learner, it is by no means obvious why Sandbank,
Sandburg, Sandkasten, Sandsturm or Sandmann, Sandpaper and Sandstem
could be translated by equivalent English compounds with sand as their
first element, but not Sandstrand. From this point of view, the fact that
sandy beach is the equivalent of Sandstrand makes it an encoding idiom,
or, if you like, a Hausmann-type of collocation in the same way as weak tea
Choosing sandy beaches - collocations, probabemes and the idiom principle 35

or guilty conscience. Even from the LI perspective it seems rather idiosyn-


cratic that there are sandcastles but not sandy castles and that sandy banks
exist alongside (in the language, not geographically speaking) sandbanks
but with a different meaning.
To complicate matters further, there is the question of sand beach. Sand
beach is not normally listed in dictionaries (with the exception of the OED
where it is given under "general combinations"), three of four native speak-
ers consulted said it did not exist and one pointed out a technical geological
sense; the BNC contains 23 instances of sand beach® in comparison with
249forWJWM{beach/N}-l).15
In German, the situation seems to be the opposite - both Sandstrand and
sandtge Strande cmbe found:
(14) Sylt verfugt tiber 38,3 km Sandstrand mit mehr als 13.000 Strandkorben16
(15) Wale benutzen em Sonarsystem zur Onentierung und konnen verwirrt
werden, wenn sie in die Nahe sandiger Strande kommer, <DeReKo: HMP05>
The DeReKo corpus of the Institut fur deutsche Sprache (W-6ffentlich)
yields 5,247 instances of the different forms of Sandstrand (« 2.01 ipm) as
opposed to 30 of those of sandtge Strande (« 0.01 ipm), which can be taken
as an indication of the fact that Sandstrand is the established way of refer-
ring to a sandy beach. The fact that a Google search produced several thou-
sand instances of the latter can perhaps be explained by assuming that
many of the texts in which these forms occur have been translated from or
modelled on English, and indeed they often seem to refer to non-German
beaches. Although this is by far outnumbered by millions of Sandstrdnde
(and the corresponding morphological forms) in Google, sandtge Strande
seems to be acceptable in German and often as a synonym of Sandstrand -
something also suggested by the Duden UmversalworterbucVs (2001)
definition of Sandstrand as "sandiger Strand".
Irrespective of the question of whether there is a difference in meaning
between the different uses of Sandstrand and sandtger Strand in German
and sand beach and sandy beach or whether such a difference in meaning is
always intended by the speakers or perceived in the interpretation of a
text, 17 it is probably fair to say that the vast majority of uses of Sandstrand
in German corresponds to the vast majority of uses of sandy beach in Eng-
lish. This, however, raises the question of whether the fact that we have a
compound in one language and a collocation in the other can or must be
taken to mean that we also have different representations at the cognitive
level.
36 Thomas Herbst

It would be tempting to associate unified concepts of meaning with in-


dividual lexical items - either single words or compounds - and to see col-
locations such as sandy beaches as composite units, both at the formal and
semantic levels. On the one hand, this is underscored by the fact that in the
case of sandy and beaches, predicative uses or even superlative forms such
as sandiest beaches can be found.18 On the other hand, Schmid (2011: 132)
also says: 19
From a cognitive perspective we can say that compounds represent new
conceptual forms that are stored as integrated units in the mental lexicon, as
opposed to syntactic groups, which are put together during on-going proc-
essing of individual concepts in actual language use as the need arises.
While theoretically this criterion is without doubt the most important one, in
practice it is exceedingly difficult to implement, not least because there are
competing linguistic units that also consist of several words and are stored
as gestalts, namely fixed expressions and phraseological units. These in-
clude not just classical idioms such as to bite the dust ('to die') and to eat
humble pie ('to back down/give in'), which are almost certainly stored as
units in the mental lexicon, but ^phrasal verbs such as to get up, to walk
out and many others.
It seems unlikely that in a situation in which speakers want to describe a
beach more closely, the process that happens in German when the com-
pound Sandstrand is chosen is much different from that taking place in
English when sandy beaches is chosen. In fact, it could be argued that col-
locations of this type are subject to the same or at least similar mechanisms
of entrenchment ascompounds. 20
There are obvious parallels between compounds and collocations with
respect to the criterion of unpredictability or idiomaticity: there does not
seem to be any great difference between compounds such as lighthouse or
Leuchtturm and collocations such as lay the table or set the table in this
respect, or, for that matter, if cross-linguistic evidence is legitimate in this
kind of argument, between Sandstrand and sandy beach.
These examples show that the identification of discrete units of meaning
is by no means unproblematic. It is relatively obvious that combinations
such as low tide and Niedngwasser are units that express a particular con-
cept, but it could be argued that the meaning of the tide is high also pre-
sents a semantic unit or concept in the sense of an identifiable state of af-
fairs or situation. This concept can be expressed in a number of different
ways, but one could not argue that these were predictable in any way. Why
is the tide Mgh or in but not up? Why is the tide going out but not decreas-
Choosing sandy beaches - collocations, probabemes and the idiom principle 37

mgl It seems that, again, the co-occurrence of particular lexical items cre-
ates a particular meaning, and as such they can be considered a single
choice.
However, the fact that high tide and the tide ts high do not represent
identical concepts shows that we have to consider not only the individual
words that make up an "extended unit of meaning", to use Sinclair's (2004:
24) term, but that the meaning of the construction must also be taken into
account, where one could speculate that in the case of adjective-noun-
combinations there is a gradient from compounds to attributive collocations
(such as sandy beach, confirmed bachelor, high tide) to predicative uses
{the tide ts high) as far as the concreteness and stability of the concepts is
concerned.
Even if there are no clear-cut criteria for the identification of concepts or
semantic units, it has become clear that they do not coincide with the classi-
fication of collocations on the basis of the criteria of semantic or statistical
significance. While semantically significant collocations such as guilty
conscience or white wme can be seen as representing concepts, this is not
necessarily the case in the same way with other such collocations like
heavy ram or heavy smoker, for instance. Similarly, the fact that a statisti-
cally significant collocation such as sandy beach can be seen as represent-
ing a concept does not necessarily mean that all frequent word combina-
tions represent concepts in this way. It must be doubted, for instance,
whether the fact that the most frequent collocate of the verbs buy and sell in
the BNC is house should be taken as evidence for claiming that "buying a
house" has concept status for native speakers of English.
This means that traditional distinctions such as the ones between differ-
ent types of collocation or between collocations and compounds are not
necessarily particularly helpful when it comes to identifying single choices
or semantic units in Sinclair's sense.

2.3. The scope of unpredictability and the open choice paradox

The case of sandy beaches shows how scope and perspective of linguistic
analysis influence its outcome. If one studies combinations of two words,
as Hausmann (1984) does, then weak tea and heavy storm will be classified
as semantically significant collocations since tea and storm do not combine
with semantically similar adjectives such as feeble (tea) or strong (storm).
On the other hand, one could argue that the uses of the adjectives sandy and
sandtg follow the open-choice principle: sandy occurs with nouns such as
38 Thomas Herbst

beach, heath, soil etc.; sandrg with nouns such as Boden, but also with
Schuhe, Strumpfe etc. If one identifies two senses of sandrg in German -
one meaning 'consisting of sand', one 'being covered by sand', then of
course all of the uses are perfectly "regular". Seen in this light, such com-
binations can be attributed to the principle of open choice, which Sinclair -
at least in (1991: 109) - believed to be necessary alongside the idiom prin-
ciple "in order to explain the way in which meaning arises from language
text". The open-choice principle can be said to operate whenever there are
no restrictions in the combinations of particular lexical items. However, as
demonstrated above, the fact that sandy beach is frequently used in English
(because sand beach seems to be restricted to technical language) where
Sandstrand is used in German means that - at least when we talk about
established language use - the choice is not an entirely open one.
In any case, open choice must not be seen as being identical with pre-
dictability. For example, verbs such as buy, sell, propose or object seem to
present good examples of the principle of open choice since there do not
seem to be any restrictions concerning the lexical items that can occur as
the second valency complement of the verbs. The fact that shares, house
and goods are the most frequent noun collocates of buy and of sell in the
BNC (span ± 3 f can be seen as a reflection of the facts of the world (or of
the world discussed in the texts of the BNC) but one would hardly regard
this as a sufficient reason to consider these as phraseological or conceptual
units. So there is open choice in the language, but how speakers know that
(and when) this is the case, is a slightly different matter. In a way this is a
slightly paradoxical situation in that speakers will only produce combina-
tions such as buy shares or buy a book (a) because they have positive evi-
dence that the respective meanings can be expressed in this way and/or (b)
because they lack evidence that this is not the case - facts that cognitive
grammar would account for in terms of entrenchment or pre-emption.22
This open choice paradox can be taken as evidence for the immense role of
storage even in cases where the meaning of an extended unit of meaning
can be analysed as being entirely compositional.
A further complication about deciding whether a particular combination
of words can be attributed to the open-choice or the idiom principle is due
to the unavoidable element of circularity caused by the fact that this deci-
sion is based on our analysis of the meanings of the component parts of the
combination, which in turn however is based on the combinations in which
these words occur.
Choosing sandy beaches - collocations, probabemes and the idiom principle 39

3. Probabemes

If we consider sandy beaches in this light, then it appears as a free combi-


nation because there is no element of semantic unpredictability involved.
However, if one investigates how a particular concept (or meaning) is ex-
pressed in a language, we can observe an element of unpredictability
caused by the fact that the concept of sandy beaches is expressed by a
combination of two words and not by a one-word compound in English,
which provides further evidence for the impact of the idiom principle.
It seems worthwhile to pursue this line of investigation further and ex-
tend the analysis beyond formally defined types of combination. So far, the
study of collocations and other phraseological units has concentrated on
analysing certain formally defined types of combination. It is to be ex-
pected that the scope of idiomaticity to be found in language will be in-
creased considerably if we take an onomasiological approach and study the
ways in which different concepts are realized in different languages.
Despite the problems concerned with questions such as synonymy or
near-synonymy and the representativeness of corpora, it will be argued here
that it may be rewarding to combine an onomasiological analysis of this
kind with statistical corpus analysis. For instance, if we take such words as
year, half and three-quarters, one would hardly doubt that there are equiva-
lents in German which express the same meanings as the English words -
namely Jahr, halb and dretvtertel. However, if we combine these mean-
ings, we find that there is a considerable amount of idiomatization in-
volved: in German, there is em halbes Jahr and em dretvtertel Jahr, which
could be seen as free combinations (Langenschetdt Colltns Grofiworter-
buch Engltsch 2004 gives Dretvterteljahr as a compound, Duden Deutsches
Untversalworterbuch 2001 does not), in English, there is only half a year
butnot three quarters ofa year.
Furthermore, the BNC yields some 4000 instances oEsix months and (or
6 months), but only 46 of half a year, which suggests that the usual way of
referring to a period of some 180 days in English is six months rather than
half a year* In bilingual dictionaries, this is indicated by the fact that in
the entry of Jahr, the unit halbes Jahr is included with an extra translation.
It is in this context that the concept of probabeme (Herbst and Klotz
2003) may turn out to be useful. If one combines the insights gained by
traditional phraseology and corpus linguistics into the importance of multi-
word units with an onomasiological approach, then we have to look for all
possible formal expressions of a particular meaning in a language irrespec-
40 Thomas Herbst

tive of the fact whether these expressions take the form of one word or
several words. The term probabeme can then be used to refer to umts such
as six months, i.e. the (most) likely (or established) verbalizations of a par-
ticular meaning in a language. If we talk about the idiom principle, we talk
about what is established, usual in a speech community, and thus identify-
ing probabemes is part of the description of the idiom principle.
This aspect of idiomaticity was highlighted by Pawley and Syder (1983:
196), who point out that utterances of the type
(16) I desire you to become married to me
(17) Your manying me is desired by me
"might not achieve the desired response". More recently, Adele Goldberg
(2006: 13) observes that "it's much more idiomatic to say
(18) I like lima beans
than it would be to say
(19) Lima beans please me."
These examples show that if we follow the frame-semantics approach out-
lined by Fillmore (1977: 16-18),25 which includes such factors as perspec-
tivization, choice of verb and choice of construction, the idea of a single
choice may apply to larger units than the ones discussed so far.
A further case in point concerning probabemes is presented by the
equivalent of combinations such as wrr drer oder die drer, which in English
tends to be the three of us (BNC: 137) and the three of them (BNC: 189)
rather than we three (BNC: 22) or they three (BNC: 4).
The three of us is a very good example of how difficult it is to account
for the recurrent chunks in the language. What we have here is a kind of
construction which could not really be called item-specific since it can be
described in very general terms: the + numeral + of + personal pronoun.
Again, bilingual dictionaries seem more explicit than monolingual diction-
aries. Langenscherdt's Power Dictionary (1997) and Langenscherdt Collins
Grofiworterbuch English (2004) give the three of us under wir, but no
equivalent under ihr or die.

4. Meaning-carrying units

The types of formal realizations to be considered in this context should not


be restricted in any way. The point of an onomasiological approach as sug-
gested here is not only to include "extended units of meaning" that consist
Choosing sandy beaches - collocations, probabemes and the idiom principle 41

of several words but not to treat them any differently from single words.26
Units of meaning in this sense include elements that traditionally could be
classified as single words {beach), compounds (lighthouse), collocations
(sandy beaches, weak tea, set the table) or units such as bear resemblance
to or be of duration. Basically, all items that could be regarded as con-
structions (or at least item-based constructions) in the sense of form-
meaning pairings should be included in such an approach.27
There may be relatively little point in attempting a classification of all
the types of meaning-carrying units to be identified in the language (or a
language) in that little seems to be gained by expanding the lists of phrase-
ological units identified so far (Granger and Paquot 2008: 43-44; or Glaser
1990), especially since many of the units observed do not neatly fall into
any category.
In the light of the range of phraseological units identified it may rather
be necessary to radically rethink commonly established principles and cate-
gories in syntactic analysis. Thus Sinclair (1991: 110-111) points out that
the "o/in of course is not the preposition of that is found in grammar
books" and likewise Fillmore, Kay and O'Connor (1988: 538) ask whether
we have "the right to describe the the" in the the Xer the reconstruction
"as the definite article".28 Similarly, from a semantic point of view it seems
counterintuitive to analyse number and deal as heads of noun phrases in
cases such as
(21) a number of novels <NW72>
(22) a great deal oftime<NW46>
where one could also argue for an analysis in terms of complex determiners
(Herbst and Schuller 2008: 73-74).29 A further case in point is presented by
examples such as
(23) The possibilities, I suppose, are almost endless <VDE>
where an analysis which takes suppose as the governing verb and the clause
as a valency complement of the verb is not entirely convincing (which is
why such cases are given a special status in the Valency Dtctionary ofEng-
Itsh, for instance). In fact, there are good arguments for treating / suppose
as a phraseological chunk that has the same function as an adverb such as
presumably and can occur in the same positions of the sentence as adver-
bials with the same pragmatic function. It has to be said, however, that the
precise nature of the construction is more complex: it is characterized by
the quasi-monovalent use of the verb (or a particular class of verbs com-
prising, for example, suppose, assume, know or explain) under particular
42 Thomas Herbst

contextual or structural conditions, and at the same time one has to say that
the subject of the verb need not be a first person pronoun as in (23), al-
though statistically it very often is
(24) PMUp Larkin, one has to assume, was joking when he said that sexual
intercourse began in 1962. <VDE>
The precise identification and delimitation of such units raises a number of
problems, however, both at the levels of form and meaning. In the case of
example (1), for instance, one might ask whether the unit to be identified is
[of duration] or [be® of duration} or [be® of 'time span' duration],
where be® stands for different forms of the verb be and 'time W indi-
cates a slot to be filled by expressions such as ten weeks or short. One
might also argue in favour of a compositional account and not see be as
part of the unit at all. At the semantic level, one would have to ask to what
extent we are justified in treating expressions such as [be® of 'time
span' duration] or [last 'time span'] as alternative expressions ofthe"same
meaning or not. In his description of the idiom principle, Sinclair (1991:
111-112) himself points at the indeterminate nature of such "phrases" by
pointing out their "indeterminate extent", "internal lexical" and "syntactic
variation" etc.30 Nevertheless, the question of to what extent formal and
semantic criteria coincide in the delimitation of chunks deserves further
investigation, in particular with respect to the notions of predictability and
storage relevant in foreign language linguistics and cognitive linguistics.

5. Creativity and open choice

There can be no doubt that in the last thirty or forty years, "the analysis of
language has developed out of all recognition", as John Sinclair wrote in
1991, and, indeed, the "availability of data" (Sinclair 1991: 1) has contrib-
uted enormously to this breakthrough. On the other hand, the use and the
interpretation of the data available is easier in some areas than in others.
For instance, we can now find out from corpora such facts as that
- there is a very similar number of occurrences of the verbs suppose
(11,493) and assume (10,956) and that 6 6 % of the uses of suppose
(7,606) are first person, but only 7 % (773) of those of assume;
- the verb agree (span ±1) shows 113 co-occurrences with entirely, 31
^th fully, 27 with wholeheartedly (or whole-heartedly), 25 with to-
tally, 20 with completely and 9 with wholly;
Choosing sandy beaches - collocations, probabemes and the idiom principle 43

- and that 88 % of the entirely agra-cases are first person singular, but
only 52 % of the fully agree-czses, compared with only 16 % of all
uses of agree.*
With respect to the relevance of these data, one will probably have to say
that some of these findings can be explained in terms of general principles
of communication such as the fact that one tends to ask people questions of
the type
(25) Do you agree?
rather than
(26) Do you agree entirely?
Like the statistically significant co-occurrence of the verb buy with particu-
lar nouns such as house or the predominance of westerly winds in the Brit-
ish National Corpus, such corpus data can be seen as a reflection of certain
types of human behaviour or facts of the world described in the corpus
analysed. Although Sinclair (1991: 110) mentions "the recurrence of simi-
lar situations in human affairs" in his discussion of the idiom principle, co-
occurrences of this type may be of more relevance with respect to psycho-
linguistic phenomena concerning the availability of certain prefabricated
items to a speaker than to the analysis of a language as such. However, the
fact that / suppose is much more common than / assume makes it a proba-
beme which is relevant to foreign language teaching and foreign language
lexicography. This is equally true of the co-occurrence of entirely and
agree, where a comparison of the collocations of agree in the BNC with the
ICLE learner corpus shows a significant underuse of entirely by learners.
Thus, obviously, the insights to be gained from this kind of analysis have to
be filtered and evaluated with regard to particular purposes of research.
While the occurrence of certain combinations of words or the overall fre-
quency of particular words or combinations of words may be relevant to
some research questions, what is needed in foreign language teaching and
lexicography is information about the relative frequency of units expressing
the same meaning in the sense of the probabeme concept32
Recognizing the idiom principle thus requires a considerable amount of
detailed and item-specific description, which is useful and necessary for
applied purposes and this needs to be given an appropriate place in linguis-
tic theory. At the same time, it is obvious that when we discuss the idiom
principle, we are not concerned with what is possible in a language but with
what is usual in a language - with de Saussure's (1916) parole, Cosenu's
(1973) Norm or what in British linguistics has been called use. In other
44 Thomas Herbst

words, we are not necessarily concerned with the creativity of speakers,


which no doubt exists and equally has to be accounted for in linguistic the-
ory. While it would be futile to discuss to what extent and in which sense
Goldberg's (2006: 100) equivalent of colourless green rdeas
(27) She sneezed the foam off the cappuccino
or
(28) He pottered off pigwards
quoted from P.G. Wodehouse by Hockett (1958: 308) can be regarded as
creative uses of language,33 it is certainly true that established language use
provides the background to certain forms of creativity. Thus whereas (27)
can be seen as an atypical use of the verb sneeze but one which describes a
rather unusual situation for which no more established form of expression
would come to mind, (28) is a conscious and intended deviation from more
established ways of expressing the same meaning, which can perhaps also
be said of Ian McEwan's avoidance of shmgle beach in the formulation
(29) ... Chesil Beach with its infinite shmgle. <OCB4>
It may be debatable whether one should make a distinction between a prin-
ciple of open choice and a principle of creativity, as suggested by Siepmann
(this volume), but it is certainly true that for such purposes deviation from
established use cannot be measured purely in terms of frequency (Ste-
fanowitsch 2005). Thus one would hesitate to argue that using buy in com-
bination with door as in
(29) We bought er [pause] a solid door [pause] for the front. <BNC: KDM
11602>
is more creative than using it in combination with car
(30) 'Why did you buy a foreign car?' he said. <BNC: ANY 2910>
despite the fact that car is more than 50 times as frequent as a direct object
o f / ^ (span ±3) in the BNC.34

6. Choices: from meaning to form

As far as the question of single choice is concerned, what I wanted to dem-


onstrate here can be summed up as follows:
1. Some but not all semantical^ significant collocations can be seen in
terms of a single choice. Collocations involving intensification such as
weak tea can readily be analysed as structured in terms of base and
collocate, often with a restricted set of possible collocates, which
Choosing sandy beaches - collocations, probabemes and the idiom principle 45

could be an indication of storage. Other combinations such as white


wine or bear resemblance to, ratse objections, have a swtm seem to
represent unified semantic concepts and thus single choices in Sin-
clair's terms.
2. What appears as a free combination in the categories of descriptive
linguistics can be analysed as a single choice if it represents a concep-
tual unit. The decision not to use sand beach may in fact be a decision
in favour of sandy beach, which would then be a single choice. Such
an analysis seems more convincing for sandy beach than for, say, a
combination such as blue bus, but, of course, our arguments for claim-
ing that a particular combination represents a unified concept are
based on language use and thus there is a certain danger of circularity,
although Elena Togmm-Bonelli (2002) has provided a list of criteria to
identify what she calls functionally complete units of meaning - inter-
estingly in the context of translation, which is another area where
AWphenomena feature prominently.
3. If we agree that concepts can be expressed by single words or by
chunks of words, then these units must be given equal status in a se-
mantic description as possible single choices. This can be supple-
mented by the factor of frequency as in the probabeme concept en-
compassing the preferred choice. In any case, recognizing the role of
multi-word units in the creation of meaning in language text shows
how misguided some of the structuralist work on word fields that only
comprise simple lexemes is (or was). The analysis of such items fur-
ther shows that the same amount of arbitrariness can be observed as
with traditional simple lexemes.
The identification of the idiom principle and the evidence provided for its
essential role in creating language text has thus opened up far-reaching
perspectives for further research. From a cognitive point of view it will be
important to see what sort of evidence can be found for storage and acces-
sibility of multi-word units and whether differences between different types
of multi-word units identified in traditional phraseology and corpus linguis-
tics can be shown to exist in this respect* Furthermore, the role of "ex-
tended units of meaning" in texts as demonstrated by Sinclair shows that
what is required now is a concentration on the paradigmatic dimension in
terms of an identification of the units that carry meaning - ranging from
morphemes or words to collocations and multi-word units or item-based
constructions - and the meanings or concepts expressed by these units. In
fact, the collocation and thesaurus boxes to be found in more recent edi-
46 Thomas Herbst

tions of many modern English learners' dictionaries can be taken to apply


that sort of insight to lexicographical practice36 A further consequence is
that at least for certain descriptive and theoretical purposes one should be
quite radical in overcoming traditional or established types of categories or
classification. If, for instance, Granger and Paquot (2008: 43) describe se-
quences such as depend on and interested in as grammatical collocations
but exclude other valency patterns such as avoid -tog-form because they do
not consider them "to be part of the phraseological spectrum", this obscures
the fact that valency patterns constitute single choices, too. In the spirit of a
lexical approach to valency or construction grammar, Sinclair's (1991: 110)
creed that "a language user has available to him or her a large number of
semi-preconstructed phrases that constitute single choices" can be applied
equally to both phenomena of lexical co-occurrence and the co-occurrence
of a word (or a group of words) and a particular construction. If we take
choice in terms of a choice to express a particular meaning, then language
consists of a rather complex system of choices (some of which may even
determine the meanings we tend to express in a particular situation of utter-
ance).37 Recognizing that some meanings can be expressed by single words,
by words linguists tend to call complex or by combinations that one can
refer to as collocations, multi-word units or item-based constructions means
understanding the generally idiomatic character of language. Thus drawing
a line between phraseology and lexis proper then seems as inadequate as
drawing a sharp line between grammar and lexis, of which John Sinclair
(2004: 164) wrote:
Recent research into the features of language corpora give us reason to be-
lieve that the fundamental distinction between grammar, on the one hand,
and lexis, on the other hand, is not as fundamental as it is usually held to be
and since it is a distinction that is made at the outset of the formal study of
language, then it colours and distorts the whole enterprise.
If we approach the choices speakers have for expressing particular mean-
ings from an onomasiological perspective, this is equally true of the distinc-
tion between lexis and phraseology.
Choosing sandy beaches - collocations, probabemes and the idiom principle 47

Notes

1 I would like to thank Susen Faulhaber, Eva Klein, David Heath, Michael
Klotz, Kevin Pike and Peter Uhng for their valuable comments.
2 Compare Giles's (2008: 6) definition of a phraseologism as "the co-
occurrence of a form or a lemma of a lexical item and one or more additional
linguistic elements of various kinds which functions as one semantic unit in a
clause or sentence and whose frequency of co-occurrence is larger than ex-
pected on the basis of chance".
3 See, for instance, Altenberg (1998) and Johansson and Holland (1989). See
also Biber (2009). Cf also Mukherjee (2009: 101-116).
4 For the role of phraseology in linguistic theory see Ones (2008). For a com-
parison of traditional phraseology and the Smclairean approach see Granger
and Paquot (2008: 28-29). For parallels with pattern grammar and construc-
tion grammar see Stubbs (2009: 31); compare also Gnes (2008: esp. 12-15).
5 Cf. Croft and Cruse (2004: 225), who point out that "construction grammar
grew out of a concern to find a place for idiomatic expressions".
6 For a discussion of definitions of construction see Fischer and Stefanowitsch
(2006: 5-7). See also Goldberg (1995: 4).
7 For a detailed discussion of different concepts of collocation cf. Nesselhauf
(2005: 11^0) or Handl (2008). For a discussion of the frequency-based, the
semantically-based approach and a pragmatic approach to collocation see
Siepmann (2005: 411). See also Cowie (1981) or Schmid (2003). Handl
(2008: 54) suggests a multi-dimensional classification in terms of a semantic,
lexical and statistical dimension.
8 "Der Kollokator ist em Wort, das beim Formuheren in Abhangigkeit von der
Basis gewahlt wird und das folglich mcht ohne die Basis defimert, gelernt
und iibersetzt werden kann" (Hausmann 2007: 218).
9 This table shows the number of occurrences of the adjectives listed with the
corresponding nouns (query adjective + {noun/N}). It must be stressed that
the figures given refer to absolute frequencies of occurrence and should in no
way be taken as a measure of collocational strength. Grey highlighting means
that the respective collocation is listed under the noun in the Oxford Colloca-
tions Dictionary (2002). Obviously, not all possible collocates of the nouns
have been included. Furthermore, one has to bear in mind that in some cases -
such as weak tea and light tea - the adjectives refer to different lexical units.
Cf. also Herbst (2010: 133-134).
10 For the criterion of "begrenzte Kombmationsfahigkeit" see Hausmann (1984:
396).
11 See also the pattern grammar approach taken by Hunston and Francis (2000).
The patterns in the table can be illustrated by sentences such as the following:
NP V„tNP NP: / wasn 't really what you 'd call a public school boy... (VDE); NP
48 Thomas Herbst

Vact NP as NP: Many commentators have regarded a stable two-party system


as the foundation of the modern British political system (VDE); NP Vact of
NP as NP: One always thinks of George Orwell as a great polemicist (VDE).
12 See also Behrens (2009: 390) or Lieven, Behrens, Speares and Tomasello
(2003).
13 Cf., however, OALD8 under beach: "a sandy/pebble/shingle beach".
14 Excluded are three cases where sandy and beach co-occur but where sandy is
not m a direct relationship to beach.
15 The symbol @ is used to denote all morphological forms.
16 Source: syltinfo.de. (www.syltmfo.de/content/view/286/42/; August 2011).
17 Of course, the question of whether Sandstrand and sandiger Strand are inter-
changeable or synonymous is notoriously difficult to answer. There may cer-
tainly be cases where sandige Strande is chosen consciously and for a special
reason, such as example (15) perhaps. On the other hand, a large number of
sandige Strande from the Google text search seem to occur in travel itinerar-
ies or adverts for hotels or holiday cottages, where no semantic difference to
Sandstrande can be discerned (but then, as pointed out above, many of these
occurrences may have been influenced by texts written in English). Even if
they were not necessarily produced by native speakers of German, pragmati-
cally they would certainly be perceived as having the same meaning as Sand-
strande (and not evoke the negative connotations the German adjective
sandig seems to have in combinations such as sandige Schuhe, sandiges Haar
etc.).
18 "This is one of the sunniest, driest areas of the United Kingdom with some of
the sandiest beaches in the land." (www.campsites-uk.co.uk/details.php7id
=1741;March2011)
19 This does not mean that such combinations should be classified as com-
pounds. For instance, it can be doubted that they have the same power of hy-
postatization. For a detailed discussion of "the conceptual effect of one single
word" see Schmid (2008: 28).
20 It seems that one can either argue that sandy beaches is stored in the mind as
a unit in a similar way as Sandstrand is or that when speakers feel they want
to specify or describe a beach more closely, the German lexicon provides a
compound whereas in English a compositional process has to take place. For
a discussion of factors resulting in entrenchment see Schmid (2008: 19-22).
Compare also Langacker (1987: 59).
21 Log-likelihood values for buy + shares (1708.2759), house (1212.157) and
goods (990.8394); for sell + shares (1706.8992), house (467.7887) and goods
(1933.0683) in the BNC.
22 Compare e.g. Tomasello (2003: 178-131), Goldberg (1995: 122-123, 2006:
94-96) and Stefanowitsch (2008).
Choosing sandy beaches - collocations, probabemes and the idiom principle 49

23 The picture is complicated somewhat by the fact that six months can occur as
a premodifier in noun phrases. Excluding uses of the type 4 to 6 months the
BNC yields the following figures: six months (3750), 6 months (187), six
month (135), 6 month (6), six-month (214), 6-month (14) versus half a year
(46), half year (171), half-year (81) and halfyear (1). However, it is worth
noting that in the BNC the verbs last and spend do not seem to co-occur with
half a year but that there are over 60 co-occurrences of the verb spend with
six months and 27 of the verb last (span ±5).
24 Example numbers added by me; running text in original.
25 Cf also Fillmore (1976) and Fillmore and Atkins (1992).
26 This seems very much in line with the following statement by Firth (1968:
18): "Words must not be treated as if they had isolate meaning and occurred
and could be used in free distribution".
27 Cf. e.g. Goldberg (2006: 5) or Fillmore, Kay and O'Connor (1988: 534).
28 Compare also the list of complex prepositions given in the Comprehensive
Grammar of the English Language (1985: 9.10-11) including items such as
ahead of instead of subsequent to, according to or in line with, whose con-
stituents, however, are analyzed in traditional terms such as adverb, preposi-
tion etc.
29 On the other hand, within a lexically-oriented valency approach these of-
phrases could be seen as optional complements, which would have to be part
of the precise description of the corresponding units.
30 Note, however, the considerable amount of variation of idiomatic expressions
indicated in the Oxford Dictionary of Current Idiomatic English by Cowie,
Mackm and McCaig 1983). For the related problem of defining constructions
in construction grammar see Fischer and Stefanowitsch (2006: 4-12).
31 For a discussion of the collocates of agree on the basis of completion tests see
Greenbaum (1988: 118) and Herbst (1996).
32 For instance, it could be argued that specialised collocation dictionaries such
as the Oxford Collocations Dictionary would be even more useful to learners
if they provided some indication of relative frequency in cases where several
synonymous collocates are listed.
33 Compare also all the sun long, a grief ago wi farmyards away discussed by
Leech (2008: 15-17).
34 The frequencies of door (27,713) and car (33,942) cannot account for this
difference. The analysis is based on the Erlangen treebank.mfo project (Uhrig
and Proisl 2011).
35 See Underwood, Schmitt and Galpin (2004: 167) for experimental "evidence
for the position that formulaic sequences are stored and processed hohsti-
cally". Compare also the research carried out by Ellis, Frey and Jalkanen
(2009). See also Schmitt, Grandage and Adolphs (2004: 147), who come to
50 Thomas Herbst

the conclusion that "corpus data on its own is a poor indication of whether
those clusters are actually stored in the mind".
36 The latest editions of learner's dictionaries such as the Longman Dictionary
of Contemporary English (LDOCE5), the Oxford Advanced Learner's Dic-
tionary (OALD8) and the Macmillan English Dictionary for Advanced
Learners (MEDAL2) make use of rather sophisticated ways of covering
multi-word units such as collocations; cf. Herbst and Mittmann (2008) and
Gotz-Votteler and Herbst (2009), which can be seen as a direct reflection of
the developments described. Similarly, dictionaries such as the Longman
Language Activator (1993), the Oxford Learner's Thesaurus (2008) or the
thesaurus boxes of LDOCE5 list both single words as well as word combina-
tions under one lemma.
37 Compare the approach of constructional analysis presented by Stefanowitsch
and Ones (2003).

References

Altenberg,Bengt
1998 On the phraseology of spoken English: The evidence of recurrent
word-combinations. In Phraseology: Theory, Analysis, and Applica-
tions, Anthony P. Cowie (ed.), 101-122. Oxford: Clarendon Press.
Behrens,Heike
2007 The acquisition of argument structure. In Valency: Theoretical, De-
scriptive and Cognitive Issues, Thomas Herbst and Katrin Gotz-
Votteler (eds.), 193-214. Berlin/New York: Mouton de Gruyter.
Behrens,Heike
2009 Usage-based and emergentist approaches to language acquisition.
Linguistics V (2): 383-411.
Biber, Douglas
2009 A corpus-driven approach towards formulaic language in English:
Extending the construct of lexical bundle. In Anglistentag 2008 Tu-
bingen: Proceedings, Chnstoph Remfandt and Lars Eckstein (eds.),
367-377. Trier: Wissenschaftlicher Verlag Trier.
Bybee,Joan
2007 The emergent lexicon. In Frequency of Use and the Organization of
Language, Joan Bybee (ed.), 279-293. Oxford: Oxford University
Press.
Cosenu,Eugemo
1973 ProblemederstrukturellenSemantik.Tubmgm-.mn.
Choosing sandy beaches - collocations, probabemes and the idiom principle 51

Cowie, Anthony Paul


1981 The treatment of collocations and idioms in learners' dictionaries.
Applied Linguistics 2: 223-235.
Croft, William and D.Alan Cruse
2004 Cognitive Linguistics. Cambridge: Cambridge University Press.
Ellis, Nick C, Eric Frey and Isaac Jalkanen
2009 The psycholinguist* reality of collocation and semantic prosody (1):
Lexical access. In Exploring the Lexis-Grammar Interface, Ute
Romer and Rainer Schulze (eds), 89-114. Amsterdam/Philadelphia:
Benjamins.
Faulhaber, Susen
2011 Verb Valency Patterns: A Challenge for Semantics-Based Accounts.
Berlin/New York: De Gruyter Mouton.
Fillmore, Charles
1976 Frame semantics and the nature of language. In Origins and Evolu-
tion of Language and Speech: Annals of the New York Academy of
Sciences, Stevan R. Hamad, Horst D. Stekhs and Jane Lancaster
(eds.), 20-32. New York: The New York Academy of Sciences.
Fillmore, Charles
1977 The case for case reopened. In Kasustheorie, Klassifikation, semanti-
sche Interpretation, Klaus Heger and Janos S. Petofi (eds.), 3-26.
Hamburg: Buske.
Fillmore, Charles, and Beryl T. Atkins
1992 Toward a frame-based lexicon: The semantics of RISK and its
neighbors. In Frames, Fields, and Contrasts: New Essays in Seman-
tic and Lexical Organization, Adnenne Lehrer and Eva Feder Kittay
(eds.), 75-188. Hillsdale/Hove/London: Lawrence Erlbaum Associ-
ates.
Fillmore, Charles, Paul Kay and Catherine M. O'Connor
1988 Regularity and idiomaticity in grammatical constructions: The case
of let alone. Language 64: 501-538.
Firth, John Rupert
1968 Linguistic analysis as a study of meaning. In Selected Papers by J. R.
Firth 1952-59, Frank R. Palmer (ed.), 12-26. London/Harlow:
Longmans.
Fischer, Kerstin and Anatol Stefanowitsch
2006 Konstruktionsgrammatik: Em tjberblick. In Konstruktionsgramma-
tik: Von der Theorie zur Anwendung, Kerstin Fischer and Anatol
Stefanowitsch (eds.), 3-17. Tubingen: Stauffenburg.
52 Thomas Herbst

Gilqum,Gaetanelle
2007 To err is not all: What corpus and dictation can reveal about the use
of collocations by learners. In Collocation and Creativity, Zeitschrift
fur AnglistikundAmerikanistik 55 (3): 273-291.
Glaser,Rosemane
1990 Phraseologie der englischen Sprache. Leipzig: Enzyklopadie.
Gotz-Votteler, Katrm and Thomas Herbst
2009 Innovation in advanced learner's dictionaries of English. Lexico-
graphica 25: 47-66.
Goldberg, AdeleE.
1995 A Construction Grammar Approach to Argument Structure. Chi-
cago/London: Chicago University Press.
Goldberg, AdeleE.
2006 Constructions at Work: The Nature of Generalizations in Language.
Oxford/New York: Oxford University Press.
Granger, Sylviane
1998 Prefabricated patterns in advanced EFL writing: Collocations and
formulae. In Phraseology: Theory, Analysis and Applications, An-
thony Paul Cowie (ed.), 145-160. Oxford: Oxford University Press.
Granger, Sylviane
2011 From phraseology to pedagogy: Challenges and prospects. This
volume.
Granger, Sylviane and MagahPaquot
2008 Disentangling the phraseological web. In Phraseology: An Interdis-
ciplinary Perspective, Sylviane Granger and Fanny Meumer (eds.),
37-49. Amsterdam/Philadelphia: Benjamins.
Greenbaum, Sidney
1988 Good English and the Grammarian. London/New York: Longman.
Gnes, Stefan
2008 Phraseology and linguistic theory. In Phraseology: An Interdiscipli-
nary Perspective, Sylviane Granger and Fanny Meumer (eds.), 3-35.
Amsterdam/Philadelphia: Benjamins.
Handl, Susanne
2008 Essential collocations for learners of English: The role of colloca-
tional direction and weight. In Phraseology in Foreign Language
Learning and Teaching, Fanny Meumer and Sylviane Granger (eds.),
43-66. Amsterdam/Philadelphia: Benjamins.
Hausmann, Franz Josef
1984 Wortschatzlernen ist Kollokationslernen. Praxis des neusprachlichen
Unterrichts 31: 395-406.
Choosing sandy beaches - collocations, probabemes and the idiom principle 53

Hausmann, Franz Josef


1985 Kollokationen mi deutschen Worterbuch: Em Beitrag zur Theorie
des lexikographischen Beispiels. In Lexikographie und Grammatik,
Hennmg Bergenholtz and Joachim Mugdan (eds.), 118-129. Tubin-
gen: Niemeyer.
Hausmann, Franz Josef
2004 Was sind eigentlich Kollokationen? In Wortverbindungen mehr
oder wenigerfest, Kathrin Steyer (ed.), 309-334. Berlm/New York:
Walter de Gruyter.
Hausmann, Franz Josef
2007 Die Kollokationen im Rahmen der Phraseologie: Systematische und
histonsche Darstellung. Zeitschrift fur Anglistik und Amerikanistik
55 (3): 217-234.
Herbst, Thomas
1996 What are collocations: Sandy beaches or false teeth? English Studies
77 (4): 379-393.
Herbst, Thomas
2007 Filmsynchromsation als multimediale Translation. In Sprach(en)kon-
takt -Mehrsprachigkeit- Translation: Innsbrucker Ringvorlesungen
zur Translationswissenschaft V 60 Jahre Innsbrucker Institut fur
Translationswissenschaft, Lew N. Zybatow (ed.), 93-105. Frankfurt:
Lang.
Herbst, Thomas
2009 Valency: Item-specificity and idiom principle. In Exploring the
Grammar-Lexis Interface, Ute Romer and Rainer Schulze (eds.), 49-
68. Amsterdam/Philadelphia: John Benjamins.
Herbst, Thomas
2010 English Linguistics. Berlin/New York: De Gruyter Mouton.
Herbst, Thomas and Michael Klotz
2003 Lexikografie. Paderborn: Schomngh (UTB).
Herbst, Thomas and Bngitta Mittmann
2008 Collocation in English dictionaries at the beginning of the twenty-
first century. In Lexicographic** 24: 103-119. Tubingen: Max Nie-
meyer Verlag.
Herbst, Thomas and Susen Schuller [now Faulhaber]
2008 An Introduction to Syntactic Analysis: A Valency Approach. Tubin-
gen: Narr.
Herbst, Thomas and Peter Uhng
2009- Erlangen Valency Patternbank. Available online at: http://www.
patternbank.um-erlangen.de.
Hockett, Charles
1958 A Course in Modern Linguistics. New York: Macmillan.
54 Thomas Herbst

Hunston, Susan and Gill Francis


2000 Pattern Grammar: A Corpus-Driven Approach to the Lexical Gram-
mar of English. Amsterdam/Philadelphia: Benjamins.
Johansson, Stig and Knut Holland
1989 Frequency Analysis of English Vocabulary and Grammar, Based on
the LOB Corpus. Vol. 2: Tag Combinations and Word Combinations.
Oxford: Clarendon Press.
Langacker, Ronald W.
1987 Foundations of Cognitive Grammar. Volume 1: Theoretical Prereq-
uisites. Stanford, CA: Stanford University Press.
Leech, Geoffrey
2008 Language in Literature. Harlow: Pearson Longman.
Lieven, Elena
forthc. First language learning from a usage-based approach. Patterns and
Constructions, Thomas Herbst, Hans-Jorg Schmid and Susen Faul-
haber (eds.). Berlin/New York: de Gruyter Mouton.
Lieven, Elena, Heike Behrens, Jennifer Speares and Michael Tomasello
2003 Early syntactic creativity: A usage-based approach. Journal of Child
Language 30: 333-370.
Makkai,Adam
1972 Idiom Structure in English. The Hague/Paris: Mouton.
Mukherjee,Joybrato
2009 Anglistische Korpuslinguistik: Fine Einfuhrung. Berlin: Schmidt.
Nesselhauf,Nadja
2005 Collocations in a Learner Corpus. Amsterdam: Benjamins.
Pawley, Andrew and Frances Hodgetts Syder
1983 Two puzzles for linguistic theory. In Language and Communication,
Jack C. Richards and Richard W. Schmidt (eds.), 191-226. London:
Longman.
Quirk, Randolph, Sidney Greenbaum, Geoffrey Leech and Jan Svartvik
1985 The Comprehensive Grammar of the English Language. Lon-
don/New York: Longman. [CGEL]
Saussure, Ferdinand de
1916 Cours de linguistique generate, Charles Bally and Albert Sechehaye
(eds.). Pans/Lausanne: Payot.
Schmid, Hans-Jorg
2003 Collocation: Hard to pin down, but bloody useful. Zeitschrift fur
AnglistikundAmerikanistik 51 (3): 235-258.
Schmid, Hans-Jorg
2008 New words in the mind: Concept-formation and entrenchment of
neologisms.^/,"* 126: 1-36.
Choosing sandy beaches - collocations, probabemes and the idiom principle 55

Schmid,Hans-J6rg
2011 English Morphology and Word Formation. Berlin: Schmidt. 2nd
revised and translated edition of Englische Morphologie und Wort-
bildung 2005.
Schmitt, Norbert, Sarah Grandage and Svenja Adolphs
2004 Are corpus-derived recurrent clusters psychologically valid? In For-
mulaic Sequences, Norbert Schmitt (ed.), 127-151. Amster-
dam/Philadelphia: Benjamins.
Siepmann,Dirk
2005 Collocation, colligation and encoding dictionaries. Part I: Lexico-
logical aspects. InternationalJournal of Lexicography 18: 409-443.
Siepmann,Dirk
2011 Sinclair revisited: Beyond idiom and open choice. This volume.
Sinclair, John McH.
1991 Corpus, Concordance, Collocation. Oxford: Oxford University
Press.
Sinclair, John McH.
2004 Trust the Text: Language, Corpus and Discourse. London/New
York:Routledge.
Stefanowitsch,Anatol
2005 New York, Dayton (Ohio) and the Raw Frequency Fallacy. Corpus
Linguistics and Linguistic Theoiy 1 (2): 295-301.
Stefanowistch,Anatol
2008 Negative entrenchment: A usage-based approach to negative evi-
dence. Cognitive Linguistics 19 (3): 513-531.
Stefanowitsch, Anatol and Stefan Th. Ones
2003 Collostructions: Investigating the interaction between words and
constructions. International Journal of Corpus Linguistics 8 (2):
209-243.
Stubbs, Michael
2009 Technology and phraseology: With notes on the history of corpus
linguistics. In Exploring the Grammar-Lexis Interface, Ute Romer
and Rainer Schulze (eds.), 15-31. Amsterdam/Philadelphia: John
Benjamins.
Togmm-Bonelli, Elena
2002 Functionally complete units of meaning across English and Italian:
Towards a corpus-driven approach. In Lexis in Contrast Corpus-
Based Approaches, Bengt Altenberg and Sylviane Granger (ed.), 7 3 -
95. Amsterdam/Philadelphia: Benjamins.
Tomasello, Michael
2003 Constructing a Language: A Usage-based Theoiy of Language Ac-
quisition. Cambridge, MA/London: Harvard University Press.
56 Thomas Herbst

Uhrig, Peter and Thomas Proisl


2011 The treebank.mfo project. Paper presented at ICAME 32, Oslo, 4
June 2011.
Underwood, Geoffrey, Norbert Schmitt and Adam Galpin
2004 They eyes have it. An eye-movement study into the processing of
formulaic sequences. In Formulaic Sequences, Norbert Schmitt (ed.),
153-172. Amsterdam/Philadelphia: Benjamins.

Dictionaries

A Valency Dictionary of English


2004 by Thomas Herbst, David Heath, Ian Roe and Dieter Gotz. Ber-
lin/New York: Mouton de Gruyter. [VDE]
Collins COBUILD English Language Dictionary
1987 edited by John McH. Sinclair. London: Collins. [Cobmldl]
DudenDeutsches Universal*•orterbuch
2001 edited by Dudenredaktion (Annette Klosa, Kathrin Kunkel-Razum,
Werner Scholze-Stubenrecht and Matthias Wermke), Mannheim:
Dudenverlag. 4th edition.
Langenscheidt Collins Grofiworterbuch Englisch
2004 edited by Lorna Sinclair Knight and Vincent Docherty, Berlin et al.:
Langenscheidt. 5th edition.
Langenscheidt's Power Dictionary Englisch-Deutsch Deutsch-Englisch
1997 edited by Vincent J. Docherty, Berlm/Munchen: Langenscheidt.
Longman Dictionary of Contemporary English
2009 edited by Michael Mayor. Harlow: Pearson Longman. 5th edition.
[LDOCE5]
Longman Language Activator
1993 edited by Delia Summers. Harlow: Longman.
Macmillan English Dictionary for Advanced Learners
2007 edited by Michael Rundell. Oxford: Macmillan. [MEDAL2]
Oxford Advanced Learner's Dictionary of Current English
2010 by A. S. Hornby, edited by Joanna Turnbull. Oxford: Oxford Univer-
sity Press. 8th edition. [OALD 8]
Oxford Collocations Dictionary for Students of English
2002 edited by Jonathan Crowther, Sheila Dignen and Diana Lea. Oxford:
Oxford University Press.
Oxford Dictionary of Current Idiomatic English. Vol 2: Phrase, Clause and Sen-
tence Idioms
1983 edited by Anthony Paul Cowie, Ronald Mackm and Isabel R.
McCaig. Oxford: Oxford University Press.
Choosing sandy beaches - collocations, probabemes and the idiom principle 57

Oxford English Dictionary


1989 edited by John Simpson, and E. S. C. Werner. Oxford: Clarendon.
2nd edition. [OED2]
Oxford Learner's Thesaurus. A Dictionary of Synonyms
2008 edited by Diana Lea. Oxford: Oxford University Press.

Corpora and further sources used

BNC British National Corpus


DeReKo Das Deutsche Referenzkorpus DeReKo, http://www.ids-mannheim.
de/kl/projekte/korpora/, am Institut fur Deutsche Sprache, Mannheim
(using: COSMAS I/II {Corpus Search, Management and Analysis
System), http://www.ids-mannheim.de/cosmas2/, © 1991-2010 Insti-
tut fur Deutsche Sprache, Mannheim.
ICLE International Corpus of Learner English, Version 1.1. 2002. Sylvia-
ne Granger, Estelle Dagneaux, Fanny Meumer, eds. University ca-
thohque de Louvain: Centre for English Corpus Linguistics.
NW Nice Work. By David Lodge (1989). Harmondsworth: Penguin. First
published 1988.
OCB On Chesil Beach. By Ian McEwan. London: Cape.
VDE A Valency Dictionary of English (see bibliography).
Sinclair revisited: beyond idiom and open choice
DirkSiepmann

1. Introduction

In the present article I have set myself a triple goal. First, I would like to
suggest a new take on the principles of idiom and open choice. Second, I
wish to highlight the need to complement these principles by what I have
chosen to term "the principle of creativity". Third, I shall endeavour to
show how these three principles can be applied conjointly to the teaching of
translation.

2. The principles of idiom and open choice

In 1991 the late John Sinclair, who is renowned for his pioneering work in
the field of corpus-based lexicography, propounded an elegantly simple
theory. In Sinclair's view, the prime determinants of our language behav-
iour are the principles of idiom and open choice, and the principle of idiom
takes precedence over the principle of open choice: "The principle of idiom
is that a language learner has available to him or her a large number of
semi-preconstructed phrases that constitute single choices, even though
they might appear to be analysable into segments" (Sinclair 1991: 110).
The principle in question finds its purest expression in what Sinclair
(1996) terms the lexical item. Here is a straightforward example:

the text can be dmded mto three parts


the work is arranged in four sections
the book is organized mto eight thematic sections
the volume consists of ten chapters

Sinclair's lexical item displays marked surface variations which conceal


one and the same semantic configuration. Although none of the words in
this configuration is obligatory, there is no doubt that all the lexical realiza-
tions of the configuration are semantically related.
60 DirkSiepmann

Broadly speaking, we can distinguish four principal levels in this kind of


configuration:
the lexical level: preferential collocations (book + be organized + sections,
etc.)
the syntactic level: preferential colligations (subject + verb + preposition +
noun phrase, normally with a passive verb)
the semantic level: preferential semantic fields (nouns denoting publications
or parts of these publications, verbs describing the way in which something
is constructed)
the pragmatic level: the discursive function, the speaker's attitude (cf. Sin-
clair 1996, 1998; Stubbs 2002: 108-121)
Incidentally, Sinclair and Stubbs were no doubt the first to provide a sys-
tematic description of these collocational configurations, but long before
the studies of Sinclair and Stubbs appeared in print the linguistic phenom-
ena in question were familiar to translator trainers and specialists in foreign
language teaching. Specimens can be found in translation manuals pub-
lished in the 1970s and 1980s. Here is a representative example from a
book on German-English translation (Gallagher 1982: 47):
_
fed with empty promises
consoled with _
put off with _

A few years before the principle of idiom was enunciated by Sinclair,


Hausmann (1984/2007), working independently of his British colleague, set
up a typology of word combinations which is illustrated in the following
table:

co-creation collocation counter-creation


= words with wide com- = words with limited = words with limited
bmability which enter combmability which combmability which
into relations with each enter into relations with enter into relations with
other in accordance with each other in accordance lexical items beyond
minimal semantic rules with refined semantic their normal combina-
rules tory profile in accor-
dance with minimal
semantic rules
Sinclair revisited: beyond idiom and open choice 61

a red suitcase, a nice to curb one's anger, a the lisping bay, a loose
house, to wait for the peremptory tone, a procession of slurring
postman cracked wall feet, the water chuck-
led, the body ebbs

Although Hausmann's typology and Sinclair's principles originated in dif-


ferent intellectual contexts,1 the degree of overlap between the conceptual
systems evolved by Hausmann and Sinclair is striking. Co-creations and
collocations can be explained by the principle of idiom, while counter-
creations can be accounted for by the principle of open choice. The only
difference between the two systems resides in the way Hausmann defines
the collocational phenomenon (Hausmann 2007). In our view, Hausmann's
definition is unduly restrictive. I have shown elsewhere (Siepmann 2006a,
2006b) that the collocational phenomenon is not limited to binary units,
that there is no clear dividing line between collocations on the one hand and
colligations and morpheme combinations on the other, and that Haus-
mann's hypothesis concerning a difference in semiotactic status between
the constituents of a collocation, though pedagogically very useful, is seri-
ously flawed.
This view is shared by Frath and Gledhill, who use the concepts "de-
nominator" (denomination) and "interpretant" (mterpretant) in order to
demonstrate that the phraseological unit is an "essentialist artefact selected
arbitrarily by linguistic tradition from a continuum of referential expres-
sions ranging from the lexical unit to the sentence, the paragraph, and even
the entire text" (Frath and Gledhill 2005a; my translation).2 Frath and
Gledhill (2005b) apply the term "denominator" to all multi-word units
which are more or less frozen:
A denominator (symbolised here by N) is a word or a string of words winch
refers globally to elements of our experience which are lumped into a cate-
gory by the N. Ns are not usually created by the individual, they are given
to us by our community. They are what Merleau-Ponty (1945) called parole
institute, i.e. institutionalised language. Whenever we get acquainted with
an N, we naturally suppose that it refers to an object (or O), even if we
know nothing about it. [my emphasis]
Thus, for instance, units such as strong tea, coffee grinder and psycho-
analysts all allegedly refer jointly and globally to an object defined more or
less arbitrarily by the linguistic community.
In our view the term object is inapposite in the above definition; it
would be preferable to speak of a concept. Linguistic sequences such as
62 DirkSiepmann

mmdyour own damn business, I can't believe that, history never repeats
Uself or I love you Amotion as indices pointing towards concepts or patterns
which are familiar to the linguistic community. The link between the lin-
guistic sequence and its semantic extension or "reference" is not, as Frath
and Gledhill apparently assume, a direct one. The social "value" of a lin-
guistic sign is constituted not by what it refers to, but rather by the conven-
tional manner in which it is used. In the present instance we should speak
of self-reference rather than reference, for we are here concerned with cases
where one act of communication refers to another rather than to an object.
Using the terminology of Gestalt psychology, we might say that a set
phrase is a linguistic "figure" or "fore-ground" which refers to a situational
"ground" or "back-ground" (cf.Feilke 1994: 143-151).

3. Extending the principle of idiom

The validity of the principle of idiom has now been proved beyond dispute.
In several of his own publications Sinclair clearly demonstrated that his
assumption was correct (cf Sinclair 1991, 1996, 1998); his ideas have ap-
parently been taken up by exponents of construction grammar as well as by
scholars subscribing to other schools of thought; and the author of this arti-
cle has added a small stone to this vast edifice by extending the principle of
idiom in two directions.
First, I have shown that Sinclair's principle applies not only to isolated
words, but also to syntagmas comprising several units. Thus, for instance,
the word group with this m mind collocates with syntagmas such as let us
turn to and let us consider (for further details, see Siepmann 2005: 100-
105).
Second, I have followed up on the idea that semantic features exert a
collocational attraction on each other - an idea which, in my opinion, is
implicit in the postulate that there are such things as collocational configu-
rations. Thus, in configurations such as the work is arranged m eight sec-
tions or the second volume consists of five long chapters, the nouns which
occupy the subject position contain the semantic feature /text/.
Semantic features, like living creatures, may be attracted to each other
even when they are separated by considerable distances. In extreme cases
there may be dozens of words between the semic elements involved. Con-
venient examples are provided by structures extending over formal sen-
tence or even paragraph boundaries, e.g. certainly [...] but, it often seems
that [...]. Not so and you probably think that [...]. Not so. I have termed
Sinclair revisited: beyond idiom and open choice 63

these lexical dependencies long-distance collocations (cf. Siepmann


2005b).
In order to take account of such phenomena, I have reformulated Sin-
clair's idiom principle as follows: "One of the main principles of the or-
ganisation of text is that the choice of one word or phraseological unit af-
fects the choice of other words or phraseological units, usually within a
maximum span of several paragraphs. Collocation is one of the patterns of
mutual choice, and idiom is another" (Siepmann 2005a: 102, based on Sin-
clair 1991: 173).
Pursuing this idea further, we might hazard the hypothesis that certain
syntactic phenomena which appear to be free choices are in fact determined
by the principle of idiom. A good example is provided by pseudo-cleft sen-
tences. Functional grammarians generally take the line that this type of
sentence marks a turning point in an argument. Thus, for instance, a
pseudo-cleft sentence is often used to mark a transition to a new topic, in-
troduce a comment or highlight a contrast. The rhematic element in the
pseudo-cleft sentence is thrown into sharp relief, and at the same time it is
presented as the topic which is to be dealt with in the text segment that
follows. Here are a few examples from journalistic texts:
1. Contrast
There are important debates to be had in this area. Marc even makes his tiny
contribution, saying seriously for a moment that he does not believe in the
values that dominate contemporary art, citing his dislike of its "surprise"
and "novelty". Yet that contribution stuck out so strangely that / wondered
for a second whether it had been slipped in by Christopher Hampton, the
translator, to make the original more topical for the British. What is certain
is that by importing a reactionary French diatribe against the legacy of mod-
ernism, the presenters of Reza's play have muddied the waters of debate,
almost certainly preventing a far more significant discussion from taking
place. (The Guardian 29.10.1996: 7)
So it is clear what the advantage of a merger would be for Warburg. What is
less obvious is what Morgan Stanley hopes to gam. (The Economist
10.12.1994)
2. Topic shifting
Dr Narin is also able to break patents down by nationality. He discovered,
for example, that in 1985 Japanese companies, for the first time, filed more
patents in America than American companies did. They have continued to
do so every year smce.
64 DirkSiepmann

Numbers are Significant, of course, but what really counts is quality: only a
few inventions end up as big money-spinners. Dr Narin thinks he can spot
these, too. Patent applications often refer to other patents. By looking at the
number of times a particular patent is cited in subsequent applications, and
comparing this to the average number of citations for a patent in that indus-
try, he gams a measure of its importance to the development of the field.
{TheEconomist20.U.\992)
3. Commentary
At 9pm, with the lights, telly, dishwasher and washing machine on, Wattson
is registering a whopping £3,000. But what most shocks me is the power be-
ing drawn from the socket when my appliances are supposedly turned off.
{Times Online 15.7.2004)
It would be counter-intuitive to postulate a correlation between the use of
specific terms and a non-specific linguistic phenomenon like topic shifting.
Yet it is precisely this postulate that is borne out by our statistics. If we
look at the adjectives and adjectival phrases that crop up repeatedly in
pseudo-cleft sentences hinging on the copula to be, we find that some 70%
of these sentences contain a limited number of specific lexical items. This
can be seen from the following table:

Speaker's attitude what +


"what ism question" is at stake/in question/involved
clarity is sure/certain/clear/obvious
necessity is important/counts
"what is striking" is remarkable / exceptional / characteristic / evident /
odd (etc.)
comparison is less sure 1 more important

It follows from this that even the use of certain syntactic constructions is, to
a certain extent, determined by the principle of idiom, even though wnters
may have some leeway regarding the meanings that have to be expressed.
Does this mean that all linguistic behaviour boils down to the principle
of idiom? The answer is "no". Nonetheless, there is no escaping the fact
that language users have little room for manoeuvre in situations where they
have to arrange morphemes, individual words and syntagmas consisting of
several lexical items. In such situations the principle of open choice only
comes into play when language users display their incapacity to conform to
linguistic norms or happen to be motivated by the desire to flout such
Sinclair revisited: beyond idiom and open choice 65

norms by breaking valency, collocational or semantic moulds.3 The lan-


guage user's natural bent is to follow collocational norms (or ''denomina-
tional" norms, to use the terminology of Frath and Gledhill). This is con-
firmed by the fact that many translators are inclined to "normalise" texts
which run counter to linguistic norms (cf Chevalier and Delport 1995;
Kenny 2001; Gallagher 2007: 213-15).
When our attention is no longer absorbed by short and relatively rigid
word groups we enjoy a certain amount of freedom at the sentence, para-
graph and text levels, yet even here we have less room for manoeuvre than
is generally assumed (cf. Stein 2001). In our view the principle of open
choice normally comes into play whenever we have to link up several
valency patterns, collocations or probabemes4 This means that open
choices are made primarily at the semantic level and the level of unspoken
or deverbalised thoughts (if such entities really exist). Once we have made
an open choice we generally find ourselves in the domain of prefabricated
language.

4. The principle of creativity

So far I have endeavoured to circumscribe the range of the principles of


idiom and open choice. Closer examination reveals the need to distinguish
two types of open choice: those which are completely abnormal in every
respect, and those which constitute a more or less deliberate deviation from
accepted norms that can be accounted for in terms of a set of semantic rela-
tions such as analogical transfer, lexical substitution, metaphor or meto-
nymy.
We can bring these phenomena into sharper focus by examining a
phrase from a novel by Colin Dexter:
[...] a paperback entitled The Blue Ticket, with a provocative picture of an
economically clad nymphet on the cover. (Dexter 1991: 201; [emphasis
added])5
In the present instance we have to do with a kind of analogical transfer.
Under normal circumstances, clad and its more modern synonym dressed
both collocate with the adverb scantily. Since economically is a partial
synonym of scanty, this is neither a collocation nor an open choice (or, to
use Hausmann's terminology, a counter-creation). It follows therefore that
the expression economically clad cannot be explained by either of the two
66 DirkSiepmann

principles enunciated by Sinclair. It is, so to speak, a humorous extension


of the principle of idiom.
In view of the special characteristics of this and other examples which
are too numerous to cite (cf for example Partington 1998), I feel it is ne-
cessary to postulate a third principle which stands in a complementary rela-
tionship to the two principles enunciated by Sinclair. I shall call this the
principle of creativity. Taking Hausmann's classification as my starting
point, I therefore suggest that co-occurrences should be divided into four
groups: co-creations, collocations, analogical creations and counter-
creations. Co-creations and collocations can be explained by the principle
of idiom, analogical creations by the principle of creativity, and counter-
creations by the principle of open choice. The distinctions I have drawn can
be justified on grounds of frequency and distribution (see the table below).

principle of idiom principle of principle of open


creativity choice
co-creation collocation analogical counter-creation
creation
figure-ground figure with a groundless
relation text-specific figure
ground (or a
ground specific
to a limited
number of texts)
denominator interpretant interpretant
beautifully clad scantily clad (799 economically clad thirstily clad (no
(350 hits in htem Google (24 bits m Google Ms m Google
Google Books6) Books) Books) Books)
real hate (365 naked hate (207 bare hate (5 hits in loving hate
Ms m Google htem Google Google Books) (Shakespeare)
Books) Books)
two-hour drive scenic drive 27-hour meander sledge meander
by sledge
dizzy whh shock his eyes widened arnve whh shock
with shock

Such analogical transfers underlie the phenomena which Hoey (2005) de-
scribes as "semantic associations". Hoey argues that a word combination
such as a two-hour drive is based on an associative pattern of the type
Sinclair revisited: beyond idiom and open choice 67

number-time-journey by vehicle. It is this kind of pattern which generates


analogical creations such as a 27-hour meander by sledge.
Valency patterns can be extended in the same way. A characteristic ex-
ample is provided by English verbs and adjectives which are combined
with the preposition with and a noun denoting an emotion (e.g. anger or
grief). Thus, for instance, we can say she was dizzy with shock or he was
shaking with rage. Francis, Hunston and Manning (1998: 336) are correct
in claiming that only verbs and adjectives admit this construction, but they
restrict their attention to word combinations which are explainable by the
idiom principle and ignore analogical creations such as his eyes widened
with shock, her eyes sparkled with happiness and out of his mind with
grief
One might object that all these word combinations can be explained by
the principle of idiom. This is partly true if one takes account of the fact
that the idiom principle extends to semantic features (Siepmann 2005)8 and
semantic associations (Hoey 2005), but one ought to bear in mind the fact
that Sinclair only takes account of the lexical surface.
What I would like to stress here is the fact that there are other types of
analogical creation which cannot be explained by the principle of idiom.
Good examples are provided by metaphors in general and synaesthetic
metaphors in particular. It is particularly enlightening to analyse examples
from Iris Murdoch's novels and compare Murdoch's sentences with their
French translations. Let us begin with a sentence from a novel entitled The
Red and the Green.

English original French translation


The wind blew the light ram against Leventpoussaitlegerementlapluie
Xn^mAo^ in intermittent sighing centre l e s f e n e l r e s / r c r W o ^
g«sfa that were like a softrippleof commedessoupirsintermittents: on
waves. (Murdoch 1965: 247) auraitditundouxclapotisdevagues
(Murdoch 1967: 233, tr. Anne-Mane
Soulac)

By combining the verb sigh with an inanimate noun (gust), Murdoch per-
sonifies the wind. But how can we classify a word combination like sighing
gusts! It is neither a co-creation ("a regularly formed, normal-sounding
combination"9) nor a collocation ("a manifestly current combination"). If
we adopt Hausmann's classification, we are therefore forced to conclude
68 DirkSiepmann

that sitting gusts is a counter-creation, i.e. a rare or umque combination


which can be explained by the open-choice principle.
However, if we compare Hausmann's counter-creative examples 10 with
sighing gusts, it immediately becomes apparent that the word-combinations
which Hausmann classes as counter-creations are much more daring and
much less natural than sighing gusts. Hausmann's expressions are very
rare, probably even unique " Not so with sighing gusts Our intuition told
us that this word combination and its underlying semantic configuration are
not infrequent in literary texts, and this intuitive insight was confirmed by a
number of Google searches. In Google Books alone there were 675 hits for
sighing winds, 660 for sighing wind and 119 for sighing gusts. We may
therefore conclude that the word combination in question is a metaphor that
happens to have taken the fancy of certain authors. As a search on Google
Books readily shows, it occurs several times, for instance, in the works of
Fitzgerald and Murdoch.
There is sufficient evidence to prove that the distinction I have just
made is directly relevant to practical translation work. Soulac, the French
translator of Murdoch's The Red and the Green, has "normalised" the pas-
sage quoted above by applying the principle of idiom (Murdoch 1965:
247). Par bourrasquesis a prepositional phrase of medium frequency12 - in
other words a collocation - and soupirs intermittent is a co-creation (in-
termittent + any noun denoting a discontinuous phenomenon).
What Soulac failed to notice is that there is a grey zone between the
principle of idiom and the principle of open choice - a zone in which the
creativity principle holds sway. By applying this principle, she might have
succeeded in producing a more satisfactory rendering of Murdoch's poetic
prose. After examining a number of well-written literary texts by native
speakers of French, I decided that it would be preferable to translate the
passage in question as follows:
Le vent qui soupiiait en bouffees (or: en bourrasques) mterrmttentes rabat-
tait une pluie fine contre les fenetres: on aurait dit un doux clapotis de va-
gues.
Le vent qui soupirait en bourrasques plaquait le crachm contre les fenetres:
on aurait dit un doux clapotis devagues.
Le vent qui soupirait en bourrasques plaquait de douces ondees de pluie fine
contre les fenetres.
In all these sentences the verb soupirer is connected to vent in the same
way as sigh is linked to gust in the English sentence.
Sinclair revisited: beyond idiom and open choice 69

It is interesting to compare the extract from The Red and the Green
(1965) with a passage from a much later novel entitled The Good Appren-
tice (1985).

EngHsh original | French translation |


The wind, which tired him so by day, Le vent, qui le fatiguaittant lejour,
came ztmght in regular sighing gusts, venaithunitenrafalesdesoupirs
sounding like some great thing deeply obstines,commeunegrmde chose
and steadily breathmg. (Murdoch respirantprofondementetreguliere-
2001:152) ment (Murdoch 1987: 187, tr. Anny
Amberni)

Here we have exactly the same poetic word combination as in The Red and
the Green, but Amberm's translation is quite different from the rendering
suggested by Soulac. Instead of applying the creativity principle, Amberni
has gone to the opposite extreme by opting for an open choice which sa-
vours of affectation (en rafales de soupirs). This calls for a number of
comments. Although the pattern en NP de NP is quite common in contem-
porary French (e.g. en cascades de dtamants), en NP de souptrs is an ex-
tremely rare pattern, and en rafales de NP is subject to severe selectional
restrictions. In French prose we often come across syntagms where the slot
following the preposition en is occupied by a noun denoting a sudden gush
of fluid or semi-liquid matter (e.g. en cascades d'eau clatre, en torrents de
bone), but such syntagms sound distinctly odd as soon as we replace cas-
cades or torrents by rafales. In meteorological contexts syntagms such as
en rafales de 45 nceuds or en rafales de plutes torrentielles sound perfectly
normal, but these patterns are only marginally acceptable when nouns like
nceuds, plute or grele are replaced by words denoting sounds or emotions
(souptrs, rales, hatne, rage). Amberm's phrase is not un-French, but it is a
counter-creation and therefore sounds much less natural than Murdoch's
phrase.
Our third and final example provides even more convincing evidence of
the workings of the creativity principle. Here is a sentence which shows
how a vivid stylistic effect can be achieved by means of a syntactic trans-
formation:
Sarah Harrison, a slimly attractive, brown-eyed brunette in her late twenties
[= slim and attractive] (Dexter 2000: 29)
70 DirkSiepmann

Dexter's adverb + adjective construction is based on a common or garden


co-creation consisting of a pair of coordinated adjectives (slim and attrac-
tive). Since the underlying semantic association is not modified in any way
when slim and attractive is transformed into slimfy attractive, the latter
cannot be classed as a counter-creation. Nor can it be categorized as a col-
location, for it is neither normal-sounding nor particularly frequent. We
must therefore conclude that it is an analogical creation which can only be
explained by the creativity principle.

5. The implications for translation teaching

It remains to consider the relevance of the aforementioned principles to


translation teaching. Translation teachers should explain these principles to
their students and encourage them to replicate stylistic effects wherever
possible.
In many cases the creativity principle can be applied in both the source
and the target language. Thus sighmg gusts can be rendered adequately by a
persomficatory expression containing the words vent, souprrer and bour-
rasques (see the aforementioned examples from Murdoch's novels).
Stylistic normalization should only be attempted whenever the applica-
tion of the creativity principle would violate target language norms. Slimfy
attractive is a good example of an English stylistic device which cannot be
replicated in French. Since the French adjective mince cannot be converted
into an adverb (*mincement), we have no choice but to normalize Dexter's
expression by rendering it as mince et sedmsante.
I believe that the translation of syntagms is amenable to a systematic
presentation, but I concede that the systematic treatment of such translation
problems may be hindered by obstacles such as polysemy, transpositional
irregularities, unpredictable stylistic and textual factors, differences in fre-
quency, and collocational gaps. Let us now consider each of these obstacles
in turn.

5.1. Polysemy

Since collocations are often polysemous, translators have to look carefully


at the contexts in which they occur in order to find out exactly what they
mean. Thus, for instance, the collocation elever + niveau can be used with
reference to racehorses as well as debates and conversations.13 Similarly,
Sinclair revisited: beyond idiom and open choice 71

the EngHsh collocation have an interest (+ in) can mean either to be inter-
ested or to have a stake}* Even Sinclair's (1996) prototypical example (the
postulated link between the phraseological combination with / to the naked
eye and the notions of difficulty and visibility) can pose problems for the
translator since the French expression a I 'ceil nu is not always used in the
szme^y as with/to the naked eye}5

| with/to the naked eye Ul'ceilnu |


Sense 1 (semantic
equivalence between
French and English)
prototypical example (cf. [...] just visible to the [...] a peine visible a
Sinclair 1996) naked eye [...] l'ceilnut...]
Neutral examples At night at their house Jerusalem est uneville
they sat on the deck and d'oul'onpeut encore
watched the stars with voiral'oeilnul'epaisse
the naked eye (there was couvertured'etoiles.
no telescope).
Egypt may be the best
spot on earth to see the
stars with the naked eye
[...]•
Counter-examples The brightest star in it is Pourlesdissuader,cha-
oVelorum(3.6).Itis quenouvellecoupure
easily visible with the porteseptsignesde
naked eye [...] secunte parfaitement
In astronomy, the naked- visiblesal'oeilmUiesa
eye planets are the five la fabrication du papier
planets of our solar sys- etal'mipression.
tem that can be dis-
cerned with the naked
eye without much diffi-
culty.
72 DirkSiepmann

Sense 2 (present in
French but not in Eng-
lish)
= strikes the eye [...Jlecontrasteavecles
Etats-Unissevoital'ceil
nu/Les« accords de
Mat lg non»sontdeve-
nusfragnes.Lesfelures
sevoiental'ceilnu./
[...Jpresqueal'cennu

Ilestvraiquequiconque
peutconstateral'ceilnu
quelesdegatssont
consequents [...](= sans
regarderdepres)
Sanscetteechappee
lusitanienne.onvoit
bien,al'ceilnu,que
dans6y 0 «ety 0 « e /,ilya
«jou»,commedans
«joujou» [...].

5.2. Problems associated with the systematic presentation of


transpositions

Translation manuals often contain transpositional rules. Chuquet and


Paillard (1987: 18), for instance, claim that an English adjective modified
by an adverb ending in -ly often has to be rendered by means of a double
transposition (adverb + adjective - noun + adjective). To support their
assertion they cite the following example:
remarkably white (skin) - (teint) d'une blancheur frappante
This kind of transposition is more dependent on collocational and co-
creational constraints than appears at first sight. After examining all the
contexts in which collocations of the type remarkably white might occur, I
have reached the following conclusions:
1. It is not necessary to resort to a transposition in order to express the
notion of intensification earned by a degree adverb. Remarkably wMte
skm and remarkably whrte teeth can be rendered respectively as une
peau tres blanche (or une peau toute blanche) and des dents tres
Sinclair revisited: beyond idiom and open choice 73

blanches (or des dents toutes blanches). This shows that translation
work requires a perfect mastery of both the source and the target lan-
guage - a mastery that can only be attained with the aid of large cor-
pora 16
2. Transposition is impossible in cases where the qualities expressed by
the adverb and the adjectives do not add up (cf Ballard 1987: 189; Gal-
lagher 2005: 16):
previously white areas - des zones jusqu'alors blanches
3. Other kinds of transposition might be envisaged (e.g. ebloutssant de
blancheur).
4. The example cited by Chuquet and Paillard is atypical, for the colour
adjective white is generally combined with other adjectives (e.g. pure,
dead, bright, brilliant) or with nouns and adjectives designating white
substances (e.g. mtlk(y), creamfy), chalkfy)). Moreover, white often oc-
curs in comparative expressions like as white as marble. It is this type
of word combination that ought to constitute the starting-point for any
systematic contrastive study of the combinatorial properties of white,
whiteness, blanc and blancheur. When we embark on this kind of study
we soon notice that the French almost invariably use constructions such
as d'une blancheur I d'un blanc + ADJECTIVE {absolu(e), fantomattque,
latteux, latteuse, etc.) or d'une blancheur de + NOUN / d'un blanc de
{crate, ecume, porcelame, etc.). The English, by contrast, use a variety
of expressions such as pure white, ghostly white, milky white or as
white as foam.

5.3. Unpredictable stylistic factors

I shall restrict my attention to two typically French phenomena: synonymic


variation and subjectivism. The Gallic preference for synonymic variation
may be illustrated by means of the English expression golden age and its
French equivalents. While an English-speaking author will have no qualms
about repeating golden age several times within the same paragraph, a
French author will avoid such flat-footed repetition by using a stylistic
variant (epoque doree - age d'or). Any translator worth his salt will do
the same.
French subjectivism manifests itself in a marked tendency to set facts in
relation to an active subject, whereas English tends to represent reality as
74 DirkSiepmann

clusters of facts which are unrelated to the creatures which observe them. If
we compare the collocations of the French noun impression with their Eng-
lish equivalents, we find that avoir I'impression is rarely rendered by its
direct equivalent have the impression. One of the reasons for this is that the
word combination avoir I'impression frequently occurs in subjectivist con-
structions. Here is an example from Multiconcord (GroG, MMer and Wolff
1996):
Comment se peut-il qu'en l'espace d'une demi-heure, alors qu'on s'est
borne a deposer les bagages dans le vestibule, a preparer un peu de cafe, a
sortir le pain, le beurre et le miel du refngerateur, on ait une telle impression
de chaos?
How in the world did it happen that within half an hour - though all they
had done was to make some coffee, get out some rye crisp, butter, and
honey, and place their few pieces of baggage in the hall - chaos seemed al-
ready to have broken loose, [...].
The same kind of interlingual divergence can be observed when we exam-
ine the English translation equivalents of French noun phrases where im-
pression is followed by the preposition de and another noun:
cette impression de vertige disparut - this giddiness disappeared

5.4. Unpredictable textual factors

Literal translation is often impossible for textual reasons. A good example


is provided by the word combination jours heureux. This is normally ren-
dered directly as happy days (cf Memories of Happy Days, the title of a
book by the Franco-American author Julian Green). However, if the pre-
ceding context implies a comparison, jours heureux has to be translated as
happier days-
This translation shift is due to the fact that the French generally prefer
jours heureux to jours plus heureux. This is true even when jours heureux
is immediately preceded by a verb implying a transition from unhappiness
to happiness. Witness the following quotation from a blog:
Cote boulot enfin, Julie a recu son contrat done pas de probleme, et moi je
me retrouve des mardi a supmeca pans dans le cadre de mon stage de se-
cours, en attendant desjours heureux.
(http://juheetjeremieapans.blogspot.com/2006_04_01_archive.html)
Sinclair revisited: beyond idiom and open choice 75

It is interesting to note that the word combination en attendant des jours


heureux is used interchangeably with en attendant des jours meilleurs."
Both word combinations are so common that they have virtually attained
the status of set phrases.19

5.5. Differences in frequency

A little consideration shows that equivalence problems may be posed by


the fact that a high-frequency word combination in one language may cor-
respond to a word combination with a much lower frequency in another
language. While word combinations such as illegal download and tele-
chargement illegal are equally common in English and French, the same
cannot be said of ambiance de plomb and leaden atmosphere. The expres-
sion ambiance de plomb can often be heard in French radio broadcasts, and
over 4,000 occurrences can be found on the Internet, but its direct English
equivalent, leaden atmosphere, is comparatively rare. The English prefer
expressions such as a brooding atmosphere, an oppressive atmosphere or
an atmosphere heavy with tension (and menace).

5.6. Collocational gaps

This brings us to the subject of collocational gaps, which occur wherever


the languages under consideration use different collocates although there is
an exact correspondence between the collocational bases. Consider, for
instance the French economic term creneau and its English equivalents
{gap tn the market and market gap). Since creneau is frequently combined
with prometteur but gap tn the market (like market gap) rarely collocates
with the adjective promising? English translators would be well advised to
render un creneau prometteur by means of a word combination such as a
profitable gap tn the market or a potentially profitable gap tn the market.
It will be evident from the foregoing that the automatic21 translation of
collocations and collocational configurations is fraught with often unpre-
dictable problems. Nonetheless, we can pave the way to a systematic treat-
ment of translation equivalences by studying collocational gaps and fre-
quency differences between various languages.
In order to make this perfectly clear, I shall round off my study with a
detailed analysis of the combinatorial properties of the French noun im-
pression. The table in the appendix juxtaposes the collocations of the
76 DirkSiepmann

French noun impression recorded in the Robert des combinations de mots


and the collocations of the corresponding English noun listed in the Oxford
Collocations Dictionary for Students ofEnghsh. The words in small capi-
tals are additional collocations which I added after a thorough corpus inves-
tigation.
A cursory examination of the verb-noun collocations shows that the
English collocation record + impression has no simple direct equivalent in
French. In order to fill this collocational gap, translators have to resort to a
kind of translation shift which Vinay and Darbelnet (1958) termed "modu-
lation-
she recorded her impressions (of the city) in her diary - elle a confie ses
impressions a son journal / elle a Hvre (raconte) ses impressions dans son
journal
If we draw up systematic lists of collocational equivalents and collocational
gaps we can predict cases were translators have to resort to modulatory
shifts, for there is clear evidence of a direct correlation between the mean-
ing^) of words and their ability to combine with other lexical items.
The systematic description of words' combinatorial properties reduces
the risks involved in a purely intuitive approach to translation teaching.
Translation into foreign languages can be dealt with in a more objective
manner if translation work proper is preceded by a systematic comparison
of articles from collocation dictionaries.
This can be effectively demonstrated by examining the adjectives that
combine with the noun impression. The Dictiormaire des combmarsons de
mots lumps together adjectives like desagreable and navrant on the one
hand and defavorable on the other. However, a French-English comparison
shows that in the present instance we have to do with two distinct catego-
ries: impression defavorable can be classified under the sense we might
label "opinion", while word combinations like impression navrante, im-
pression epouvantable and impression horrible belong to the sense we
might label "feeling". Since noun-adjective collocations belonging to the
second category are specific to French, they cannot be rendered directly
into English. The following example from Multiconcord (GroB, MiBler &
Wolff 1996) is not above criticism, but it illustrates the kind of modulation
technique which has to be used in such cases:
les maigrichons me donnent toujours l'impression desagreable de ne pas
etre a la hauteur - thin men always make me feel inadequate somehow
Sinclair revisited: beyond idiom and open choice 77

6. Conclusion

I would like to conclude by recapitulating the main points discussed in this


article:
- the extension of the idiom principle to semantic features and word
groups
- the need to establish a new principle which I have termed the creativ-
ity principle
- the need to operational^ the practice of translation on the basis of
objectively verifiable principles, concepts and research findings (a) the
principles of idiom, creativity and open choice, (b) the results obtained
by a systematic study of the collocational equivalences between source
and target languages.
In order to teach translation more effectively, we need new translation
manuals in which information has been organized in accordance with these
principles. The books currently available on the market generally contain a
mere hotchpotch of ad-hoc observations which fail to take account of recent
advances in lexicography. In order to improve translation books we need to
draw a clear distinction between two distinct categories of lexical items:
those to which the creativity principle applies, and those which need to be
dealt with in accordance with the idiom principle. In order to resolve the
difficulties posed by words belonging to the latter category, we require
translation-oriented lexicogrammatical reference works (cf Salkoff 1999)
containing in-depth analyses of typical transpositional problems (cf. our
discussion of the colour adjective white). However, there is no need to treat
modulations in minute detail since most of these translation shifts can be
predicted with the aid of existing collocation dictionaries.22 These reference
works, however, need to be expanded and reorganized.

Notes

1 Hausmann's typology has its origins in traditional phraseology, which seeks


to define and categorize different types of word combinations; Sinclair's prin-
ciples, by contrast, are firmly rooted in British contextual^, which consid-
ers the concept of collocation primarily as a heuristic tool for constructing a
new language theory.
2 "C'est un artefact essentiahste, une entite arbitrairement selectionnee par la
tradition linguistique dans un continuum depressions referentielles qui vont
78 DirkSiepmann

de l'umte lexicale a la phrase, au paragraphe, au texte tout entier. Elle n'a ain-
si pas d'existence 'en soi'" (Frath/Gledhill 2005a).
3 Michael Hoey describes these moulds as "primings" (Hoey 2005).
4 Herbst and Klotz (2003: 145-149) use the termprobabeme to denote multi-
word units which speakers are likely to use to express standard thought con-
figurations. Thus, according to Herbst and Klotz, a native speaker of English
might say grind a cigarette end into the ground, while a native speaker of
German would probably use the verb austreten to express the same idea. The
kinetic verb grind evokes a vivid image of a cigarette butt being crushed be-
neath a foot, while the less graphic verb austreten gives more weight to the
idea of extinction. One might postulate a gradient ranging from valencies to
probabemes via collocations (in the traditional sense of the term).
5 This example, like the one that follows, was suggested by John D. Gallagher.
6 The search was carried out on 11 March 2008.
7 His eyes widened with shock and her eyes sparkled with happiness have a
distinctly literary flavour, but an expression like out of his mind with grief
might occur in everyday conversation. This shows that the creativity principle
operates at every level.
8 The hypothesis that semantic features exert a powerful attraction on each
other offers a plausible explanation for word combinations like his eyes spar-
kled with joy / happiness /glee and he was light-headed / almost unconscious
with tiredness.
9 Cf.Hausmann (1984) and (2007).
10 Hausmann cites word combinations such as la route se rabougrit and le jour
estfissure.
11 When we looked for la route se rabougrit and le jour est fissure we were
unable to find any occurrences on the Internet or in our corpus.
12 It should, however, be noted that en bourrasque is more common.
13 We can say le cheval est encore capable d'elever son niveau or ilfaut elever
le niveau dudebat.
14 For further examples see Siepmann (2006b).
15 According to Sinclair, the phraseological combination the naked eye consists
of a semantic prosody ("difficult"), a semantic preference ("see"), a colliga-
tion (preposition) and an invariable core (the collocation "the naked eye").
Our own findings indicate that the semantic prosody postulated by Sinclair is
not always present in English (cf. our counter-examples). In our opinion "de-
gree of difficulty" would be a more appropriate expression here.
16 Similar remarks apply to the word combination critically ill, which can be
rendered as dans un etat grave or gravement malade (cf. Chuquet and Paillard
1987: 18).
17 A particularly enlightening example can be found in a translation manual by
Mary Wood (Wood 1995: 106, 109 [note 24]).
Sinclair revisited: beyond idiom and open choice 79

18 Cf. the following example from a French daily: Si tel devait etre le cas, Pri-
makov ne ferait que retrouver, dans l'ordre mterieur, la fonction que lui avait
devolue en son temps son veritable patron histonque, Youn Andropov, chef
du KGB de 1967 a 1982, sur la scene diplomatico-strategique du monde
arabe: organiser, canaliser et moderer, en attendant des jours meilleurs, les
bouffees neo-staliniennes en provenance de l'Orient comphque. {Le Monde
5.11.1998:17)
19 It should however be noted that en attendant des jours meilleurs is more
common than en attendant des jours heureux.
20 When we searched the Web we found fewer than ten examples in texts writ-
ten by native speakers of English.
21 We use automatic in the fullest sense of the word.
22 We have demonstrated this by means of a detailed analysis of the collocations
of Fr. impression and their English translation equivalents.

References

Ballard, Michel
1987 La traduction de I'anglais aufrancais. Paris: Nathan.
Chevalier, Jean-Claude and Marie-France Delport
1995 Problemes linguistiques de la traduction: L'horlogerie de Saint
Jerome. Paris: L'Harmattan.
Chuquet,Helene and Michel Paillard
1987 Approche linguistique des problemes de traduction. Anglais-
Francars. Paris: Ophrys.
Crowther, Jonathan, Sheila Dignen and Diana Lea (eds.)
2002 Oxford Collocations Dictionary for Students of English. Oxford:
Oxford University Press.
Feilke, Helmut
1994 Common sense-Kompetenz: Uberlegungen zu einer Theorie "sympa-
thischen" und "naturlichen" Meinens und Verstehens. Frankfurt a.
M.: Suhrkamp.
Francis, Gill, Susan Hunston and Elizabeth Manning (eds.)
1998 Collins Cobuild Grammar Patterns. 2: Nouns and Adjectives. Lon-
don: HarperCollins.
Frath, Pierre and Christopher Gledhill
2005a Qu'est-ce qu'une unite phraseologique? In La phraseologie dans
tous ses etats: Actes du collogue "Phraseologie 2005" (Louvam,
13-15 Octobre 2005), Catherine Bolly, Jean Rene Klein and Beatrice
Lamiroy (eds.). Louvain-la-Neuve: Peeters. [cf. www.res-per-
nomen.org/respernomen/pubs/lmg/SEM12-Phraseo-louvam.doc].
80 DirkSiepmann

Frath, Rerre and Christopher Gledhill


2005b Free-range clusters or frozen chunks? Reference as a defining crite-
rion for linguistic units. In RANAM (Recherches Anglaises et Nord-
Americaines) 38 [cf www.res-per-nomen.org/respernomen/pubs/
lmg/SEM09-Chunks-new3.doc].
Gallagher, John D.
1982 German-English Translation: Texts on Politics and Economics.
Munich: Oldenbourg.
Gallagher, John D.
2005 Stilistik und tibersetzungsdidaktik. In Linguistische und didaktisch-
psychologische Grundlagen der Translation, Bogdan Kovtyk (ed.),
15-36. Berlin: Logos.
Gallagher, John D.
2007 Traduction litteraire et etudes sur corpus. In Les corpus en linguisti-
que et en traductologie, Michel Ballard and Carmen Pineira-
Tresmontant(eds.), 199-230. Arras: Artois Presses University.
GroB, Annette, Bettina MiBler and Dieter Wolff
1996 MULTICONCORD: Em Multilmguales Konkordanz-Programm. In
Kommunikation und Lernen mit alten und neuen Medien: Beitrage
zum Rahmenthema "Schlagwort Kommunikationsgesellschaft" der
26. Jahrestagung der Gesellschaft fur Angew andte Linguistik, Bernd
Riischoff and Ulnch Schmitz (eds.), 4 9 ^ 3 . Frankfurt: Peter Lang.
Guillemm-Flescher, Jacqueline
2003 Theonser la traduction. Revue francaise de linguistique appliquee
VIII (2): 7-18.
Hausmann, Franz Josef
2007 Apprendre le vocabulaire, c'est apprendre les collocations. In Franz
Josef Hausmann: Collocations, phraseologie, lexicographie: Etudes
1977-2007 et Bibliographie, Elke Haag, (ed.), 49-61. Aachen: Sha-
ker. First published in 1984 as Wortschatzlernen ist Kollokationsler-
nen: Zum Lehren und Lernen franzosischer Wortverbmdungen. Pra-
xis des neusprachlichen Unterrichts 31: 395^06.
Herbst, Thomas and Michael Klotz
2003 Lexikografie: Eine Einfuhrung. Paderborn: Schomngh.
Hoey, Michael
2005 Lexical Priming: A New Theoiy of Words and Language. London:
Routledge.
Kenny, Dorothy
2001 Lexis and Creativity in Translation: A Corpus-Based Study. Man-
chester: St Jerome.
Sinclair revisited: beyond idiom and open choice 81

Le Fur, Dominique (ed.)


2007 Dictionnaire des combinations de mots: Les synonymes en contexte.
Pans: Le Robert.
Partington, Alan
1998 Patterns and Meanings: Using Corpora for English Language Re-
search and Teaching. Amsterdam/Philadelphia: Benjamins.
Salkoff, Morris
1999 A French-English Grammar: A Contrastive Grammar on Transla-
tion^ Principles. Amsterdam: Benjamins.
Siepmann,Dirk
2002 Ergenschaften und Formen lexrkalrscher Kollokatronen: Wider em
zu enges Verstandms. Zeitschrift fur franzosische Sprache und Li-
teratur 3: 240-263.
Siepmann,Dirk
2005a Discourse Markers across Languages: A Contrastive Study of Sec-
ond-Level Discourse Markers in Native and Non-Native Text with
Implications for General and Pedagogic Lexicography. Abing-
don/New York: Routledge.
Siepmann,Dirk
2005b Collocation, colligation and encoding dictionaries. Part I: Lexico-
logical aspects. International Journal of Lexicography 18 (4): 409-
444.
Siepmann,Dirk
2006a Collocation, colligation and encoding dictionaries. Part II: Lexico-
graphical aspects. International Journal of Lexicography 19 (1): 1-
39.
Siepmann,Dirk
2006b Collocations et dictionnaires d'apprentissage onomasiologiques:
questions aux theonciens et pistes pour l'avenir. Langue francaise
150:99-117.
Sinclair, John McH.
1991 Corpus, Concordance, Collocation. Oxford: Blackwell.
Sinclair, John McH.
1996 The search for units of meaning. Textus IX: 75-106.
Sinclair, John McH.
1998 The lexical item. In Contrastive Lexical Semantics, Edda Weigand
(ed.), 1-24. Amsterdam: Benjamins.
Stem, Stephan
2001 Formelhafte Texte: Musterhaftigkeit an der Schmttstelle zwischen
Phraseologie und Textlmguistik. In Phraseologie und Phraseodidak-
tik, Marline Lorenz-Bourjot and Hemz-Helmut Liiger (eds.), 21-40.
Vienne:Praesens.
82 DirkSiepmann

Stubbs, Michael
2002 Words and Phrases: Corpus Studies of Lexical Semantics. Oxford:
Blackwell.
Vinay, Jean-Paul and Jean Darbelnet
1958 Stylistique compare dufrancais et de I'anglais. Pans: Didier.

Sources used

Dexter, Colin
1991 The Third Inspector Morse Omnibus. London: Pan Books.
Dexter, Colin
2000 The Remorseful Day. London: Pan Books.
Green, Julian
1942 Memories of Happy Days. New York: Harper.
Murdoch, Ins
1965 The Red and the Green. New York: The Viking Press.
Murdoch, Ins
1967 Pdques sanglantes. Paris: Mercure de France.
Murdoch, Ins
1987 L'apprenti du Men. Pans: Gallrmard.
Murdoch, Ins
2001 The Good Apprentice. London: Penguin.
Wood, Mary
1995 Theme anglais: Filiere classique. Pans: Presses Universitaires de
France.

Appendix: Collocations of impression

Collocations pnnted in small capitals have not been recorded in Le Fur


(2007) and Crowther, Dignen & Lea (2002). For some collocations, addi-
tional information has been provided in the form of examples or explana-
tions. Question marks indicate that the dictionary in question offers no
equivalent for a particular sense. Arrows provide cross-references to other
types of equivalents.
Sinclair revisited: beyond idiom and open choice 83

1.V +impression

Le Robert des combinations de mots Oxford Collocations Dictionary for


Students of English
avoir, eprouver,ressentir,retirer form, gam, get, have, obtain, receive, be
under
causer, creer, produire, provoquer, convey, create, give (sb), leave sb with,
susciter,degager, dormer, faire, lais- provide (sb with), MAKE {he really
ser, procurer made an impression on me)
? maintain
confirmer,conforter,corroborer confirm, BEAR OUT
aggraver,accentuer,accroitre,ajouter heighten, reinforce, strengthen, INTEN-
a SIFY
attenuer,temperer
comger, rectifier correct
contredire,dementir BELIE
d1SS1per,effacer,gommer ?
cedera,sefiera,garder,conserver, -> BE LEFT WITH
restersur
EViTERDEDONNER,~sedefairede ~ avoid {It was difficult to avoid the
{onadumaldsedefairedecette impression that he was assisting them
impression) for selfish reasons)
? record {She recorded her impressions of
the city in her diary)
commumquer,confier,decnre, EXCHANGE
echanger,expnmer,livrer,raconter
entendre, veemmv {recueillir les ?
impressions de M. Mitterrand apres
le voyage)
RAPVORTERDEQerapporteune im- COME AWAY WITH (J came away with a
pression favorable decetentretien), favourable impression of that meeting)
'REPARTIRAVEC
EMOVSSER{l'habitude availemousse DULL
chez luicette impression d'aventure
qu'ilressentaitdixansauparavant)
84 DirkSiepmann

2. impression+ ADJ

Le Robert des combinations de mots Oxford Collocations Dictionary for


Students of English
personnelle personal
SUBJECTIVE subjective
dommante DOMINANT, mam, overriding, over-
whelming
generale,D'ENSEMBLE,GLOBALE general (1), overall
MVVELLE (Qui sail ce que serontses
penseesjorsquedenouvelles impres-
sions naitront en elle!)
DERNIERE(S) final
'repandue (rare) general (2), widespread, COMMON
pretmere early, first, immediate, initial, INSTANT
DEUXiEME(rare) SECOND
confuse, diffuse, vague v a g u e , CONFUSED, CERTAIN
JUSTE accurate, right
fugace, fugitive fleeting, BRIEF
Icgcvc (j-'aiunelegere impression de superficial (cf. also I have a slight /
dejd-vu) vaSuefeelinSofdeid-vu)
contrasted (toujoursavec tor.-.4 ?
lafoisjuvenileetmaternelle, Pauline
Amoultlaissaituneimvression
contrasts de ieune femme fatale et de
maturite [...])
W t r a i r e (X, qui est certainement un opposite, CONTRARY (we apologise if
treshaut magistrate nous a pas the contrary impression was conveyed)
donneune impression contraire)
bonne, excellente, positive excellent, favourable, good (better,
best), great, right
Sinclair revisited: beyond idiom and open choice 85

defavorable,desagreable,DEPLAi- bad, poor, unfavourable, negative,


SANTE,DOULOUREUSE,facheuse, NASTY
frustrante,mauvaise,navrante, nega-
tive, pemble,pietre,catastrophique,
deplorable, desastreuse, detestable,
epouvantable, horrible, TERRIBLE,
SALE,TRlSTE(j'ai la triste impression
d'une occasion manquee)
erronee,fausse,illusoire,trompeuse distorted, erroneous, false, mistaken,
misleading, spurious, wrong, MERETRI-
CIOUS, CONTRADICTORY, DAMAGING,
UNFAIR, UNFORTUNATE
enorme,grosse,profonde,vive, forte, BIG, DEFINITE, REAL, considerable,
grande, intense, nette, PUISSANTE deep, powerful, profound, strong, tre-
mendous, ENORMOUS, MASSIVE, SERI-
OUS, SIGNIFICANT; clear, vivid, UNMIS-
TAKEABLE; distinct, firm, strong ; DIS-
TINCT ; FORCEFUL; KEEN
etonnante,frappante,incroyable, EERIE -» feeling
saisissantcetrange
ambigue, bizarre, cuneuse, drole de, DANGEROUS, UNCOMFORTABLE, UN-
etrange,smguliere,troublante, trouble NERVING
indefinissable,indescriptible, inexpli-
cable
indelebile,inoubliable, memorable, abiding, indelible, lasting, ENDURING,
TENACE LINGERING, PERMANENT
agreable, delectable, delicieuse, -» feeling
douce, enivrante, grisante, volup-
tueuSe,apaiSante,raSSurante,HEU-
REUSE,SALUTAIRE
Waincante convincing
PEUDE(rare),MOiNDRE(C/«rac^ LITTLE (this made very little impression
bien garde delaissertransparaitre on me) (ties frequent), NOT THE SLIGHT-
dans son discours la moindre impres- EST
sion de frilosite ou de doute)
COLLOCATIONS « GRAMMATICALES » / COLLOCATIONS « GRAMMATICALES » /
COLLIGATIONS : COLLIGATIONS

DE OF + N
86 DirkSiepmann

DE + INF OE + V-mG (he gives every impression


QUELLE + IMPRESSION of soldiering on)
QVELQVE (Son discoursnelaissa pas OFS.O. AS S.TH. (the abiding impression
defairequelque impression surmoi) is ofAlan Bates as a wonderfully
QUELQUES clenched Hench)
PRON + PROPRE(S) FROM (my abiding impression from the
Matrix Churchill documents was that
[•••])

NOT MUCH OF AN
S.O'S IMPRESSION IS OF S.TH. (my chief
impression was of a city of'retiree [...])
THE (ADJ) IMPRESSION IS ONE OF (e.g.
great size)
A (ADJ)/THE IMPRESSION IS THAT +
CLAUSE
EVERY
O^ (my own impression from the
literature is that [...])
THE SAME
SOME
CLINICAL (special.) (the clinical impres-
sion of hepatic involvement)
Eugene Mollet, Alison Wraf and Tess Fitzpatrick

1. Introduction2

Before the computer age, there were many questions about patterns in lan-
guage that could not be definitively answered. Dictionaries were the pri-
mary source of information about word meaning, and said little if anything
about the syntagmatic aspect of semantics - how words derive meaning
from their context. Now, with computational approaches to language de-
scription and analysis, we have an Aladdin's cave of valuable information,
and can pose questions, the answers to which derive from the sum of many
instances of a word in use.
However, there remain questions with elusive answers, and there is still
an onward journey for linguistic research that is dependent on new ad-
vances in computation. This paper offers one contribution to that journey,
by proposing a method by which new and useful questions can be posed of
text. It addresses a challenge that Sinclair identified: "What we still have
not found is a measure relating the importance of collocation events to each
other, but which is independent of chance" (Sinclair, Jones and Daley
[1970] 2004: xxii, emphasis added).
Specifically, we have applied an existing analytic method, network
modelling, to the challenge of finding out about patterns of lexical co-
occurrence. To date, networks have been used little, if at all, as a practical
means of extracting collocation, and for good reason: they are computa-
tionally heavy and render results that, even if different enough to be worth
using, remain broadly compatible with those obtained using the existing
statistical approaches such as T-score (see later). However, the very com-
plexity that makes networks a disadvantageous option for exploring basic
co-occurrence patterns also presents a valuable opportunity. For encoded in
a network is information that is not available using other methods, regard-
ing notable patterns of behaviour by one word within the context of another
word's co-occurrence patterns. It is this phenomenon, which we term 'sec-
ond-order collocation'3 that we propose offers very valuable opportunities
88 Eugene Mollet, Alison Wray and Tess Fitzpatrick

not only for investigating subtle aspects of the meanings of words in con-
text, but also for work in a range of applied domains, including critical dis-
course analysis, authorship and other stylistics studies, and second language
acquisition research4

1.1. Conceptualising 'second-order collocation' analysis

The information available through second-order collocation analysis can be


illustrated with an analogy that reflects, appropriately, social networking -
for the network modelling we will apply derives from that sphere of en-
quiry (compare Milroy 1987). Let us take as our 'corpus' a large academic
conference lasting several days. Our 'texts' are the many different events
that take place: plenary and parallel presentations, meals, coffee breaks,
visits to the book stall and bathroom, walks back to the hotel, and so on.
Our target 'word' is a particular academic attending the conference, and our
analysis aims to ascertain his interests and friendships, by examining whom
he spends his time with. Of course, he will be found in the proximity of
many more people than he knows and will probably talk to many people of
no personal or professional significance to him. But in the course of the
conference it should be possible to establish, on the basis of the emerging
patterns, whom he most likes to spend time with, and which other people
must share his interests even if he never speaks to them, because they turn
up in the same rooms at the same time, to hear the same papers.
Were the academic in question to be asked what sort of impression such
an analysis would give of his profile of professional acquaintances and in-
terests, he would no doubt first of all say that no single conference could
capture everything about him - as no corpus can capture everything about a
word. Next, he might observe that his behaviour at a conference depended
on more than just what papers were on offer and whom he spotted in the
crowd: there are other dynamics, ruled by factors less directly or entirely
beyond his control. For example, if he had his talented post-doctoral re-
searcher with him, he might make a point of introducing her to people he
thought might have a job coming up in their department - such people he
might otherwise not prioritize speaking to. Meanwhile, as a result of current
politics in his department, he might need to be careful about being seen to
fraternize with anyone his Department Chair viewed as 'the enemy' - so a
lot would depend on where his Department Chair was at the time. Further-
more, if he were very candid, he might admit that the amount of time he
Accessing second-order collocation through lexical co-occurrence networks 89

spends with a certain female acquaintance was determined in large measure


by whether or not his wife's brother was at the conference.
We see that the pattern of strong 'collocations' associated with this aca-
demic will depend not only on who is there, whom he knows, and what lo-
cal 'semantic' contexts he finds himself in, but also on how his own inter-
actions are affected by the presence of others and their own acquaintances,
allies and enemies. In short, we cannot isolate this one academic's relation-
ships from the deep and broad context of all the other relationships at the
conference.
In corpus analysis, although some measures, such as Mutual Informa-
tion, take into account not only how much word A is attracted to word B
but also the reverse - something that has a clear parallel in our academic's
conference experience - overall, analyses of word collocation are not all
that sophisticated in their ability to reflect the dynamic of secondary rela-
tionships - how one word's behaviour is influenced by the collocational
behaviour of other words. Sinclair's (2004: 134) discussion of 'reversal' is
highly relevant here:
Situations frequently arise in texts where the precise meaning of a word or
phrase is determined more by the verbal environment than the parameters of
a lexical entry. Instead of expecting to understand a segment of text by ac-
cumulating the meanings of each successive meaningful unit, here is the re-
verse; where a number of units taken together create a meaning, and this
meaning takes precedence over the 'dictionary meanings' of whatever
words are chosen.
Sinclair goes on to say that coherence in text is achieved naturally if there
is an obvious relationship between the meaning of the individual item and
that added by the environment. But if there is not, the reader must work
harder, variously inferring that a rare meaning of the item is intended, or a
metaphorical or ironic one.
Capturing second-order patterns is computationally extremely greedy.
Nevertheless, our method does it by economizing in ways relevant to the
analysis. To understand how, it is useful to return to the analogy. Were we
to have unlimited time and resources, we could track every single person at
the conference, rather than just the one, to gain the perfect picture of all
interactions. However, this computationally expensive procedure would
render much more information than we need. Our interest in the Depart-
ment Chair, for instance, extends only to those instances when he is located
close to our target academic, since we want to know how our target's be-
haviour in relation to some third party is influenced by his presence.
90 Eugene Mollet, Alison Wray and Tess Fitzpatrick

For this reason, in our approach to networking we constrain the model


to provide information only about situations in which the target item and
the influencing item co-occur. For instance, if we were interested in the co-
occurrence patterns associated with the word SIGNIFICANT, we might wish
to establish how its collocation with words such as HIGHLY is affected by
the presence, in the same window, of the word STATISTICALLY on the one
hand and SOCIALLY on the other. We would conduct one network analysis
to capture information about the interrelationships of all the collocates oc-
curring with SIGNIFICANT when STATISTICALLY is also present, and another
for when SOCIALLY is present. By comparing the two, we would be able to
ascertain how the presence of the 'influencing' items affects the collocation
behaviour of the primary item. Although this does require computational
power because the entire 'space' of all the interrelationships is calculated,
the computational costs are reduced relative to computing all of the interre-
lationships in the entire text or corpus, by extracting information only from
the contexts where the primary and secondary items co-occur.

1.2. Outline of the paper

In the remainder of this paper we develop the case for using network mod-
els and exemplify the process in detail. In section 2, we review the existing
uses of lexical networks in linguistic analysis and consider the potential and
limitations of using them to examine first-order collocations. We also ex-
plain what a network is and how decisions are taken about the parameters
for its construction. In section 3 we describe the network model adopted
here and illustrate the model at work using a corpus consisting of just two
lines of text by Jane Austen. In section 4 we demonstrate what the model
can do, exploring the lexical item ORDER in the context of the lexical item
SOCIAL on the one hand and of the tag <[MATHEMATICAL]FORMULAE> on
the other. Finally, we suggest how second-order collocation information
might be used in linguistic analysis.

2. Modelling language using lexical networks

Network models have been employed for a surprisingly diverse variety of


linguistic data, but seldom to extract information about collocations di-
rectly. Two useful overviews of network modelling in language study are
those of Sole, Murtra, Valverde and Steels (2005) and Steyvers and
Accessing second-order collocation through lexical co-occurrence networks 91

Tenenbaum (2005). Meara and colleagues use network models to interpret


word association experiments (e.g. Meara 2007; Meara and Schur 2002;
Schur 2007; Wilks and Meara 2002, 2007; Wilks, Meara and Wolter 2005).
Others have used them to express thesaural relations (Holanda, Pisa, Ki-
nouchi, Martinez and Ruiz 2004; Kinouchi, Martinez, Lima, Lourenco and
Risau-Gusman 2002; Motter, de Moura, Lai and Dasgupta 2002; Sigman
and Cecchi 2002; Zlatic, Bozicevic, Stefancic and Domazet 2006), phono-
logical neighbourhoods (Vitevitch 2008), syntactic dependencies in Chi-
nese (Liu 2008) and in English (Ferrer i Cancho 2004, 2005, 2007; Ferrer i
Cancho, Capocci and Caldarelh 2007; Ferrer i Cancho, Sole and Kohler
2004), lemma, type and token co-occurrence (Antiqueira, Nunes, Ohveira
and Costa 2007; Caldeira, Petit Lobao Andrade, Neme and Miranda 2006;
Masucci and Rodgers 2006), syllables in Portuguese (Soares, Corso and
Lucena 2005) and Chinese characters (Li and Zhou 2007; Zhou, Hu, Zhang
and Guan 2008). Network studies of collocation include Ferrer i Cancho
and Sole (2001), Magnusson and Vanharanta (2003) and Bordag (2003).
However, the tendency has been to construct networks of collocations pre-
viously extracted rather than using the network model as the basis for the
extraction,5 something which fails to encode the additional layers of infor-
mation that we exploit in our procedure. Ferret's (2002) approach does ex-
tract collocations on the basis of the network, using them to sort text ex-
tracts by topic. Park and Choi (1999) experiment with thesaurus-building
using a collocation map constructed from probabilities between all items on
the map. These approaches nevertheless differ from our method, because
we are able to interpret relationships between two collocates relative to a
third.
At its simplest, a network consists of a collection of nodes connected by
lines. Depending on the purpose of the model, the nodes may represent,
inter alia, words in an individual's receptive or productive vocabulary,
sounds, graphemes, concepts, morphemes, or something else. For example,
in Schur's (2007) word association research, the nodes are a finite set of
stimulus words, joined to indicate which other stimulus word or words the
subject selected as plausible associative partners. Mapping lexical knowl-
edge in this way offers two different sorts of opportunity. Semantic net-
work models focus on what is similar across individuals' knowledge, so
that one can talk about the associative properties of sets of words in a lan-
guage. Such research may seek to explain typical patterns of interference
between words or concepts in the same semantic field, such as in terms of
competition during spreading activation across the network (e.g. Abdel
92 Eugene Mallet, Alison Wray and Tess Fitzpatnck

Rahman and Melmger 2007: 604-605). In contrast, word association stud-


ies like Schur's are typically used to seek differences between individuals'
knowledge networks.
The detail of how a network is constructed depends upon decisions
about what should serve as a node and the parameters that should apply for
connecting nodes. For a given set of nodes, the more connections there are
between them, the denser the network will be (figure 1). One of the major
challenges in network modelling is selecting parameters that reveal the
most useful amount of information. Much as in the more standard ap-
proaches to studying collocation in corpora, decisions must be made about
the length of string under scrutiny and about frequency. In both approaches,
thresholds are applied, to thm out the representation until it is manageable.

Figure 1. (a) a sparsely connected network; (b) a densely connected network

In network models, connection strengths can be expressed through weight-


ing, on the basis of, for instance, frequency. In our analyses, weightings are
determined on the basis of distance from the primary focus word, as de-
scribed later. Deciding whether or not to include weighting is contingent on
one's specific aims in an analysis. The same applies to the question of di-
rectionality: should the model encode information about whether (and, con-
sequentially, how often) a given word precedes or follows the reference
word? In the case of text analysis, the decision may depend on the analyst's
views about what the language under analysis permits in terms of semantic
relationships based on order. Encoding directionality could, for example,
lead to different profiles for pairs of lexical items in asymmetrical relation-
ships of attraction in opposite directions, such as DAMSEL -> DISTRESS and
HIGH <r DUDGEON. Our analyses do not encode directionality, but there
certainly is scope to do so in this method.
Accessing second-order collocation through lexical co-occurrence networks 93

3. Method

In this section we describe the principles of the network algorithm used for
the illustration presented in section 4. We have selected an algorithm that
we think is generally effective, but we have constrained it here in ways that
make it easier to demonstrate. There is, in other words, broader scope for
parameter setting than is exemplified here. The generation of a network
proceeds in two stages. The first is mathematical. A computer program ap-
plies an algorithm to extract information from a text or corpus of texts and
to calculate relationships. As will become clear, the procedures combine to
calculate measures of weighted curvature (defined later), as an expression
of the multidimensional space in which a lexical item operates and by
which it is influenced. Second, the results of this analysis are filtered to
produce a graphic expression of selected information.

3.1. The connection of nodes

In our algorithm, two lexical items (types), A and B, are linked if B occurs
within a window of four content words either side of A. In the early days of
computational research into collocation, Sinclair identified a five-word
window as optimal (Knshnamurthy 2006: 596). Our window is smaller, but
wider in scope, because we have elected to capture only content words -
see below for the reason. As a result, many more content words will tend to
be linked in our ±4 window than would be the case in a standard five-word
window. We have also experimented with other types of window, including
the authorial sentence, and find some merit in them for certain kinds of
analysis. The parameters must be set with consideration of one's specific
research question. As noted earlier, simply linking words occurring in the
same window means that we have chosen not to encode directionality. An-
other algorithm could encode it, by, for instance, linking A and B only if B
followed A within the window.
The weight of the connection between A and B is encoded here accord-
ing to a measure of the distance between them. We have used the reciprocal
of the distance in content words 1/distance, but other options also exist,
including Instance 2 ). 'Distance' in both cases can be understood as the
number of content word steps between A and B. If B is adjacent to A, only
one step is required to get from A to B, so it scores 1/1 = 1. If B and A are
separated by one word, two steps are required, so the connection scores 1/2
= 0.5. If they are separated by two words, the score is 1/3= 0.33, and so on.
94 Eugene Mollet, Alison Wray and Tess Fitzpatrick

3.2. Calculating node relationships

After calculating the score for each word pair occurrence within the ±4
window identified by the focus item, the program calculates the strength of
the overall relationship between each two-word pair, by adding up all the
scores calculated for that word pair. We also calculate the weighted dis-
tance, which is the total amount of weight of the connections to and from
each node. This measure indicates how important the node is in the context
we have created. When the network graph is drawn, there are mechanisms
for positioning the most important nodes at the centre. However, although
weight measures are very important, they combine information about fre-
quency and distance: frequency contaminates the values because a word
occurring frequently will have more connections, each contributing to the
weight score. That is, a total weight score of 2 could be the result of two
occurrences immediately adjacent, four with one word intervening, six with
two words intervening, etc. To alleviate this problem, we calculate the Fre-
quency Normalized Weighted Distance (FNWD). The effect of this calcula-
tion is to factor out the influence of frequency of occurrence (i.e. how
many tokens of a given type occur in the sampled windows) without affect-
ing the expression of frequency of co-occurrence (i.e. how often, when an
item occurs, it is in the vicinity of a given other item). The calculation of
FNWD entails squaring the values in each row cell, summing the squares
and taking the square root. Then the sum of the columns is calculated *
Frequency normalized weighted distance provides a good indication of
the patterns of co-occurrence. Yet, among the long lists of words that co-
occur with a focus word, only a subset are of interest as collocates. In other
words, co-occurrence is a necessary, but not sufficient criterion for colloca-
tion. We want to know not only which words occur close to our focus word
but also which of them are there in some measure because o/tbe focus
word, rather than because they just turn up rather indiscriminately. To as-
certain this we need to understand the notion of curvature of a node in a
network. Curvature is the extent to which words that are each connected to
the focal item are also connected to each other. In other words, how often,
if word A is connected to word K and also to word M, do we find that K
and Mare also connected?
Curvature provides an expression of, one might say, the 'promiscuity' of
a word in the created context. If a word tends to turn up in different lexical
company each time it occurs, the curvature value will be low. But if it co-
occurs with the same words each time, then the curvature value will be
Accessing second-order collocation through lexical co-occurrence networks 95

high. Thus, we can use curvature as a measure of how 'choosy' a given


word is about its company, in the context (i.e. network) being studied. Cur-
vature is not influenced by frequency of occurrence, and, rather, provides
information about the extent to which frequency translates into 'more of the
same co-occurrences' or 'different co-occurrences'. The procedure for cal-
culating curvature is demonstrated below.

3.3. The visual representation

Finally, the visual representation of the network is generated by drawing


links between all word pair connections that reach a pre-established thresh-
old. Even figure lb, above, hints at how soon a representation would be-
come difficult to read, and a threshold is applied as a means of filtering, so
that only the most prominent information is visible. Thresholds can be
based on a simple frequency weighting - one might only show connections
when the words have co-occurred at least, say, three times - or on another
measure, such as weighted curvature, as below. We find it effective to ad-
just the threshold to produce the first 80 candidates by rank, since this
quantity can be viewed with ease in a network representation (see section
4). It should be borne in mind that filtering a diagram does not suppress
information in the calculations, only in the visual presentation.

3.4. The use of stop-listing

It was noted earlier that we have elected to engage with just content words
in these illustrations. Stop-listing, the removal of certain words before the
networks are constructed, acts as a filter on certain kinds of co-occurrence
(and indeed collocation) that are not of interest. We have used a stop-list
that combines the Glasgow7 and New York University stop-words.8 The
effect of doing so is to apply a fairly brutal filter, comprising all function
words, all digits, numbers and isolated letter characters, as well as several
very high frequency words such as KNOW, LIKE - which tend to be
bleached earners - and a range of discourse-related adverbs, such as HOW-
EVER, ALTHOUGH. While some other words in the stop-list would be early
contenders for readmission for many kinds of analysis (e.g. FILL, MILL,
SIDE), for our present purposes their exclusion is not of major significance.
More generally, it will be clear that for many researchers, stop-listing all
but content words would be unhelpful. For instance, if one is trying to iden-
96 Eugene Mollet, Alison Wray and TeSS Fttzpatnek

tify multiword lexical units, one would not want to exclude function words:
too many such units contain them - and indeed are differentiated from
other strings solely through them. In the same way, any analysis that re-
quires the distribution of proforms to be tracked would obviously not bene-
fit from their omission. Finally, we should note that we did not lemmatize
the data. Doing so conflates distributional information about particular
forms of a lemma, and we considered it better not to permit that to happen,
since "sometimes different forms of a lemma behave differently" (Stubbs
2001:30).

3.5. Simple worked example

To illustrate the procedures described so far in the simplest possible way,


we will use just two concordance lines from the works of Jane Austen, gen-
erated for the lexical item SEX (figure 2a). The size of this window is de-
termined by the presence of exactly four content words either side of SEX,
though this is easier to see when the stop-listed words are removed (figure
2b). Although SEX is the focus word - it lies in the middle of the window -
it will soon become clear that this does not give it priority treatment in the
procedure, other than that it defines the window under examination. The
relationships between all the other items in that window are also calculated.
To re-apply the conference analogy, this is equivalent to examining all of
the interactions between all of the participants in a conversation that our
target academic is part of. It is the centrality of our target that determines
the scope of the analysis, but within that defined domain, we obtain a com-
plete profile of everyone's interactions.

We each begin, probably, with a little bias towards our own SEX; and upon that bias build every circumstance in favour of it

perhaps be a little soured by finding, like many others of his SEX, that through some unaccountable bias in favour of beauty

Figure 2a. The Jane Austen text

Figure 2b. The Jane Austen text after stop-listing


Accessing second-order collocation through lexical co-occurrence networks 97

The distance between each content word and its neighbours is calculated.
As described in 3.1 above, distance is based on the number of intervening
content words. For example, in the first string, SEX is followed by BIAS
with no intervening content words. It scores 1/1 = 1. As it happens, BIAS
also precedes SEX with no intervening content word. This also scores 1/1 =
1, giving an interim total of 2 (figure 2c).

Figure 2c. Assigning weights to links in string 1

Continuing the SEX - BIAS calculation with the second string, in which they
also both occur, we see that the distance between them is 2 steps (SEX -»
UNACCOUNTABLE - BIAS) (figure 2d). Accordingly this link scores 1/2=
0.5. Thus, the final value for the link between SEX and BIAS is 2.5.

Figure 2d. Assigning weight to a link in string 2

When this process has been repeated for all the possible combinations of
content words in the texts, we have a series of weight scores associated
with connections between pairs of words occurring in the selected windows
(table 1). Thus, at the intersection of LITTLE and BEAUTY lies a value of
0.14, deriving from their co-occurrence in the second string, at a distance of
seven content word steps (LITTLE - SOURED - FINDING - SEX - UNAC-
COUNTABLY - BIAS - FAVOUR - BEAUTY). The score is 1/7 = 0.14. Note
that this distance measure of seven content words is possible only because
the window has been defined as four content words either side of SEX. We
can liken it to defining the relationship between two people in a room who
do not know each other but both know our target academic.
98 Eugene Mollet, Alison Wray and Tess FUzpatrick

Table 1. The weight seores for the content words m two strings from Jane Aus-
ten

Beauty Begin Bias Build Circ'e Favour Find'g Little P'haps Prob'ly Sex Soured Unac'ble
Beauty 0.00 0.00 0.50 0.00 0.00 1.00 0.20 0.14 0.12 0.00 0.25 0.17 0.33
Begin 0.00 0.00 0.53 0.17 0.14 0.12 0.00 0.50 0.00 1.00 0.25 0.00 0.00
Bias 0.50 0.53 0.00 1.33 0.75 1.53 0.33 1.53 0.17 0.75 2.50 0.25 1.00
Build 0.00 0.17 1.33 0.00 1.00 0.50 0.00 0.25 0.00 0.20 0.50 0.00 0.00
Circumstance 0.00 0.14 0.75 1.00 0.00 1.00 0.00 0.20 0.00 0.17 0.33 0.00 0.00
Favour 1.00 0.12 1.53 0.50 1.00 0.00 0.25 0.33 0.14 0.14 0.58 0.20 0.50
Finding 0.20 0.00 0.33 0.00 0.00 0.25 0.00 0.50 0.33 0.00 1.00 1.00 0.50
Little 0.14 0.50 1.53 0.25 0.20 0.33 0.50 0.00 1.00 1.00 0.83 1.00 0.25
Perhaps 0.12 0.00 0.17 0.00 0.00 0.14 0.33 1.00 0.00 0.00 0.25 0.50 0.20
Probably 0.00 1.00 0.75 0.20 0.17 0.14 0.00 1.00 0.00 0.00 0.33 0.00 0.00
Sex 0.25 0.25 2.50 0.50 0.33 0.58 1.00 0.83 0.25 0.33 0.00 0.50 1.00
Soured 0.17 0.00 0.25 0.00 0.00 0.20 1.00 1.00 0.50 0.00 0.50 0.00 0.33
Unaccountable 0.33 0.00 1.00 0.00 0.00 0.50 0.50 0.25 0.20 0.00 1.00 0.33 0.0

We can rank the words according to their total weight, by totalling the
rows. Note that as the table is a mirror tmage (as indtcated by shaded-
unshaded areas) each cell entry occurs twice (signifying the connection A
to B and also the connection B to A). It is this fact that enables the fre-
quency normalized wetghted dtstance calculate to be done later. The to-
tals from htghest ranked down are: BIAS (11.17), SEX (8.32), LITTLE (7.53),
FAVOUR (6.29). The remaining words are in tied sets: FINDING and UNAC-
COUNTABLE (4.11), SOURED and BUILD (3.95), CIRCUMSTANCE and
9
PROBABLY (3.59) and BEAUTY, BEGIN and PERHAPS (2.71).
Next we apply the frequency normalizing procedure as described earlier.
The FNWD values are: BIAS (5.55), LITTLE (4.01), SEX (3.96), FAVOUR
(3.25), FINDING (2.03), SOURED (2.02), UNACCOUNTABLE (1.81), PROBA-
BLY (1.72), BUILD (1.66), CIRCUMSTANCE (1.55), PERHAPS (1.28), BEGIN
(1.24), BEAUTY (1.21). Note that these figures express the mutual relation-
ships between all of the content words. For example, in this rearranged
ranking SOURED is above UNACCOUNTABLE even though both words occur
only once (string 2) and the latter is closer to the focus word SEX than the
former. The reason is that the scores for SOURED and UNACCOUNTABLE are
determined in part by how their own netghbours behave: the 'space' de-
scribed by the network calculation is defined in terms of all the different
relationships in it, and the location of a given word in that space is deter-
mined by the nature of the entire space, SOURED is elevated relative to UN-
ACCOUNTABLE because it occurs closer to LITTLE, which itself occurs in
both strtngs and so has other connections in the 'space'. Thts is the crucial
principle underlying the weighted curvature measure.
Weighted curvature is the means by whtch we can establtsh how much
notice to take of co-occurrences - that is, do they signtfy that there is a true
Accessing second-order collocation through lexical co-occurrence networks 99

semantic association between the words, or is one of the words simply


promiscuous, i.e. very likely to turn up in many different contexts? The
curvature calculation is referred to as 'weighted' because it uses as input
the FNWD values, so it takes account of the distance between items and
how often they occur. However, we shall demonstrate the principle in a
more simplified way, ignoring both, and just posing the question 'are these
two items connected or not?'
In a simple network derived from the Austen samples earlier, containing
just the four words BIAS, SEX, CIRCUMSTANCE and BEAUTY, we find that
BIAS and SEX are both joined to all three of the others, but CIRCUMSTANCE
and BEAUTY are not joined to each other (figure 3). This comes about be-
cause words occurring only in one string are not connected in the network
to words that occur only in the other string In terms of the conference
analogy, a network of relationships would not directly connect two indi-
viduals, one of whom stood in the same coffee queue as our target aca-
demic on Tuesday and the other of whom was sitting at his lunch table on
Thursday. (Recall that the analysis excludes consideration of what these
two individuals do when our target academic is not present, since every-
thing in a given analysis is centred on the experience of the target. If we
wanted information about their independent relationship, we would need to
make one of them our target.)

Figure 3. Exemplifying the basis of curvature

In figure 3 CIRCUMSTANCE andBEAUTY score 1.00 because all (i.e. both) of


the words that occur with them are also joined to each other. However,
BIAS and SEX score less than 1.00 because they co-occur with CIRCUM-
100 Eugene Mollet, Alison Wray and Tess Fitzpatrtck

STANCE and BEAUTY which are not joined to each other. The score is the
number of triangles as a proportion of the total possible triangles associated
with a given node. For instance, a triangle will be formed in the network
each time two words connecting to BIAS also co-connect (BIAS-
CIRCUMSTANCE-SEX; BIAS-BEAUTY-SEX). No triangle is formed if they do
not co-connect (BIAS-CIRCUMSTANCE-BEAUTY). AS two of the three possi-
ble triangles with BIAS are found, it scores 2/3 = 67%.
The full algorithm we use for weighted curvature" is as follows. As
noted, it incorporates FNWD. It also allows for directionality. In non-
directional weighted calculations this aspect is simply ignored in the calcu-
lation, but having it there enables the use of a single algorithm for a range
of different specific analyses.

Weighted curvature gives us a measure of the relative 'fussiness' versus


'promiscuity' of the co-occurring words. The lower the weighted curvature,
the more 'fussy' (i.e. 'particular') words are in their location in the vicinity
of the foeus word. Of course, in our very small example, it is hardly appro-
priate to judge the promiscuity of words, just as we should not want to
judge how gregarious or shy our academic was on the basis of only whom
he walked back to his hotel with. There is nothing all that promiscuous
about a word only occurring onee in a text of 45 tokens and having a score
of 1.00 relative to everything in its window. Nor is there anything particu-
larly fussy about a word occurring twice in non-identieal strings and
thereby being found alongside words that do not themselves co-oceur.
However, it should be evident that, in much larger samples, more examples
of a focus word are processed and more and more potential collocates are
added, so that a finer-grained picture gradually develops about how the
words behave relative to each other. In addition, as noted, curvature values
are calculated using FNWD and are thus distance-weighted and sensitive to
frequency co-occurrence. As a result, the values reflect the entire network
space, so that they encode a considerable amount of detail about the con-
nections to, from and beyond the node. This is the equivalent of building up
a complex picture of our academic's interactions from many different en-
Accessing second-order collocation through lexical co-occurrence networks 101

counters that includes information about how often the same people turn up
together and how directly he engages with them.
Now we come to the v1Sual representation of the relationships. Those
words which turn up in the vicinity of the focus word irrespective of the
context (i.e. the other words that co-occur with them are different each
time) have a low weighted curvature value. In a drawn network, such words
can be placed at the centre, close to the focus word (figure 4). The network
in figure 4 nicely illustrates that it is based on two strings, and that the
strings have four words in common. With larger texts, of course, the rela-
tionship back to the original will be obscured by the greater number of oc-
currences of words, and the patterns in the network, including the items
closest to the centre, will indicate distributions not so easily observed in the
original text(s).

Words which are close to the


focus in both contexts have a
lower curvature value

Figure 4. The network for the Austen strings

Network models, as described so far, could be used simply to explore the


straightforward patterns of co-occurrence customarily examined by corpus
linguists -what we shall refer to as 'first-order collocations'. The informa-
tion generated is not identical to that provided by any other measure - see
Evert (2005: 162) for a detailed comparison of measures, with some sur-
102 Eugene Mollet, Alison Wray and Tess Fitzpatrick

prising results, demonstrating that "collocation extraction is not a solved


problem" (Evert 2005: 165) - but the items this method identifies as the
strongest collocates are certainly compatible with those from other ap-
proaches. In a trial comparison of the outputs from FWND and weighted
curvature with those from six other methods (T-score, Mutual Information
[MI], Cubic association index, Log-likelihood, Sensitivity and Salience), 38
of the total 66 types produced by the two network calculations were also
produced by one or more of the statistical approaches. In particular, there
was an overlap of three-fifths between the outputs of frequency normalized
weighted distance and T-score. Overall, the network methods identify plau-
sible collocates, and could therefore usefully supplement existing statistical
methods, though they would probably not be viewed as suitable replace-
ments.
Overall, the network model remains a rather inefficient way to explore
first-order collocation, because in order to calculate the co-occurrence of a
focus word with its various neighbours, the network holds in the com-
puter's memory all the co-occurrence information of all of the neighbours,
not simply in relation to the focal word but m relation to all the other
words m the neighbourhood too. However, when we explore second-order
relationships, this large amount of information becomes a valuable re-
source, and it is this opportunity that we demonstrate now. More specifi-
cally, we exploit the weighted curvature calculation to establish a sensitive
measure of the activity of all items in the vicinity of not just the target item
but also an item that may influence its behaviour.

4. Case study: the extraction of second-order collocations


from a corpus

4.1. Relationships and meanings

We used our analogy earlier to illustrate the nature of second-order colloca-


tions: they are a way of understanding how the presence of one item affects
the 'behaviour', thence meaning, of another. But how are we to ground our
interpretation of the effect of an influencing item? It must be recognized
that comparison is inherent: we aim to evaluate the effect of the presence
versus the absence of a given influence or, arguably more plausibly, the
presence of one influence rather than one or more others. But how the dif-
ference between two compared patterns is interpreted depends on our re-
search focus.
Accessing second-order collocation through lexical co-occurrence networks 103

Those with a primarily semantic focus may comment on how the find-
ings supplement the information in formal dictionary entries (Sinclair 2004:
133-136). That is, a full view of the meaning of an item is developed from
observing its behaviour in many different texts - by analyzing large cor-
pora. Meanwhile, for cognitive linguists, the meaning of an item, within
and outside a given context, is construed in terms of the individual's
knowledge. Knowledge is based on experience, which can include not only
previously seen texts but also the use of dictionaries. The cognitive view
aims to capture the process by which a reader turns observation into experi-
ence by relating it to previous knowledge.
The most significant contrast between the two approaches resides in the
inevitability of the cognitive view modelling each individual's knowledge
as unique, since it is dependent on his or her particular experiences. Where
corpus linguists would hope, by virtue of using a large enough collection of
material, to capture information about an item in 'the language', blurring
the differences arising from the subset of texts that any given individual has
encountered, the cognitive linguist must inevitably view the mental experi-
ence of a language as fundamentally discrete to each user, with the effec-
tiveness of communication entirely contingent on overlaps in experience
and, consequently, knowledge.
We find it helpful never to lose sight of the reader as interpreter of the
text. For one of the benefits of second-order collocation analysis may be the
opportunity to understand more about what a reader's 'intuition' is when
extracting meaning. In this way, we may be able to progress somewhat the
convergence of the two approaches, leading to a better understanding of the
relationship between patterns in the corpus and the role of the individual in
constructing and sustaining the platform of shared meaning that underpins
effective communication.

4.2. Understanding ORDER

We turn now to an illustration of how the weighted curvature calculation,


applied to a larger corpus, can help examine the effect of two potential in-
fluencing collocates of a target item. Even here, in the interests of clarity,
we have elected to demonstrate a rather less subtle case than our approach
might ultimately be most useful for. We will show how the lexical item
ORDER collocates differently when in the context of (a) the lexical item SO-
CIAL and (b) any mathematical formula (collectively tagged in the corpus
as <FORMULAE>). ORDER in the sense of 'social order' and ORDER in its
104 Eugene Mollet, Alison Wray and Tess Fitzpatrick

technical mathematical senses - which include 'a partially ordered set' -


would be fairly easily distinguished by the reader: but how? The reader
would pick up clues from the topic matter and the surrounding text and use
them to infer the likely intended sense of ORDER. Just how subtle would the
inferences need to be, though, if the text was, say, a technical mathematical
treatment of questionnaire data relating to perceptions of social order?
Second-order collocations entail exploring the interaction of the target
item and its potential influences" The process occurs in two stages. We
home in first on the behaviour of the influencing word, and look for occur-
rences of the target word within the same window - this being the equiva-
lent of figuring out where the Department Chair is and then homing in on
our target academic's behaviour.
Our corpus for this demonstration is the recently-published British Aca-
demic Written English (BAWE) corpus.12 The BAWE corpus comprises
2,897 student assignments from three universities across the three years of
undergraduate education and Master's level. Assignments cover 35 differ-
ent disciplines and 13 different genre families. BAWE contains 6,506,995
tokens and approximately 86,670 types.
As noted above, the two contextualizing items selected for this analysis
are SOCIAL and the tag <FORMULA>. This tag replaces mathematical or
symbolic formulae in the BAWE corpus. Searching for the tag enables us to
examine text containing a range of different specific formulae, all of which
might be viewed as a clear signal of mathematical discourse. The
neighbourhoods for these two items have the advantage of being very
nearly the same size, which assists in the illustration of the method. Apply-
ing a ±4 window and a threshold of more than three co-occurrences, there
are 2,835 co-occurrence candidates for SOCIAL and 2,854 for <FORMULA>.
In both cases ORDER is one of the items. Since we are using FNWD only as
a means of calculating weighted curvature, we need not review the FNWD
results in their own right.
Table 2 presents the values of the top twenty-four collocates of SOCIAL
according to weighted curvature. Table 3 presents the matching information
forthe collocates of <FORMULA>.
Accessing second-order collocation through lexical co-occurrence networks 105

Table 2. Top 24 collocates of SOCIAL in BAWE, according to weighted curva-


ture (xlOOO)
economic political class cultural people life order world
1.326 1.471 1.621 1.684 1.705 1.729 1.765 1.782

society structure women change behaviour groups relations important


1.855 1.862 1.876 1.877 1.930 1.961 2.030 2.036

context status interaction group research terms time individual


2.065 2.066 2.074 2.078 2.088 2.095 2.184 2.187

Table 3. Top 24 collocates o f <FORMULA> i n B A W E , according to weighted


curvature (xlOOO)
equation values form calculated function set number case
0.917 1.122 1.127 1.194 1.217 1.218 1.262 1.292

defined shown time results constant order result theorem


1.297 1.317 1.363 1.396 1.410 1.410 1.464 1.494

equal model formula rate point factor figure unit


1.495 1.546 1.575 1.596 1.606 1.689 1.691 1.705

The curvature figures for SOCIAL (table 2) tell us that withm the environ-
ment of SOCIAL, the item that itself co-occurs most often with other words
in SOCIAL'S window is ECONOMIC. It is a 'well-connected' item. In terms of
the analogy, it represents a person who, when in the company of the De-
partment Chair, holds the greatest number of conversations of his own with
the others present. Recall that it does not tell us what that person does when
not in the Department Chair's company (this would entail a separate analy-
sis). However, it is not ECONOMIC that we are interested in, but ORDER, the
seventh most well-connected item according to the algorithm we have
used.13 Meanwhile, we see from table 3 that The most well-connected item
in the windows around <FORMULA> is EQUATION, while ORDER is the four-
teenth.
The next stage entails shifting the focus of our attention from the influ-
encing word to the second-order item, ORDER, our true target. We do this in
the context of the network graph that reflects the weighted curvature val-
ues. As noted earlier, an important aspect of creating network graphs from
the values is setting thresholds for what is visually represented, so as to en-
sure the graph can be easily read and interpreted. To this end, the values for
weighted curvature are set at a level that generates the desired amount of
detail - that is, so as to feature the desired number of nodes. Setting the
106 Eugene Mollet, Alison Wray and Tess FUzpatnck

values is an empirical and non-trivial matter, but rather then explaining it


here, we simply apply the principle. Selecting 80 as a practical number of
nodes to have in the visual representation, the weighted curvature values
are progressively adjusted until the desired number of candidates is gener-
ated. It should be borne in mind that this is simply a device for keeping the
diagrammatic aspects clear, and is not a real constraint of the model.
When the networks are first generated, they centre around their focus
word, here, SOCIAL in one case and <FORMULA> in the other. We will ex-
emplify the next steps using the SOCIAL network. Figure 6 represents the 80
or so words most attracted to SOCIAL in the BAWE corpus, where 'at-
tracted' is defined in terms of proximity to it within the ±4 window and
frequency of appearance, as described earlier.
MODEL PROBLEM
RESPONSIBILITY
CORPORATE

PSYCHOLOGICAL

POLICY
IMPACT CONTRACT
PHYSICAL
INSTITUTIONS
DEVELOPMENT

B CONCEPT HISTORY
ISSUES
FACTORS ™RMS STATK

STATUS CHANGE
GROUP ECONOMIC IMPORTANT THEORY SC.KNtl.
LIKE
TTMT7 IU1.IUIIAL ..,.*...,»,
IUVIJ& WOMFN
COMMUNITY ORI)F.R '0,, A
' JNTERACTION
C L A S S
POLITICAL
CFNDRR BEHAVIOUR
I.IJNDEK PEOPLE
CLASSES GROUPS WORLD STRUCTURE
\ \ INDIVIDUAL SOCIETY

RESEARCH RELATIONS
PI .vm*
CONTEXT INDIVIDUALS
WORK CULTURE
POWER
STRUCTURES
FORMS
FORM STUDY

I'osirio RELATIONSHIPS

Figure 6. Collocates of SOCIAL according to weighted curvature

Next, the focus moves, so that ORDER is itself examined, within the same
graph. This is like turning to examine the interactions of the target aca-
demic in just those instances where one of the influencing other people is
present. What we do next is rather like asking the people present to gather
Accessing second-order collocation through lexical co-occurrence networks 107

around the target academic at a distance representing their relationship with


him. The procedures are carried out in R (R Development Core Team 2008)
using igraph (Csardi 2008) to build the images. R permits us to manipulate
the graph to sort the items according to their relative attraction to ORDER,
even though the graph was constructed on the basis of the behaviour of SO-
CIAL. An easy way to conceptualize the process is to think of the nodes
(lexical items) as balloons tied together with string. If we drag the balloon
representing ORDER down to the bottom right hand corner, it pulls with it
all the balloons that are directly connected to it (figure 7). In this way, we
separate out the collocates of SOCIAL into those that are and are not also
collocates of ORDER. We refer to the collocates of ORDER as second-order
collocates of the focus ORDER with respect to SOCIAL, notated as ORDER,
(SOCIAL^

Figure 7. A network representation of collocations for SOCIAL (first-order) and


ORDER (second-order)
108 Eugene Mollet, Alison Wray and Tess Fitzpatnck

The same procedure is then earned out for ORDER, (<FORMULA>1)


(figure 8).

Figure 8. A network representation of collocations for <FORMULA> (first-order)


and ORDER (second-order)

Note that it is possible to impose different thresholds for the second-order


analysis from those applied for the first-order one, since all the information
is mathematically encoded, and the thresholds only filter the visual repre-
sentation. This means we can, theoretically at least (i.e. if there are enough
connections), home in on ORDER and generate a new network with 80 items
if we wish. However, here we shall work at the same resolution as hitherto.
Figures 9 and 10 re-manipulate the image to locate ORDER at the centre of
the respective networks, enabling us to visualize the relative co-occurrence
Accessing second-order collocation through lexical co-occurrence networks 109

behaviour of other words that, like ORDER itself, are attractive collocates of
the respective first-order focus, SOCIAL or <FORMULA>. 14
RESPONSIBILITY
CORPORATE PROBLEM
SIGNIFICANT
APPROACH PSYCHOLOGICAL

SE^UKCAPITAL EFFE<
INSTITUTIONS PHYSICAL 3NTROL

CONCEPT RESEARCH
ISS
"ES DEVELOPMENT
CONDITIONS STRICTURE PROCESS
STATE CARE
THEORY ROLE

FACTORS HEALTH SUPPORT


IMPORTANT
NATURE MORAL ECONOMIC WOMEN
CULTURAL FACT
WORLD SOCIAL BEHAVIOUR
POLITICAL PEOPLE CHANGE ""PORT
CLASS
TERMS SOCIETY GROUPS
FAMILY
INDIVIDUALS LIFE RELATIONS
ENYIRONMI VI
INDIVIDUAL COMMUNITY
GENDER
CULTURE INTERACTION CONTEXT
LOWER FORM
WORK
MEANS

POWER GROUP

MEN
LANGUAGE
IDENTITY

STRUCTURES

Figure 9. Network for ORDER2 (SOCIAL^ in B AWE

Of the 80 neighbours of ORDER2 (SOCIAL,) and 81 of ORDER2 (<FOR-


MULA> 1 ) (as there is a tie for final place), seven are common to both:
CHANGE, LAW, FORM, GROUP, TERMS, PROCESS and TIME. The remainder
can be viewed as likely to contribute to the process by which a reader as-
signs appropriate meaning to ORDER in its context of use.
Exploring an item's behaviour in a range of different influencing con-
texts offers the opportunity to explore the space it inhabits in text more
generally, and how sectors of the space are determined by the words occur-
ring with it. The trick, perhaps, is to adopt Sinclair's (2004) stance in ask-
ing to what extent the semantic definition of a given word has a solid ver-
sus a porous boundary.
110 Eugene Mollet, Alison Wray and Tess Fitzpatrick

Figure 10. Network for ORDER2 (<FORMULA>i) in B A W E

5. Applications

To what uses, then, might this method be put in linguistic analysis? There is
relatively little impediment to learning the very simple procedures involved
in calculating weighted curvature values and generating network graphs.
The procedures involve pasting a short stretch of code into R, and import-
ing the desired corpus, which does not need to be stop-listed, and which can
be either tagged or not tagged. The instruction to analyze particular target
words is achieved by simply typing the word into the specified space in the
code ... and waiting. The mam constraint on a PC is in not selecting items
that are too frequent, or working on a corpus that is too large. On more
powerful machines, more, of course, is possible (see Conclusion).
There are some linguists who will be contented simply to explore the
potential of the method to reveal interesting patterns. Others, however, will
Accessing second-order collocation through lexical co-occurrence networks 111

have a clear research question in mind, and we list a small number here as
examples.

5.1. Shades of meaning

The ORDER illustration demonstrates how second-order relationships can


assist in separating out different senses of the same word. Of course, with
such clear examples, other methods are also possible, such as pre-sorting
one's texts. But homography, polysemy and even subtler shades of mean-
ing seem to be on a continuum that defies easy definition in dictionaries.
This may be because, to varying degrees, words suck up the finer aspects of
meaning from the words around them - and indeed not just the words
around them but the words around the words around them, and so on. Our
method may be able to assist in demonstrating the nature and extent of dif-
ferences in meaning in the same form. To give one example, WELL seems
to be a word with a subtle shade of meaning in the context of mental health.
It seems to mean something similar to IN REMISSION in the context of can-
cer. Stuart, in the extract below, is speaking about the failure of three psy-
chiatrists to identify him as having bipolar syndrome. The context was a
television experiment in which five people with mental health problems
and five people without were tested and observed over several days, to see
if the psychiatrists could achieve correct identifications and diagnoses.15
Stuart: It just shows how well my recovery's gone and how far I've come.
'Cos I think maybe a year ago or so they would have picked up on
it because I was displaying a lot of symptoms. So for me it's just I
think a testament to the fact that I am well. I'm well. I'm not ill at
tins moment intime.(BBC 2008)
This meaning of WELL is not radically different from that in 'are you well?'
and 'when you get well', because the contrast with 'not ill' is consistent.
However, at a fine-grained level there is a nuance of special meaning, so
how is that being determined? Is it because WELL contrasts with the state of
having a particular illness rather than any illness? Do we sense something
about the co-occurrence relationships - perhaps that WELL is, unusually
here, something that you are rather than something that you are not? This
word, we think, merits attention, and may benefit from the opportunities
afforded by second-order collocate analyses.
112 Eugene Mollet, Alison Wray and Tess Fitzpatrick

5.2. Stylistics

Research on style, including genre and authorship, might benefit from sec-
ond-order collocation, to capture aspects of how an effect is created
through language not so much through the selection of lexical items that are
themselves all that distinctive, but through the 'circle of friends' found to-
gether. For example, how might an advertising company manipulate public
perceptions of a product, company or political party by creating texts that
reveal no biases in their first-order collocations, but convey subtle positive
overtones through their second-order associations?
How does an author or playwright succeed in presenting characters dif-
ferently through descriptions of them, or through the words they use? Char-
acters in plays by Harold Pinter or Alan Bennett might be interesting to
track, by examining how certain target words are affected by other words in
one way when from the mouth of character A and another from the mouth
of character B.
Those who engage in authorship analysis might see potential to probe
deeper into the individual patterns seen in a person's writing, at a level
unlikely to be open to conscious control even when attempting to obscure
identity - for the curvature patterns may be interpreted as expressing the
composite knowledge of the writer about the meanings of words, with no
awareness that, at this level, the word's meaning and the author's personal
style are very closely entwined.

5.3. Critical Discourse Analysis

We are all familiar with the observation that one text's GUERILLA or TER-
RORIST is another text's FREEDOM FIGHTER, but just how subtle might the
secondary co-occurrence patterns of sets of words like this be? How, for
example, are the collocates of BOMB different, according to the description
used of the bombers? Would the description of a bomb as POWERFUL rather
than DEVASTATING be under the influence of the network of co-occurring
words that weave the subtle context leading to a 'preferred' interpretation?
How does political correctness impact on the wider network of colloca-
tions? What happens to the rest of the text when coffee is referred to as
'with milk/cream' rather than 'white'? That is, if we assume some general
network pattern for a set of words, what is the impact of removing one of
them? Can another one simply move into its place, or are all the other con-
nections too different for that to work? What more subtle levels of social
Accessing second-order collocation through lexical co-occurrence networks 113

discrimination might be encoded in the differences between the second-


order collocations of DISABILITY or IMPAIRMENT in relation to LEARNING,
VISUAL, HEARING and MOBILITY?

5.4. Linguistic competence

The spoken or written expression of some highly proficient users of a for-


eign language consistently demonstrates their non-nativeness in very subtle
ways that cannot be laid at the door of error. What is it, exactly, that they
have not mastered? Our approach models the relationships between words
in terms of networks of associations that are subtle to the point of intangi-
bility, and that require readers or hearers to interpret meaning on the basis
of the interaction between the local collocation activity and past experi-
ences of other such activity. Glimpsing this additional dimension of infor-
mation, hidden in most analyses, offers scope to explore the nature and
provenance of linguistic intuition, and to ascertain what might be missing
in the comprehension and production of outwardly fully competent non-
native language users.

6. Conclusion

The method we have presented here promises, we believe, to offer new in-
sights into the nature of patterns in language, in the spirit of Sinclair's in-
terest in how collocation events relate to each other (Sinclair, Jones and
Daley 2004: xxh). We believe there are more layers to be uncovered in lan-
guage than current approaches in corpus linguistics easily reveal. They are
layers that the native speaker reader or hearer knows about and takes into
account when engaging with text. Each new foray into the exploration of
text patterns may bring us another step closer to modelling computationally
the essence of linguistic intuition. Second-order collocation patterns, as
described in this paper, may be central to such new developments for they
do not simply divide up the known world in new ways, but uncover infor-
mation of a different order. What second-order collocation measurements
reveal is something about a word's location in cognitive space - space that
is determined not only by the behaviour of the focus word, nor only of its
collocates, but also of its collocates' collocates. As the network image use-
fully reminds us, when everything is joined together, a movement in one
place has an impact on everything else.
114 Eugene Mollet, Alison Wray and Tess Fitzpatrick

Two final remarks should be made. The first regards the extent to which
information about secondary collocates is already available by other means.
In WordSmith Tools,16 it is possible to create a concordance for all occur-
rences of word A which have word B within a specified distance, such as
five words. Having created this 'sub-corpus', it is possible to explore the
collocations of word B. Where word B has more than one meaning, its col-
locates can be contrasted by selecting different contextualising words as A.
For example, one could explore the collocates of BANK (word B) in the
context of, separately, MONEY and RIVER as word A. However, this Word-
Smith procedure is much shallower than the network method. The collo-
cates of B are simply those that occur in the new sub-corpus of concor-
dances of A. They are computed locally and separately from the computa-
tions for A, whereas the network computes the information about B at the
same time as, and in relation to, A - and indeed all the other words that co-
occur. To put it another way, while the network method offers deeper views
into the total space of the collocates, Wordsmith approach maps only a sur-
face view and does not provide the analyst with the rich mathematical in-
formation that underpins the network relationships.
A more promising approach to second-order collocation is developed in
Collier, Pacey and Renouf (1998), though the relationships identified there
are arguably closer to 'mutual collocations', because they capture informa-
tion about collocates shared between two lexical items rather than the inter-
relationships of all lexical items relative to a second-order focus. Useful
features of Collier et al.'s method might fruitfully be combined with those
of our own method in future research.
The second observation that should be made here regards the ambitious-
ness of a methodology that is computationally so demanding. Not only
have we laid out, as a baseline for network research, computations that take
some time to complete on a PC. We have ventured to imply in some places
that much larger-scale computations might be interesting to carry out. Al-
though the personal computer continues to grow in memory and in proces-
sor size, processing speeds have changed little in recent years. It may there-
fore seem pointless to lay out a research agenda too powerful to be under-
taken. However, new technological advances have shifted attention from
processing speed to overall processing potential through the creation of vast
parallel systems. High end computing operations line up thousands of pro-
cessors to operate together, sharing out time-costly jobs so they are com-
pleted in a fraction of the time. As a result, we have the opportunity, for the
Accessing second-order collocation through lexical co-occurrence networks 115

first time, to pose a new level of questions about texts, knowing that there
is a means to answer them.
As Carter (2004: 6) points out in his introduction to Sinclair’s reissued
texts, “The landscapes of language study are changing before our eyes as a
result of the radically extended possibilities afforded by corpus and compu-
tational linguistics”. If language study is a landscape, then network anal-
yses offer us a chance to examine more than a two-dimensional map – ra-
ther, we can enter the terrain itself and explore the ways that words locate
themselves in multidimensional space.

Notes

1 Corresponding author: wraya@cf.ac.uk.


2 Acknowledgements: We are grateful for extremely helpful comments on ear-
lier versions of this paper from Gordon Tucker, Mike Stubbs, Chris Butler
and Paul Rayson. The research reported in this paper relates to a project entit-
led “Developing new analytic techniques for profiling language phenotypes in
genetic research”, funded by the Arts and Humanities Research Council, grant
number AH/E001874.
3 A paper by Collier, Pacey and Renouf (1998) also addressed ‘second-order
collocation’ and there are some very useful points of contact between our
work and theirs, though, as pointed out later (in the Conclusion), their ap-
proach may reveal, rather, ‘mutual collocation’.
4 The term ‘collocation’ customarily means ‘co-occurrence at a level greater
than chance’. Our calculations are not stochastic in the way that T-score, MI
score and so on are, and provide a different, and, we feel, usefully comprehen-
sive, picture of how words go together. As there are frequency thresholds for a
word’s inclusion in the network calculations, and thresholds based on other
features of the relationship are imposed before inclusion in the graphic repre-
sentations, we will use ‘collocation’ to refer to the relationships that display,
by those means, an association that seems to imply a semantic association.
5 The potential for network representations of language is extensively evaluated
by Mehler (2008) but with a rather different emphasis than ours.
6 It is worth a brief explanation of why this works. The table is symmetrical on
the diagonal, since, with a row entry and a column entry for each word, there
are two intersections of each pair, carrying identical values because the
connections are non-directional. Although an individual score in a cell does not
differentiate frequency from weight, there is a precise value for the total weight
assigned in the network (essentially equivalent to the sum of the weights for
each window x the number of windows examined). As a result, the frequency
neutralising calculation, carried out across the whole table, takes into account
116 Eugène Mollet, Alison Wray and Tess Fitzpatrick

the total number of connections to and from each node, as a share of the total
‘space’. Squaring preserves an underlying Euclidean geometry for nodes. This
feature, which we will not develop further here, provides some particularly
useful properties, notably that the inner product of rows yields the cosine of the
angle between those rows and the same inner product after centering (i.e.
subtracting the row mean from each element) and renormalization yields the
correlation coefficient (Pearson’s r) between the distributional patterns of the
nodes in question (Jackson 1924, Rodgers and Nicewander 1988 and, most
succinctly, Kundert 1980). This allows a more direct comparison between our
graph theoretical model and the more familiar vector models used in allied
fields such as Information Retrieval and also in earlier work on second-order
collocation, most notably Collier, Pacey and Renouf (1998).
7 http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words.
8 http://nlp.cs.nyu.edu/GMA_files/resources/english.stoplist.
9 The ties are a consequence of the text being too small for the scores to separa-
te further.
10 C˜iD refers to the weighted, directed, clustering coefficient of node i. W is the
weighted adjacency matrix. This value (C˜iD) is the total number of weighted,
directed triangles centered on node i (˜tiD), divided by the total number of such
triangles possible (TiD). W[1/3] refers to the matrix formed by multiplying all the
weights in W individually by 1/3. W T refers to the transpose of the matrix W. The
notation [X]3ii refers to the ith diagonal entry of the third power of the matrix X.
The two remaining terms refer to the total (binary) degree of node i (ditot ) and the
number of bilateral edges incident to i (di ). For a more detailed explanation of
this formula see Fagiolo (2007). We implemented the function in R from Prof.
Fagiolo’s own MATLAB code, and are grateful to Prof. Fagiolo for providing it.
11 There is a discussion to have about whether we should imply that influence
can be laid at the door of just one co-occurring lexical item (though examples
such as Sinclair’s (2004) white in the context of wine suggest it sometimes
can). It is certainly true that an analysis of the type demonstrated here is lim-
ited by searching only for co-occurrences of the target and influencing items
within a narrow window of 4 content items either side. However, provided the
computational power is available (see Conclusion) analyses are not limited to
a small window size, nor in the potential to amalgamate the outcomes of
examining several different co-occurring items for their individual and joint
effects on the patterns observed. Even then, the analysis would miss out on
directly capturing aspects of meaning that are only implied, such as the
description of Governor Pyncheon in chapter 1 8 of Nathaniel Hawthorne’s
The House of the Seven Gables, where the reader can only infer that he is in
fact dead. On the other hand, our analysis may have the capacity to help us
better understand just how the reader does work out that Pyncheon is dead –
how are the words interacting to permit that inference?
Accessing second-order collocation through lexical co-occurrence networks 117

12 The Bntish Academic Written English (BAWE) corpus was developed at the
Universities of Warwick, Reading and Oxford Brookes under the directorship
of Hilary Nesi and Sheena Gardner (formerly of the Centre for Applied Lin-
guistics/CELTE, Warwick), Paul Thompson (formerly of the Department of
Applied Linguistics, Reading) and Paul Wickens (Westminster Institute of
Education, Oxford Brookes), with funding from the ESRC (RES-000-23-
0800). The corpus is freely available via the Oxford Text Archive (resource
number 2539 http://ota.ahds.ac.uk/headers/2539.xml).
13 The algorithm can be modified according to the importance the research ques-
tion places on proximity. For instance, exploring the occurrence of preposi-
tions at the end of certain multiword expressions would place less importance
on the same preposition occurring later in the window. Conversely, research
into the stylistic effects of lexical repetition would perhaps widen the window
and score distant co-occurrences more evenly with close ones than our algo-
rithm has.
14 This re-orientation results in different items appearing most proximal. This is
a function of the underlying additional nodes, suppressed on the graph. The
true picture is the one obtained from the values that generate the visual repre-
sentation.
15 They largely failed.
16 We are grateful to Chris Butler for pointing this out.

References

Abdel Rahman, Rasha and Alissa Melmger


2007 When bees hamper the production of honey: Lexical interference
from associates in speech production. Journal of Experimental Psy-
chology: Learning, Memory and Cognition 33 (3): 604-614.
Antiqueira, Lucas, Maria das Gracas V. Nunes, Osvaldo N. Ohveira and Luciano
da F.Costa
2007 Strong correlations between text quality and complex networks fea-
tures.PhysicaA 373: 811-820.
BBC Television
2008 Horizon: How mad are you? Part 2, first broadcast 18th Nov 2008,
on BBC 2.
Bordag, Stefan
2003 Sentence co-occurrences as small-world graphs: A solution to auto-
matic lexical disambiguation. Conference on Intelligent Text Pro-
cessing and Computational Linguistics 2003: 329-332.
118 Eugene Mollet, Alison Wray and Tess Fitzpatrick

Caldeira, Silvia M. G., Thierry C. Petit Lobao, R. F. S. Andrade, Alexis Neme and
J. G.V.Miranda
2006 The network of concepts in written texts. European Physical Journal
5 49:523-529.
Carter, Ronald
2004 Introduction. In Trust the text, John McH. Sinclair and Ronald Carter
(eds.), 1-6. London/New York: Routledge.
Collier, Alex, Mike Pacey and Antoinette Renouf
1998 Refmmg the automatic identification of conceptual relations in large-
scale corpora. Proceedings of the Sixth Workshop on Very Large
Corpora. Association for Computational Linguistics.
http://www.aclweb.Org/anthology-new/W/W98/W98-1109.pdf
Csardi,Gabor
2008 igraph 0.5.1. Url: http://cneurocvs.rmki.kfki.hu/igraph/mdex.html.
Evert, Stefan
2005 The Statistics of Word Cooccurrences: Word Pairs and Word Collo-
cations. Ph.D. thesis, Umversitat Stuttgart: Institut fur maschmelle
Sprachverarbeitung. http://elib.um-stuttgart.de/opus/volltexte/2005/
2371/pdf/Evert2005phd.pdf
Fagiolo, Giorgio
2007 Clustering in complex directed networks. Physical Review E 76:
026107.
Ferrer i Cancho, Ramon
2004 The Euclidean distance between syntactically linked words. Physical
Review E 70: 056135.
Ferrer i Cancho, Ramon
2005 The structure of syntactic dependency networks: Insights from recent
advances in network theory. In The Problems of Quantitative Lin-
guistics, Gabriel Altmann, Viktor Levicky and Valentma Perebyims
(eds.), 60-75. Chermvtsi:Ruta.
Ferrer i Cancho, Ramon
2007 Why do syntactic links not cross? Europhysics Letters 76 (6): 1228-
1234.
Ferrer i Cancho, Ramon, Andrea Capocci and Guido Caldarelh
2007 Spectral methods cluster words of the same class in a syntactic de-
pendency network. International Journal of Bifurcation and Chaos
17 (7): 2453-2463.
Ferrer i Cancho, Ramon and Richard V. Sole
2001 The small-world of human language. Proceedings of the Royal Soci-
ety of'London Series B 268: 2261-2266.
Accessing second-order collocation through lexical co-occurrence networks 119

Ferrer i Cancho, Ramon, Richard V. Sole and Remhard Kohler


2004 Patterns in syntactic dependency networks. Physical Review E 69:
051915.
Ferret, Olivier
2002 Using collocations for topic segmentation and link detection. Pro-
ceedings ofCOLING 2002: The 19th International Conference on
Computational Linguistics, Taipei, Taiwan.
http://www.aclweb.Org/anthology/C/C02/C02-1033.pdf.
Firth, John Rupert
1957 A synopsis of linguistic theory: 1930-1955. In Studies in Linguistic
Analysis, special volume of the Philological Society, John Rupert
Firth etal, 1-32. Oxford: Blackwell.
Holanda Adnano de Jesus, Ivan Torres Pisa Osame Kmouchi, Alexandre Souto
Martinez and Evandro Eduardo Seron Ruiz
2004 Thesaurus as a complex network. PhysicaA 344: 530-536.
Jackson, Dunham
1924 The trigonometry of correlation. The American Mathematical
Monthly 31 (6): 275-280.
Kmouchi, Osame, Alexandre Souto Martinez, Gilson Franzisco Lima, G. M.
Lourenco and Sebastian Risau-Gusman
2002 Deterministic walks in random networks: An application to thesaurus
graphs. PhysicaA 315: 665-680.
Knshnamurthy,Ramesh
2006 Collocations. In Enclopedia of Language and Linguistics, vol. 2 (2nd
edition), K. Brown (ed.), 596^00. Oxford: Elsevier.
KundertK.R.
1980 Correlation-A vector approach. The Two-Year College Mathematics
Journal 11(1): 52.
UJianyu and Jie Zhou
2007 Chinese character structure analysis based on complex networks.
PhysicaA 380: 629-638.
Lm,Haitao
2008 The complexity of Chinese dependency syntactic networks. Physica
A 387: 3048-3058.
Magnusson, Camilla and Hannu Vanharanta
2003 Visualizing sequences of texts using collocational networks. In Ma-
chine Learning and Data Mining in Pattern Recognition: Papers
from the 3rd International MLDM Conference, Leipzig, Petra Perner
and Aznel Rosenfeld (ed.), 276-283. Berlin: Springer Verlag.
Masucci,A.P.andG.J.Rodgers
2006 Network properties of written human language. Physical Review E
74: 026102.
120 Eugene Mollet, Alison Wray and Tess Fitzpatrick

Meara,PaulM.
2007 Simulating word associations in an L2: The effects of structural
complexity. Language Forum 33 (2): 13-31.
Meara, Paul M. and Ellen Schur
2002 Random word association networks: A baseline measure of lexical
complexity. In Unity and Diversity in Language Use, Knstyan
Spelman Miller and Paul Thompson (eds.), 169-182. London: Con-
tinuum.
Mehler, Alexander
2008 Large text networks as an object of corpus linguistic studies. In Cor-
pus Linguistics: An International Handbook, vol. 1, Anke Ludeling
and Merja Kyto (eds.), 328-382. Berlin: Mouton de Gruyter.
Milroy, Lesley
1987 Language and Social Networks (2nd edition). Oxford: Blackwell.
Motter, Adilson E., Alessandro P. S. de Moura, Ymg-Cheng Lai and Partha Das-
gupta
2002 Topology of the conceptual network of language. Physical Review E
65:065102.
Park, Young C. and Key-Sun Choi
1999 Automatic thesaurus construction using Bayesian networks. Informa-
tion Processing and Management 32 (5): 543-553.
R Development Core Team
2008 R: A language and environment for statistical computing. R Founda-
tion for Statistical Computing, Vienna, Austria. http://www.R-
project.org.
Rodgers, Joseph Lee and W. Alan Nicewander
1988 Thirteen ways to look at the correlation coefficient. The American
Statistician 42(1): 59-66.
Schur, Ellen
2007 Insights into the structure of LI and L2 vocabulary networks: Intima-
tions of small worlds. In Modelling and Assessing Vocabulary
Knowledge, Helmut Daller and James Milton (eds.), 182-203. Cam-
bridge: Cambridge University Press.
Sigman, Mariano and Guillermo A. Cecchi
2002 Global organization of the Wordnet lexicon. Proceedings of the Na-
tional Academy of Sciences of the United States of America 99 (3):
1742-1747.
Sinclair, John McH.
2004 Trust the Text. London: Routledge.
Sinclair, John McH., Susan Jones and Robert Daley
2004 English Collocation Studies: The OSTI Report. London/New York:
Continuum. First published 1970.
Accessing second-order collocation through lexical co-occurrence networks 121

Scares, Marcio Medeiros, Gilberto Corso and Liacir dos Santos Lucena
2005 The network of syllables in Portuguese. Physica A 355 (2-4): 678-
684.
Sole, Richard V., Bernat Corommas Murtra, Sergi Valverde and Luc Steels
2005 Language Networks: Their Structure, Function and Evolution. Tech-
nical Report 05-12-042, Santa Fe Institute Working Paper.
http://www.santafe.edu/research/pubhcations/workmgpapers/05-12-
042.pdf
Steyvers, Mark and Joshua B. Tenenbaum
2005 The large-scale structure of semantic networks: Statistical analyses
and a model of semantic growth. Cognitive Science 29: 41-78.
Stubbs, Michael
2001 Words and Phrases. Oxford: Blackwell.
Vitevitch, MichaelS.
2008 What can graph theory tell us about word learning and lexical re-
trieval? Journal of Speech Language Hearing Research 51: 408-
422.
Wilks, Clarissa and Paul M. Meara
2002 Untangling word webs: Graph theory and the notion of density in
second language word association networks. Second Language Re-
search 18(4): 303-324.
Wilks, Clarissa and Paul M. Meara
2007 Implementing graph theory approaches to the exploration of density
and structure an L1 and L2 word association networks. In Modelling
and Assessing Vocabulary Knowledge, Helmut Daller, James Milton
and Jeanme Treffers-Daller (eds.), 167-182. Cambridge: Cambridge
University Press.
Wilks, Clarissa, Paul M. Meara and Brent Wolter
2005 A further note on simulating word association behaviour in a second
language. Second Language Research 2\: 1-14.
Zhou, Shuigeng, Guobiao Hu, Zhongzhi Zhang and Jihong Guan
2008 An empirical study of Chinese language networks. Physica A 387:
3039-3047.
Zlatic, Vinko, Miran Bozicevic, Hrvoje Stefancic and Mladen Domazet
2006 Wikipedias: Collaborative web-based encyclopedias as complex
networks. Physical Review E 74: 016115.
From phraseology to pedagogy: challenges and

Sylviane Granger

1. Introduction

John Sinclair's model of language is firmly centred on a contextual ap-


proach to meaning capable of accounting for "the contextual^ sensitive
relationships that are contracted in actual text" (Sinclair 2004a: 278). This
approach stands in sharp contrast to semantic models that view meaning as
inherent in words, a view which Sinclair considers incapable of doing jus-
tice to the richness of meaning: "The subtlety and flexibility of meaning
that is so characteristic of its everyday use is regularised and sanitised to
make the words stable nodes in a network that is far removed from their
textual origins" (Sinclair 2004a: 278). By contrast, in Sinclair's contextual
approach to meaning, "A word may have many potential meanings, but its
actual meaning in any authentic written or spoken text is determined by its
context: its collocations, structural patterns and pragmatic functions"
(Knshnamurthy2002).
The two major corollaries of Sinclair's contextual approach - the inter-
dependence of lexis and grammar and the idiom principle - constitute a
major challenge for linguistic theory, language description and all language
applications, but in no field is the challenge more acute than in foreign lan-
guage learning and teaching. Both Sinclair's contextual approach and
Hoey's (2005) theory of "lexical priming", which builds upon it, see each
word form as having its own phraseology, viz. its preferred collocations,
colligations, semantic prosody, syntactic positioning, etc. This fine-grained
approach is extremely useful for explaining learners' difficulties, as learn-
ers - even advanced ones - are susceptible to getting things wrong at any of
these levels. However, it also lays a heavy burden on teachers who usually
have limited time to teach a syllabus where vocabulary is measured in
breadth as well as (and in some cases, more than) depth of knowledge. The
gap between the fine-grained corpus-driven analysis of words (cf e.g. the
analysis of the phrasal verb set in in Sinclair 1991: 70-75 or the verb main-
tain in Hunston 2002: 139-140) and the reality of the classroom is
124 Sylvian Granger

dauntingly wide and has so far not been given the attention it deserves.
Lewis's (1993, 2000, [1997] 2002) Lexical Approach has admittedly
opened up exciting new avenues for pedagogical implementation and cre-
ated an upsurge of interest in lexical approaches to teaching. However, the
diverging interpretations of the very concept of lexical approach and the
very forceful pronouncements found in the literature are liable to create
confusion in the minds of teachers and materials designers and may even
end up being less - rather than more - efficient in learning terms. This
chapter is an effort to reconcile Sinclair's contextual approach and the reali-
ties of the teaching/learning environment. In section 2, I start by defining
the lexical approach and circumscribing its scope before highlighting what
I view as its major strengths and weaknesses (section 3). Section 4 de-
scribes three major challenges to the pedagogical implementation of the
lexical approach - methodology, terminology and selection - and focuses
more particularly on the potential contribution of learner corpora. Section 5
draws together the threads of the discussion and offers pointers for future
research.

2. Lexical approach to teaching: definition and scope

Although Lewis's Lexical Approach shares many fundamental principles


with communicative approaches to language teaching, it differs from them
in several respects. In Lewis's (1993: iv) own words, "[t]he most important
difference is the increased understanding of the nature of lexis in naturally
occurring language, and its potential contribution to language pedagogy".
His approach aims to mark a clear departure from structural grammar-based
syllabuses and shift the focus to lexis, a component which, unlike the tradi-
tional "vocabulary", gives pride of place to multi-word prefabricated
chunks. Like Sinclair, Lewis defends the decoupling of grammar and lexis,
a position which he expresses in what has become one of the most quoted
statements in applied lexical studies and which he himself presents as his
"refrain", viz. that "language consists of grammaticahsed lexis, not lexical-
ised grammar" (Lewis 1993: 89).
The preceding brief outline might give the impression that the Lexical
Approach is a well-defined unified approach. Nothing could be further
from the truth. A survey of the literature shows that the term covers mark-
edly diverging realities and confirms Harwood's (2002: 139) own conclu-
sion that '"[L]exical approach' is a term bandied about by many, but, I sus-
pect, understood by few". This is probably due to the fact that Lewis (2002:
From phraseology to pedagogy: challenges and prospects 125

204) does not consider the Lexical Approach as "a new all-embracing
method, but a set of principles based on a new understanding of language".
His list of 20 key principles (Lewis 1993: vi-vn) contains some fairly un-
controversial principles which most proponents of lexical approaches are
likely to adhere to, but also contains some more radical statements that are
implicitly or explicitly rejected by a number of them. In this connection, it
is interesting to note that Harwood feels the need to point out that his un-
derstanding of the term "lexical approach" is not exactly the same as
Lewis's although he does not explicitly say where the differences he.
Among the principles on which there is no general consensus are
Lewis's pronouncements on grammar. In his 1993 book, he is extremely
critical of grammar, proposing "a greatly diminished role for what is usu-
ally understood by 'grammar teaching'" (Lewis 1993: 149), underlining
"the dubious value of grammar explanations" and advising teachers to treat
them "with some scepticism" (Lewis 1993: 184). In a later publication,
however, he expresses a more qualified view, saying: "The Lexical Ap-
proach suggests the content and role of grammar in language courses needs
to be radically revised but the Approach in no way denies the value of
grammar, nor its unique role in language" (Lewis 2002: 41). The revision
of the grammar syllabus involves introducing as lexical phrases a large
number of phenomena that used to be treated as part of sentence grammar:
"The Lexical Approach implies a decreased role for sentence grammar, at
least until post-intermediate levels. In contrast, it involves an increased role
for word grammar (collocation and cognates) and text grammar (supra-
sentential features)" (Lewis 1993: 3). Although Lewis (1993: 146) cites
phenomena like the passive or reported speech as items that "could unques-
tionably be deleted", he does not give a precise list of the candidates for
shifting and the contours of the reduced sentence grammar component re-
main very hazy. In spite of that, the principle has been taken over by sev-
eral proponents of the lexical approach. Porto (1998), for example, lists the
following phenomena as candidates for a shift from grammar to lexis: first,
second and third conditionals; the passive; reported speech; the -mg form;
the past participle; will, would and gomg to; irregular past tense forms; and
the concept of time which "may be most efficiently presented as lexis
rather than tense". No arguments are given that justify the selection of these
phenomena and it is interesting to note that other authors make quite differ-
ent selections. Lowe (2003), for example, suggests keeping core tenses in
his slimmed-down core grammar component.
126 Sylvian Granger

Other ELT specialists, however, have a much more moderate stance on


the grammar/lexis issue. Woolard (2000: 45), for example, does not advo-
cate a major shift from sentence grammar to word grammar but presents the
two as complementary: "A word grammar approach complements the tradi-
tional approach to grammar by directing the students' attention to the syn-
tactic constraints on the use of lexis. ... Both approaches, then, are essential
components of grammatical competence." Similarly, Harwood (2002: 148)
warns against "an iconoclastic call to abandon all grammar activities" and
simply calls "for the teaching of lexis to come higher up the agenda".
There is clearly much to be gained from a more lexical approach to
grammar. To take but one example, the strict separation of grammar and
lexis has resulted in a highly limited presentation of modality, with a near-
exclusive focus on modal auxilanes at the expense of equally, if not more,
common lexical expressions of modality such as it is possible that, there's
a chance that, it may be necessary. However, shifts from grammar to lexis
cannot be made simply as a matter of principle. Like Lowe (2003), I be-
lieve that there is still a place for sentence grammar, a grammar core, which
provides a useful organizing principle for learners.
Although there are several versions of the lexical approach, Lewis's
model is undoubtedly the one that has been most extensively described. As
it is also the most influential, it is this model that we will take as reference
point to describe the pros and cons of a lexical approach to teaching.

3. Lexical approach: pros and cons

In this section I list what I view as the major strengths and weaknesses of
the lexical approach. By offering a non-partisan view of this approach, I
hope to contribute constructively to the debate surrounding it and help
counter-balance the somewhat over-optimistic and at times downright
dogmatic statements found in the literature.

3.1. Pros

3.1.1. Wide phraseological approach

The major advantage of the lexical approach is its close fit with contextual
models of language (Sinclair's model, construction grammar, pattern
grammar) that integrate the intertwining of lexis and grammar and give
From phraseology to pedagogy: challenges and prospects 127

phraseology a more central role in language than was previously the case.
The days when phraseology was viewed as a peripheral component of lan-
guage are dead and gone. Corpus-based studies have uncovered a "huge
area of syntagmatic prospection" (Sinclair 2004c), which contains a much
wider range of units than the highly fixed non-compositional units - id-
ioms, proverbs, phrasal verbs - that used to constitute the main focus of
attention. This wide view of phraseology includes "a large stock of recur-
rent word-combinations that are seldom completely fixed but can be de-
scribed as 'preferred' ways of saying things - more or less conventional-
ized building blocks that are used as convenient routines in language pro-
duction" (Altenberg 1998: 121-122). The relevance of this wide view of
phraseology for teaching is demonstrated by Nattinger and DeCarnco
(1992), who describe the essential functions of conventionalized lexical
phrases in discourse, both spoken and written, and suggest ways of incorpo-
rating them into teaching.

3.1.2. Fluency

Incorporating a wide view of phraseology into teaching comes down to giv-


ing fluency a higher priority in teaching. As pointed out by Nattinger and
DeCarnco (1992: 32), "[i]t is our ability to use lexical phrases that helps us
speak with fluency". According to Porto (1998), mastery of lexical phrases
is likely to boost motivation as it allows learners to express themselves in
the absence of rich linguistic resources: "Lexical phrases prove highly mo-
tivating by developing fluency at very early stages and thus promote a
sense of achievement". Although the impact of lexical phrases on fluency is
mostly related to speech, several authors also highlight their role in promot-
ing fluency in writing (Howarth 1999; Gilquin, Granger and Paquot 2007;
Coxhead 2008). This said, the incorporation of lexical phrases into teaching
will only be fully justified once we have a better grasp of their role in for-
eign/second language acquisition and production. Several recent psycholin-
guists studies have begun to lift the veil on this complex issue (Schmitt
2004; Siyanova and Schmitt 2008; Conklin and Schmitt 2008) and shown
its relevance for teaching (Ellis, Simpson-Vlach and Maynard 2008).
128 Sylvian Granger

3.1.3. Accuracy

Recognition of the difficulty in mastering the contextual^ appropriate use


of words goes back a long way. One of the eight assumptions of lexical
competence in Richards's seminal 1976 article deals explicitly with this
type of knowledge: "Knowing a word means knowing the degree of prob-
ability of encountering that word in speech or print. For many words we
also 'know' the sort of words most likely to be found associated with the
word" (Richards 1976: 79). Collocation is explicitly mentioned, with ex-
amples of adjectives commonly used with nouns like fruit {ripe, green,
sweet, bitter) or meat {tender, tough). Hoey (2005: 8) explains this type of
knowledge in terms of the process of lexical priming:
We can only account for collocation if we assume that every word is men-
tally primed for collocational use. As a word is acquired through encounters
with it in speech and writing, it becomes cumulatively loaded with the con-
texts and co-texts in which it is encountered, and our knowledge of it in-
cludes the fact that it co-occurs with certain other words in certain kinds of
context.
This process occurs naturally and gradually for native speakers but the
situation is quite different for non-native speakers who lack the necessary
exposure for words to be successfully primed. The difficulties encountered
by learners were highlighted by proponents of both contrastive analysis and
error analysis and numerous examples given of both interlingual and intra-
lingual errors as well as mixtures of the two, a phenomenon that Dechert
and Lennon (1989) refer to as "blends". Far from being a difficulty limited
to lower level learners, collocations have been proved to be a major diffi-
culty at advanced levels (Nesselhauf 2005). One of the advantages of the
lexical approach is that by attaching more importance to word selection, it
is likely to improve this aspect of learners' lexical accuracy more than pre-
vious non-lexical approaches (cf Conzett 2000).

3.1.4. Ease of learning

In Sinclair's (2004a: 274) view, one of the positive outcomes of the contex-
tual approach to meaning is that it is likely to facilitate language learning:
"If a more accurate description eliminates most of the apparent ambiguities,
the language should be easier to learn because the relationship between
form and meaning will be more transparent". This idea is taken up by many
From phraseology to pedagogy: challenges and prospects 129

proponents of the lexical approach. Porto (1998), for example, states that
"[frequency of occurrence and context association make lexical phrases
highly memorable for learners and easy to pick up". Lewis (2000: 133)
goes further and claims that memorability is enhanced by the length of the
phrase: "The larger the chunks are which learners originally acquire, the
easier the task of re-producing natural language later". The main argument
behind this assertion is that it is easier to deconstruct a chunk than to con-
struct it: "We have already seen that learners acquire most efficiently by
learning wholes which they later break into parts, for later novel re-
assembly, rather than by learning parts and then facing a completely new
task, building those parts into wholes" (Lewis 2002: 190). On this basis, he
gives the following advice to teachers: "don't break language down too far
in the false hope of simplifying; your efforts, even if successful in the short
term, are almost certainly counterproductive in terms of long-term acquisi-
tion" (Lewis 2000: 133). Strong assertions on the ease of learning afforded
by the lexical approach abound in the literature and many - though by far
not all - sound intuitively right. However, it is important to note that at this
stage they are more like professions of faith as validation studies are very
rare. Studies like those of Tremblay et al. (2008) and Ellis, Simpson-Vlach
and Maynard (2008) that demonstrate the effect of frequency of word se-
quences on ease of acquisition and production are still quite rare. At this
stage, therefore, ease of learning cannot entirely be taken for granted.

3.2. Cons

3.2.1. Generative power

Within the lexical approach, "phrases acquired as wholes are the primary
resource by which the syntactic system is mastered" (Lewis 1993: 95). This
assertion, frequently found in the Lexical Approach literature, is based on
LI acquisition studies which have demonstrated that children first acquire
chunks and then progressively analyze the underlying patterns and generalize
them into regular syntactic rules (Wray 2002). According to Nattinger and
DeCarnco (1992: 27), there is no reason to believe that L2 acquisition works
differently: "The research above concerns the language acquisition of children
in fairly natural learning situations. Because of infrequent studies of adult
learners in similar situations, the amount of prefabricated speech in adult ac-
quisition has never been determined. However, there is no reason to think that
adults would go about the task completely differently". Similarly, Lewis
130 Sylvian Granger

(1993: 25), while recognizing that the question is a contentious one, argues
that "it seems more reasonable to assume that the two processes are in some
ways similar than to assume that they are totally different". In fact, there are
very good reasons for doubting that L2 acquisition functions in the same way
as child acquisition in this respect. One major reason is that L2 learners do not
usually get the amount of exposure necessary for the "unpacking" process to
take place. In her overview of findings on formulaicity in SLA, Wray (2002:
148) notes that formulaic sequences do not seem to contribute to the mastery
of grammatical forms. While lexical phrases are likely to have some genera-
tive role in L2 learning, it would be a foolhardy gamble to rely primarily on
the generative power of lexical phrases. Pulverness (2007: 182-183) is right
to point to the "risk of the so-called 'phrasebook effect', whereby lexical
items accumulate in an arbitrary way, and learners are saddled with an
ever-expanding lexicon without the generative power of a coherent struc-
tural syllabus to provide a framework within which to make use of all the
lexis they are acquiring". Lowe (2003) insists on the crucial role played by
a process akin to "cobbling together", especially amongst L2 learners: "The
less expert we are, the more makeshift is our speech". The most sensible
course, as rightly pointed out by Wray (2002: 148), is to maintain "a balance
between formulaicity and creativity".

3.2.2. Depth vs. breadth

One of the distinctive characteristics of the lexical approach is its focus on


depth rather than breadth of vocabulary knowledge. Sinclair and Renouf s
(1988: 155) "lexical syllabus", which was developed alongside the Cobuild
dictionary and can be considered as the precursor of the Lexical Approach,
"does not encourage the piecemeal acquisition of a large vocabulary, espe-
cially initially. Instead, it concentrates on making full use of the words that
the learner already has, at any particular stage". For Woolard (2000: 31),
"learning more vocabulary is not just learning new words, it is often learn-
ing familiar words in new combinations". In her highly influential article
on lexical teddy-bears, Hasselgren (1994) deplores the way that the most
frequent words in the language are learnt early in just one or two primary
meanings and subsequently neglected in the foreign language curriculum,
leaving learners largely unaware of their numerous (semi-) prefabricated
uses. Fleshing out these common words is thus a very welcome develop-
ment. But how deep can one afford to be in view of the fact that the re-
quirements of teaching programmes are often formulated in terms of
From phraseology to pedagogy: challenges and prospects 131

breadth rather than depth and in a context where teachers usually have a
very limited number of teaching hours at their disposal? Sinclair (2004a:
282) is well aware of "the risk of a combinatorial explosion, leading to an
unmanageable number of lexical items" and Harwood (2002: 142) expli-
citly warns against "learner overload", insisting that "implementing a lexi-
cal approach requires a delicate balancing act" between exploiting the rich-
ness of fine-grained corpus-derived descriptions and keeping the learning
load at a manageable level.

4. Implementation of the lexical approach

Successful implementation of the lexical approach requires that progress be


made on the following three fronts: (1) clear description of effective class-
room methodology; (2) design of a pedagogically-onented terminology of
multi-word units; and (3) consideration of a range of criteria beside fre-
quency when it comes to selecting lexical phrases for teaching. I will tackle
each of these challenges in turn and describe the role that learner corpora
can play in addressing them.

4.1. Methodology

In his review of Lewis's (2000) Teachmg Collocation volume, Barfield


(2001: 415) concludes that "the picture that Lewis presents is of an exciting
pedagogic challenge". What makes the challenge particularly tough is that
Lewis introduces a wide range of activities that can help teachers imple-
ment the lexical approach and describes activities that teachers are strongly
advised not to use, but "we are never presented with a comprehensive syl-
labus based around a lexical approach that Lewis does approve o f (Har-
wood 2002: 148). For Rogers (2000) "[l]exical phraseology is an approach
in search of a methodology". Teachers are left with many unanswered
questions regarding the operationalization of the approach, such as the fol-
lowing ones formulated by Rogers (2000): "Is massive memorization pos-
sible or recommended? Is prolonged immersion in an L2 environment the
only answer?" More generally, is the change advocated by Lewis a radical
or a moderate one? In this connection Lewis's writings are far from clear.
His (1993) book clearly points to the necessity of a radical change: "It is
difficult to grasp immediately the enormity of the changes implied by the
perception of lexis as central to language. It is much more radical than any
132 Sylvian Granger

suggestion that there are a few multi-word items which have in the past
been overlooked [my emphasis]" (Lewis 1993: 104). His later (2002)
statement is laden with ambiguity on this issue as he claims both that
"[implementing the Lexical Approach in your classes does not mean a
radical upheaval" and that "[implementation may involve a radical change
of mindset, and suggest many changes in classroom procedure" (Lewis
2002: 3). The ambiguity probably comes from the fact that Lewis wants to
leave the door open for both a strong and a weak implementation of the
lexical approach, though there is little doubt that the strong version has his
preference (2002: 12-16).
In my view, the most exciting methodological contribution of the lexical
approach, in both its weak and strong versions, is its promotion of language
awareness activities. Lewis's publications contain a wealth of innovative
types of exercises which aim to make learners aware of the existence of
chunks, viz. apply to lexical phrases the type of discovery learning advo-
cated by Johns (1986) and many others after him. Numerous studies have
reported success in implementing these methods in a variety of teaching
contexts and have further extended the battery of exercise types (cf e.g.
Woolard 2000; Conzett 2000; Kavaliauskiene and Janulevieiene 2001;
Hamilton 2001; Deveci 2004). However, here too one might speak of a
strong and a weak version. For Lewis, these methods are meant to replace
the previous teacher-led methodology: "The Lexical Approach totally re-
jects the Present-Practise-Produce paradigm advocated within the behav-
iourist learning model; it is replaced by the Observe-Hypothesise-
Expenment cyclical paradigm" (Lewis 1993: 6). For many, however, these
techniques are a complement to the battery of existing techniques. Diver-
gences are particularly strong as regards grammar. While Lewis considers
grammar to be primarily receptive (Lewis 1993: 149) and is extremely
critical of full-frontal grammar teaching, Willis (2003: 42) considers that
there is a place for explicit grammar instruction: "different aspects of the
grammar demand different learning processes and different instructional
strategies. The grammar of structure, for example, is very much rule gov-
erned and instruction can provide a lot of support for system building".

4.2. Terminology

Although phraseology has always been "a field bedevilled by the prolifera-
tion of terms and by conflicting uses of the same term" (Cowie 1998: 210),
the widening of the field spurred by Sinclair's corpus-driven approach has
From phraseology to pedagogy: challenges and prospects 133

further compounded the situation: "The recent interest in lexis in language


teaching has exposed an embarrassingly broad range of categories which,
while incontrovertibly linguistic entities, have no names" (Sinclair 2004a:
273). Although Sinclair is of the opinion that "[w]e need a new way of talk-
ing about lexical choices, rather than a terminology" (Sinclair 2004a: 285),
many foreign language learning specialists have felt the need for "a worka-
ble framework for classifying them" (Meehan 2003). Not all ELT special-
ists agree though. Some argue that there is no need to break down the all-
embracing notion of "lexical phrase" as defined, for example, by Nattinger
and DeCarnco (1992: 1): "multi-word lexical phenomena which are con-
ventionalized form/function composites that occur more frequently and
have more idiomatically determined meaning than the language that is put
together each time". Kavahauskiene and Janulevieiene (2001), for example,
explicitly state that "[i]t is unimportant if students do not know which cate-
gory a lexical item belongs to". Many, however, explicitly or implicitly
recognize the need for a terminology but in the absence of an established
typology are reduced to inventing their own and the categories used are
often more confusing than helpful. This inconsistency is also found in dic-
tionaries. Gabnelatos ([1994] 2005a) notes that the lexical phrase in the
vicinity is presented as an idiom, an expression and a collocation in differ-
ent dictionaries.
Due to the long-time dominance of grammar in language teaching, the
metalanguage used in textbooks is largely grammatical. Most learners are
exposed at one time or other in their curriculum to terms referring to word
categories (noun, adverb, preposition, adverb) and subcategories (countable
vs. uncountable noun), tenses (simple present, simple past), voice (active
vs. passive voice) and a great many others. For lexis, the repertoire is much
more limited and, unlike that used for grammar, differs widely from one
textbook to another (cf Gouverneur 2008). Now that lexis has come to oc-
cupy a more dominant position in teaching, it would be very helpful to both
teachers and learners to have access to a sound pedagogically-onented ter-
minology of multi-word units. I fully agree with Lewis (2000: 129) that we
need to "think about the kind of terminology which will be helpful for
learners. Introducing unnecessary jargon into the classroom is intimidating
and unhelpful to learners, but the careful introduction and regular use of a
few well-chosen terms can be helpful and save a lot of time over the length
of a course for both teacher and learner". The successive typologies he put
forward in his three major publications (1993, 2000, 2002) bear testimony
to the difficulty of the task and the establishment of a helpful pedagogical
134 Sylvian Granger

terminology of multi-word units remains one of the major desiderata for the
future. To be maximally effective this terminology should cover the full
spectrum of multi-word units, from the most fixed to the loosest ones, and
follow a number of principles, among which the following four strike me as
especially important:
(a) Whatever the terminology used, list the criteria that have been used
to identify/select the multi-word units;
(b) Distinguish clearly between linguistic and distributional categories;
(c) Avoid using the same terms to refer to quite different types of unit;
(d) Choose the level of granularity that best fits the teaching objectives.
Principle (b) aims to avoid typologies that mix up terms and criteria per-
taining to the traditional approach to phraseology, viz. linguistic criteria of
semantic non-compositionality, syntactic fixedness and lexical restriction,
with the terminology used in the Sinclair-inspired distributional approach to
refer to quantitatively-defined units, i.e. units identified on the basis of
measures of recurrence and co-occurrence (for more details, see Granger
and Paquot 2008).

4.3. Selection

4.3.1. Criteria

"Pedagogically the main problem with phrases is that there are so many of
them." This statement by Willis (2003: 166) points to one of the biggest
challenges of the lexical approach, i.e. the selection of lexical phrases. The
criterion that occupies a clearly dominant position in the lexical approach is
corpus-based frequency. Corpora make it possible to identify "the common
uses of the common words" that a lexical syllabus should focus on (Sinclair
and Renouf 1988: 154). There is no denying that frequency is a crucial cri-
terion. Far too much teaching time is wasted on words and phrases that are
not even worth bringing to learners' attention for receptive purposes, let
alone for productive purposes. There is much to gain from teaching high-
frequency words such as the high frequency verbs see or give (Sinclair and
Renouf 1988: 151-153) in all their richness rather than focusing exclu-
sively on their primary meanings. However, it is important to bear in mind
that there is no such thing as generic frequency. Hugon (2008) reminds us
that frequency ranking varies in function of the overall composition of the
corpus from which it is derived.
From phraseology to pedagogy: challenges and prospects 135

Proponents of the lexical approach make no claim that "frequency of


occurrence is the only relevant factor" (Sinclair and Renouf 1988: 148).
Sinclair (2004a: 275) mentions "different criteria such as complexity and
familiarity" but clearly presents them as secondary to frequency, i.e. as a
way of arranging "initial frequency-based listings" (Sinclair 2004a: 275).
The interplay between frequency and the many other factors that should be
heeded in vocabulary selection is hardly ever tackled. As represented in
figure 1, frequency needs to be counter-balanced by at least three other fac-
tors: learner variables, learnability and teachability.

Frequency

Learnability Teachability

Learner Variables
Figure 1. Criteria for the selection of lexical phrases

Second language acquisition research has uncovered a wide range of vari-


ables that prove to have a strong influence on language learning. Among
these are age, social distance, aptitude, motivation, learning style (analytic
vs. holistic), LI and the linguistic distance between LI and L2, proficiency
level, amount of L2 exposure and learning needs, in particular learners'
targeted accuracy level. These variables are largely disregarded in studies
of the learner phrasicon: "research has tended to assume that the 'learner'
label overrides all others, so that individuals who would easily be acknowl-
edged as different in aspects of their LI behaviour and, indeed, different in
all other respects in their L2 learning, suddenly become a homogeneous
group when it comes to formulaicity" (Wray 2002: 144). To be maximally
136 Sylvian Granger

effective, the lexical approach needs to be fine-tuned in function of these


variables.
The issue of learnability should also be brought into the equation. Myles
(2002) concludes her survey of SLA research by pointing out that "[t]here
is still a huge gap - not surprisingly, given the limits of our knowledge -
between the complementary agendas of understanding the psycholinguists
processes involved in the construction of L2 linguistic systems, and under-
standing what makes for effective classroom teaching" (Myles 2002). This
is particularly true of phraseology, probably due to the fact that research on
the processing and storage of multiword units by L2 learners is in its early
stages. The issue of memorization is particularly crucial. Although several
studies have demonstrated the effectiveness of memorization (Wray and
Fitzpatnck 2008: 124-125), more research is needed to establish "the part
memory plays in second-language learning, and whether (and under what
conditions) memorised language becomes analysed language" (Thornbury
1998: 13).
Finally, there is the issue of teachability. Lowe (2003) wonders what
implications "the massively increased lexical load" has for teaching meth-
odology: "If there is all this extra stuff to learn as fixed expression, are we
implying that repetition and rote-learning may regain a place in the pan-
theon of teachers' tools?" One key to the problem, suggested by Lowe him-
self, resides in making a sharp difference between teaching for productive
vs. receptive purposes. Another important factor to consider is the type of
multiword unit. Grant and Bauer (2004) highlight the differences between
figurative and non-figurative phrases and demonstrate that the two catego-
ries call for different teaching methods. While figurative idioms like he's a
small fish in a big pond can be unpicked by drawing learners' attention to
the underlying metaphor, different methods need to be used for non-
figurative idioms like red herring or he's not winging the lead (cf Grant
and Bauer 2004) and other categories not mentioned by the authors, notably
collocations and lexical bundles. More generally, in the absence of a well-
defined methodological framework which does justice to both the creative
and holistic aspects of language, it is probably wise to integrate the lexical
approach progressively via "mini-action programmes" (Lewis 2000: 153),
i.e. local experiments integrated into the teachers' preferred and/or imposed
teaching curriculum. In this connection, a resource that may contribute to a
smooth and efficient integration of the lexical approach is Willis's (2003:
163) "pedagogic corpus", i.e. a corpus made up of the texts used in the
classroom to support teaching. The main advantage of this type of corpus is
From phraseology to pedagogy: challenges and prospects 137

that the lexical phrases selected for teaching are extracted from texts that
learners have already processed for meaning, which ensures better contex-
tualization, increased relevance and hence higher motivation for learning
them.

4.3.2. The role of learner corpora

According to Barfield (2001: 415) one of the weaknesses of Lewis's (2000)


book on teaching collocation is that "the voices of typical language learners
are largely omitted". He argues convincingly that "it is perhaps only by
including more detail about actual stages of learner collocational develop-
ment that deeper questions of how collocations are learnt and mis-learnt
may be better answered in the future". In order to decide what to include in
a lexical syllabus, it is essential to have a precise picture of learner difficul-
ties. In this respect, learner corpora are a particularly useful resource as
they give access to the full range of lexical items - both single words and
multiwords - produced by learners in relatively uncontrolled circumstances
(typically, argumentative essays for writing and informal interviews for
speech). Learner corpora are an excellent basis for identifying among the
myriad of phrases that can be taught those that present the greatest chal-
lenge to learners. They have the advantage of displaying a higher degree of
representativeness than experimental data sets, which tend to be restricted
to a very limited number of learners. In addition, the electronic format of
the data allows for automated methods of analysis that were hitherto impos-
sible to apply. Using the appropriate computer tools, it is possible to iden-
tify multiword units in corpora, count and sort them in various ways, ana-
lyze them and compare the results with those obtained from corpora repre-
senting other learner groups and/or expert speakers or writers. Several re-
cent learner-corpus-based studies have shed new light on different catego-
ries of multiword units: idiomatic expressions (Wiktorsson 2003), colloca-
tions (Nesselhauf 2005), phrasal verbs (Waibel 2008), lexical bundles (Mil-
ton 1999; Hyland 2008). Most studies deal with learners' use of multiword
units in writing. De Cock (2000, 2004, 2007) is one of the few scholars to
have investigated learner speech.
One highly conventionalized variety of language that has proved to be
very difficult to master by EFL/ESL students is English for Academic Pur-
poses (EAP). EAP-specific phraseology is characterized by word combina-
tions such as the arm of this study, the extent to which, it has been sug-
gested, it ts hkely that that are essentially semantically and syntactically
138 Sylvian Granger

compositional. EAP would therefore be a good field within which to dem-


onstrate the usefulness of the Sinclainan contextual approach to meaning
and the benefits that can be gained from the conjoined use of native and
learner corpus data. In the following lines I give a brief outline of some of
the results of a large-scale investigation of EAP vocabulary based on the
systematic comparison of EAP words in the International Corpus of
Learner English (Granger, Dagneaux, Meumer and Paquot 2009) and a
comparable corpus of native academic writing.
The selection of EAP words is strongly inspired by Sinclair and Re-
nouf s idea of focusing on "the common uses of the common words".
Unlike Coxhead (2000), we have included in our EAP list highly frequent
words like the verbs describe, report and suggest, which despite belonging
to the top 2,000 words in English (the so-called General Service List), fill
important roles in EAP and therefore deserve to be brought to students'
attention (for details on the criteria used to establish the list, see Paquot
2007). Following Willis's (2003: 161) suggestion that "[o]ne way of help-
ing learners with phrases - polywords, frames and patterns - is to organize
them into meaningful groups", we have grouped the EAP words into twelve
rhetorical or organisational functions that are particularly prominent in aca-
demic writing, such as contrasting, adding information, etc. A detailed
analysis of EAP words in learner and native corpora enabled us to uncover
many differences in terms of frequency of use, meaning, lexico-
grammatical patterning, collocational preferences and syntactic positioning
(for more details, see Paquot 2010; Gilquin, Granger and Paquot 2007).
While this comparison brought to light a series of downright errors - for-
mal (in the contrary, let us state an example, a conclusion can be drawn
up), semantic (on the contrary in the meaning of on the other hand), collo-
cational (we have performed a survey), it is mainly the instances of over-
and underuse that have potential from a teaching perspective. For example,
the overuse of the adjective important is a prompt for lexical expansion
exercises aimed to raise learners' awareness of other adjectives (major,
crucial, significant, etc.) that can be used instead and highlight their fre-
quent collocates. On the other hand, instances of significant underuse point
to words and phrases that teachers might want to add to the lexical syllabus
or that need to be consolidated. This underuse affects many patterns involv-
ing verbs typically used in EAP texts, such as occur, note, suggest, require
or assume (Granger and Paquot 2009). As learner needs are prime, it is
clear that not all sequences that are underused need to be included in the
syllabus nor should all words and phrases that are overused be stigmatized,
From phraseology to pedagogy: challenges and prospects 139

but it is important for teachers and syllabus designers to have access to this
information (for further discussion of this issue, cf Granger 2009).

5. Conclusion

John Sinclair's view of language as being essentially lexical and consisting


of phrasal units rather than single words is a major challenge for linguistic
theory and an equally great challenge for all language applications, in par-
ticular language learning and teaching. From the early Cobuild days, Sin-
clair was aware that the changes he advocated "were likely to have a pro-
found effect on the teaching and learning of languages, because the new
descriptions would represent language in a different way" (Sinclair 2004b:
9). The numerous books and articles inspired by Sinclair's ideas and their
pedagogical applications spurred by Lewis's Lexical Approach confirm this
prediction, but also demonstrate the difficulty of pedagogical implementa-
tion. The literature abounds in extreme statements, with reactions ranging
from overoptimistic to overpessimistic. In the absence of hard evidence that
radical changes to the syllabus - notably as regards the depletion of sen-
tence grammar - lead to better learning, it is probably wise to adopt a weak
version of the lexical approach (Pulverness 2007). Timmis (2008) goes one
step further and claims that the lexical approach is dead and that one should
rather speak of a "lexical dimension": "we need to talk about the principled
application of a lexical dimension to teaching, a dimension which can be
applied, to differing degrees, in any teaching context". In this connection, it
is interesting to note that Sinclair himself did not call for a pedagogical
revolution: "All the work suggested here can be merged with traditional
models of language and language-teaching; the emphasis is different, and
gradually we can expect the interests of students to shift into new areas, but
there is nothing revolutionary in my proposals as offered here" (Sinclair
2004a: 297).
As an English language teacher of over 30 years standing, I have noted
the growing lexicalization of teaching materials and personally experienced
the motivational boost it gives to learners. The step from motivation to suc-
cess is but a small one and this development is thus clearly a very positive
one. Clearly however, much research is still needed if we are to assess how
far we should go along that route. First, it is essential to provide the teach-
ing community with a clear description of the different categories of multi-
word units and a pedagogically-onented terminology to refer to them. Sec-
ond, more research needs to be carried out into the processing and storage
140 Sylvian Granger

of multiword units and language specialists - in particular materials de-


signers - need to heed the conclusions drawn from psycholinguists ex-
periments. Finally, as regards integration into teaching practice, I would
advocate "localizing" the lexical approach, i.e. implementing it in small-
scale classroom studies and adapting it in function of learner needs and the
overall teaching context. At all stages, it is advisable to keep in mind Gab-
nelatos's (2005b) apt observation that "English language teaching is vul-
nerable to pendulum swings, and has a propensity for the marketing and
uncritical acceptance of 'miracle methods'". With respect to the lexical ap-
proach, the pendulum appears to be swinging back to a more balanced posi-
tion. In a recent article, Cullen (2008) revisits Widdowson's (1990) notion
of grammar as a "liberating force" and shows how it can be integrated into
ELT practice. This view of grammar "as a construct for the mediation of
meaning" (Widdowson 1990: 95) is perfectly compatible with the lexical
approach. Cullen's article and other recent ones like Timmis (2008) are a
clear indication that lexis and grammar are slowly beginning to fall into
place for the mutual benefit of both teachers and learners.

References

Altenberg,Bengt
1998 On the phraseology of spoken English: The evidence of recurrent
word-combmations. In Phraseology: Theory, Analysis and Applica-
tions, Anthony P. Cowie (ed.), 101-122. Oxford: Oxford University
Press.
Barfield,Andy
2001 Review of M. Lewis (ed.), Teaching Collocation: Further Develop-
ments in the Lexical Approach. ELT Journal 55 (4): 413-415.
ConUin,Ka%andNorbertSchmitt
2008 Formulaic sequences: Are they processed more quickly than non-
formulaic language by native and normative speakers? Applied Lin-
guistics 29 (1): 72-89.
ConzetWane
2000 Integrating collocation into a reading and writing course. In Teach-
ing Collocation: Further Developments in the Lexical Approach,
Michael Lewis (ed.), 70-87. Boston: Tomson Heinle.
Cowie, Anthony Paul
1998 Phraseological dictionaries: Some East-West comparisons. In Phra-
seology: Theory, Analysis and Applications, Anthony P. Cowie (ed.),
209-228. Oxford: Oxford University Press.
From phraseology to pedagogy: challenges and prospects 141

Coxhead,Averil
2000 A new academic word list. TESOL Quarterly 34 (2): 213-238.
Coxhead,Avenl
2008 Phraseology and English for academic purposes: Challenges and
opportunities. In Phraseology in Foreign Language Learning and
Teaching Fanny Meumer and Sylviane Granger (eds.), 149-161.
Amsterdam/Philadelphia: Benjamins.
Cullen, Richard
2008 Teaching grammar as a liberating force. ELT Journal 62 (3): 221-
230.
Dechert, Hans-Wilhelm and Paul Lennon
1989 Collocational blends of advanced second language learners: A pre-
liminary analysis. In Contrastive Pragmatics, Wieslaw Olesky (ed.),
131-168. Amsterdam: Benjamins.
DeCock,Sylvie
2000 Repetitive phrasal chunkmess and advanced EFL speech and writing.
In Corpus Linguistics and Linguistic Theory, Christian Mair and
Marianne Hundt (eds.), 51-68. Amsterdam: Rodopi.
DeCock,Sylvie
2004 Preferred sequences of words in NS and NNS speech. Belgian Jour-
nal of English Language and Literatures (BELL): 225-246.
DeCock,Sylvie
2007 Routimzed building blocks in native speaker and learner speech:
Clausal sequences in the spotlight. In Spoken Corpora in Applied
Linguistics, Man C. Campoy and Maria J. Luzon (eds.), 217-233.
Bern: Peter Lang.
Deveci,Tanju
2004 Why and how to teach collocations. English Teaching Forum. April
2004: 16-20.
Ellis, Nick, Rita Simpson-Vlach and Carson Maynard
2008 Formulaic language in native and second-language speakers: Psy-
cholinguists, corpus linguistics and TESOL. TESOL Quarterly 42
(3): 375-396.
Gabnelatos,Costas
2005a Collocations: Pedagogical implications and their treatment in peda-
gogical materials. SHARE 6/146. Available from http://www.
shareeducation.com.ar/past%20issues2/SHARE%20146.htm. First
published in 1994.
Gabnelatos,Costas
2005b Corpora and language teaching: Just a fling or wedding bells?
TESL-EJH4Y 1-37.
142 Sylvian Granger

Gilqum, Gaetanelle, Sylviane Granger and Magali Paquot


2007 Learner corpora: The missing link in EAP pedagogy. In Corpus-
based EAP Pedagogy, Paul Thompson (ed.), Special issue of Journal
oj-English/or Academic Purposes 6 (4): 319-335.
Gouverneur, Celine
2008 Phraseology in foreign language learning and teaching: Thirst for
efficient metalanguage. Paper presented at the FLaRN conference,
University of Nottingham (UK), 19-20 June 2008.
Granger Sylviane
2009 The contribution of learner corpora to second language acquisition
and foreign language teaching: A critical evaluation. In Corpora and
Language Teaching, Karin Aymer (ed.), 13-32. Amster-
dam/Philadelphia: Benj amins.
Granger, Sylviane, Estelle Dagneaux, Fanny Meumer and Magali Paquot
2009 The International Corpus of Learner English. Version 2. Handbook
and CD-ROM. Louvam-la-Neuve: Presses Umversitaires de Lou-
vain.
Granger Sylviane and Magali Paquot
2008 Disentangling the phraseological web. In Phraseology: An Interdis-
ciplinary Perspective, Sylviane Granger and Fanny Meumer (eds),
27-49. Amsterdam/Philadelphia: Benjamins.
Granger, Sylviane and Magali Paquot
2009 Lexical verbs in academic discourse: A corpus-driven study of
learner use. In At the Interface of Corpus and Discourse: Analysing
Academic Discourses, Maggie Charles, Susan Hunston and Diane
Pecorari (eds.), 193-214. London/New York: Continuum.
Grant, Lynn and Laurie Bauer
2004 Criteria for re-defining idioms: Are we barking up the wrong tree?
Applied Linguistics 25(1): 38-61.
Hamilton, Nick
2001 Weaving some lexical threads. IH Journal of Education and Devel-
opment 10: 13-15.
Harwood, Nigel
2002 Taking a lexical approach to teaching: Principles and problems. In-
ternational Journal ofApplied Linguistics 12(2): 139-155.
Hasselgren, Angela
1994 Lexical teddy bears and advanced learners: A study into the ways
Norwegian students cope with English vocabulary. International
Journal ofApplied Linguistics 4: 237-260.
Hoey, Michael
2005 Lexical Priming: A New Theoiy of Words and Language. Lon-
don/New York: Routledge.
From phraseology to pedagogy: challenges and prospects 143

Howarth, Peter
1999 Phraseological standards in EAP. In Academic Standards and Expec-
tations, H. Bool and P. Lugord (eds), 143-158. Nottingham: Not-
tingham University Press.
Hugon, Claire
2008 Towards a variationist approach to frequency in ELT. Paper pre-
sented at ICAME 29, Ascona, May 2008.
Hunston, Susan
2002 Corpora in Applied Linguistics. Cambridge: Cambridge University
Press.
Hyland,Ken
2008 Academic clusters: Text patterning in published and postgraduate
writing. International Journal ofApplied Linguistics 18 (1): 41-62.
Johns, Tim
1986 Micro-concord: A language learner's research tool. System 14 (2):
151-162.
Kavaliauskiene, Galina and Violeta Janulevieiene
2001 Using the lexical approach for the acquisition of ESP vocabulary.
The Internet TESL Journal VII (3), March 2001.
Knshnamurthy,Ramesh
2002 Learning and teaching through context: A data-driven approach.
Downloaded from http://www.developmgteachers.com/articles
tchtrammg/corpora3 ramesh.htm.
Lewis, Michael
1993 The Lexical Approach: The State of ELT and a Way Forward. Hove:
Language Teaching Publications.
Lewis, Michael (ed.)
2000 Teaching Collocation: Further Developments in the Lexical Ap-
proach. Boston: Thomson Heinle.
Lewis, Michael
2002 Implementing the Lexical Approach. Boston: Thomson Heinle. First
published in 1997.
Lowe, Charles
2003 Lexical approaches now: The role of syntax and grammar. IH Jour-
nal of Education and Development, http://www.ihworld.com/
ihjournal/charleslowe.asp.
Meehan,Paul
2003 Lexis - the new grammar? How new materials are finally challeng-
ing established course book conventions. http://www.
developmgteachers.com/articlestchtrammg/lexnewpfjaul.htm.
144 Sylvian Granger

Milton, John
1999 Lexical thickets and electronic gateways: Making text accessible by
novice writers. In Writing: Texts, Processes and Practices, Christo-
pher N. Candlm and Ken Hyland (eds.), 221-243. London: Long-
man.
Myles, Florence
2002 Second Language Acquisition (SLA) research: Its significance for
learning and teaching. In The Guide to Good Practice for Learning
and Teaching in Languages, Linguistics and Area Studies. South-
ampton: LTSN Subject Centre for Languages, Linguistics and Area
Studies. Downloaded from http://www.llas.ac.uk/resources/
goodpractice.aspx?resourceid=421.
Nattmger, James and Jeanette S. DeCamco
1992 Lexical Phrases and Language Teaching. Oxford: Oxford University
Press.
Nesselhauf,Nadja.
2005 Collocations in a Learner Corpus. Amsterdam/Philadelphia: Benja-
mins.
Paquot,Magali
2007 Towards a productively-oriented academic word list. In Corpora and
ICT in Language Studies, Jacek Walmski, Krzysztof Kredens and
Stamslav Gozdz-Roszkowski (eds.), 127-140. Frankfurt a. M.: Peter
Lang.
Paquot,Magali
2010 Academic Vocabulary in Learner Writing: From Extraction to
Analysis. London/New York: Continuum.
Porto, Melma
1998 Lexical phrases and language teaching. Forum 36: 3. Downloaded
from http://exchanges.state.gov/forum/vols/vol36/no3/mdex.htm.
Pulverness,Alan
2007 Review of McCarthy, Michael and Felicity O'Dell, English Colloca-
tions in Use. E M V O K , ™ / 6 1 : 182-185.
Richards, Jack C.
1976 The role of vocabulary teaching. TESOL Quarterly 10 (1): 77-89.
Rogers, Ted
2000 Methodology in the new millennium. English Language Teaching
Forum 38 (2): November 2000.
Schmitt,Norbert(ed.)
2004 Formulaic Sequences. Amsterdam/Philadelphia: Benjamins.
Sinclair, John McH.
1991 Corpus, Concordance, Collocation. Oxford: Oxford University
Press.
From phraseology to pedagogy: challenges and prospects 145

Smclmr, John McH.


2004a New evidence, new priorities, new attitudes. In How to Use Corpora
in Language Teaching John McH. Sinclair (ed.), 271-299. Amster-
dam/Philadelphia: Benjamin.
Sinclair, John McH.
2004b Introduction. In How to Use Corpora in Language Teaching John
McH. Sinclair (ed.), 1-10. Amsterdam/Philadelphia: Benjamin.
Sinclair, John McH.
2004c Trust the Text: Language, Corpus and Discourse. London:
Routledge.
Sinclair, John McH. and Antoinette Renouf
1988 A lexical syllabus for language learning. In Vocabulary and Lan-
guage Teaching Ronald Carter and Michael McCarthy, 140-158.
London: Longman.
Siyanova, Anna and Norbert Schmitt
2008 L2 learner production and processing of collocation: A multi-study
perspective. The Canadian Modern Language Review 64 (3): 429-
458.
Thornbury, Scott
1998 The lexical approach: A journey without maps. Modern English
Teacher 1 (4): 7-13.
Timmis,Ivor
2008 The lexical approach is dead: Long live the lexical dimension. Mod-
ern English Teacher 17 (3): 5-9.
Tremblay, Antome., R. Harald Baayen, Bruce Derwmg and Gary Libben
2008 Frequency and the Processing of Multiword Strings: A behavioural
and ERP Study. Paper presented at the conference of the Canadian
Linguistics Association, Umversite de la Colombie-Bntanmque, 31
May - 2 June 2008. http://ocs.sfu.ca/fedcan/mdex.php/cla/acl-cla
2008/paper/viewFile/219/152.
Waibel,Birgit
2008 Phrasal Verbs: German and Italian Learners of English Compared.
Saarbrucken: VDM.
Widdowson, Henry G.
1990 Aspects of Language Teaching. Oxford: Oxford University Press.
Wiktorsson, Maria
2003 Learning Idiomaticity: A Corpus-Based Study of Idiomatic Expres-
sions in Learners' Written Production. Stockholm: Almqvist and
Wiksell.
Willis, Dave
2003 Rules, Patterns and Words: Grammar and Lexis in English Lan-
guage Teaching Cambridge: Cambridge University Press.
146 Sylvian Granger

Woolard, George
2000 Collocation: Encouraging learner independence. In Teaching Collo-
cation: Further Developments in the Lexical Approach, Michael
Lewis (ed.), 28-46. Boston: Thomson Heinle.
Wray, Alison
2002 Formulaic Language and the Lexicon. Cambridge: Cambridge Uni-
versity Press.
Wray, Alison and Tess Fitzpatnck
2008 Why can't you just leave it alone? Deviations from memorized lan-
guage as a gauge of nativelike competence. In Phraseology in For-
eign Language Learning and Teaching, Fanny Meumer and Sylviane
Granger (eds.), 123-147. Amsterdam/Philadelphia: Benjamins.
Dieter Gotz

1. Towards "chunks-

Nowadays many linguists entertain the idea that language in use contains
quite a number of phraseological units, or ready-made, prefabricated units,
or chunks. We can observe various strands of thought, which, in hindsight,
might have contributed toward this idea.
When Austin wrote about speech acts (in the 1960s) he was one of the
first to point out that we do not simply talk when we talk but that by talking
we actually do something, like asking, confirming, naming, declaring war,
repenting, forgiving, etc. Accordingly, pieces of language were seen as
used in and linked to specific situations, as a means of coping with the
world around and also with people. Actual language behaviour came to be
viewed as a performance by an individual, under certain circumstances and
for certain purposes (reminiscent of Malinowski, also of Buhler and Jakob-
son).
Not only can it be asked what the function of an utterance is but also
what a certain part of the utterance is for. This is a functional grammatical
approach (e.g. by Czech scholars like Mathesius, Vachek and Firbas), taken
further by Halliday, with influence on the London School. Lyons linked
words to the world, by using "referring, reference" - meaning and things no
longer belong to different universes. The description of words developed
into a description of their use (Wittgenstein), or usage, which meant that
collocation (and the like) received attention (e.g. by Firth and later by Sin-
clair). The phenomenon of concomitant words led to introducing the prin-
ciple of idiomaticity (Sinclair), or in other terms, repeated speech (Cosenu).
This is the stage at which one could speculate that like situations may have
like (or similar) wordings, with strings of words strung together.
Empirically though, repetition (or recurrence, or related terms) is hard to
describe. Here is the beginning of an Economist article (5 May 2007), with
frequency counts for some stretches (according to the BNC).
148 Dieter Gotz

A cagey game
AT THIS week's meeting in Sharm el-Sheikh, an Egyptian resort, a group
of representatives from Middle Eastern and other countries gathered to talk
about Iraq. But the buzz was not directly about that troubled country. In-
stead, eyes were on the interaction between two big players. Whether Con-
doleezza Rice, America's secretary of state, would converse with her coun-
terpart from Iran, and whether she would take the chance to raise the issue
of Iran's nuclear programme, were the big questions of the gathering.
cagey game 0
this week 4233
meeting in 1589
group of 7254
group of representatives 7
representatives from 455
gather* to 170
gather* to talk 2
troubled country 1
eyes on 634
interaction between 2285
big player* 23
take the chance 39
take* the chance to 8
raise* the issue of 157
big question 72
in the end 3120
face-to-face meeting 14
nuclear programme 88

To anyone speaking English, the frequency of this week, in the end, ratse
the tssue and others come as no surprise. But it would be futile to try and
define repeatedness numerically - perhaps like "from 75 occurrences per
million onwards". In their Longman Grammar of Spoken and Written Eng-
lish, Biber et al. (1999: 992) require 10 occurrences per one million words
for a given stretch to count as a "recurrent" lexical bundle. This is reason-
able, perhaps eminently so, but it has more to do with the genre of publish-
ing than with linguistics and need not affect a theory of language.
Moreover, there may be some doubts about what makes such a stretch.
One might view in the end as an option offered by a larger unit (m the {end,
beginning, mtddle, meantime, afternoon...}). Also note that, in a sense,
Chunks and the effective learner 149

raise the issue might not be a collocation at all. Raise goes with words that
signify 'matter of, case of, instance of, topic of, etc. If you use raise as in,
say, raise Harry's jealousy (i.e. the conversational topic that Harry is jeal-
ous), then the direct object expresses the semantic role associated with the
valency of raise (which of course results in some kind of collocation, and
rouse jealousy would be something different).
Repetition is relative: there is considerable variation in occurrence with
regard to register or subregisters. Chunks in sports commentaries for exam-
ple will not be the same as those in school essays, or poetry, or classroom
English, or in academic medical English.1
There appears to be no theoretical flaw in saying that different persons
may have different stores of repeated speech items. There must be room for
individual and subjective bits and pieces in "language". We should not
forget that corpora do not represent language: they can present language in
a reasonable way, namely, depending on the choice of genres, registers
(etc.) and the purposes of the corpus. But let us keep in mind: language in
use does show recurrent strings of words and these strings can be related to
extra-linguistic situations.

2. Chunks and the learner

Now suppose a German learner would like to know what kind of food will
be served for the next meal, a situation, or an intention that is usually real-
ised as Was gibts zu essen? in German or as What are we going to have for
dinner/lunch (etc.)? in English. Note that a so-called literal translation of
the German expression will not yield an adequate expression in English.
That is to say: you cannot translate the German "chunk" unless you know
the English "chunk" (or situationally appropriate string). Although What
are we going to have for dinner/lunch would not be called an idiom, it can
hardly be spontaneously generated by a learner of English. Similarly, there
is, from the point of view of English, nothing idiomatic about Dogs wel-
come (in a farm holiday brochure), but that does not mean that you can
spontaneously generate Hunde willkommen in German or cam benvenuti in
Italian (an important issue in translation equivalence, cf Tognim-Bonelli
and Manca 2004). This is why it is essential that in foreign language teach-
ing and learning learners should come to understand the importance of
chunks for fluent and idiomatic foreign language production - something
which has been recognized by applied linguists and didacticians for quite
150 DieterGotz

some time. Firstly learners should be advised to spot "gaps" of the kind just
described, in their FL storage.
Here is what learners should do. When they hear or read parts of the
language to be learnt, they might want to split these parts into two groups:
into what they already know and into what they do not yet know. The for-
mer - what they already know - serves to confirm their prior knowledge.
But if what they come across is recognised as new, or different from
what has already been learnt, they will be alert to spot their personal lin-
guistic gaps. The new item will be focussed upon, as a possible piece for
expanding one's knowledge: once learners realise that strings of words
refer to situations, they will develop a linguistic alertness. That is to say
learners react to words like, say, crepuscular or units like The liver can
repair itself or You 're afraid, aren 't you? or even to Under certain circum-
stances there is nothing more agreeable than the hour dedicated to the
ceremony known as afternoon tea. They will not react to crepuscular by
saying to themselves "Oh, never mind!". Instead they will react by saying
"I couldn't have produced that myself!" They will react to You're afraid,
aren'tyou? by making a mental note: so that is what people say in a situa-
tion like the one at hand. In other words, what is required is a "monitoring
of the input". When reading a foreign language text this monitoring can
easily be applied: there is nearly always enough time for pondering over a
phrase. When engaged in listening, the situation at hand will trigger the
attention (provided of course that listening comprehension does not require
an undue amount of mental capacity). There is a fair chance that if you
connect a concrete utterance with a concrete situation, you might co-
remember communicative factors like register, social distance, jargon etc.

3. Further evidence: words and circumstances

It is not often that you can use a single word to describe a situation. A word
like passport is likely to appear in the company of valid, invalid, or check,
issue, apply for, expire, etc. If a customs officer tells you Your passport is
going to expire soon and if you as a foreign speaker of English do not know
"the meaning of expire" you might make an intelligent guess at what he
means or you might look up expire later. You would then memorise "pass-
port + expire" both as a specific situation and as the words that cover it.
And if you were an efficient German learner of English, you would not
store it as expire =ablaufen, but probably as Pass+ablaufen > pass-
port+expire, linking both expressions to the same situation. The option of
Chunks and the effective learner 151

storing it simply as ablaufen=exptre is not chosen, since this would or


could mean that ablaufen can always be translated or rendered by expire
(an example of the commonest strategic mistake in language learning). In
other words, what you assemble in your memory is something like
(Pass ablaufen) > situation < (expire passport),
where Pass ablaufen would be optional since, as a German, you know it
and could produce it anyway. Imagine now that you want to produce some-
thing that might be phrased in German as Ihr Pass muss erneuert werden
(i.e., literally, renewed). If passport > renew is available, the learner will
have no difficulty. If this particular collocation is not at hand though, you
might check whether there is something similar to Pass > erneuern, hit
upon Pass > ablaufen, classify Your passport > expire as contextual^ syn-
onymous (not strictly synonymous, of course) for your purposes and then
render Ihr Pass muss erneuert werden as Your passport will expire. We
then get
(Pass > ablaufen) > Situation < expire > passport
(Pass > erneuern) > Situation < expire > passport,
and similarly
(Pass > verlangern) > Situation < expire > passport
Clearly, the concept of collocation is present here. But for a learner
these "collocations" are primarily such chains of words, more or less adja-
cent words, that attract attention, regardless of further linguistic subclassifi-
cation. It is unlikely that learners spend much energy on deciding whether
e.g. raise an objection is primarily a collocation, a subcollocation, a
valency , or whatever, as long as it comes in handy. I would conclude this
from my own personal experience as a learner of Italian, where I noted
down the following chunks, taken from an Easy Reader, with the meanings
I (as a learner!) attribute to them, in brackets without paying any attention
to their linguistic "status": nel mese Febbrato ('in February'), quell' anno
('in that year'), il preside entrando ('the President, while coming m\fino
al mezzanotte ('until midnight'), Lei si deve calmare ('Do calm down'),
raccontare tutto con ordtne ('tell everything just as/in the order it happened
(in good order)'), era cosa conosctuta ('everybody knew'), senttre un botto
('hear a shot'). What is common to them all is that I think they might be
useful to me personally when I want to express myself in Italian and when I
feel that I would not be able to produce them. (If I had time to learn ten
such items a day, they would add up to some 10,500 in three years, a for-
midable collection.)
152 DieterGotz

4. Metalingual faculties

Awareness of chunks - or the role of the idiom principle in the operation of


language - is important for developing metalingual faculties. "Metalingual"
shall here refer to the way in which foreign language learners process the
pieces of the target language. An example: when inexperienced German
learners want to store foreign language items, they link them to items in
their native language, e.g. choose to wdhlen, carry to tragen, afraid to
angstlich, extend to verldngern etc. That is, when they produce an utterance
in English, what they produce is a roughly translated equivalent of German,
hopefully, and they use their kind of English as a metalanguage for Ger-
man.
Effective metalinguistic competence, however, is important for enlarg-
ing your general linguistic competence. I have over the years asked many
groups of students to give me a list of sentences which illustrate the differ-
ent senses of carry. The result was always the same. A group of 10-20
students would produce three or four different types of meaning. Nearly all
of them come up with carry as in She carried a basket, some with carry as
in the tram earned passengers and perhaps one or two with Flies can carry
diseases. Usually, carry a gun, carry the roof, carry oneself like a ... and
others, do not appear. Most likely, this is due to the students tagging Ger-
man words with English ones {tragen carry). What learners need is that
they store several kinds of carry: one that is tied up with e.g. bags, babies,
another that goes with train and people, a third with disease, a fourth with
e.g. pillars and roof. Fossilization in learners (see Selinker 1972; Han and
Selinker 2005) may of course be a partly psychological phenomenon - but
learners who do not develop a complex awareness of situation, recurrence
and chunks, will never become achievers.
Any piece of language that is situationally correct and that learners have
stored in their memory and that they can retrieve, is one that allows
metalinguistic inspection. So if you know that poach collocates with egg,
tomato, fish then you might be able to paraphrase poach as 'cook in hot
water in such a way that the shape of the food is preserved' - which would
be very close to a native speaker's intuition. Or if you know that repair
collocates with damage and words that signify it (such as leak), you will
perhaps use the word mend with trousers (which is the usual collocation)
since you have not yet met My trousers were repaired. Your analysis may
not be quite correct (clothes etc, does collocate with repair, though rarely),
but considerations like these will make you suspicious and lead you to
Chunks and the effective learner 153

choose something else if you want to play safe. It is up to you to run risks
or not. While some people think that trusting native speakers is too risky,
learners who trust themselves and their own poor translation, run a much
greater risk. Metalinguistic inspection may of course be applied to any level
of linguistic description (from phonology to discourse analysis).

5. Skills and chunks

The function of repetition when acquiring language skills is more than ob-
vious. Clearly, one of the most important keys to listening comprehension
is repetition. Repetition equals redundancy and redundancy will raise the
degree of expectability. Learners cannot learn listening by listening, but
they can learn listening by detecting co-occurrent vocabulary. Fast reading
is another skill that needs chunk stores.
When writing, learners can choose to play safe and use only those
stretches which they know to be correct,2 and should they leave firm
ground they will at least know that they are doing just that. Advanced con-
versational skills is another point. Here, repetition facilitates quick compre-
hension (and quick comprehension is necessary) and it is also the basis for
producing prefabricated items as quickly as is normal. These items also
help learners to gain and compete for the speaker's role. Moreover, chunks
allow a non-native speaker to monitor their production and to know that
what they said was what they meant.

6. Exploiting chunks

The concept of chunks, together with the implications for language learn-
ing, has been around for quite a number of years, cf e.g. Braine (1971),
Asher , Kusudo and de la Torre (1974), Gotz (1976). Chunkiness, however,
was not really a popular idea in advanced generative grammar - but for
some time, and perhaps due to the idiomatic principle, it has no longer been
frowned upon (see e.g. Sylviane Granger, this volume).

6.1. Bridge dictionaries

One particularly important field in this respect is, of course, lexicography.3


Surprisingly, even some modern dictionary-makers might need to catch up
154 DieterGotz

on chunks. OALD8 s.v. watch, illustrates the pattern ~ where, what, etc...
by Hey, watch where you 're going! This is a good example, but only under
very favourable circumstances. It can re-enforce the learner's knowledge -
in case he or she already knows the phrase. Learners who do not know it,
cannot decide what it really means, to what kind of situation it really refers.
Does it mean a) 'make sure you take a direction/road etc. that leads to
where you want to go', perhaps, or specifically, b) '... where you want to
go in life', or c) 'be inquisitive about the things around you!' or perhaps d)
'look where you set your foot, might be slippery, muddy, etc'. Learners
cannot know intuitively that d) is correct, and hence this example needs
some comment or a translation, e.g. Pass auf, wo du hmtnttst! in German.
(Admittedly, the hey would be a kind of hint for those that know.) Exam-
ples of usage might be chosen and translated in such a way that they indi-
cate clearly what sort of situation they refer to - and can show how co-
selection (see e.g. Sinclair 1991) works.
In short, we are approaching the idea of a bridge dictionary - one of the
many ideas suggested by John Sinclair. In a bridge dictionary, foreign lan-
guage items are presented in the native learner's language. Using this kind
of metalanguage will ensure that a learner has no difficulty understanding
what is said even if it is fairly subtle 4
Incidentally, a COBUILD-style explanation is one that tries to depict a
situation, cf "1 If you watch someone or something, you look at them,
usually for a period of time and pay attention to what is happening"
(COBUILD4).
To my knowledge, various lexicographers (including myself) have tried
to find publishers for bridge dictionaries (such as English - German, Eng-
lish - Italian etc.), but they have tried in vain. However, a dictionary that
contains information like the following article (based here on an OALD
version) need not necessarily become a flop:

watch [...] Verb


1 mit Aufmerksamkeit schauen, zuschauen, beobachten: watch + N
watch television/watch football fernsehen, FuBball schauen, gucken:
Watch (me) carefully Schau gut zu, Pass gut auf, wie ich es mache;
"Would you like to play?"-"No, I'll just watch" ... Nein, ich kucke
bloB zu; watch + to + Verb He watched to see what would happen
Er schaute hin um mitzukriegen, was passieren wiirde; watch + wh-
She watched where they went Sie schaute wohin sie gmgen; watch +
N + -ing She watched the children playing Sie schaute den Kindern
beim Spielen zu; watch + Infinitiv She watched the children cross
Chunks and the effective learner 155

the road Sie sah, wie die Kinder iiber die StraBe gmgen 2 watch
(over) + N sich um etwas oder jemanden kiimmern, indem man da-
rauf aufpasst: Could you watch (over my clothes while I swim Passt
du auf meine Kleider auf, wahrend ich beim Schwimmen bin? 3
(Umgangssprache) auf das aufpassen, was man tut, etwas mit
Sorgfalt tun: Watch it! Pass auf! Watch yourself. Pass auf und ... (fall
nicht hm, sag mchts Falsches, lass dich mcht erwischen) You 'd better
watch your language tiberleg dir, wie du es formulierst Watch what
you say Pass auf, was du sagst!
watch + for + N (ausschauen und) warten, dass jemand kommt oder
dass etwas passiert: You'll have to watch for the right moment Du
musst den richtigen Zeitpunkt abpassen; watch + out (besonders im
Imperativ) aufpassen, well Vorsicht notig ist Watch out! There's a
car coming Achtung! ...; watch + out + for + N 1 konzentnert
zuschauen, hmschauen, damit einem mchts Wichtiges entgeht: The
staff were asked to watch out for forged banknotes Die Angestellten
mussten sorgfaltig auf gefalschte Geldscheme achten 2 bei etwas
sehr vorsichtig sein: Watch out for the steps they're rather steep
Pass beiderTreppe auf...

6.2. Collocations and patterns

Concomitance of words is due to the fact that some situations are alike, or
viewed alike. It is imperative for a learner to be aware of this phenomenon,
and hence it should be an integral part of learner's dictionaries. Although
the recent editions of English learners' dictionaries such as the Longman
Dictionary of Contemporary English have made great progress in this di-
rection, Langenscheidt's Grofiworterbuch Deutsch ah Fremdsprache
(1993) and its derivatives can be seen as one of the first really systematic
treatments of collocations in this type of dictionary as it provides lists of
collocations for many headwords: the entry for e.g. Sturm contains a sec-
tion in < >, namely <em S. kommt auf, bricht los, wutet, flaut ab, legt stch;
in emen S. geraten>. A list like this need not be representative or exhaus-
tive or meticulously structured - its main purpose is to demonstrate con-
comitant words (some of them certainly useful) and remind the learner of
concomitance.
A surface syntactic pattern, such as Noun + Verb + Noun is of course
much too general to make sense as a chunk. Usually, however, such a sur-
face pattern is in reality a kind of cover term for "chunkable" items, pro-
vided the pattern is filled semantically. In the case of e.g. the verb fly we
get several distinguishable semantic subpatterns of the syntactic pattern
156 DieterGotz

Noun + Verb + Noun, depending on the actants' reference, such ^ pilot +


/fy + /,&*>, ^ M g B r + fly + a / r f i ^ /,&*> + J7y + distance/direction,
pilot + Jfy + /rco/rf* f+ * « C t f o « ; and others (see Gotz-Votteler 2007).
Coverage of this kind of pattern must probably be restricted to specialized
dictionaries such as the Valency Dictionary of English (2004) or textbooks
such as Hennger's (2009) collection of "valency chunks". 5 Such collections
of chunks are of course not lists of items to be learnt by heart. They serve
as a range of offers from which you can choose if need be and, more impor-
tantly, they can serve as evidence of how language works and offer effec-
tive ways of learning a foreign language.
In any case, dealing with collocations and item-specific constructions,
and thus doing justice to Sinclair's idiom principle, will remain one of the
great challenges in the future - in language teachmg and lexicography. 6

Notes

1 For the distribution of chunks across registers see view.byu.edu.


2 See de Cock (2000).
3 See e.g. Siepmann (2005).
4 Lexicographers of English often underrate the difficulties that learners have in
understanding explanations. According to LDOCE, the meaning of charge a
price is 'to ask someone for a particular amount of money for something you
are selling': Does ask mean 'put a question' or 'beg' or 'request'? Who is
you? Can services be sold?
5 It is very attractive to assume that "valency chunks" might be useful material
for learners. Heringer (2009) is a collection of such syntactic-semantic chunks
in German, some thirty chunks for about eighty verbs each. Here is, in a
Simplified form, a selection of chunks containing antworten: Da hab ich ge-
antwortet; Was wiirden Sie antworten, wenn; antwortete er ... er wisse nicht
...; was soil man darauf antworten; zogert kurz und antwortet dann; antwor-
tet er auf die Frage warum; auf einen Brief geantwortet. The chunks them-
selves were determined by co-occurrence analyses (cf Belica 2001-2006).
6 I would tike to thank Tony Hornby and the editors of this volume for then
comments on an earlier draft of this article.
Chunks and the effective learner 157

References

Asher, James J., Jo Anne Kusudo and Rita de la Torre


1974 Learning a second language through commands: The second field
test. Modern Language Journal 58 (1/2): 24-32.
Belica, Cyril
2001-06 Kookkurrenzdatenbank CCDB. Erne korpuslinguistische Denk- und
Expenmentrerplattform fur die Erforschung und theoretrsche Be-
grundung von systemrsch-strukturellen Ergenschaften von Kohasi-
onsrelatronen zwrschen den Konstrtuenten des Sprachgebrauchs. In-
strtut fur Deutsche Sprache, Mannherm.
Brame, Martin Dan Isaac
1971 On two types of models of the internalization of grammars. In The
Ontogenesis of Grammar: A Theoretical Symposium, D. I. Slobin
(ed.), Academrc Press: New York: 153-186.
deCock,Sylvie
2000 Repetitive phrasal chunkmess and advanced EFL speech and writing.
In Corpus Linguistics and Linguistic Theory, Christian Man and
Marianne Hundt (eds.), 51-68. Amsterdam/Atlanta: Rodopi.
Douglas Biber, Stig Johansson, Geoffrey Leech, Susan Conrad and Edward Fine-
gan
1999 Longman Grammar of Spoken and Written English. Edinburgh:
Longman.
Gotz, Dieter
1976 Textbezogenes Lernen: Aspekte des Fremdsprachenerwerbs fortge-
schnttenerLernender. DNS75: 471^78.
Gotz, Dieter, Gunther Haensch and Hans Wellmann (eds.)
2010 Grofiworterbuch Deutsch als Fremdsprache. Berlin/Mimchen: Lan-
genscheidt.
G6tz-Votteler,Katrm
2007 Describing semantic valency. In Valency: Theoretical, Descriptive
and Cognitive Issues, Thomas Herbst and Katrin Gotz-Votteler
(eds.), 37-50. Berlin/New York: Mouton de Gruyter.
Granger, Sylviane
2011 From phraseology to pedagogy: Challenges and prospects. This
volume.
Han, ZhaoHong and Larry Selmker
2005 Fossilization in L2 Learners. In Handbook of Research in Second
Language Teaching and Learning, Eh Hinkel (ed.), 455-470. Mah-
wah,NJ:Erlbaum
158 DieterGotz

Herbst, Thomas, David Heath, Ian F. Roe and Dieter Gotz (eds.)
2004 A Valency Dictionary of English: A Corpus-Based Analysis of the
Complementation Patterns of English Verbs, Nouns and Adjectives.
Berlm/NewYork:MoutondeGruyter.
Hermger,HansJurgen
2009 Valenzchunks: Empirisch fundiertes Lernmaterial. Miinchen: Indici-
um.
Selmker, Larry
1972 Interlanguage./iMI 10 (2): 209-231.
Siepmann,Dirk
2005 Collocation, colligation and encoding dictionaries. Part I: Lexico-
logical aspects. International Journal of Lexicography 18: 409-443.
Sinclair, John McH.
1991 Corpus, Concordance, Collocation. Oxford: Oxford University
Press.
Togmm-Bonelli, Elena and Elena Manca
2004 Welcoming children, pets and guests: Towards functional equiva-
lence in the languages of 'Agritunsmo' and 'Farmhouse Holidays'.
In Advances in Corpus Linguistics: Papers from the 23rd Interna-
tional Conference on English Language Research on Computerized
Corpora (ICAME 23), Goeteborg 22-26 May 2002, Karin Aymer
and Bengt Altenberg (eds.), 371-385. Amsterdam/New York: Ro-
dopi.

Dictionaries

Longman Dictionary of Contemporary English


2009 edited by Michael Mayor. Harlow: Pearson Longman. 5th edition.
[LDOCE5]
Oxford Advanced Learner's Dictionary of Current English
2010 by A. S. Hornby, edited by Sally Wehmeier. Oxford: Oxford Univer-
sity Press. 8th edition. [OALD8]

Corpus

BNC The British National Corpus, version 3 (BNC XML Edition). 2007.
Distributed by Oxford University Computing Services on behalf of
the BNC Consortium, http://www.natcorp.ox.ac.uk/.
Exploring the phraseology of ESL and EFL

NadjaNesselhauf

1. Introduction

John Sinclair was among the leading proponents of the centrality of phrase-
ology, or what he referred to as the "idiom principle", in language. He also
advocated that this aspect of language be investigated with a corpus-
approach. These convictions have been proved right over and over again by
what must by now be tens of thousands of corpus-based studies on the
phraseology of (LI) English (and other languages). The study of the phra-
seology of EFL varieties1 has also intensified over the past few years, al-
though only a relatively small proportion of this work is corpus-based.
What is rare to date, however, is studies, in particular corpus-based ones,
on the phraseology of ESL varieties. What is practically non-existent (in
any type of approach) is comparisons of the phraseology of ESL and EFL
varieties. Given the pervasiveness of the phenomenon in any variety and
the relatedness of the two types of variety, this is a gap that urgently needs
to be filled. The present paper is therefore going to explore phraseological
features in ESL and EFL varieties and to investigate to what degree and in
what respects the phraseology of the two types of variety is similar.2
The paper starts out by providing a brief overview of previous research
as well as an overview of the corpora and methodology used for the inves-
tigation. Then, three types of analyses into the phraseology of ESL and
EFL varieties will be presented. In Section 3.1, it will be investigated how
"competing collocations" (or collocations that share at least one lexical
element and are largely synonymous) are dealt with in the two types of
varieties. In 3.2, the treatment of internally variable collocations will be
considered. Finally, I am going to look at what have been referred to as
"new prepositional verbs" (Mukherjee 2007), i.e. verbs that are simple
verbs in LI English but have become or are treated as if they were verb-
preposition collocations (or prepositional verbs) in ESL or EFL varieties
(Section 3.3).
160 NadjaNesselhauf

2. Previous research and methodology

2.1. Previous research

Studies on phraseological features of any kind in ESL varieties are rare so


far. Usually, a number of such features is included in studies of individual
varieties, but only very few investigations actually focus on phraseology.
Of those that do, many have a cultural impetus and are restricted to the
examination of collocates of culturally loaded terms (e.g. Schmied 2004;
Wolf and Polzenhagen 2007). Only a few studies are not restricted in this
way, as for example Skandera (2003), who looks at idioms in Kenyan Eng-
lish, or Schilk (2006), who investigates collocations in Indian English.
Comparisons of phraseological features across ESL varieties are rarer still
and mostly consist of surveys of existing studies on individual varieties
(which in turn are mostly based on anecdotal evidence; e.g. Crystal [1997]
2003; Ahulu 1995; Piatt, Weber and Lian 1984). The number and scope of
systematic, corpus-based studies on phraseological features across several
ESL varieties is highly restricted to date (Schneider 2004; Mair 2007; and
the papers by Hoffmann, Hundt and Mukherjee 2007 and Sand 2007).
What is more, existing comparative studies tend to focus on differences
between different ESL varieties rather than on common points. In spite of
this, a few phraseological features or tendencies have been reported as oc-
curring in several varieties: Schneider (2004), for example, finds that the
omission of particles in phrasal verbs occurs in East African English, In-
dian English, Philippine English and Singapore English. Piatt, Weber and
Lian (1984) report that collocating the verbs open and close with electrical
switches or equipment as in open the radio or close the hght is common in
six different ESL varieties of English (Hawaiian English, East African Eng-
lish, Hong Kong, Malaysian, Philippine and Singaporean English), and
Sand (2007) finds occurrences of in the light of, discuss about and with
regards to in four different ESL varieties (Singaporean, East African, Ja-
maican and Indian English).
The investigation of phraseological phenomena in learner language has
a longer tradition than in ESL varieties, but is also usually limited to one
specific learner variety and often restricted to the investigation of a small
number of phraseological items by elicitation. One of the few exceptions is
Kaszubski (2000), who investigates a great number of collocations in the
language of learners with different LI backgrounds. A great number of
phraseological deviations from LI English is also listed in the Longman
Exploring the phraseology ofESL and EFL varieties 161

Dictionary of Common Errors (Turton and Heaton [1987] 1996), which is


based on analyses of a huge learner corpus (the Longman Learners' Cor-
pus), which contains writings from foreign learners with a wide range of
different LI backgrounds. As in the case of phraseological ESL studies, a
number of common points can also be inferred from a close reading of the
existing investigations. For example, collocations involving high-frequency
verbs appear to be a source of deviation across different LI backgrounds
(Kaszubski 2000; Shei 1999; Nesselhauf 2003, 2005). Various studies also
reveal that learners are often unaware of collocational restrictions in LI
English and at the same time unaware of the full combinatory potential of
words they know (cf e.g. Herbst 1996; Howarth 1996; Channell 1981;
Granger 1998).
Comparisons of ESL and EFL varieties, finally, are rare in general, with
Williams (1987) and Sand (2005) being two notable exceptions. It seems
that an important reason for this neglect is that the two fields of ESL vari-
ety research and research into foreign learner production (or "interlan-
guage") have remained separate to a large degree so far - a situation that
clearly needs to be remedied.

2.2. Corpora and methodology

Three types of corpora were needed for the present investigation: ESL cor-
pora, learner corpora, and, as a point of reference, LI corpora. As a corpus
representing several ESL varieties of English, I used the ICE-corpus {Inter-
national Corpus of English). The varieties included in the present study are
Indian English, Singaporean English, Kenyan English and Jamaican Eng-
lish (cf. table 1). It is important to note that the degree of institutionaliza-
tion of English vanes in these four countries and, in particular, that Jamai-
can English occupies a special position among the four, in that it may also
with some justification be classified as an ESD (English as a second dia-
lect) variety rather than as an ESL variety.3 The composition of all the ICE-
corpora, with the exception of ICE-East Africa, is the same, with each sub-
corpus containing 1 million words in total, of which 60 % are spoken and
40 % written language. In the case of ICE-East Africa, only the Kenyan
part was included in the present investigation, which contains slightly less
than one million words and about 50 % of spoken and written language
each.
162 NadjaNesselhauf

Table 1. ICE-subcorpora used in the analyses4


Corpus: Number of words:
ICE-India 1.14 million
ICE-Smgapore 1.11 million
ICE-Jammca (version 11 June 07) 1.02 million
ICE-EastAfnca (Kenya only) 0.96 million

As a corpus representing foreign learner language of learners with different


first language backgrounds, I used ICLE (International Corpus of Learner
English). ICLE contains 2.5 million words of learner language from learn-
ers with the following 11 LI backgrounds: Bulgarian, Czech, Dutch, Fin-
nish, French, German, Italian, Polish, Russian, Spanish and Swedish. The
text type predominantly represented in the corpus is argumentative essays.
An important limitation of the study presented here therefore is the differ-
ent composition of the ESL and the EFL corpora, in particular the fact that
the ESL corpora contain both written and spoken language of various text
types, while the learner corpus is restricted to one kind of written language.
For better comparison, therefore, in some analyses only the written parts of
the corpora will be considered. The difference in size of the ESL and EFL
corpora is less problematic. When comparable size seemed desirable, I only
used a one-million word (precisely: 0.98 million) subcorpus of ICLE, in
which only writings by learners with the Lis Finnish, French, German and
Polish (i.e. one from each language family) were included. This corpus will
be referred to as ICLE-4L1 in what follows. A further limitation of the
study is the size of the corpora in general, which is small for many types of
phraseological features. However, for a first exploration of the topic the
corpora were considered to be adequate.
As a point of comparison, I chose British English as represented in ICE-
GB and in the BNC (British National Corpus). ICE-GB has the same size
and composition as the other ICE-corpora, and the BNC contains 100 mil-
lion words of a great range of text types, of which 10 % are spoken and
90 % written texts. For better comparison, whenever frequencies from the
whole of the BNC are provided, I also provide the frequencies for a ficti-
tious corpus based on the BNC labelled "BNC"-ICE-comp., which is of the
same size and general composition as the ICE-corpora.
Exploring the phraseology ofESL andEFL varieties 163

3. Analyses and results

3.1. Competing collocations

The starting point for the analysis of competing collocations was an obser-
vation I made in my analysis of collocations in German learner language
(e.g. Nesselhauf 2005). The learners tended to overuse the collocation play
a role, while they hardly used the largely synonymous and structurally
similar collocation play a part. In LI English, on the other hand, both of
these competing collocations occur with similar frequencies. The use of the
two expressions was thus investigated in LI, L2 and learner varieties, to
find out whether the behaviour of the ESL varieties in any way resembled
the behaviour of the learner varieties and if so, to what degree and why.
The results of this investigation (with only the written parts of the cor-
pora considered) are provided in figure 1.

BNC 3349 3236

ICE-Jam 28 12

ICE-Sing 48 4
DPLAY+ROLE
pPLAY+PART
ICE-Ind 55 3

ICE-Ken 67 3

ICLE-4L1 12 1 32

o% 20% 40% 60% 80% 100%

Figure 1. PLAY + ROLE / PART in the wntten parts of the BNC, ICE and ICLE

Here and elsewhere, the bars in the graphs indicate the relative frequencies
of the relevant expressions, but the absolute frequencies are also given on
each individual bar. The results confirm my earlier observations. In the
written part of the BNC the two expressions have almost the same frequen-
cies, and in ICLE-4Ll,^aj a role is used in about 80 % and play apart in
about 20 % of the occurrences of either expression. In the ESL varieties,
164 NadjaNesselhauf

the proportion of play a role is also consistently greater than that of play a
part. Except in the case of Jamaican English, the proportion of play a role
is even greater in the ESL varieties than in the learner varieties. So it seems
that overuse of play a role at least partly at the expense of play apart is a
feature of both types of varieties under investigation.
To find out whether this result is restricted to this particular pair of
competing collocations or whether it reveals a more general tendency, an-
other group of competing collocations was investigated: take into consid-
eration, take into account and take account of. The results are displayed in
figure 2.

BNC 2617 1526


!4<J

ICE-Ken 11 15 2

ICE-Jam 7 9 ( DTAKE + into consideration


• TAKE + into account
ICE-Ind 8 9 ( .TAKE + account of

ICE-Sing 9 9
m
ICLE-4L1 70 66 b

0% 20% 40% 60% 80% 10 3%

Figure 2. TAKE + consideration / account in the written parts of the BNC, ICE
andlCLE

In the BNC, a clear dominance of take into account can be observed, fol-
lowed by the expression take account of. In ICLE, take account of is hardly
present (with only three instances) and take into consideration is slightly
more frequent than take into account. In other words, the expression with
the lowest proportion in LI English has the highest proportion in written
learner language (or at least in the type of learner language investigated
here). The ESL corpora resemble the learner corpus in that take account of
is also hardly present (with two instances in total, in the Kenyan corpus).
They also resemble the learner corpus in that the proportion of take into
Exploring the phraseology ofESL and EFL varieties 165

consideration is considerably larger than in the LI corpus. This means that


for both groups of competing collocations investigated, the behaviour of
the ESL varieties resembles the behaviour of the learner varieties much
more closely than the one of the LI variety.
What seems to be happening in both cases discussed here (i.e. play a
role I part and take + consideration I account) resembles what Sand has
described for the morphosyntax of ESL varieties, namely that there is a
tendency for functionally equivalent variants to be "whittled down to a
smaller number of choices or a single preferred variant" (2005: 211). This
tendency therefore also seems to hold for the area of phraseology and
seems to operate not only in ESL but also in EFL varieties.
A question arising from this observation is which phraseological van-
ants tend to be preferred if the language provides a choice of several.
Closer analysis reveals that overall frequency is not an important factor: in
the whole of the BNC, the proportion of play a role (vs. play a part) is
50.8 %, the proportion of take into consideration (vs. its two competitors)
is 6.3 %; in the spoken part of the BNC, the proportions are 54.4 % for play
a role and 13.0 % for take into consideration. To investigate whether the
degree of formality of the expressions might be a factor in the observed
preference patterns, I calculated the written versus spoken ratios of the
relevant expressions, with the following results: play a role: 3.0:1; play a
part: 3.3:1; take into consideration: 0.5:1, take into account: 1.3:1 and take
account of. 2.7:1. In the first case, therefore, the preferred expression is
very slightly more informal than the dispreferred one; in the second, it is
the clearly most informal expression that is preferred in both ESL and EFL
varieties. Degree of formality may therefore have some influence on the
preference pattern but does not seem to be the decisive factor either.
What seems to have a greater impact is the degree of language-internal
regularity of the expressions involved. If the meanings of the nouns consid-
eration and account are considered outside the collocations investigated
here, it turns out that while consideration keeps one of its regular uses
(namely "careful thought about") in the collocation, account does not. In
addition (and probably related to this first point), take into consideration is
related to the verb consider as are many other verbs in English: the mean-
ings are roughly synonymous, and the noun in the collocation and the re-
spective verb are derivationally related.5 In the case of play a role ^play
a part the difference in the degrees of what I have called language-internal
regularity is more subtle, as both part and role keep one of their regular
uses ("function or involvement in a situation") in the collocations. How-
166 NadjaNesselhauf

ever, role in this sense is fairly frequent, whereas this sense of part is infre-
quent (in particular in relation to the overall frequency of this word). So, in
a sense, it can perhaps also be claimed that play a role is language-
internally more regular than its counterpart.

3.2. Internally variable collocations

Analyses of learner language have revealed that foreign learners tend to


treat collocations as if they were more variable than they actually are (see
Section 2.1). In my analysis of German learner language, I found, for ex-
ample, that the collocation have an intention is used with greater variability
in learner than in native speaker writing. A quick glance at the concordance
line of the collocation in ICE-GB already reveals the low degree of vari-
ability of the collocation in LI English, with the predominant realization
being HAVE no intention of-ing (cf figure 3).

Figure 3. Concordance ofHAVE + INTENTION (span 5) in ICE-GB

An examination of the collocation in the BNC confirms this and reveals


that of the other realizations of the collocation, most are either instances
including a different negative or restrictive premodifier (such as not have
any intention of-ing or have little intention of-ing) or instances of the
patterns HAVE every intention of-ing and have good/bad intentions.
A comparison of the concordance lines from ICE-GB with the ones
from ICLE (figure 4) indicates - despite the low overall numbers - that
learners' use of this collocation is fairly different.
Exploring the phraseology ofESL andEFL varieties 167

Figure 4. Concordance of HAVE + INTENTION (span 5) in ICLE

There is only one instance of HAVE no intention of-ing, plus one with
intention in the plural, none of the other instances is either premodified by a
negative or restrictive modifier or an instance of the other patterns preferred
in LI, and complementation tends to be infinitival rather than with of-ing

BNC* 182 22 30

"BNC"*-ICE-
4,1 3,5 1,1
comp D HAVE no INTENTION

a (not) HAVE (any/little/no


ICE-GB 6 0 2
+ adj.) intention
n HAVE (eveiy/the etc.)
INTENTION
ICE-L2 13 I 8

ICLE 2 1 5

0% 20% 40% 60% 80% 100%

Figure 5. Noun premedication of HAVE + INTENTION in BNC, ICE and


ICLE
168 NadjaNesselhauf

Figure 6. Noun pluralization of HAVE + INTENTIONin BNC, ICE and ICLE

Figures 5 to 7 give the results for premodification, pluralization of inten-


tion, and complementation, respectively, for LI, learner and L2 varieties.
Figure 5 reveals that almost 80 % of the occurrences in the BNC are in-
stances of HAVE no INTENTION,6 whereas in ICLE, almost 80 % of the
admittedly fairly few instances are realized differently. In the ICE-corpora,
which had to be considered together here due to the low numbers involved,
the collocation is realized as HAVE no INTENTION in slightly more than
5 0 % of the cases.
The picture for pluralization of the noun intention looks quite similar,
with around 5 % of pluralization in the LI corpora, about 20 % in the L2
corpora, and almost 40 % in ICLE (cf figure 6).

Figure 7. Complementation of HAVE + INTENTION m BNC, ICE and ICLE


Exploring the phraseology ofESL andEFL varieties 169

With respect to complementation, figure 7 reveals that in the LI corpora,


almost all instances of the collocation are complemented with of-mg (with
only about 3 % of infinitival complementation in the BNC). In the L2 cor-
pora, infinitival complementation occurs in about 20 % of the cases and in
ICLE in about 70 %. Therefore, with respect to all aspects of variability of
this collocation, what are infrequent variants in LI English either seem to
be the predominant variants in learner English or at least variants that are
used with about the same frequency as the major (LI) variant. Furthermore,
the behaviour of the ESL varieties with respect to the variability of this
collocation consistently lies in-between the behaviour of the LI and the
learner varieties.
A further example of an internally variable collocation is come in / into
contact.

Figure 8. COME in /into contact in BNC, ICE and ICLE

In LI English, come into contact is the major variant, and come tn contact
the minor one, with about 15 % of the occurrences in the BNC and about
25 % in ICE-GB (cf figure 8). In contrast, in ICLE, both variants are used
equally often, and in the ESL corpora, come into contact is only slightly
more frequent than come tn contact. The behaviour of the L2 varieties
therefore again lies in-between the LI and learner varieties, although it is
fairly close to the latter here, and an infrequent variant is again treated as an
equivalent variant in learner language. Due to the small numbers here and
above, further research is needed to confirm the results, but the fact that the
tendencies are the same throughout the analyses here might be an indication
that the observed tendencies are more general.
When the results concerning variable collocations are compared to those
obtained in the previous section on competing collocations, it appears that
170 NadjaNesselhauf

these two groups are treated in what may be called opposite ways in both
EFL and ESL varieties: in the case of competing collocations, it seems that
the variants tend to be reduced, whereas in the case of variable collocations,
it seems that minor variants tend to be treated as equivalent or even major
ones.

3.3. New prepositional verbs

The third type of phraseological units to be treated here are new preposi-
tional verbs (cf Mukherjee 2007). In descriptions of individual ESL vane-
ties, certain new prepositional verbs are sometimes listed as features of
different varieties by different authors. Comprise of for example, is cited as
a feature of Kenyan English by Skandera (2003) and as a feature of Indian
English by Mukherjee (2007). What is more, the phenomenon as such, i.e.
the use of simple LI verbs used with a preposition, also occurred in my
analysis of German learner language (Nesselhauf 2005). One non-Ll
prepositional verb that occurred in my learner data was enter into, used in
cases where LI English would simply have enter?
(1) ... we have all those friendly ... guys who usually use ... their cars to enter
into the city (ICLE-German)
(2) Probably they 11 lose importance and English expressions will enter into
those languages much [more easily] (ICLE-German)
To find out whether this particular use also occurred in ESL varieties, I
consulted the ICE-corpora, and indeed, instances such as the following do
occur:
(3) The plasma membrane of a cell which forms its outer wall does not
normally allow external molecules to enter into the cell (ICE-Ind, w2b-027)
(4) ... the drugs are allowed to enter into the country. (ICE-EA, classlessonK)
In order to investigate whether parallel usage can also be found with re-
spect to other new or non-Ll prepositional verbs in EFL and ESL corpora, I
examined a number of such verbs that had either been cited as a feature of
one or even several ESL varieties or that had been found to occur in learner
language. Those verbs which the analysis has shown to occur in at least two
EFL and at least two ESL varieties are provided in table 2.
Exploring the phraseology ofESL andEFL varieties 171

Table 2. New prepositional verbs in ICE and in learner varieties


ICE- ICE- ICE- ICE- ICE- ICLE Diet. ofC.
GB Jam Sing Ind Ken Errors
answer (V) to - - - 1 2 3 yes
approach (V) - - - 2 1 1 yes
to
comprise of 1 2 2 8 13 1 yes
demand (V) for - - - 1 2 3 -
discuss about - - 9 14 7 10 yes
emphasiz/se on - - 3 - 4 2 yes
enter into + - - 1 16 3 7 yes
[place]
invite for - - 2 3 2 8 -
request (10 for - - 3 6 11 - yes

While both the procedure and the low numbers again do not allow definite
conclusions, a quantitative tendency that can be inferred from the table is
that Jamaican English seems to make much less use of new prepositional
verbs than the other varieties under investigation. It also appears that many
of the prepositional verbs are used more frequently in the ESL varieties
than in the EFL varieties (N.B. that ICLE is more than double the size of
the individual ICE-corpora, cf Section 2.1). A possible reason for this is
that these prepositional verbs have actually become or are in the process of
becoming features of several ESL varieties, while in learner language they
probably tend to be created on the spot by individual learners.
Nevertheless, if the same prepositional verbs are created in the two
types of varieties, it does not seem unreasonable to assume that similar
processes must be at work, and that these processes must go beyond simple
LI influence and instead be based on constellations found in LI English.
An investigation of this question (cf. also Nesselhauf 2009) revealed four
factors that seem to play a part in the creation of new or non-Ll preposi-
tional verbs (cf. figure 9).
172 NadjaNesselhauf

derivationally related
noun exists & has
preposition in question
e.g. request (N) for,
answer (N) to, emphasis
on

verb + prep. 11 similar verbs take


collocation exists
(with different x new prepositional , preposition;
meaning, in diff. verb independent
construction etc.) meaning of prep, is
adequate
e.g. be comprised
of, enter into + e.g. demand for:
agreement, verb has features of ask for; discuss
discussion etc. meaning about: talk / speak
-movement' + about; about = 'on
-direction' -> these the subject o f
are broken up
e.g. enter + into,
approach + to

Figure 9. Factors influencing the creation of new or non-Ll prepositional verbs

One important factor seems to be that a noun that is derivationally related


to the verb in question exists and takes the preposition in question (such as
a request for, emphasis on etc.). A further factor appears to be that the rele-
vant verb + preposition collocation exists, albeit in a different construction,
with a different meaning, or the like (such as in the case of be comprised of
enter into + agreement, discussion etc.). A third possible factor seems to be
that verbs with similar meanings take the preposition in question or (or and)
that the preposition that is added appears adequate for the meaning that is
expressed by the verb (as in the case of about, which can mean "on the
subject of; verbs with similar meanings are, for example, ask for, which
may influence the use of demand for etc.). A further possible factor, which
however affects fewer verbs, is that a verb has the features of meaning
"movement" and "direction" (as in enter or approach), which then appear
to be broken up and assigned to individual forms (as in enter + into, ap-
proach + to).
Exploring the phraseology ofESL andEFL varieties 173

It appears that in most if not all cases of non-Ll prepositional verbs at


least two factors are at work (cf figure 10).

nswer (h
demand (N) for

:L
comprise of

be comprised of enter into


answer t
enter into ^—\ /L _ consist of
answer to {a demand for ., reply / respond
name; 'have to
ask for
explain one's
actions to sb.')
enter + into

Figure 10. The creation of individual new or non-Ll prepositional verbs

In the case of demand for, for example, it seems to be the circumstance that
both the corresponding noun and the semantically related verb ask take the
preposition/or that leads to its creation. In the case o? answer to, three fac-
tors seem to be involved: the fact that the noun as well as several semanti-
cally related verbs {reply I respond to) take the preposition, and the fact that
the verb-preposition combination does exist, both in the sense of "have to
explain one's actions to sb." and in the collocation answer to a name. A
hypothesis would therefore be that if at least two factors coincide, the prob-
ability that a new or non-Ll prepositional verb is created in ESL and EFL
is high.

4. Conclusion

The investigation presented in this paper has demonstrated that certain


phraseological characteristics are shared by ESL and EFL varieties (and, by
implication, are also shared by several ESL and EFL varieties, respec-
tively). These are the redistribution of competing collocations, the greater
internal variability of collocations than in LI English, and the use of new
prepositional verbs. In all three cases it is both the phenomenon as well as
174 NadjaNesselhauf

several individual instantiations of the phenomenon that occur across vane-


ties and across variety types.
The behaviour of the ESL varieties in relation to LI and learner varieties
seems to depend on the type of phraseological phenomenon. In some cases,
the behaviour of the (or many) ESL varieties seems to lie in-between the
behaviour of the LI and the learner varieties (as in the case of most aspects
of the internal variability of collocations). In other cases, the behaviour of
the (or many) ESL varieties seems be even further removed from the LI
than the behaviour of the EFL varieties (for example in the case of the new
prepositional verbs).
While these results need to be confirmed by further analyses and com-
plemented by the investigation of further phraseological phenomena, the
present exploratory investigation has also revealed that an analysis along
the lines sketched here can lead to important insights both into the nature of
ESL and EFL varieties and into the workings of phraseology in language
environments where normative influence is less strong than in an ENL
(English as a native language) context.

Notes

1 The term 'Variety" is used in a broad sense here and includes the output of
(advanced) foreign learners with a certain LI background, although such "va-
rieties" lack stability.
2 Parts of this paper are based on Nesselhauf (2009).
3 The classification depends on whether or not Jamaican Creole is considered
as a variety of English.
4 Thanks go to Christian Mair and the English Department of the University of
Freiburg for letting me use this preliminary version of ICE-Janwca and to
Andrea Sand and Ute Romer for providing me with a stripped version of ICE-
GB.
5 The terms for such collocations vary widely; some examples are "support
verb constructions", "tight verb constructions" and "stretched verb construc-
tions".
6 The BNC is marked with an asterisk in this table, as only a random sample of
300 instances of HAVE + INTENTION in a span of 5 were considered (the
reason being the necessary manual disambiguation of instances, as not all in-
stances thrown up by this search constitute instances of the relevant colloca-
tion).
7 For learner language, it might be more appropriate to speak of "non-Ll
prepositional verbs" than of "new prepositional verbs".
Exploring the phraseology ofESL and EFL varieties 175

References

Ahulu, Samuel
1995 Variation in the use of complex verbs in international English. Eng-
lish Today 42: 28-34.
Channel!, Joanna
1981 Applying semantic theory to vocabulary teaching. ELT Journal 35
(2): 115-122.
Crystal, David
2003 English as a Global Language. 2nd edition. Cambridge: Cambridge
University Press. First published in 1997.
Granger, Sylviane
1998 Prefabricated patterns in advanced EFL writing: Collocations and
formulae. In Phraseology: Theory, Analysis, and Applications, An-
thony P. Cowie (ed.), 145-160. Oxford: Clarendon.
Herbst, Thomas
1996 What are collocations: Sandy beaches or false teeth? English Studies
77 (4): 379-393.
Hoffmann, Sebastian, Marianne Hundt and Joybrato Mukherjee
2007 Indian English: An emerging epicentre? Insights from web-derived
corpora of South Asian Englishes. Paper presented at ICAME-18,
23-27 May 2007.
Howarth, Peter
1996 Phraseology in English Academic Writing: Some Implications for
Language Learning and Dictionary Making. Tubingen: Niemeyer.
Kaszubski,Przemyslaw
2000 Selected aspects of lexicon, phraseology and style in the writing of
Polish advanced learners of English: A contrastive, corpus-based ap-
proach, http://mam.amu.edu.pl/~przemka/research.html.
Mair, Christian
2007 Varieties of English around the world: Collocational and cultural
profiles. In Phraseology and Culture in English, Paul Skandera (ed.),
437^68. Berlin/New York: de Gruyter.
Mukherjee, Joybrato
2007 Structural nativisation in Indian English: Exploring the lexis-
grammar interface. In Rainbow of Linguistics, Niladn Sekhar Dash,
Probal Dasgupta and Pabitra Sarkar (eds.), 98-116. Calcutta: T. Me-
dia Publication.
Nesselhauf,Nadja
2003 Transfer at the locutional level: An investigation of German-
speakmg and French-speaking learners of English. In English Core
176 NadjaNesselhauf

Linguistics: Essays in Honour ofD. J. Allerton, Cornelia Tschichold


(ed.), 269-286. Bern: Lang.
Nesselhauf,Nadja
2005 Collocations in a Learner Corpus. Amsterdam/Philadelphia: Benja-
mins.
Nesselhauf,Nadja
2009 Co-selection phenomena across New Englishes: Parallels (and dif-
ferences) to foreign learner varieties. English World-Wide 30 (1): 1-
26.
Piatt, John, Heidi Weber and Ho Mian Lian
1984 The New Englishes. London/Boston, MA/Melbourne: Routledge.
Sand, Andrea
2005 Angloversals? SharedMorpho-Syntactic Features in Contact Varie-
ties of English. Unpublished monograph. University Freiburg.
Sand, Andrea
2007 Patterns and language contact: Multiword units in the New Eng-
lishes. Paper presented at ICAME-18, 23-27 May 2007.
Schilk, Marco
2006 Collocations in Indian English: A corpus-based sample analysis.
Anglia 124 (2): 276-316.
Schmied, Josef
2004 Cultural discourse in the corpus of East African English and beyond:
Possibilities and problems of lexical and collocational research in a
one-million-word corpus. World Englishes 23 (2): 251-260.
Schneider, Edgar W.
2004 How to trace structural nativization: Particle verbs in world Eng-
lishes. World Englishes 23 (2): 227-249.
Shei,Chi-Chiang
1999 A brief survey of English verb-noun collocation.
http://www.dai.ed.ac.uk/homes/shei/survey.html.
Skandera,Paul
2003 Drawing a Map ofAfrica: Idiom in Kenyan English. Tubingen: Narr.
Turton, Nigel D. and John B.Heaton
1996 Longman Dictionary of Common Errors. 2nd edition. Harlow: Long-
man. First published in 1987.
Williams, Jessica
1987 Non-native varieties of English: A special case of language acquisi-
tion. English World-Wide 8 (2): 161-199.
Exploring the phraseology ofESL and EFL varieties 177

Wolf, Hans-Georg and Frank Polzenhagen


2007 Fixed expressions as manifestations of cultural conceptualizations:
Examples from African varieties of English. In Phraseology and
Culture in English, Paul Skandera (ed.), 399-435. Berlin/New York:
MoutondeGruyter.

Corpora

BNC The British National Corpus. Distributed by Oxford University


Computing Services on behalf of the BNC Consortium.
http://www.natcorp.ox.ac.uk/.
ICLE International Corpus of Learner English, Version 1.1. 2002.
Sylviane Granger, Estelle Dagneaux, Fanny Meumer, eds. Uni-
versity catholique de Louvain: Centre for English Corpus Lin-
guistics.
ICE-GB The International Corpus of English: The British Component.
1998. Gerald Nelson, London, Survey of English Usage, Univer-
sity College.
ICE-East Africa The International Corpus of English: The East-African Corpus.
2002. Gerald Nelson, Hong Kong.
ICE-India The International Corpus of English: The ICE-India Corpus.
2002. Gerald Nelson, Hong Kong.
ICE-Smgapore The International Corpus of English: The Singapore Corpus.
2002. Gerald Nelson, Hong Kong.
Writing the history of spoken standard English

Christian Mair

1. Introduction

In contrast to most other contributions to this volume, the present essay is


not primarily a tribute to John Sinclair, the pioneer in the compilation of
large digital corpora and research on the role of collocations and chunks in
the lexicon and grammar, but to John Sinclair, pioneer in the study of re-
search on spoken English, the author of A Course in Spoken English
Grammar (1972) and (with Malcolm Coulthard) of Towards an Analysis of
Discourse (1975). Collocations, colligation, "chunks" will play their due
role in the following analyses, but the main thrust of the argument is to
foster awareness of the paramount importance of speech in the analysis of
language change, corpus-based or otherwise.
Like linguists of many other persuasions, corpus linguists are quick to
point out the primacy of speech over writing in theory but equally quick to
ignore this postulate in their practical analytical work, for example by re-
ducing "live" spoken data to rather poorly annotated orthographic transcrip-
tions in most digital corpora of spoken language. Owing in part to the early
work by John Sinclair, we now know a lot about how the grammar of spo-
ken English differs from that of written English synchromcally (cf, e.g.,
the comprehensive survey documentation provided in Biber et al. 1999).
The present contribution will shift the perspective to diachrony, asking to
what extent ongoing grammatical changes in spoken and written English
run "in sync" and to what extent change in the two media follows divergent
paths of development.
John Sinclair's point that there should be "rather more emphasis on in-
formal spoken English than you commonly find in grammars" (1972: 1)
will thus be heeded in the present study, which is based on the author's
more than fifteen years of experience in the corpus-based real-time study of
ongoing changes in written English and aims to integrate his first insights
gleaned from the corpus-based real-time study of spoken English into the
picture thus obtained. As we shall see, a comparison of results derived from
180 ChrmanMarr

standard reference corpora of written English such as the "Brown f a r m l -


and the recently released Drachromc Corpus of Present-Day Spoken Eng-
lish (DCPSE) yields at least the four following possible constellations:
(1) changes exclusive to the spoken language and not attested (as yet?) in
writing,
(2) changes going on simultaneously in speech and writing,
(3) changes going on in the spoken language spreading to writing, with a time-
lag,
(4) changes exclusive to the written language.
Naturally, these constellations are not always attested in their "pure" forms,
and mixed types are possible - for example, if only some written genres
show a time lag in the spread of an innovation. Judging from historical
precedent (for example complex prepositions such as notwithstanding, with
regard to or regarding - see Hoffmann 2005), a fifth-type - changes origi-
nating in the written language spreading to the spoken medium, with a time
lag - should not be ruled out, but will not play a major role in the following
remarks.
Constellation (1), innovation restricted to the spoken language, and con-
stellation (2), simultaneous advance of an innovation in spoken and written
data, will be illustrated in section 2 below on the basis of particular types of
specification^ cleft sentence - most prominently associated with chunks
such *s MX did was (e.g. ask I to ask IX asked a question) or What X did
was {ask I to ask I X asked a question), but also allowing for numerous
variations on this theme of the kind illustrated in the actual corpus exam-
ples to be discussed below. Section (3), on "modality on the move" (Leech
2003), will mainly exemplify constellation (3), time-lags between broadly
parallel developments in speech and writing, while the final section (4),
which deals with an aspect of noun-phrase grammar, will show the poten-
tial for autonomous diachromc development in the written medium.

2. Specificational clefts in twentieth century English

The transition from specificational clefts of the type All I did was to ask
(with the focussed element realized as a marked infinitival clause) to All I
did was ask (unmarked infinitive) represents a relatively little noticed in-
stance of ongoing syntactic change in present-day English. The phenome-
non has received some attention in a number of studies by Giinter Rohden-
burg, who - based on the strength of an analysis of an electronic anthology
Writing the history of spoken standard English in the twentieth century 181

of newspaper articles from the Times of London (The Changing Times


1785-1992) - notes that the "evidence ... suggests that we are dealing here
with another case of the marked infinitive being ousted by the bare infini-
tive in the second half of the 20th century" (2000: 31). His main concern,
however, is not the historical development of the construction but structural
determinants of synchronic variation. For example, he notes that the un-
marked infinitive is less likely to be used if (a) material intervenes between
the forms of do and be (e.g. All I did on that occasion was (to) help), (b) be
is used in the past rather than in the present (All I did was (to) help) and (c)
if the form of be is complex (e.g. All I can do will be to help). Traugott
(2008), on the other hand, is an in-depth analysis of the long-term history
of the construction since Early Modern English, whose major focus is nei-
ther on very recent and ongoing developments nor on the charting of statis-
tical trends.
However, attention to the recent past and to statistical shifts is called for
in this case, as both written and spoken evidence suggests that the twentieth
century was a particularly dramatic phase in this particular change - to the
extent that we can observe a clear and consistent reversal of usage prefer-
ences across genres and varieties during this period. This is the situation
which emerges from an analysis of trends in written usage in twentieth
century British and American English, based on the "Brown family" of
one-million word reference corpora.
As is shown in examples (1) and (2), the corpora illustrate variation be-
tween marked and unmarked infinitives in specificational cleft sentences:
(1) Then you have four sections to your speech. Decide then what you want to
say m each - and the best way of saying it - and then rehearse it over and
over again. But don't memorize it word for word. All you need do is to
remember the four names - and the order in which they come. (LOB F)
(2) You haven't written much - apart from letters - since you left school. Don't
worry. Editors don't want school compositions or essays. All you need do is
tell it like it is, write as though you were talking to your neighbour over the
garden fence. (F-LOBE)
As the following example (from the spoken material of the DCPSE) shows,
-/^-complements are occasionally encountered alongside the more typical
infinitival ones (presumably triggered by a preceding -ing-form in the first
part of the cleft construction): 2
(3) Well it's very very unlikely uhm because the other thing I'm doing is try is
trying to pass a driving test <DCPSE:DI-C07/ICE-GB:SlA-097
#0134:l:A>
182 ChrmanMarr

Table 1 and figure 1 summarize the relevant findings from the five corpora
of the Brown family - based on searches for possible combinations of do
and be in adjacent position (such as, for example, did was, does is, did is,
etc.).3

Table 1. Specifications clefts infivecorpora


Bntish English American English
B-LOB LOB F-LOB Brown Frown
Tb-mfimtive 16 10 5 9 3
Bare infinitive 0 5 14 11 17
-mg 0 0 2 1 2_

Complements in specificational clefts in five


corpora of 20th century written English

lUU/o

80% -

60% - •mg
• baremf.
40% l •to-inf.
20%

Uyo ^

Figure 1. Specification^ clefts infivecorpora

Covering sixty years of diachronic development, the British data show that
the only attested form in the 1930s was the All I did was to ask-typc, with
the to-infinitive, while by 1991 preferences had clearly been reversed to All
Ididwasask, with the unmarked or bare infinitive, with the 1961 LOB data
representing a transitional stage. The thirty years' development in Ameri-
can English covered by Brown and Frown shows a parallel direction of the
Writing the history of spoken standard English in the twentieth century 183

change, with Amencan English leading British English, /^-complements


are not attested in sufficient numbers to allow any statistical conclusions 4
In the analysis of the spoken-language data from the DCPSE, we find
the three constructional types already discussed (to-infinitive, unmarked
infinitive, and, more rarely, -ing), but also the following, additional one - a
finite clause "echoing" the structure of the preceding part of the cleft sen-
tence:
(4) What they 're doing is they 're working on the < > Pascal tiling winch they'll
have to < > uhm do at Cambridge because < > from Agmeszka's point of
view it was so difficult despite the fact that she's <> really good
(<DCPSE:DI-B01/ICE-GB:S1A-005#0141:1:B>)
Such structures are common enough to consider them as instances of a
conventionalized grammatical construction, albeit one that is restricted to
the spoken language. As always in corpus analyses of spontaneous speech,
however, it is difficult to set the limits of what one usefully includes in
one's counts. Thus, while the following example is clearly very interesting
in terms of Paul Hopper's (1998) notion of emerging grammar (and similar
constructions are in fact discussed in Hopper 2001 and 2004), it was not
counted in the present analysis, as too much material intervenes between
the two parts of an arguable cleft-sentence-like focus construction:
(5) what I like doing is uhm < > with the Pakistani children and the Indian
children the infants when their tooth falls out in school and they cry < >
and if they've got enough English / explain to them that in England < > you
put h under the pillow (<DCPSE:DL-B28/LLC: S-04-03 #0592:1: A>)
"Orderly" notions of grammatical structure inspired by written English fail
here, leading to an analysis of the passage as an anacoluthon, with what I
like doing is... representing a false start which is not taken up again. Alter-
natively, we could assume contamination of what I like doing with the
Pakistani children and the Indian children the infants when their tooth falls
out in school and they cry... and if they 've got enough English I explain to
them that in England you put it under the pillow. Seen in its discourse con-
text and from the point of view of the speaker, however, this is an act of
focussing, functionally equivalent to a cleft sentence What I like doing is
[to] explain .... In other words, something which is partly ungrammatical
at the syntactic level turns out to be a very successful instance of attention-
getting and competent floor-holding in discourse-analytical terms.
But let us return to the realm of "grammar proper" even in our analysis
of the spoken data (if only to ensure comparability with the findings ob-
184 ChrmanMarr

tamed from the "Brown family"). Table 2 gives the frequencies of the four
recurrent types of specificational clefts, i.e. those which could be consid-
ered conventional grammatical constructions, in the spoken corpus. "LLC"
indicates that examples are from the "old" (London-Lund Corpus, 1958-
1977) part of the DCPSE, whereas ICE-GB indicates origin in the "new"
(ICE-GB, 1990-1992) part.

Table 2. Four types of specificational clefts in the DCPSE


to-infinitive Unmarked infinitive -ing finite "echo" clause
LLC 24 9 1 11
ICE-GB 18 31 0 6

(Chi square to- vs. bare infinitive: p=0.0030)

Let us focus on the three constructions familiar from the written corpora
first, that is the two types of infinitival complements and the rare -mg-
complement. Here the most striking result is that the reversal of preferences
in British English spoken usage is virtually simultaneous with the one ob-
served in writing. Clearly, this is not what one would expect given the gen-
erally conservative nature of writing. As for -/^-complements, they are as
marginal in this small diachromc spoken corpus as in the written corpora of
the Brown family.
The exclusively spoken finite-clause complement (All I did was I
asked), on the other hand, is amply attested and apparently even on the rise
in terms of frequency. This raises an interesting question: Why is it that one
innovative structure, the bare infinitival complement (All I did was ask),
should show up in written styles so immediately and without restriction,
whereas the other, the finite-clause type, should be blocked from taking a
similar course? The reason is most likely that the finite-clause variant is not
a grammatically well-formed and structurally complete complex sentence
and therefore not felt to be fully acceptable in writing. That is writers re-
frain from using it for essentially the same reasons that they shun left- and
right-dislocation structures or the use of copy pronouns (e.g. this man, I
know Mm; that was very rude, just leaving without a word). And just as
such dislocation structures are presumably very old, the finite type of speci-
ficational cleft, unlike the unmarked infinitive, may not really be an innova-
tion but an old and established structure which merely failed to register in
our written sources.
Writing the history of spoken standard English in the twentieth century 185

Further corpus-based research on specification^ clefts should proceed


in two directions. On the basis of much larger corpora of (mostly written)
English, it should be possible to determine the history and current status of
the -/^-complement, which is not attested in sufficient numbers either in
the Brown family or in the DCPSE. Possibly small but specialized corpora
of speech-like genres (informal letters, material written by persons with
little formal education, Old Bailey proceedings, etc.) are needed, on the
other hand, to establish the potentially quite long history of the finite-clause
type.

3. "Modality on the move" (Leech 2003)

Modal verbs, both the nine central modals and related semi-auxihanes and
periphrastic forms, have been shown to be subject to fairly drastic dia-
chromc developments in twentieth and twenty-first century written English.
The point has been made in several studies based on the Brown family (e.g.
Leech 2003; Smith 2003; Mair and Leech 2006; Leech et al. 2009). Other
studies, such as Krug (2000), have explored the bigger diachromc picture
since the Early Modern English period and show that such recent changes
are part of a more extended diachromc drift. Considering the central role of
modality in speech and writing, modals are thus a top priority for research
in the DCPSE 5
Table 3 shows the frequency of selected modal verbs and periphrastic
forms in the oldest (1958-1960) and most recent (1990-1992) portions of
the DCPSE. The restriction to the first three years, at the expense of the
intervening period from 1961 to 1977, was possible because modals are
sufficiently frequent. It was also desirable because in this way the extreme
points of the diachromc developments were highlighted. What is a potential
complication, though, is the fact that it is precisely the very earliest DCPSE
texts which contain the least amount of spontaneous conversation, so that a
genre bias might have been introduced into the comparison.
186 ChnstianMatr

Table 3. Real-time evidence from spoken English - frequencies of selected


modals and semi-modals of obligation and necessity in the DCPSE
(Klein 2007)
DCPSE 1958-1960 1990-1992 Log lklhd Diff(%)
total n/10,000 total n/10,000
must 38 10.21 195 4.63 16.61** -54.67
(HAVEJgotto 24 6.45 185 4.39 2.84 -31.90
HAVEto 34 9.31 555 13.17 4.97* +44.21
need(zux.) 0 0.00 1 0.02 0.17
NEED to 1 0.27 116 2.75 13.15** +924.80
Total 97 26.06 1052 24.97 0.16 -4.19

Log likelihood: a value of 3.84 or more equates with chi-square values of p < 0.05;
a value of 6.63 or more equates with chi-square values of p < 0.01. *HAVE to
1958-60 vs.1990-92: significant at p < 0.05; **must, need to 1958^0 vs.1990-92:
significant at p < 0.01.
CAPITALIZED forms represent all morphological variants.

Much of what emerges from these spoken data is farmhar from the study of
contemporaneous wntten English: the dominant position of have to among
the present-day exponents of obligation and necessity, the decline of must,
the marginal status of need in auxiliary syntax and the phenomenal spread
of main-verb need to in modal functions. Note, for example, that in the
span of a little more than 30 years the normalized frequency of must drops
from around 10 instances per 10,000 words to a mere 5, thus leading to its
displacement as the most frequent exponent of obligation and necessity. By
the early 1990s this position has been clearly ceded to have to. Note further,
that main-verb need to, which barely figured in the late 1950s data, has
firmly established itself 30 years later.
However, as table 4 shows, normalized frequencies (per 10,000 words
of running text in this case) and, more importantly, relative rank of the in-
vestigated forms still differ considerably across speech and writing.
Writing the history of spoken standard English in the twentieth century 187

Table 4. Modals and semi-modals of obligation and necessity in their order of


precedence in speech and writing (Smith 2003: 248; Klein 2007)
LOB F-LOB DCPSE 1958-60 DCPSE 1990-92
Rank n/10,000 n/10,000 n/10,000 n/10,000
1 must 11.41 HAVEto 8.17 must 10.21 HAVE to 13.17
2 HAVEto 7.53 must 8.07 HAVEto 9.13 must 4.63
3 (HAVE) 4.11 NEED to 1.96 (HAVE) 6.45 (HAVE) 4.39
got to got to got to
4 JVEEDto 0.54 (HAVE) 0.27 MJED to 0.27 NEED to 2.75
go/to
Total 23.59 1^50 26,06 24.94

The decline of must is less pronounced in writing than in speech - as would


be expected for changes originating in the spoken language. By contrast,
the drop in the frequency of (have) got to is sharper in writing than in
speech. Growing reluctance to use this form in writing may be due to two
factors. First, it has an informal stylistic flavour, and secondly it is one of
the very few clear syntactic Briticisms. Its near elimination from written
British English may thus be a sign of a trend towards greater homogemsa-
tion of formal and written language use in an age of globalization - an
analysis which is consistent with the oft-proved sociolinguistic dichotomy
ofSchreibeinheit vs. Sprechvielfalt (Besch 2003; M a n 2007), roughly to be
translated as "unity in writing" vs. "diversity in speech".

4. Autonomous change in writing: information compression in the


noun phrase

Numerous corpus-based studies (e.g. Raab-Fischer 1995; Hinnchs and


Szmrecsanyi 2007) have provided overwhelming evidence to show that the
absolute frequency of s-gemtives has increased in the recent past in written
English corpora. What remains controversial is the question whether the
observed statistical increase is due to more common occurrence in the tradi-
tional range of uses (the point of view defended in Mair 2006), or whether
it is partly the result of an additional trend towards greater use of the s-
gemtive with inanimate nouns (cf Rosenbach 2002: 128-176). The details
of this controversy need not preoccupy us here; the major point relevant to
188 ChrmanMarr

the present discussion is that, as will be shown, all and any changes ob-
served in genitive usage seem to be confined to writing (or writing-related
formal genres of speech such as broadcast news). This emerges in striking
clarity from a comparison of genitive usage in the Brown family and in the
DCPSE. For ease of comparison, DCPSE figures have been normalized as
"N per million words", with absolute frequencies given in brackets:6

Table 5. S-genitives in selected spoken and written corpora


"1930s" "1960s" "1990s"
DCPSE (spoken Bntish English) n.a. 2037(861) 1786(775)
B-LOB, LOB &F-LOB (written 4625 4962 6194
British English)
Brown & Frown (written Amencan n.a. 5063 7145
English)

The table shows that genitives in spoken language are consistently less
frequent than in writing in both periods compared, which is an expected
spin-off from the general fact that noun phrases in spontaneous speech tend
to be much shorter and less structurally complex than in writing. More in-
teresting, though, is the fact that while nothing happens diachromcally in
speech (with the frequency of genitives hovering around the 2,000 in-
stances per million word mark), there are steep increases in the written
corpora, which in the thirty-year interval of observation even document the
emergence of a significant regional difference between American and Brit-
ish English.7 In other words, on the basis of the recent diachrony of the s-
gemtive (and a number of related noun-phrase modification structures),8 we
can make the point that written language has had a separate and autono-
mous history from spoken English in the recent past. This history is appar-
ently a complex one as the observed development manifests itself to differ-
ent extents in the major regional varieties. How this partial diachromc
autonomy of writing can be modelled theoretically is a question which we
shall return to in the following section.

5. Conclusion: theoretical issues

Even in the case of English, a language endowed with a fantastic corpus-


linguistic working environment, the real-time corpus-based study of on-
going grammatical change in the spoken language is a fascinatingly novel
Writing the history of spoken standard English in the twentieth century 189

perspective. It was opened up only a very short while ago with the publica-
tion of the DCPSE and is currently still restricted to the study of one single
variety, British Standard English. As I hope to have shown in the present
contribution, it is definitely worth exploring.
Change which proceeds simultaneously in speech and writing is possi-
ble but rare. In the present study, it was exemplified by the spread of un-
marked infinitives at the expense of ^-infinitives in specificational clefts.
The more common case by far is change which proceeds broadly along
parallel lines, but at differential speed in speech and writing. This was illus-
trated in the present study by some ongoing developments involving modal
expressions of obligation and necessity. The recent fate o? have got to in
British English, for example, shows very clearly that local British usage
may well persist in speech while it is levelled away in writing as a result of
the homogenizing influences exerted by globalized communication. Con-
versely, must, which decreases both in speech and writing, does so at a
slower rate in the latter.
The potential for autonomous developments in speech and writing was
shown by finite-clause clefts (the Allldidwas Iasked-typc) and s-gemtives
respectively. Of course, the fact that there are developments in speech
which do not make it into writing (and vice versa) does not mean that there
are two separate grammars for spoken and written English. Genuine struc-
tural changes, for example the grammaticalization of modal expressions,
usually arise in conversation and are eventually taken up in writing - very
soon, if the new form does not develop any sociolinguistic connotations of
informality and non-standardness, and with a time lag, if such connotations
emerge and the relevant forms are therefore made the object of prescriptive
concerns. What leads to autonomous grammatical developments is the dif-
ferent discourse uses to which a shared grammatical system may be put in
speech and writing.
Spoken language is time-bound and dialogic in a way that formal edited
writing cannot be. On the whole, spoken dialogue is, of course, as gram-
matical as any written text, but this does not mean that the grammatical-
structural integrity of any given utterance unit is safe to the same extent in
spontaneous speech as that of the typical written sentence. Structurally
complete grammatical units are the overwhelming norm in writing but
much more easily given up in the complex trade-offs between grammatical
correctness, information distribution and rhetorical-emotional effects which
characterize the online production of speech. This is witnessed by "disloca-
tion" patterns such as that kind of people, I really love them or - in the con-
190 ChrmanMarr

text of the present study - the finite "echo clause" subtype of specifica-
tional clefts (4// / did was I asked)9 Tins structure shows a sufficient de-
gree of conventionalisation to consider it a grammatical construction. How-
ever, it is not a grammatical construction which is likely to spread into writ-
ing because the subordinate part of the cleft construction is not properly
embedded syntactically.
Conversely, compression of information as it is achieved by expanding
noun heads by modifiers such as genitives, prepositional phrases or attribu-
tively used nouns is not a high priority in spontaneous speech. However, it
is a central functional determinant of language use in most written genres.
More than ever before in history, writers of English today are having to
cope with masses of information, which will give a tremendous boost to
almost any structurally economical compression device in the noun phrase,
as has been shown for the s-gemtive in the present study.
Thus, even if spoken and written English share the same grammar, as
soon as we move to the discourse level and study language history as a
history of genres or as the history of changing traditions of speaking and
writing, it makes sense to write a separate history of the written language in
the Late Modern period. This history will document the linguistic coping
strategies which writers have been forced to develop to come to terms with
the increasing bureaucratization of our daily lives, the complexities intro-
duced by the omnipresence of science and technology in the everyday
sphere and the general "information explosion" brought about by the me-
dia.
Above and beyond all this, however, close attention to the spoken lan-
guage in diachromc linguistics is salutary for a more general reason. It
keeps challenging us to question and re-define our descriptive categories.
As was shown in the case of specificational clefts, the variable and its van-
ants were easy to define in the analysis of the written language, and diffi-
culties of classification of individual corpus examples were rare. This was
entirely different in the spoken material, where we were constantly faced
with the task of deciding which of the many instances of discourse-
pragmatic focusing which contain chunks such as what X did was or all X
did was represented a token of the grammatical construction "specifica-
tional cleft sentence" whose history we had set about to study. Grammar
thus "emerges" in psychological time in spontaneous discourse long before
it develops as a structured system of choices in historical time.
Writing the history of spoken standard English in the twentieth century 191

Notes

1 That is the famihar array of the Brown Corpus (American English, 1961), its
Bntish counterpart LOB (1961), their Freiburg updates (F-LOB, British Eng-
lish 1991; Frown, American English 1992) and - not completed until recently
- B-LOB ("before LOB"), a matching corpus illustrating early 1930s British
English. I am grateful to Geoff Leech, Lancaster, and Nick Smith, Salford, for
allowing me access to this latter corpus, which is not as yet publicly available.
2 Note that this example has the speaker correcting an unmarked infinitive into
m-ing form.
3 This admittedly unsophisticated strategy secures relatively high precision and
even higher recall, although of course a very small number of instances with
material intervening between be and do, such as All I did to him was criticise
to will be missed).
4 In particular, the following two issues are in need of clarification, on the basis
of much larger corpora than the Brown family: (1) Are there -ing-
complements without the preceding trigger (type All I did was asking), and
(2) are there unmarked infinitival complements following a preceding pro-
gressive (type All I was doing was ask)? The one instance found of the latter,
quoted as (3) above, shows instant self-correction by the speaker.
5 And this research was duly carried out by Barbara Klein in an MA thesis
(Klein 2007). The author wishes to thank Ms. Klein for her meticulous work
in one of the first DCPSE-based studies undertaken.
6 The DCPSE consists of matching components of London-Lund (1958-1977)
and ICE-GB (1990-1992) material, totalling ca. 855,000 words.
7 Judging from the B-LOB data, it also seems that the trend picked up speed in
the second half of the twentieth century in British English. Pending the
completion of a "pre-Brown" corpus of 1930s written American English, it is,
however, difficult to determine the precise significance of the B-LOB fin-
dings.
8 Chiefly, these are nouns used in attribute function, for which similarly drastic
increases have been noted in Biber (2003), for example. See also Biber (1988)
and (1989). Indeed, in terms of information density, a noun phrase such as
Clinton Administration disarmament initiative could be regarded as an even
more compressed textual variant of the Clinton Administration's disarmament
initiative, which in turn is a compressed form of the disarmament initiative of
the Clinton Administration. Raab-Fischer (1995) was the first to use corpus
analysis to prove that the increase in genitives went hand in hand with a de-
crease in 0/-phrases post-modifying nominal heads. Her data was the then a-
vailable untagged press sections of LOB and F-LOB. Analysis of the POS-
tagged complete versions of B-LOB, LOB and F-LOB shows that her provi-
sional claims have stood the test of time quite well. 0/-phrases decrease from
192 ChrmanMarr

31,254 (B-LOB) through 28,134 (LOB) to 27,115 (F-LOB). Like genitives,


noun+noun sequences, or more precisely: noun+common noun (= tag se-
quence N* NN*) sequences, increase - from 17,023 in B-LOB through
21,393 m LOB to 25,774 m F-LOB.
9 Here, additional evidence is provided by the "emergent" structures briefly
illustrated in example (5) above, which were excluded from consideration as
they would have distorted the statistical comparison between speech and writ-
ing.

References

Besch, Werner
2003 Schriftemheit - Sprechvielfalt: Zur Diskussion urn die nationalen
Vananten der deutschen Standardsprache. In Deutsche Sprache im
Wandel: Kleinere Schriften zur Sprachgeschichte, Werner Besch
(ed.), 295-308. Frankfurt: Lang.
Biber, Douglas
1988 Variation Across Speech and Writing. Cambridge: Cambridge Uni-
versity Press.
Biber, Douglas
2003 Compressed noun-phrase structures in newspaper discourse: The
competing demands of popularization vs. economy. In New Media
Language, Jean Aitchison and Diana M. Lewis (eds.), 169-181. Lon-
don: Routledge.
Biber, Douglas and Edward Finegan
1989 Drift and evolution of English style: A history of three genres. Lan-
guage 65: W-511.
Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad and Edward Fine-
gan
1999 Longman Grammar of Spoken and Written English. Harlow: Long-
man.
Hmrichs, Lars and Benedikt Szmrecsanyi
2007 Recent changes in the function and frequency of standard English
genitive constructions: A multivariate analysis of tagged corpora.
English Language and Linguistics 11: 437-474.
Hoffmann, Sebastian
2005 Grammaticalization and English Complex Prepositions: A Corpus-
Based Study. London: Routledge.
Hopper, Paul
1998 Emergent grammar. In The New Psychology of Language, Michael
Tomasello (ed.), 155-175. Mahwah, NJ: Lawrence Erlbaum.
Writing the history of spoken standard English in the twentieth century 193

Hopper, Paul
2001 Grammatical constructions and their discourse origins: Prototype or
family resemblance? In Applied Cognitive Linguistics I: Theory and
Language Acquisition, Martin Ptitz and Susanne Niemeier (eds.),
109-129. Berlin: MoutondeGruyter.
Hopper, Paul
2004 The openness of grammatical constructions. Chicago Linguistic
Society 40: 239-256.
Klein, Barbara
2007 Ongoing morpho-syntactic changes in spoken British English: A
study based on the DCSPE. Unpublished Master's thesis. University
ofFreiburg.
Krug, Manfred
2000 Emerging English Modals: A Corpus-Based Study of Grammaticali-
zation. Berlin/New York: Mouton de Gruyter.
Leech, Goeffrey
2003 Modality on the move: The English modal auxiliaries 1961-1992. In
Modality in Contemporary English, Roberta Facchmetti, Manfred
Krug and Frank R. Palmer (eds.), 223-240. Berlin: Mouton de
Gruyter.
Leech, Geoffrey, Mananna Hundt, Christian Mair and Nicholas Smith
2009 Change in Contemporary English: A Grammatical Study Cam-
bridge: Cambridge University Press.
Mair, Christian
2006 Inflected genitives are spreading in present-day English, but not
necessarily to inanimate nouns. In Corpora and the History of Eng-
lish: Festschrift for Manfred Markus on the Occasion of his 65th
Birthday, Christian Mair and Reinhard Heuberger (eds.), 243-256.
Heidelberg: Winter.
Mair, Christian
2007 British English/American English grammar: Convergence in writing,
divergence in speech. Anglia 125: 84-100.
Mair, Christian and Geoffrey Leech
2006 Current changes. In The Handbook of English Linguistics, Bas Aarts
and April McMahon (eds.), 318-342. Oxford: Blackwell.
Raab-Fischer,Roswitha
1995 Lost der Gemtiv die 0/-Phrase ab? Erne korpusgestiitzte Studie zum
Sprachwandel im heutigen Englisch. Zeitschrift fur Anglistik und
Amerikanistik 43: 123-132.
Rohdenburg,Gunter
2000 The complexity principle as a factor determining grammatical varia-
tion and change in English. In Language Use, Language Acquisition
194 ChrmanMarr

and Language History: (Mostly) Empirical Studies in Honour of


Rudiger Zimmermann, Ingo Plag and Klaus Peter Schneider (eds.),
25-44. Trier: WVT.
Rosenbach,Anette
2002 Genitive Variation in English: Conceptual Factors in Synchronic
andDiachronic Studies. Berlin: Mouton de Gruyter.
Sinclair, John McH.
1972 A Course in Spoken English Grammar. Oxford: Oxford University
Press.
Sinclair, John McH. and Richard M. Coulthard
1975 Towards an Analysis of Discourse: The English Used by Teachers
and Pupils. Oxford: Oxford University Press.
Smith, Nicholas
2003 Changes in the modals and semi-modals of strong obligation and
epistemic necessity in recent British English. In Modality in Con-
temporary English, Roberta Facchmetti, Manfred Krug and Frank R.
Palmer (eds.), 241-266. Berlin: Mouton de Gruyter.
Traugott, Elizabeth
2008 'All that he endeavoured to prove was ...': On the emergence of
grammatical constructions in dialogic contexts. In Language in Flux:
Dialogue Coordination, Language Variation, Change and Evolution,
Robin Cooper and Ruth Kempson (eds.), 143-177. London: Kings
College Publications.

Corpora

BLOB The BLOB-1931 Corpus (previously called the Lancaster-1931 or


B[efore]-LOB Corpus. 2006. Compiled by Geoffrey Leech and Paul
Rayson, University of Lancaster.
BROWN A Standard Corpus of Present-Day Edited American English, for use
with Digital Computers (Brown). 1964, 1971, 1979. Compiled by W.
N. Francis and H. Kucera. Brown University. Providence, Rhode Is-
land.
DCPSE The Diachronic Corpus of Present-Day Spoken English. 2006. Com-
piled by Bas Aarts, Survey of English Usage, University College
London.
FROWN The Freiburg-Brown Corpus ('Frown'), original version. Compiled
by Christian Mair, Albert-Ludwigs-Umversitat Freiburg.
LOB The LOB Corpus, original version. 1970-1978. Compiled by Geof-
frey Leech, Lancaster University, Stig Johansson, University of Oslo
(project leaders) and Knut Holland, University of Bergen (head of
computing).
Writing the history of spoken standard English in the twentieth century 195

FLOB The Freiburg-LOB Corpus ('F-LOB'), original version. 1996. Com-


piled by Christian Mair, Albert-Ludwigs-Umversitat Freiburg.
Prefabs in spoken English

Brigitta Mittmann

1. Introduction

This article discusses the method of using parallel corpora from what are
arguably the two most important regional varieties of English - American
English and British English - for finding prefabricated or formulaic word
combinations typical of the spoken language. It sheds further light on the
nature, shape and characteristics of the most frequent word combinations
found in spoken English as well as the large extent to which the language
can be said to consist of recurrent elements. It thus provides strong evi-
dence supporting John Sinclair's idiom principle (1991: 110-115). The
article is in parts an English synopsis of research that was published in
German in monograph form in Mittmann (2004) and introduces several
new aspects of this research hitherto not published in English. This research
is highly relevant in connection with several issues discussed elsewhere in
this volume.

2. A study of British-American prefab differences

2.1. Material

The study is based upon two corpora which both aim to be representative of
natural every-day spoken English: for British English, the spoken demo-
grapMc part of the British National Corpus (BNCSD) and for American
English, the Longman Spoken American Corpus (LSAC). Both corpora
contain recordings of people from all age groups, both sexes and from a
variety of social and regional backgrounds of the two countries. The size of
the two corpora is similar: the BNCSD contains about 3.9 million and the
LSAC about 4.9 million words of running text.
Despite some minor differences, these corpora are similar enough to be
compared as parallel corpora. When the research was earned out, they be-
longed to the largest corpora of spoken English available for this purpose.
Nonetheless, frequency-based and comparative studies of word combina-
198 BrigittaMittmann

tions demand a certain minimum of occurrence. This meant that it was ne-
cessary to concentrate upon the most frequent items since otherwise the
figures become less reliable.

2.2. Method

For determining the most frequent word combinations in the BNCSD and
the LSAC, a series of programs was used which had been specially written
for this purpose by Flonan Klampfl. These programs are able to extract n-
grams or clusters (i.e. combinations of two, three, four, or more words)
from a text and count their frequency of occurrence. For example, the sen-
tence Can I have a look at this? contains the following tngrams: CAN I
HAVE-I HAVE A-HAVE A LOOK-A LOOK AT and LOOK AT THIS
(each with a raw frequency of one). The idea to use n-grams - and the term
cluster - was inspired by the WordSmith concordancing package (Scott
1999).
With the help of the X2-test a list was created which sorted the clusters
from those with the most significant differences between the two corpora to
those with the greatest similarity. In addition to this, a threshold was intro-
duced to restrict the output to those items where the evidence was strongest.
This minimum frequency is best given in normalized figures (as parts per
million or ppm), as the LSAC is somewhat larger than the BNCSD. It was
at least 12.5 ppm in at least one of the corpora (this corresponds to around
49 occurrences in the BNCSD and about 61 occurrences in the LSAC).

2.3. Types of clusters/prefabs found

The cluster lists contained ample evidence of word combination differences


between the two varieties. A part of them had previously been recorded
elsewhere as being more typical of either American or British English, but
a considerable number appeared to be new.
Most notably, the material contained many conversational routines or
parts of them. They are of very different kinds, ranging from greeting for-
mulas, such as how are you? or how are you doing? (esp. LSAC) to multi-
functional expressions like here you are (esp. BNCSD) or here you go (esp.
LSAC), complex discourse markers such as mind you or the trouble rs (both
esp. BNCSD), hedges like kind 0 /(esp. LSAC) and sort 0 / ( e S p . BNCSD),
general extenders (cf Overstreet 1999) such as and stuff (like that) and shit
Prefabs in spoken English 199

(like that) (both esp. LSAC), or multi-word expletives like bloody hell, oh
dear (esp. BNCSD) or (oh) my gosh, oh boy, oh man, oh wow (esp. LSAC).
Apart from conversational routines, there was a wide range of other
types of word combinations to be found. Most of the material is unlikely to
fall under the heading of 'classical idioms', but nonetheless a substantial
part of it can be seen as idiomatic in the sense that their constituents are
either semantical^ or syntactically odd. Borrowing Fillmore, Kay and
O'Connor's (1988: 508) expression, one could say that they are mostly
"familiar pieces unfamiliarly arranged". Amongst these pre-assembled
linguistic building blocks, there are idioms such as be on about (sth), have
a go (both esp. British English), 2 but also expressions such as be like, a
reporting construction used predominantly by younger American speakers
as it is exemplified in the following stretch of conversation:
(1) <2159> Call him up and he gets kind of snippy with me on the phone. Well
he's sending in mixed messages anyway but uh I called him <nv_clears
throat> and he's snippy and he's like no I can't go. And I'm like fine that's all
I need to know. And so let him go. Tuesday he comes in and he's like
<mimickmg>hi how are you? </m imi ckmg> and everything and I'm just
like comes up and talks to me like twice and he's like <nv sigh> you don't
believe me do you? I'm like no. (LSAC 150701)
Further types of frequent word combinations include phrasal and preposi-
tional verbs - e.g. British get on with, go out with; American go ahead (and
...), figure out, work out ('exercise') - and certain quantifiers such as a bit
of (esp. BNCSD), a Utile 0 / ( e s p . LSAC) or a load 0 / ( B N C S D ) . In the cor-
pora, there were also instances showing that individual words can have
quite different collocational properties in different varieties. For example, a
lot tends to co-occur much more often with quite and not in British English
than it does in American English.

2.4. The'fuzzy edges'of phraseology

A large number of clusters points to what one might call the 'fuzzy edges'
of traditional phraseology. On the borderline between phraseology and
syntax there are, for example, tag questions, 3 periphrastic constructions
such as the present perfect (which is used more frequently in British Eng-
lish), semi-modals such as have got (esp. BNCSD) and going to/gonna
(esp. LSAC) or valency phenomena such as the fact that in spoken Amen-
200 BrigittaMittmann

can Enghsh feel/look/seem take complements introduced by like much more


often than they do in Bntish EngHsh.
Recent research in several linguistic paradigms, most notably perhaps in
the context of construction grammar, has tended to bridge the traditional
divide between grammar and lexis (see e.g. Romer and Schulze 2009) and
also emphasizes the continuum between phraseology and word formation
(e.g. Granger and Paquot 2008). Here, again, the comparison of BNCSD
and LSAC provided a number of relevant combinations such as adjectival
ones consisting of participle and particle (e.g. fed up with, screwed up in a
whole load of screwed up ones, cf BNCSD file ked), complex prepositions
(apart from, in terms of), complex conjunctions (as if as though), complex
adverbs (of course, as well, at all), or complex pronouns (you guys,
y'all/you all).
A type of prefabricated expression that is typically ignored in studies of
formulaic word combinations is that of time adverbials - or parts of time
adverbials such as British at the moment, in a minute or American right
now, at this point, (every once) in a while, the whole time. In the BNCSD
the time of day is usually given with the help of expressions such as half
past, quarter to, (quarter) of an hour, while the LSAC contains more com-
binations such as six thirty. The reason why these sequences are mostly
ignored by researchers is likely to be that in many cases they appear to have
been generated according to simple semantic and syntactic rules. A similar
problem exists with respect to frequent responses. As will be discussed
below, however, all of these combinations are of great significance in that
they are typical ways of expressing the relevant concepts and are preferred
over other expressions which might have been used instead.

2.5. Evaluation

In sum, studying phraseological differences between varieties of spoken


English with the help of clusters has proved very successful. It covered
most of the rather scattered and often unsystematic descriptions of British-
American word combination differences previously found in the literature,
but brought a large number of new phenomena to light which had hitherto
mainly - if at all - been mentioned in dictionaries. It goes without saying
that certain kinds of word combinations are not caught in the net of this
procedure. One example of this is collocations of the type identified by
Hausmann (1985, 1989), which may be more than just a few words apart;
i.e. combinations of lexical words such as schiitteres Haar, 'thin hair', as in
Prefabs in spoken English 201

Hausmann's example Das Haar ist mcht nur ber alten Menschen sondern
auch ber relatrv jungen Menschen bererts recht haufig schutter (Literally:
'The hair not just of older people but also of relatively young ones is quite
often thin already.') (1985: 127). However, this phenomenon tends to be
comparatively rare in comparison with the large amount and wide variety
of other material which can be collected.
The approach chosen is largely data-driven and casts the net wide with-
out either restricting or anticipating results. It proved very useful for what
was effectively a pilot study, as there had not been any systematic treatment
of such a wide variety of word combination differences between spoken
American and British English.
In 2006, John Algeo published a book on British word and grammar
patterns and the ways in which they differ from American English. His
focus and methods are different from the ones reported upon here and he
based his research upon other data (including different corpora). Nonethe-
less, there is some overlap and in these areas, his findings generally cor-
roborate those from Mittmann (2004).
Another recently finished study which has some connection with the
present research is the project of Anne Grimm (2008) in which she studied
differences between the speech of men and women, including amongst
other things the use of hedging expressions and expletives. This project is
also based upon the BNCSD and the LSAC and observes differences be-
tween the regional varieties. Again, the results from Mittmann (2004) are
generally confirmed, while Grimm differentiates more finely between the
statistics for different groups of speakers for the items that she focuses on.
However, in these - and other - works a number of theoretical issues
had to be left undiscussed and it is some of these points that will be ex-
plored in the next sections.

3. Theoretical implications

3.1. The role of pragmatic equivalence

In comparing two corpora, the problem arises what the basis for the com-
parison (or the tertium comparatioms) should be. If one combination of
words occurs, for example, five times as frequently in one corpus as it does
in the other one, then this may be interesting, but it leaves open the ques-
tion how the speakers in the other corpus would express the same concept
or pragmatic function instead. Therefore it is highly relevant to look for
202 BrigittaMittmann

what one might call "synonymous" word combinations - and take this to
include combinations with the same pragmatic function. Sometimes such
groups of expressions with the same function or meaning can be found
relatively easily, as in figure 1 (taken from Mittmann 2005), which gives a
number of comment clauses which have the added advantage of having
similar or identical structures:

DLSAC HBNCSD

I reckon
I should think
I expect
I suppose
I think
I believe
I figure
I guess

0% 20% 40% 60% 80% 100%


Figure 1.

However, finding such neat groups can be difficult and similar surface
structures do not guarantee functional equivalence. For example, it has
been pointed out in the literature on British-American differences (Benson,
Benson and Ilson 1986: 20) that in a number of support verb constructions
such as take a bath vs. have a bath, American English tends to use take,
while British speakers typically use have. While there is no reason to doubt
this contrast between take and have in support verb constructions in gen-
eral, a very different situation obtains with respect to certain specific uses
in conversation. It is remarkable that while HAVE a look does indeed ap-
pear quite frequently in the BNCSD, this is not true of take a look in the
LSAC. Instead, expressions such as let's see or let me see appear to be used
instead. Moreover, both let me see and let's see as well as let me have a
look and let's have a look are often used synonymously, as can be seen
from the following extract from a conversation between a childminder and
a child:
Prefabs in spoken English 203

(2) ... you've got afilthynose. Let's have a look. (BNCSD, kb8)
Quite feasibly, a German speaker might use a very different construction
such as Zeig mal (her) - which consists of an imperative (zeig, 'show') +
pragmatic particle (mal) + adverb (her; here 'to me') - in similar situations.
Pragmatic equivalence is therefore context-dependent and comparisons
between varieties can be made at very different levels of generality. This is
also a problem for anyone studying what Herbst and Klotz (2003: 145-149)
have called probabemes, i.e. likely linguistic realizations of concepts. If
one opts for the level of pragmatic function, then very general functions
such as expressing indirectness are very difficult to take into account, as
they can be realised by such a variety of linguistic means, from modal
verbs and multi-word hedging expressions to the use of questions rather
than statements in English, or the use of certain pragmatic particles (e.g.
vielleicht) in German. Any statement about whether, for example, the
speakers of one group are more or less indirect than those of another will
have to take all those features into account. And while it appears to be true,
for example, that British speakers use certain types of modal verb more
frequently than their American counterparts (Mittmann 2004: 101-106),
there are a number of speakers in the LSAC who use many more hedging
expressions such as like (as in And that girl's going to be like so spoiled,
LSAC 130801, or it's like really important, LSAC 161902).

3.2. Variety differences can be used to identify prefabs

For a number of word combinations, it was the comparison of two parallel


corpora from regional varieties which was vital in drawing attention to their
fixedness or what Alison Wray (2002: 5) would call thenformulataty. This
is particularly interesting in those cases where the cluster frequencies show
different tendencies to the frequencies of certain single words. For exam-
ple, the verb WANT (in all its inflectional forms) occurs more frequently in
the American corpus, while the question Do you want...? can be found
more frequently in the British one. Vice versa, the words oh, well and sorry
can be found more often in the British corpus while the clusters oh my
gosh, oh man, well thank you and I'm sorry are more typical for the Ameri-
can texts.
On the other hand, certain clusters show what may be called a micro-
grammar in that certain grammatical phenomena which are otherwise typi-
cal for a variety do not apply to them. For example, the semi-modal have
204 BrigittaMittmann

got is typical for spoken Bntish English (see above), but does not normally
occur together with no idea. Both the forms I've no idea and No idea are
much more frequent than /'ve got no idea.
This also means that the external form of prefabs can be crucial, as there
may be small, but established differences between varieties. They can relate
to the use of words belonging to 'minor' word classes or to conventional
ellipses. For example, the use of articles can differ between the two vane-
ties, as with get a hold of (something), an expression which is more typical
of the American corpus, versus get hold of (something), which is its British
counterpart. Sometimes, the meaning of a phrase depends crucially on the
presence of the article, as in the combinations the odd [+ noun] or a right [+
noun] (both esp. BNCSD). In this use, odtf typically has the meaning 'occa-
sional', as in We are now in a country lane, looking out on the odd passing
car (BNCSD, ksv), whereas right is used as an intensify for nouns denot-
ing disagreeable or bad situations, personal qualities or behaviour, as in
There was a right panic in our house (BNCSD, kcl). However, as seen
above with get (a) hold of (something), there does not have to be any such
change of meaning.
In other cases, interjections are a characteristic part of certain expres-
sions. For example, in both corpora in around 80 % of all cases my god is
preceded by oh. Again, there are a number of clusters containing interjec-
tions which are far more typical of one of the two varieties. Examples for
this are well thank you, no I won >t or yes you can. Often these are re-
sponses, which will be discussed again below.
A further formal characteristic of certain formulaic sequences is that
they are frequently elliptical, such as No idea, which appears on its own in
almost half the cases in the BNCSD, Doesn >t matter (one third of cases
without subject), or Course you can (more than two thirds of cases). All of
them are often used as responses, as in the following examples:
(3a) <PS07D> Oh ah! Can I take one?</u>
<PS079> Course you can. I'll take two. (...) </u> (BNCSD, kbs)
(3b) <PS1 AD> (...) Can I use your phone?</u>
<PS1A9> Yeah, course <ptrt=KBCLC085>yO« can.<Vtr t=KBCLC086>
</u> (BNCSD, kbc)
Prefabs in spoken English 205

3.3. Variety differences show the extent to which language is


prefabricated

The fixedness of some word combinations may seem debatable, as they


simply appear to be put together according to syntactic and semantic rules.
Amongst these, there are recurrent responses such as No, it isn >t or Yes, it
is. It may seem at first sight that these are ordinary, unspectacular se-
quences, but in fact the difference between the corpora is highly significant
here. These expressions are highly context-dependent and grammatically
elliptical. A number of responses contain interjections:
no, I won't; no, I don't; no, it >s not; no it isn >t; yes it is; yes/yeah, you can;
yes/yeah I know; yes, please; oh alright; oh I see (esp. BNCSD)
oh okay (psp.LSAC)
However, there are also many combinations without interjections which are
typically used in response to another speaker's turn. The following clusters
appear directly after a speaker change in more than 50 % of all cases:
/ don't mind; never mind; it's up to you; it's alright/all right; that's
alright/all right; that's/that is right; that'll do/that will do; that's it; that's
not bad; that would be ...; that's a good.., don't be silly.
In some cases such as Course you can or Never mind, these responses tend
to be conventionally elliptic, while in others such as That's /Ythey are syn-
tactically well-formed but highly context-dependent, with an anaphoric
reference item such as that.
There are also responding questions such as the following ones:
Why's that?; Why not? (esp. BNCSD)
How come?; Like what? (esp. LSAC)
Again, they are either elliptical or contain anaphoric reference items.
It is notable that many of the responses mentioned above appear much
less frequently in the American corpus than they do in the British one. Pre-
sumably, Americans tend to respond differently, for example using single
words such as Sure. In her above-mentioned recent detailed empirical study
of similarities and differences between the language of women and men,
Grimm (2008: 301) found that the American speakers generally used more
minimal responses than their British counterparts, which would seem to
confirm this hypothesis.
Arguably, if speakers of different varieties typically use different word
combinations for verbalizing the same concepts, there is an indication that
the expressions they use are - at least to some extent - formulaic and re-
206 BrigittaMittmann

tneved from memory. This means that even highly frequent utterances such
as No, it isn >t or Yes, it is, which seem banal in that they are fully analyz-
able and can be constructed following the grammatical rules of the lan-
guage, can be regarded as prefabricated, which should put them at the cen-
tre of any theory of language. Authors such as Wray have argued persua-
sively in favour of seeing prefabricated word combinations (or, as she puts
it, formulaic sequences) as central to linguistic processing (2002: 261),
although using varieties of a language as support for this position appears
to be an approach which had not actually been put into practice before
Mittmann (2004).

4. Scope for further research

As a consequence of the richness and great variety of the material found in


the clusters, some potentially interesting findings had to be left unexplored.
These issues might be relevant for further investigations into the interplay
between prefabs and grammatical, semantic and pragmatic rules in speech
production, which is why they will be outlined briefly in this section.

4.1. Chunk boundaries and'wedges'

One problem which has also been noted by other authors is that it is often
difficult to determine where the boundaries between chunks are. Many
chunks show what one might term 'crystallization', having a stable core
and more or less variable periphery. And while some chunks are compara-
tively easy to delimit, others are not. This is, for example, partly reflected
in Sinclair and Mauranen's distinction between O and M units (2006: 59).
The M units contain what is being talked about whereas the O units (e.g.
hedges, discourse markers and similar items) organize the discourse. The
latter tend to be particularly stable in their form.
On top of this, there are sometimes intriguing differences between the
varieties. For example, in the American corpus certain items, notably cer-
tain discourse markers such as you know or the negative particle not, can
interrupt verb phrases or noun phrases by squeezing in between their con-
stituents like a wedge. In examples (4.1) and (4.2) below, the wedge is
placed between the infinitive particle and the verb, in (4.3), it is between
the article and the premodifier, and in (4.4) between an adverbial and the
verb.
Prefabs in spoken English 207

(4.1) <2396> Yeah, I don't like them either. <nv_laugh> No that's supposed to
be a good school. I'll just try to you know cheer along. Be supportive.
(LSAC, 155401)
(4.2) <2194> It's so much easier to not think, to have somebody else to do it for
you. (LSAC, 150102)
(4.3) <1510> Well and he was saying I wasridingon the sidewalk which you can
do outside of the you know, downtown area. (LSAC, 125801)
(4.4) <2058> and uh, my brother, just you know cruises around on his A T V and
his snowmobile when it's snowmobile season (...) (LSAC, 144301)
In a similar manner, other items such as kind of or / think can function as
wedges in these positions. Apparently, there is a greater tendency in
American English to insert items just in front of the verb or between certain
other closely linked clause or phrase constituents. These places would ap-
pear to be where the speaker conventionally takes time for sentence plan-
ning and there may well be differences between the varieties - as indeed
there are between languages - in this respect. Anybody who has ever stud-
ied English films dubbed into German will probably agree that hesitation
phenomena (notably repetitions and pauses) are somewhat odd in compari-
son to non-scripted, everyday conversational German.

4.2. Differences in rhythm

The wedges also affect the rhythmic patterning of sentences in the Ameri-
can corpus. Further research should investigate the links between stress
(and, thus, rhythm) and intonation, pauses, hesitation phenomena and
'chunks'. Sometimes, interesting rhythmical patterns seem to appear in
other portions of the material. For example, many of the responses which
are overwhelmingly found in the BNCSD have a stress pattern of two un-
stressed syllables followed by a stressed syllable (in other words, an ana-
paest), as in / don 7 mind; yes you can; course you can, etc. In addition to
this, there appear to be differences in the use of contracted forms. For sim-
ple modal verbs, for example, the BNCSD has more contractions involving
the negative particle (e.g. can % couldn Y), whereas there is a stronger ten-
dency towards using the full form not (e.g. cannot, could not) in the LSAC.
The same applies to the use of -11 versus will. However, since the study of
such contractions depends crucially on transcription conventions, further
research in this field would need to include the audio files.
208 BrigittaMittmann

5. Conclusion

The project described in this article has shown that American and British
spoken English differ markedly in the word combinations which they typi-
cally use. These word combinations span a wide range of types - from
various kinds of routine formulae to frequently recurring responses. A few
formulaic sequences are grammatically or semantically odd, but many more
are neither of those, although they typically have a special pragmatic or
discourse-related function. Nonetheless, the fact that they are typical of one
variety of a language but not for another indicates that they are to some
extent formulaic.
Thus, the British-American differences reported on here provide further
proof for the fact that everyday language is to a great extent conventional-
ised. Idiomaticity (or formulaicity) pervades language. It consists largely of
recurring word combinations which are presumably stored in the speaker's
memory as entities. The comparison of parallel corpora offers compelling
evidence confirming Sinclair's idiom principle. In the words of Franz Josef
Hausmann, we can say that there is "total idiomaticity" (1993: 477)

Notes

1 The author is grateful to S. Faulhaber and K. Pike for their comments on an


earlier version of this article.
2 There is of course an overlap with some of the routine formulae such as the
pragmatic marker mind you - classified by Moon (1998: 80-81) as an "ill-
formed" fixed expression.
3 Tag questions show fossilization in that there are invariant forms such as
inr.it? m some varieties.

References

Algeo,John
2006 British or American English? A Handbook of Word and Grammar
Patterns. Cambridge: Cambridge University Press.
Benson, Morton, Evelyn Benson and Robert Ilson
1986 Lexicographic Description of English. Amsterdam: John Benjamins.
Fillmore, Charles J., Paul Kay and Mary C. O'Connor
1988 Regularity and idiomaticity in grammatical constructions. Language
64 (3): 501-538.
Prefabs in spoken English 209

Granger, Sylviane and MagaliPaquot


2008 Disentangling the phraseological web. In Phraseology: An Interdis-
ciplinary Perspective, Sylviane Granger and Fanny Meumer (eds.),
27-49. Amsterdam/Philadelphia: Benjamins.
Grimm, Anne
2008 "Mannersprache" "Frauensprache"? Eine korpusgestiitzte em-
pirische Analyse des Sprachgebrauchs britischer und amerikani-
scher Frauen und Manner hinsichtlich Geschlechtsspezifika. Ham-
burg: Kovac.
Hausmann, Franz Josef
1985 Kollokationen im deutschen Worterbuch: Em Behrag zur Theorie
des lexikographischen Beispiels. In Lexikographie und Grammatik
(Lexicographica Series Maior 3), Henning Bergenholtz and Joachim
Mugdan (eds.), 118-129. Tubingen: Niemeyer.
Hausmann, Franz Josef
1989 Le dictionnaire de collocations. In Worterbucher, Dictionaries, Dic-
tionnaires, vol. 1, Franz Josef Hausmann, Oskar Reichmann, Herbert
Ernst Wiegand and Ladislav Zgusta (eds.), 1010-1019. Berlm/New
York: Walter de Gruyter.
Hausmann, Franz Josef
1993 1st der deutsche Wortschatz lernbar? Oder: Wortschatz ist Chaos.
DaF 5: 471-485.
Herbst, Thomas and Michael Klotz
2003 Lexikographie. Paderborn: Schoningh.
Mittmann,Bngitta
2004 Mehrwort-Cluster in der englischen Alltagskonversation: Unter-
schiede zwischen britischem und amerikanischem gesprochenen
Englisch als Indikatoren fur den prafabrizierten Charakter der Spra-
cfe. Tubingen: GunterNarr.
Mittmann,Bngitta
2005 'I almost kind of thought well that must be more of like British Eng-
lish or something': Prefabs in amerikanischer und britischer Konver-
sation. In Linguistische Dimensionen des Fremdsprachenunterrichts,
Thomas Herbst (ed.): 125-134. Wiirzburg: Komgshausen und Neu-
mann.
Moon, Rosamund
1998 Fixed Expressions and Idioms in English: A Corpus-Based Ap-
proach. Oxford: Oxford University Press.
Overstreet,Maryann
1999 Whales, Candlelight, and Stuff Like That: General Extenders in
English Discourse. New York/Oxford: Oxford University Press.
210 BrigittaMittmann

R6mer,UteandRamerSchulze(eds.)
2009 Exploring the Lexis-Grammar Interface. Amsterdam/Philadelphia:
Benjamins.
Scott, Mike
1999 WordSmith Tools, version 3, Oxford: Oxford University Press.
Sinclair, John McH.
1991 Corpus, Concordance, Collocation. Oxford: Oxford University
Press.
Sinclair, John McH. and Anna Mauranen
2006 Linear Unit Grammar: Integrating Speech and Writing. Amster-
dam/Philadelphia: John Benjamins.
Wray, Alison
2002 Formulaic Language and the Lexicon. Cambridge: Cambridge Uni-
versity Press.

Corpora

BNC The British National Corpus. Distributed by Oxford University


Computing Services on behalf of the BNC Consortium.
http://www.natcorp.ox.ac.uk/.
LSAC The Longman Spoken American Corpus is copyright of Pearson
Education Limited and was created by Longman Dictionaries. It is
available for academic research purposes, and for further details see
http://www.longman-elt.com/dictionaries/.
Ute Romer

1. Introduction

The past few years have seen an increasing interest in studies based on new
kinds of specialized corpora that capture an ever-growing range of text
types, especially from academic, political, business and medical discourse.
Now that more and larger collections of such specialized texts are becom-
ing available, many corpus researchers seem to switch from describing the
English language as a whole to the description of a number of different
language varieties and community discourses (see, for example, Biber
2006; Biber, Connor and Upton 2007; Bowker and Pearson 2002; Gavioh
2005; Hyland 2004; and the contributions in Connor and Upton 2004; and
in Romer and Schulze 2008).
This paper takes a neo-Firthian approach to academic writing and exam-
ines lexical-grammatical patterns in the discourse of linguistics. It is in
many ways a tribute to John Sinclair and his groundbreaking ideas on lan-
guage and corpus work. One of the things I learned from him is that, more
often than not, it makes sense to "go back" and see how early ideas on lan-
guage, its structure and use, relate to new developments in resources and
methodologies. So, in this paper, I go back to some concepts introduced
and/or used by John Sinclair and by John Rupert Firth, a core figure in
early British contextualism, who greatly influenced Sinclair's work. Con-
tinuing Sinclair's (1996: 75) "search for units of meaning" and using new-
generation corpus tools that enable us to explore corpora semi-
automatically (Collocate, Barlow 2004; ConcGram, Greaves 2005;
kJNgram, Fletcher 2002-2007), the aim of this paper is to uncover the phra-
seological profile of a particular sub-type of academic writing and to see
how meanings are created in a 3.5-million word corpus of linguistic book
reviews written in English, as compared to a larger corpus of a less special-
ized language.
After an explanation of the concept of "restricted language" and a dis-
cussion of ways in which meaningful units can be identified in corpora, the
212 Ute Romer

paper will focus on a selection of common phraseological items in linguis-


tic book review language, and investigate how specific (or how "local")
these items are for the type of language under analysis and whether the
identified local patterns are connected to local, text-type specific meanings.
It will conclude with a few thoughts on "local grammars" and recommen-
dations for future research in phraseology and academic discourse.

2. Taking a neo-Firthian approach to academic writing

The context of the analysis reported on in this paper is a large-scale corpus


study of academic discourse. Central aims of the study are to investigate
how meanings (in particular evaluative meanings) are created in academic
writing in the discipline of linguistics, and to develop a local lexical gram-
mar of book review language. The approach taken in the larger-scale study
and described in the present paper is neo-FrrtMan in that it picks up some
central notions developed and used by Firth (and his pupil Sinclair) and
uses new software tools and techniques which lend themselves to investi-
gating these notions but which Firth did not have at his disposal. The no-
tions discussed here are "restricted language" (e.g. Firth [1956] 1968a),
"collocation" (e.g. Firth [1957] 1968c; Sinclair 1991), "unit of mean-
ing"/"meamng-shift unit" (Sinclair 1996, 2007 personal communication),
"lexical grammar" (e.g. Sinclair 2004) and "local grammar" (e.g. Hunston
and Sinclair 2000).

2.1. The discourse of linguistics as a "restricted language"

In the following, I will report on an analysis of a subset of the written Eng-


lish discourse among linguists regarded as a global community of practice.
This type of discourse, the discourse of linguistics, is only one of the many
types of specialized discourses that are analyzed by researchers in corpus
linguistics and EAP (English for Academic Purposes). In Firthian terms, all
these specialized discourses constitute "restricted languages".
As Leon (2007: 5) notes, "restricted languages ... became a touchstone
for Firth's descriptive linguistics and raised crucial issues for early socio-
linguistics and empiricist approaches in language sciences". Firth himself
states that "descriptive linguistics is at its best when dealing with such [re-
stricted] languages" (Firth 1968a: 105-106), mainly because the focus on
limited systems makes the description of language more manageable. A
Observations on the phraseology of academic writing 213

restricted language can be defined as the language of a particular domain


(e.g. science, politics or meteorology) or genre that serves "a circumscribed
field of experience or action and can be said to have its own grammar and
dictionary" (Firth [1956] 1968b: 87). That means that we are dealing with a
subset of the language, with "a well defined limited type or form of a major
language, let us say English" (Firth 1968a: 98). A restricted language thus
has a specialized grammar and vocabulary, "a micro-grammar and a micro-
glossary- (Firth 1968a: 106, emphasis in original). An alternative concept
to restricted language would be that of sublanguage. Sublanguage is a term
used by Hams (1968) and Lehrberger (1982) to refer to "subsets of sen-
tences of a language" (Harris 1968: 152) or languages that deal with "lim-
ited subject matter" and show a "high frequency of certain constructions"
(Lehrberger 1982: 102). The concept of sublanguage also occurs in modern
corpus-linguistic studies, for example in a study on the language of diction-
ary definitions by Barnbrook who considers the concept "an extremely
powerful approach to the practical analysis of texts which show a restricted
use of linguistic features or have special organisational properties"
(Barnbrook 2002: 94). I will now turn to looking at the language of aca-
demic book reviews (a language of a particular domain with its own lexical
microgrammar) and at some typical constructions in this sublanguage or
restricted language.
The restricted language I am dealing with here is captured in a 3.5-
milhon word corpus of 1,500 academic book reviews published in Linguist
List issues from 1993 to 2005: the Book Reviews tn Linguistics Corpus
(henceforth BRILC). The language covered in BRILC constitutes part of
the discourse of linguistics (in an English-speaking world). BRILC mirrors
how the global linguistic research community discusses and assesses publi-
cations in the field. For a corpus of its type, BRILC is comparatively large,
at least by today's standards, and serves well to represent the currently
common practice in linguistic review writing. However, the corpus can of
course not claim to be representative of review writing in general, and cer-
tainly not of academic discourse in its entirety, but it helps to provide in-
sights into the language of one particular discourse community: the com-
munity of a large group of linguists worldwide.
214 Ute Romer

2.2. The identification of meaningful units in a corpus


of linguistic book reviews

Continuing Sinclair's search for units of meaning, the question I would like
to address here is: How can we find meaningful units in a corpus? Or, more
specifically (given that BRILC contains a particularly evaluative type of
texts), how can we find units of evaluative meaning in a corpus? Evalua-
tion, seen as a central function of language and broadly defined (largely in
line with Thompson and Hunston 2000) as a term for expressions of what
stance we take towards a proposition, i.e. the expression of what a speaker
or writer thinks of what s/he talks or writes about, comes in many different
shapes, which implies that it is not easy to find it through the core means of
corpus analysis (doing concordance searches or word lists and keyword
lists). As Mauranen (2004: 209) notes, "[identifying evaluation in corpora
is far from straightforward. ... Corpus methods are best suited for searching
items that are identifiable, therefore tracking down evaluative items poses a
methodological problem". On a similar note, Hunston (2004: 157) states
that "the group of lexical items that indicate evaluative meaning is large
and open", which makes a fully systematic and comprehensive account of
evaluation extremely difficult. In fact, the first analytical steps I carried out
in my search for units of evaluative meaning in BRILC (i.e. the examina-
tion of frequency word lists and keyword lists, see Romer 2008) did not
yield any interesting results which, at that point in the analysis, led me to
conclude that words are not the most useful units in the search for meaning
("the word is not enough", Romer 2008: 121) and that we need to move
from word to phrase level. So, instead of looking at single recurring words,
we need to examine frequent word combinations, also referred to as collo-
cations, chunks, formulaic expressions, n-grams, lexical bundles, phrase-
frames, or multi-word units. In Romer (2008), I have argued that the extrac-
tion of such word combinations or phrasal units from corpora, combined
with concordance analysis, can lead to very useful results and helps to high-
light a large number of meaningful units in BRILC.
In the present paper, however, I go beyond the methodology described
in the earlier study in which I only extracted contiguous word combinations
from BRILC (n-grams with a span of n=2 to n=7), using the software Col-
locate (Barlow 2004). I use two additional tools that enable the identifica-
tion of recurring contiguous and non-contiguous sequences of words in
texts: kJNgram (Fletcher 2002-2007) and ConcGram (Greaves 2005). Like
Collocate, kJNgram generates lists of n-grams of different lengths (i.e.
Observations on the phraseology of academic writing 215

combinations of n words) from a corpus, e.g. 3-grams like as well as or the


book is. In addition to that, the program creates lists of so-called "phrase-
frames" (short "p-frames"). P-frames are sets of n-grams which are identi-
cal except for one word, e.g. at the end of, at the beginning o/and at the
turn o/would all be part of the p-frame at the * of. P-frames hence provide
insights into pattern variability and help us see to what extent Sinclair's
Idiom Principle (Sinclair 1987, 1991, 1996) is at work, i.e. how fixed lan-
guage units are or how much they allow for variation. Examples of p-
frames in BRILC, based on 5-gram and 6-gram searches, are displayed in
figure 1.

i t would be * t o 101 10
it would be interesting to 44
it would be useful to 14
it would be nice to
it would be better to
it would be possible to
it would be helpful to
it would be fair to
it would be difficult to
it would be necessary to
it would be good to
it * be interesting to 58
it would be interesting to 44
it will be interesting to
it might be interesting to 6

it * be interesting to see 33 3
it would be interesting to see 23
it will be interesting to see 7
it might be interesting to see 3
Figure 1. Example p-frames in BRILC, together with numbers of tokens and
numbers of variants (kfNgram output)

Together with the types and the token numbers of the p-frames, kfNgram
also lists how many variants are found for each of the p-frames (e.g. 10 for
tt would be * to). The p-frames in figure 1 exhibit systematic and controlled
variation. The first p-frame (it would be * to) shows that, of a large number
of possible words that could theoretically fill the blank, only a small set of
(mainly positively) evaluative adjectives actually occur. In p-frames two
216 Ute Romer

and three, modal verbs are found in the vanable slot; however not all modal
verbs but only a subset of them {would, will, might).
ConcGram allows an even more flexible approach to uncovering re-
peated word combinations in that it automatically identifies word associa-
tion patterns (so-called "concgrams") in a text (see Cheng, Greaves and
Warren 2006). Concgrams cover constituency variation (AB, ACB) and
positional variation (AB, BA) and hence include phraseological items that
would be missed by Collocate or kfNgram searches but that are potentially
interesting in terms of constituting meaningful units. Figure 2 presents an
example of a BRILC-based concgram extraction, showing constituency
variation (e.g. it would be very interesting, it should also be interesting).
67 pon are backward anaphora, and it would be interesting to see how his theory can
68 spective of grammaticalisation it would be very interesting to have a survey of the
69 semantic transparency; again, it would be very interesting to see this pursued in
70 from a theoretical standpoint, it would be very interesting to expand this analysis
71 r future research, noting that it would be especially interesting to follow the
72 ift from OV to VO in English. It would be particularly interesting to see if this
73 ook as exciting as I had hoped it might be, although Part 4 was guite interesting,
74 d very elegantly in the paper, it would be interesting to discuss the
75 s in semantics. In my opinion, it would be interesting to see how this ontological
76 oun derivatives are discussed, it would be interesting at least to mention verbal
77 felt most positively" (p. 22). It should be noted that some interesting results
78 rs also prove a pumping lemma. It woul! d be interesting to see further
79 pear in Linguist List reviews: it wouldn't be very interesting, I didn't make a
80 is given on this work, though it seems to be very interesting for the linguist's
81 and confined to the endnotes. It would also be interesting to set Hornstein's view
82 n of a book title was omitted. It would also be interesting to see if some of the
83 erative work on corpora. Maybe it would also be interesting to test the analyses in
84 second definition. Of course, it would also be interesting to find out that
85 ub-entries, for instance). So, it should also be interesting to find, among the
86 olved in dictionary-making and it should also be interesting to all dictionary
87 n a constituent and its copy. It might however be interesting to seek a connection
88 ity and their self-perception. It might prove to be interesting to compare the
89 iteria seem fairly reasonable. It would, however, be interesting to study the
90 is, rhetoric, semantics, etc. It would certainly be very interesting to see what
91 nages to carry out the action. It would most certainly be interesting to look at
92 CTIC THEORY" by Alison Henry). It seems to me that it would be interesting to

Figure 2. Word association pattern (concgram) of the Hems it + be + interesting


mBRlLC (ConcGram output; sample)

All three tools (Collocate, kfNgram and ConcGram) can be referred to as


"phraseological search engines" as they facilitate the exploration of the
phraseological profile of texts or text collections.
The extraction of n-grams (of different spans), p-frames and concgrams
was complemented by manual filtering of the output lists and extensive
concordancing of candidate phraseological items. These semi-automatic
BRILC explorations resulted in a database of currently a little over 800
items (i.e. types) of evaluative meaning. Part of these items are inherently
evaluative (e.g. it rs not clear, wonderful, or a lack of), while others appear
"neutral" in isolation but introduce or frame evaluation (e.g. at the same
Observations on the phraseology of academic writing 217

time or on the one hand). This type of implicit or "hidden" evaluation is


much more pervasive than we would expect and will be focused on in the
remainder of the paper. In the next section, we will look at items that pre-
pare the ground for evaluation to take place and examine their use in lin-
guistic book reviews. The items that will be discussed are all frequent in
BRILC and appeared at the top of the n-gram and p-frame lists.

3. Uncovering the phraseological profile


of linguistic book reviews

3.1. Central patterns and their meanings

Before I turn to some of the high-frequency n-grams from my lists and their
use in BRILC, I would like to look at an item that came up in a discussion I
had about evaluation with John Sinclair (and that is also quite common in
BRILC, however not as common as the other items that will be described
here). In an email to me, he wrote: "Re evaluation, I keep finding evalua-
tions in what look like "ordinary" sentences these days. ... I came across the
frame "the - - lies in - -"" (Sinclair 2006, personal communication). I
think Ues in is a fascinating item and I am very grateful to John Sinclair for
bringing it up. I examined Ues tn in my BRILC data and found that gap 1 in
the frame is filled by a noun or noun group with evaluative potential, e.g.
the mam strength of the book in example (1). Gap 2 takes a proposition
about action, usually in the form of a deverbal noun (such as coverage),
which is pre-evaluated by the item from the first gap.
(1) The main strength of the book lies in its wide coverage of psycholinguistic
data and models...
This is a neat pattern, but what type of evaluation does it mainly express?
An analysis of all instances oilies in in context shows that 16 out of 135
concordance lines (12 %) express negative evaluation; see examples in (2)
and (3). We find a number (27.8 %) of unclear cases with "neutral" nouns
like distinction or difference in gap 1 (see examples [4] and [5]), but most
of the instances of Ues in (80, i.e. 60.2 %) exhibit positive evaluation, as
exemplified in (1) and (6). The BRILC concordance sample in figure 3
(with selected nouns/noun groups in gap 1 highlighted in bold) and the two
ConcGram displays of word association patterns in figure 4 serve to illus-
trate the dominance of positively evaluative contexts around Ues tn. This
means that a certain type of meaning (positive evaluation) is linked to the
218 UteRomer

lies in pattern. In section 3.2 we will see if this is a generally valid pattern-
meaning combination or whether this combination is specific to the re-
stricted language under analysis.
(2) The obvious defect of such an approach lies in the nature of polysemy in
natural language.
(3) Probably, the only tangible limitation of the volume lies in some
typographical errors...
(4) The main difference lies in first person authority ...
(5) This distinction lies in the foregrounded nature of literary themes.
(6) The value of this account lies in the detail of its treatment of the varying
degrees and types ofgivenness and newness relevant to these constructions.

89 outstanding contribution made by Saussure lies in his theory of general linguisti


90 tions. Kennedy concludes that a solution lies in maintaining a purely syntactic
91 (to y ) ' (J.K's ex. (8a)). The solution lies in the exploitation of Generalized
92 mplex). Evidence for the above statement lies in the following linguistic facts:
93 of word> geography
- — <
is a —
task that
— still
- ^ lies
T <
in future'' '
— (p. 405).
— In ''Diachro
- . . ^ ^
94 the scope of the term. Hinkel's strength lies in the fact that she led her resea
95 entation is convincing, and its strength lies in that it concentrates on one Ian
96 n book. As a textbook, its main strength lies in the presentation of the details
97 curious aspect of this agreement system lies in the fourth available agreement
98 K , , ^ , * , ^ . ,, , ^ , . r , ^ ^ ^
99 „ , _
100 ish linguistic history through its texts lies in part with the wealth of textual
101 selective loss. One explanation for that lies in the hypothesis that identificat
102 of the strengths of "Language in Theory" lies in offering an opportunity for dis
103 facing the author of a work such as this lies in where to set the limits of scho
covert. Suranyi's explanation for this lies in the nature of the features at
105 he importance of providing this training lies in the fact that simultaneous inte
106 eculiar trait of the hymn as a text type lies in "the degree of 'openness' of te
107 iscernible stress. Aguaruna's uniqueness lies in the following two properties ha
108 s true for the present volume. Its value lies in the fact that we can select fro
109 related fields of study. Its true value lies in its compact though penetrating

Figure 3. BRILC concordance sample of lies in, displaying predominantly posi-


tive evaluation

1 The strength, then, of The Korean Language, lies in its encyclopedic breadth of cover
2 The main strength of this book probably lies in the fact that it incorporates int
3 A particular strength of Jackson's book lies in its relevant biographical informa
4 synopsis, a major strength of this textbook lies in the integration of essential sema
5 ransformations. The strength of this chapter lies in the discussion where the authors
6 ASSESSMENT The main strength of this book lies in the personal testimonies and stor
7 SUMMARY The main strength of the book lies in its wide coverage of psycholingui
8 rgumentation is convincing, and its strength lies in that it concentrates on one langu
9 book. As a textbook, its main strength lies in the presentation of the details.
10 s the scope of the term. Hinkel's strength lies in the fact that she led her researc
Observations on the phraseology of academic writing 219

1 the preface that the value of the reader lies in bringing together work from VARIOIJ
2 and in my view the main value of the paper lies in the mono- and multi-factorial anal
3that is terribly new in this book; its value lies rather in how it selects, organizes a
4 there is an intellectual value in exposing lies and deceptions, and here I think even
5 addressed. The value of his contribution lies in the realization of the power imbed
6 startling claim. The value of this account lies in the detail of its treatment of the
7 enge, a further added value of this chapter lies in the close link with Newerkla's ch
8 ins strong"(471). The value of this volume lies a) in its bringing together in one pi
9 syntacticians. The real value of this book lies in its treatment of the larger issues
10 al related fields of study. Its true value lies in its compact though penetrating dis
11 not an easy read. Despite this, its value lies in how it still manages to demonstrat
12 Ids true for the present volume. Its value lies in the fact that we can select from t
13 the compound prosodic word), and its value lies mainly in demonstrating how some rece

Figure 4. Word association patterns (concgrams) of the items lies + in + strength


and lies + in + value in BRILC (ConcGram output; sample)

Let us now take a closer look at three items from the frequency-sorted n-
gram and p-frame lists: at the same time, it seems to me (it seems to *) and
on the other hand. In linguistic book review language as covered in
BRILC, at the same time mainly (in 56 % of the cases) triggers positive
evaluation, as exemplified in (7) and in the concordance sample in figure 5.
With only 5 % of all occurrences (e.g. number [8]), negative evaluation is
very rare. In the remaining 39 % of the concordance lines at the same time
is used in its temporal sense, meaning "simultaneously" (not "also"); see
example (9).
(7) Dan clearly highlights where they can be found and at the same time
provides a good literature support.
(8) At the same time, K's monograph suffers from various inadequacies ...
(9) At the same time, some new words have entered the field...

142 e animal world. At the same time, it includes a careful and honest discussion of wh
14 3 n O c t o b e r 19 9 7 . At the same t i m e , it is a s t a t e - o f - t h e - a r t p a n o r a m a of the (sub-)fi
14 4 e at t i m e s , but at the same t i m e it is a n a l m o s t e n c y c l o p e d i c s o u r c e of i n f o r m a t i o n
14 5 s corpus d a t a ) . At the same t i m e , it is clear t h a t not every author has b e e n u s i n g
146 ghout the b o o k . At the same t i m e , it is f l e x i b l e e n o u g h i n o r g a n i s a t i o n t o a l l o w th
14 7 ian and H e b r e w ; at the same t i m e it is n e v e r t h e case t h a t , say, a c c o m p l i s h m e n t s sh
14 8 ard M a c e d o n i a n . At the same t i m e , it is n o t a b l e t h a t M u s h i n ' s results are c o n s i s t e n
14 9 by _ h i s _ , but at the same t i m e it is the subj ect of the J a p a n e s e p r e d i c a t e p h r a s e
150 anguage change. At the same time, it may equally be used by college teachers who wi
151 of the base. At the same time, it must be no larger than one syllable (as discus
152 taste, and, at the same time, it provides the research with steady foundations
153 osition itself. At the same time, it was cliticised to an immediately following ver
154 ertainly rigid. At the same time, King claims, we can easily account for such utter
155 a events, while at the same time leading to interesting questions about the often
156 ge history, but at the same time maintains an engaging and entertaining style throu

Figure 5. BRILC concordance sample of at the same time, displaying predomi-


nantly positive evaluation

The next selected item, it seems to me, prepares the ground for predomi-
nantly negative evaluation (281 of 398 instances, i.e. 70.5 %), as exempli-
fied in (10) and the concordance sample in figure 6. Positive evaluation, as
220 UteRomer

shown in (11), is rare and accounts for only 4.9% of all cases. About
24.6 % of the BRILC sentences with it seems to me constitute neutral ob-
servations, see e.g. (12).
(10) Finally, it seems to me that the discussion of information structure was
sometimes quite insensitive to the differences between spoken and written
data.
(11) In general, it seems to me this book is a nice conclusion to the process
started in the Balancing Act...
(12) // seems to me that it is a commonplace that truth outstrips epistemic
notions...

11 new. It seems to me, nevertheless, that there are some difficulties related to this
12 lem; it seems to me, rather, that precedence is always transitive; it is the particu
13 ail, it seems to me that a more explicit definition of word would be needed to handl
14 68). It seems to me that a high price has been paid in terms of numbers of categorie
15 ude, it seems to me that as for theoretical results, much more [...] should be said
16 ies. It seems to me that both fields would benefit from acting a little more like th
17 ath; it seems to me that Copper Island Aleut is not a good example of such process
18 per. It seems to me that in some cases this could lead M to certain misinterpretatio
19 ry). It seems to me that it would be interesting to examine such problems in a more
20 3c). It seems to me that M-S coniures up notions of abstract constructs that are not
21 Yet It seems to me that one oan likewise make a strong ease for claiming that espeo
22 VIEW It seems to me that one of the central guestions being analyzed in this book is
23 ary. It seems to me that some additional topics could have been incorporated into th
24 ded. It seems to me that such a term is used in more than one sense, having to do bo
25 lly, it seems to me to be a weakness of this approach that it will not easily handle

Figure 6. BRILC concordance sample of it seems to me, displaying predomi-


nantly negative evaluation

Finally, if we look at on the other hand, positive evaluation follows the 4-


gram in only 8 % of the 567 BRILC examples, as in (13). Negative evalua-
tions (54 %) and neutral observations (38 %) are considerably more fre-
quent. This is illustrated in figure 7 and in examples (14) and (15) below.
(13) Other chapters, on the other hand, provide impressively comprehensive
coverage of the topics...
(14) but on the other hand, it is obvious that the book under review fails in
various regards to take into account major developments in research into
Indian English over the last 25 years.
(15) Prepositional clauses, on the other hand, do not allow stranding.
Observations on the phraseology of academic writing 221

85 M., on the other hand, comes to the opposite conclusion on the same point. It would
86 on, on the other hand, concerns the marking of event sequences through lexical and s
87 ts, on the other hand, consider the legitimate explanations to be those that do not
88 on. On the other hand, context in CA is not a priori but something that emerges from
89 nd. On the other hand, corpus linguists who want to develop their own tailor-made so
90 t). On the other hand, C denies the existence of the notion of "subject" as a univer
91 un. On the other hand, C importantly neglects other hypotheses on the origin of pers
92 re, on the other hand, denote type shifted, generalized quantifier-like or ,>-type e
93 an, on the other hand, despite its importance in the United States, left very little
94 acy on the other hand develops more slowly, influenced by production ease, salience,
95 ns, on the other hand, do not contribute to the truth conditions of the utterance bu
96 R , on the other hand, do provide support for D, and 0 and L (2002), while finding f
97 en, on the other hand do seem to have such restrictions, lengthening such words and
98 B, on the other hand, does not view optimization of language very seriously. Instea
99 On the other hand, [...], articles are missing on subjects that FG did attend to

Figure 7. BRILC concordance sample of on the other hand, displaying examples


of negative evaluation and neutral observation

3.2. Corpus comparison: How "local" are these patterns and meanings?

The items we have just analyzed clearly show interesting patterns and pat-
tern-meaning relations. Their existence in BRILC alone, however, does not
say much about their status as "local" patterns, i.e. patterns that are charac-
teristic of linguistic book review language as a restricted language (in
Firth's sense). In order to find out how restncted-language-specific the
above-discussed phraseological items {lies m, at the same time, it seems to
me, on the other hand) are, I examined the same items and their patterns
and meanings in a larger reference corpus of written English, the 90-million
word written component of the British National Corpus (BNC written).
In a first step, I compared the frequencies of occurrence (normalized per
million words, pmw) of the four items in BRILC with those in
BNC written. As we can see in table 1, all units of evaluative meaning are
more"frequent in BRILC than in BNC written, which may not be all that
surprising if we consider the highly evaluative type of texts included in
BRILC. Moving on from frequencies to functions, the next step then in-
volved an analysis of the meanings expressed by each of the phraseological
items in BNC written. For lies in I did not find a clear preference for one
type of evaluation (as in BRILC). Instead, there was a roughly equal distri-
bution of examples across the three categories "positive evaluation"
(34.5 %), "negative evaluation" (32.5 %) and "neutral/unclear" (33 %).
While negative evaluation was rather rare in the context of Ites in in the
book review corpus, the item forms a pattern with nouns like problem and
difficulty in BNC written, as the concordance samples in figure 8 show.
222 Ute Romer

Table 1. Frequencies of phraseological items in BRILC and BNC_wntten


BRILC BNC_wntten
lies in 38pmw 19pmw
on the other hand 162 pmw 57 pmw
at the same time 100 pmw 73 pmw
it seems to me 19 pmw 5pmw
360 s without saying; my difficulty lies in knowing how defensible they are in the fo
361 s. The ohief oause of difficulty lies in the fact that confessions are typically o
362 olved. The practical difficulty lies in deciding how to value the external effects
363 on CD-ROM. The real difficulty lies in the fact that CD-ROM can only process one
364 n Fig. 8.5. A second difficulty lies in the uncertainty in our knowledge of the to
365 , on the contrary the difficulty lies in obtaining sufficient evidence to identify
366 ists the cause of the difficulty lies in an institution, central planning, which c
367 opinion. Part of the difficulty lies in the developments which have taken place in
368 d spellcheckers. The difficulty lies in building real quality into the products. D
369 urn, is apparent. The difficulty lies in providing an adequate theoretical framewor
370 experimentally. The difficulty lies in heating the fuel to temperatures of about
371 y extraordinary. The difficulty lies in finding an acceptable implied limitation.
372 ilm. At present, the difficulty lies in understanding how this relates smdash if a
373.. mess of things. The difficulty lies in convincing yourself of that! If all is we
10 07 e root of the innovation problem lies in a dilemma : Sbguo Curriculum innovation re
1008 rations. sequo The main problem lies in the amount of translation the software wil
1009 p to my Martin D-16. My problem lies in the fact that I can sbquo t get the same o
1010 way from the house. One problem lies in the fact that the space is considerably „i
1011 asonably good; the only problem lies in reaching the RAM upgrade s lots, which are
1012 rly where the particular problem lies in the case of this ruler. Books about Mary t
1013 ould argue that the real problem lies in the fact that shares had been overvalued f
1014 rpose, whereas the real problem lies in the adjustment of the model 's control lin
1015 equo. Perhaps the Met 's problem lies in the present state of museum affairs, where
1016 If Radiohead 's singular problem lies in the sheer obviousness of their line of att
1017 taining prose. Here the problem lies in the generality of the terms squot descript
1018 believes the root of the problem lies in a fault with the child 's immune cells in
1019 FAO) 1985). Part of the problem lies in the fact that much of this produce is expo
1020 umour? sequo Part of the problem lies in his opening statement : sbguo Eighty seven

Figure 8. BNC_wntten concordance samples of lies in, displaying patterns of


negative evaluation

For at the same time we also find a lower share of positive contexts in the
BNC written than in the BRILC data. While authors of linguistic book
reviews use the item predominantly to introduce positive evaluation, this
meaning is (with 9 %) very rare in "general" written English (i.e. in a col-
lection of texts from a range of different text types). An opposite trend can
be observed with respect to it seems to me. Here, positive contexts are
much more frequent in BNC written than in BRILC, where negative
evaluation dominates (with 70.5 %; only 30 % of the BNC written exam-
ples express negative evaluation). Finally, with on the other hand positive
evaluation or a positive semantic prosody is (with 33 %) also much more
common in BNC written than in BRILC (see [16] and [17] for
BNC written examples). For book reviews, I found that on the other hand
mostly introduces negative evaluation and that only 8 % of the BRILC
Observations on the phraseology of academic writing 223

concordance lines express positive evaluation. These findings indicate that


the examined patterns and their meanings are indeed quite "local", i.e. spe-
cific of the language of linguistic book reviews. Not only do we find certain
phraseological items or patterns to occur with diverging frequencies across
text types and to be typical of a particular kind of restricted language, we
also observe that the same items express different meanings in different
types of language.
(16) Jennie on the other hand was thrilled when the girls announced wedding
^ < B N C w n t t e „ : B 3 4 914>
(17) On the other hand, he at last gains well-deserved riches and a life of
C O M / 0 r/.<BNCwntte„:ADM2192>

4. Concluding thoughts

Referring back to the groundbreaking work of John Firth and John Sinclair,
this paper has stressed the importance of studying units of meaning in re-
stricted languages. It has tried to demonstrate how a return to Firthian and
Sinclainan concepts may enable us to better deal with the complex issue of
meaning creation in (academic) discourse and how corpus tools and meth-
ods can help identify meaningful units in academic writing or, more pre-
cisely, in the language of linguistic book reviews. We saw that the identifi-
cation of units of (evaluative) meaning in corpora is challenging but not a
hopeless case and that phraseological search-engines like Collocate,
kjNgram and ConcGram can be used to automatically retrieve lists of
meaningful unit candidates for further manual analysis. It was found to be
important to complement concordance analyses by n-gram, p-frame and
concgram searches and to go back and forth between the different analytic
procedures, combining corpus guidance and researcher intuition in a maxi-
mally productive way. In the analysis of high-frequency items from the
meaningful unit candidate lists, it then became clear that a number of "in-
nocent" n-grams and p-frames have a clear evaluative potential and that
apparently "neutral" items have clear preferences for either positive or
negative evaluation.
The paper has also provided some valuable insights into the special na-
ture of book review language and highlighted a few patterns that are par-
ticularly common in this type of written discourse. One result of the study
was that it probably makes sense to "think local" more often because the
isolated patterns were shown to be actually very restricted-language-
224 UteRomer

specific. In a comparison of BRILC data with data retrieved from a refer-


ence corpus of written English (the written component of the British Na-
tional Corpus), we found that not only the patterns but also the identified
meanings for each of the patterns (and their distributions) are local. I would
suggest that these local patterns be captured in a "local lexical grammar"
which "is simply a logical extension of the concept of pattern grammar"
(Hunston 1999) in that it, being text-type specific, covers the patterns that
are most typical of the text type (or restricted language) under analysis and
links these patterns with the most central meanings expressed in the spe-
cialized discourse. I think that a considerable amount of research on disci-
plinary phraseology still needs to be done, and see the development of local
lexical grammars based on restricted languages as an important future task
for the corpus linguist. These text-type specific grammars will help us get a
better understanding of how meanings are created in particular discourses
and come closer to capturing the full coverage of Sinclair's (1987) idiom
principle.

Notes

1 I would Hke to thank the participants at the symposium on "Chunks in Corpus


Linguistics and Cognitive Linguistics: In Honour of John Sinclair", 25-27
October 2007, at the University of Erlangen-Nuremberg for stimulating ques-
tions and suggestions after my presentation.

References

Barlow, Michael
2004 Collocate 1.0: Locating Collocations and Terminology. Houston,
TX: Athelstan.
Barnbrook, Geoff
2002 Defining Language: A Local Grammar of Definition Sentences.
Amsterdam: John Benjamins.
Biber, Douglas
2006 University Language: A Corpus-based Study of Spoken and Written
Registers. Amsterdam: John Benjamins.
Biber, Douglas, Ulla Connor and Thomas A. Upton
2007 Discourse on the Move: Using Corpus Analysis to Describe Dis-
course Structure. Amsterdam: John Benjamins.
Observations on the phraseology of academic writing 225

Bowker,Lynne and Jennifer Pearson


2002 Working with Specialized Language: A Practical Guide to Using
Corpora. New York/London: Routledge.
Cheng, Winnie, Chris Greaves and Martin Warren
2006 From N-gram to Skrpgram to Concgram. IJCL 11 (4): 411-433.
Connor, Ulla and Thomas A. Upton (eds.)
2004 Discourse in the Professions: Perspectives from Corpus Linguistics.
Amsterdam: John Benjamins.
Firth, John R.
1968a Descriptive linguistics and the study of English. In: Selected Papers
of J. R. Firth 1952-5% Frank Robert Palmer (ed.), 96-113. Bloom-
ington: Indiana University Press. First published in 1956.
Firth, John R.
1968b Linguistics and translation. In: Selected Papers of J. R. Firth 1952-
59, Frank Robert Palmer (ed.), 84-95. Bloommgton: Indiana Univer-
sity Press. First published in 1956.
Firth, John R.
1968c A synopsis of linguistic theory. In: Selected Papers of J. R. Firth
1952-59, Frank Robert Palmer (ed.), 168-205. Bloommgton: Indi-
ana University Press. First published in 1957.
Fletcher, William H.
2002-07 KfNgram. Annapolis, MD: United States Naval Academy.
Gavioli, Laura
2005 Exploring Corpora for ESP Learning. Amsterdam: John Benjamins.
Greaves, Chris
2005 ConcGram Concordancer with ConcGram Analysis. HongKong:
Hongkong University of Science and Technology.
Harris, ZelligS.
1968 Mathematical Structures of Language. New York: Interscience Pub-
lishers.
Hunston, Susan
1999 Local Grammars: The Future of Corpus-driven Grammar? Paper
presented at the 32nd BAAL Annual Meeting, September 1999,
University of Edinburgh.
Hunston, Susan
2004 Counting the uncountable: Problems of identifying evaluation in a
text and in a corpus. In: Corpora and Discourse, Alan Partington,
John Morley and Louann Haarman (eds.), 157-188. Bern: Peter
Lang.
226 Ute Romer

Hunston, Susan and John McH. Sinclair


2000 A local grammar of evaluation. In: Evaluation in Text: Authorial
Stance and the Construction of Discourse, Susan Hunston and Geoff
Thompson (eds.), 74-101. Oxford: Oxford University Press.
Hyland,Ken
2004 Disciplinary Discourses: Social Interactions in Academic Writing.
Ann Arbor, MI: University of Michigan Press.
Lehrberger,John
1982 Automatic translation and the concept of sublanguage. In: Sublan-
guage: Studies of Language in Restricted Semantic Domains, Rich-
ard Kittredge and John Lehrberger (eds.), 81-106. Berlin: Walter de
Gruyter.
Leon,Jaquelme
2007 From linguistic events and restricted languages to registers. Firthian
legacy and Corpus Linguistics. Henry Sweet Society Bulletin 49: 5 -
25.
Mauranen,Anna
2004 Where next? A summary of the round table discussion. In: Academic
Discourse: New Insights into Evaluation, Gabnella Del Lungo Ca-
miciotti and Elena Togmm Bonelli (eds.), 203-215. Bern: Peter
Lang.
Romer, Ute
2008 Identification impossible? A corpus approach to realisations of
evaluative meaning in academic writing. Functions of Language 15
(1): 115-130.
Romer, Ute and Rainer Schulze (eds.)
2008 Patterns, Meaningful Units and Specialized Discourses (special issue
of International Journal of Corpus Linguistics). Amsterdam: John
Benjamins.
Sinclair, John McH.
1987 The Nature of the evidence. In: Looking Up: An Account of the
COBUILD Project in Lexical Computing, John McH. Sinclair (ed.),
150-159. London: HarperCollins.
Sinclair, John McH.
1991 Corpus, Concordance, Collocation. Oxford: Oxford University
Press.
Sinclair, John McH.
1996 The search for units of meaning. Texto IX (1): 75-106.
Sinclair, John McH.
2004 Trust the Text: Language, Corpus and Discourse. London:
Routledge.
Observations on the phraseology of academic writing 227

Thompson, Geoff and Susan Hunston


2000 Evaluation: An introduction. In: Evaluation in Text: Authorial
Stance and the Construction of Discourse, Susan Hunston and Geoff
Thompson (eds.), 1-27. Oxford: Oxford University Press.

Corpora

BNC The British National Corpus. Distributed by Oxford University


Computing Services on behalf of the BNC Consortium. URL:
http://www.natcorp.ox.ac.uk/.
BRILC Book Reviews in Linguistics Corpus. Compiled by the author of this
paper.
Collocational behaviour of different types of text
Peter Uhrig andKatrin Gotz-Votteler^

1. Introduction

Ever since John Sinclair introduced his idiom principle and his notion of
collocation (see Sinclair 1991: 109-121), there has been an increasing inter-
est in the study of different aspects of phraseology. In this article we would
like to present a work-in-progress report on a project exploring the colloca-
tional behaviour of text samples.2 By this term we refer to the extent to
which a text relies on or uses collocations. A text that is classified as "col-
locationally strong" can therefore be defined as a text in which a substantial
number of statistical collocations can be found, a "collocationally weak"
text as a text which contains fewer statistical collocations and consists of
more free combinations (in the sense of Sinclair's open choice principle).
In order to determine the collocational behaviour of a text, a computer
program was designed to compare the co-occurrence of words within a
certain text with co-occurrence data from the British National Corpus
(BNC). For the analysis outlined here, eight different text samples repre-
senting different text types were compiled. This selection of samples was
chosen in order to test whether certain interrelations between different texts
(or text types) and their collocational behaviours can be found.
The following three hypotheses summarize three kinds of interrelation
that we expected to occur:

Hypothesis 1: There is an interrelation between the collocational behaviour


ofatext and its perceived difficulty.

Research on collocation claims that collocations are stored in the mind as


prefabricated items (Underwood, Schmitt, and Galpin 2004: 167; Ellis,
Frey, and Jalkanen 2009). It is therefore to be expected that the collocation-
ally stronger a text, the easier it should be to process, as the text follows
expected linguistic patterns. If, on the other hand, the reader is presented
with a text that consists of a considerable number of new combinations, no
already established linguistic knowledge can be used for processing the
230 Peter Uhrig and Katrin Gotz-Votteler

text. In order to test Hypothesis 1, we selected texts which represent a range


of difficulty; those texts which we ranked as more difficult should therefore
consist of more new combinations and be collocationally weaker.

Hypothesis 2: There is an interrelation between the collocational behaviour


ofatext and the text type.

Different text types might show different collocational behaviour. Fictional


texts, for example, are generally assumed to be linguistically quite creative,
i.e. these texts might to a higher degree rely on unusual word combinations
in order to create certain effects on the reader and should therefore be col-
locationally weaker than newspaper articles or academic writing, which use
a larger number of standardized expressions.3

Hypothesis 3: There is an interrelation between the collocational behaviour


ofatext and its idiomaticity.

Knowing the c o l l a t o r s a word is associated with is one of the crucial


steps towards advanced foreign language proficiency (Hausmann 2004,
Granger this volume). Texts produced by learners of English are therefore
likely to be collocationally weaker than texts written by native speakers.

2. Software and text processing

In order to evaluate word co-occurrences in a given text to find out whether


they are strong collocations, some sort of reference is needed, which, for
our purposes, is the British National Corpus (BNC). The first version of the
computer program that was designed to test our hypotheses, queried
SARA4 at runtime, which resulted in very long response times. It only
compared bigram frequencies, so the insights to be gained from it were
rather limited (see below). The current version makes use of a pre-
computed database of all co-occurrences in the BNC for every span from 1
to 5, both for word forms and for lemmata. It offers user-definable span,
allows the use of lemmatisation and can ignore function words.
The software uses tagged and lemmatised text as input. To ensure com-
patibility with the database based on the BNC, the same tools that were
used to annotate the BNC were applied to the text samples. All input texts
were thus PoS-tagged with CLAWS (see Leech, Garside, and Bryant 1994)
and lemmatised with LEMMINGS via WMatnx (Rayson 2008).5
Collocational behaviour of different types of text 231

Our software computes all co-occurrence frequencies and association


measures, and the resulting lists are imported into Microsoft Excel where
graphs are plotted. An example list is given in the appendix (table 1).

3. Text samples

Each sample contains 20,000 words of English text. The non-fictional cate-
gory is composed of newspaper texts from the Guardian and articles from
academic journals by British authors. The fictional category comprises
Elizabeth George,6 P D. James, Ian McEwan, and Virginia Woolf. Addi-
tionally, a sample of EFL essays by German students7 was compiled. The
last sample is an automatic translation of a 19th century German novel
(Theodor Fontane's Effi Briest) by AltaVista Babelfish (now called Yahoo!
Babelfish*). This sample was included to double-check the results against
some unnatural and umdiomatic language.
The criteria for the selection of texts were
a) varying degrees of difficulty in orderto test Hypothesis 1,
b) coverage of different text types in order to test Hypothesis 2,
c) different levels of idiomaticity in order to test Hypothesis 3.
A few words have to be said about criterion a): even though it is very
common to describe a certain article, story or novel as "difficult to read", it
is - from a linguistic point of view - hard to determine the linguistic fea-
tures that support this kind of subjective judgment. A quantitative analysis
of fictional literature, using some of the authors above, has shown that the
degree of syntactic complexity seems to correspond to the evaluation of
difficulty (Gotz-Votteler 2008); the inclusion of Hypothesis 1 can be seen
as a complementation of that study.

4. Results

The first run of the software provided calculations based on non-


lemmatised word forms, including grammatical words.9 Figure 1 is a plot of
all mutual information scores.
232 Peter Uhrtg andKatrm Gotz-Votteler

Figure 1. Word forms, span 1, function words included

As mutual information boosts low-frequency highly specific combinations


(as opposed to MB for instance; see Evert 2005: 243), the two top dots
("Helter Skelter" and "Ehud Olmert") are of no particular relevance. What
is represented as zero in the diagram are those combinations for which no
association score could be computed because one or both of the items do
not occur in the BNC or their co-occurrence frequency in the BNC is zero
(such as "primitive hut"). The curves in figure 1 are strikingly similar and
do not permit any conclusions that there are differences between text types.
The only curve that deviates slightly is the automatic translation of Fon-
tanel Effi Briest, which is quite strong in the negative numbers, indicating
that there are many pairs of adjacent words which are much less common
than expected from their individual frequencies ("anti-collocations").
Similar results were found using different association measures, which
is not surprising as the absolute frequency of co-occurrence did not vary
much across text types.
Collocational behaviour of different types of text 233

Figure 2. Word forms, span 1, function words included

Figure 2 gives the number of hits in mutual information score bands of


width 1 (apart from <0). It thus illustrates the high number of anti-
collocations in the automatic translation even better. A detailed analysis of
these combinations revealed for instance their + the, of+ to, in + on or she
+ the, which shows that the first run of the software primarily measured
combinations of function words rather than lexical collocations. This can be
attributed to the short span (only 1) and to the large number of function
words that created too much "noise".
In order to solve these problems, a new query with different settings was
executed: instead of word forms, lemmata were chosen as the basis of
analysis, function words were excluded, and the span was extended to -5 -
+5. The following chart shows the distribution of Mutual Information
scores:
234 Peter Uhng andKatrin Gotz-Votteler

Figure 3. Lemmata, span 5, function words ignored

As can be seen, the graphs show different amounts of score 0, i.e. non-
occurrences in the BNC. The highest amount of score 0 can be found with
the Guardian texts and P. D. James: the fact that the Guardian texts produce
this result quite often is a consequence of the high frequency of proper
nouns, proper nouns that do not exist and therefore do not collocate in the
BNC. The P. D. James sample, on the other hand, can be characterized as
combining low-frequency lexical words with a sometimes highly formal
style, which creates word combinations that cannot be found in the BNC.10
The automatic translation of Fontane also produces many zeroes; in this
case this is also due to a substantial number of proper nouns, but addition-
ally to a large number of words which were not translated by the program
because they were assumed to be proper nouns (e.g. satisfy + mitspielen-
den, thereby + entzucken). The EFL graph shows a very small amount of
score 0; this result is at first quite surprising and in a way contradicts Hy-
pothesis 3. It might however be a consequence of the fact that the students
are well trained in using frequent combinations and only use a small range
of vocabulary.
A second difference between the graphs in the chart above is their
length: the longest curves are the ones for the Guardian texts and the aca-
demic texts. This means that they represent more scores than the other
graphs, because more words were counted, implying that the Guardian and
Collocational behaviour of different types of text 235

the academic texts contain more lexical words, whereas fictional texts have
a larger number of function words.11 The length of the curves is therefore a
result of the lexical density of the text samples, and the chart shows that the
lexical density correlates with the text type.
This leads to the question whether this finding provides insight into the
collocational behaviour of the text samples. The answer, unfortunately, is
not really, as the shape of the graphs is quite similar. The only exception is
the automatic translation of Fontane, of which the graph is a bit steeper
than the others. This can be seen more clearly in the following chart which
again gives the number of hits in Mutual Information score bands.

Number of Hits in Mutual Information Score Bands

Guardian
P.D. James
Written Academic
EFL
Elizabeth George
Virgina Woolf
Ian McEwan
Fontane_Babelfish

Figure 4. Lemmata, span 5, function words ignored

Again the curves are quite similar; even though the academic texts and the
samples from the novels by P. D. James and Ian McEwan produce a differ-
ent peak, rise and fall of the graphs are nearly identical. The automatic
translation of Fontane, however, shows a steeper fall, which means that
there are fewer scores with a high Mutual Information value, as this text
consists of fewer specific collocations. But again, the deviance is so small
that this finding can at best be interpreted as a tendency, not as a clear-cut
difference between the texts.
To sum up, also the second query did not display the distinction in col-
locational behaviour between the eight text samples that we had expected.
236 Peter Uhrig andKatrin Gotz-Votteler

The last query was therefore narrowed down to selected word-ck lasses
which have been linked to the degree of idiomaticity (Hausmann 2004,
Nesselhauf 2005). The following chart shows the Mutual Information value
related to percentages for the combinations noun-adjective and adjective-
noun, all other parameters stayed the same.
The text samples written by native speakers result in similar curves. The
automatic translation of Fontane, however, produces quite a large percent-
age of lower scores (between 3 and 5), which represent highly frequent
collocations. On the other hand, this text contains visibly fewer collocations
with higher scores, which means that we encounter fewer specific colloca-
tions here than in the other texts. The texts written by EFL students display
a similar behaviour, even though not to the same extent. Again we assume
that this finding is a result of the fact that the students are well trained in
highly frequent collocations, but use fewer low-frequency specific colloca-
tions.

Percentage of Hits in Mutual Information Score Bands

Guardian
P.D. James
Written Academic
EFL
Elizabeth George
Virgina Woolf
Ian McEwan
Fontane_Babelfish

P P > P P £ P P 'P NN N> *v> N1" 'P >P ^ \% \Q (& o>

Figure 5. Lemmata, span 5, only noun-adjective/adjective-noun combinations

5. Discussion and evaluation

We will now return to our hypotheses and evaluate them critically in the
light of our findings:
Collocational behaviour of different types of text 237

Hypothesis 1: There is an interrelation between the collocational behaviour


ofatext and its perceived difficulty.

As mentioned above, the text samples were chosen to cover a certain range
of difficulty. The preceding discussion of the three queries showed that the
results do not support Hypothesis 1. We would even go a step further and
claim that the collocational behaviour of a text does not seem to contribute
to the perceived difficulty.

Hypothesis 2: There is an interrelation between the collocational behaviour


ofatext and the text type.

Neither did the data provide any evidence for Hypothesis 2: from our charts
no interrelation between the text type and the collocational strength of a
text was visible. However, the text samples did display a difference in lexi-
cal density, i.e. text types such as newspaper articles or academic writing
contain a larger number of lexical items, whereas a text type such as fiction
consists of alarger percentage of function words.12

Hypothesis 3: There is an interrelation between the collocational behaviour


ofatext and its idiomaticity.

For our third hypothesis the results proved to be the most promising ones.
The largely nonsensical text sample generated by an automatic translation
device showed differences in behaviour for the span -5 - +5. The same is
true of the texts written by EFL learners, even though much less obviously
than we would have assumed.
As the discussion of the three hypotheses reveals, our findings are far
less conclusive than expected. This is partly due to some technical and
methodological problems which shall be briefly outlined in the following:
A whole range of problems is associated with tokemsation, PoS-tagging,
and lemmatisation. Even though excellent software was made available for
the present study, there are still errors.13 These errors would only have been
a minor problem, had they been consistent, but over the past 15 years,
CLAWS has been improved, so the current version of the tagger does not
produce the same errors it consistently produced when the BNC was anno-
tated.14 In addition, multi-word units are problematic in two respects:
firstly, the tagger recognizes many of them as multi-word units while the
lemmatiser lemmatises every orthographic word, rendering mappings of the
238 Peter Uhrig andKatrin Gotz-Votteler

two very difficult. Besides they distort the results, even if function words
are excluded, as they always lead to really high association scores.15
The most serious problem, though, is related to the size of the reference
corpus, the BNC. Even if all proper names are ignored and only lemmatised
combinations of nouns and adjectives in a five-word span to either side are
taken into account, there are still up to 40% of combinations in the samples
which do not exist at all in the BNC. Up to 60% occur less than 5 times - a
limit below which sound statistical claims cannot be maintained. This is of
course partly due to the automatic procedure, which looks at words in a
five-word span and thus may try to score two words which are neither syn-
tactically nor semantical^ related in any way.16 It therefore seems as if the
BNC is still too small for this kind of research by an order of magnitude or
two. This problem may at least be partially solved by augmenting the BNC
dataset with data from larger corpora such as Google's web IT 5-gram or
by limiting the research to syntactically related combinations in parsed
corpora.
Despite (or perhaps even because of) the inconsistencies and inconclu-
sive results of the present study, some of the aspects presented above seem
to very much deserve further investigation: as we have seen that there are
slight differences between native and non-native usage, at least for noun-
adjective collocations, it might be interesting to see whether it is possible to
automatically determine the level of proficiency of learners looking at the
collocational behaviour of their text production.17
Collocational behaviour of different types of text 239

Appendix

Table 1. Output of the database query tool

Lemma 1 Fl Lemma2 F2 Fl,2 MI Log-like Log-log Z-Score Mi3 T-Score


supermarket_SUBST 1040 accusc_VERB 2629 0 0 0 0 0 0 0
supermarket_SUBST 1040 organic_ADJ 2112 1 5,474 5,635 0 6.517 5,474 0,978
supermarket_SUBST 1040 food SUBST 18674 10 5,652 58,843 18,774 21,975 12,295 3,099
supcrmarkct_SUBST 1040 prcssurc_SUBST 11790 0 0 0 0 0 0 0
accuse_VERB 2629 organic_ADJ 2112 1 4,136 3,848 0 3.955 4,136 0,943
accuse VERB 2629 food SUBST 18674 5 3,314 13,983 7,694 6,342 7,958 2,011
accusc_VERB 2629 prcssurc_SUBST 11790 0 0 0 0 0 0 0
organic_ADJ 2112 food_SUBST 18674 54 7,063 423,033 40,644 84,324 18,572 7,293
organic_ADJ 2112 pressure.SUBST 11790 1 1,971 1,243 0 1,475 1,971 0,745
organic_ADJ 2112 easc_VERB 2357 0 0 0 0 0 0 0
organic. ADJ 2112 standard.SUBST 15079 6 4.201 23,613 10,86 9,934 9,371 2,316
food SUBST 18674 pressure_SUBST 11790 17 2,914 39,22 11,912 9.819 11.089 3,576
food_SUBST 18674 casc_VERB 2357 2 2,149 2,862 2,149 2,307 4,149 1,095
foodJSUBST 18674 standard_SUBST 15079 60 4,379 250,367 25,864 33,631 16,192 7,374
food SUBST 18674 say_VERB 317539 205 1,755 211,398 13,477 18,51 17,114 10,076
prcssurc_SUBST 11790 easc_VERB 2357 60 7,72 524,5 45,599 111,926 19,533 7,709
pressure_SUBST 11790 standard_SUBST 15079 21 3,528 64,392 15,494 14,212 12,312 4,185
pressure_SUBST 11790 say_VERB 317539 150 1,968 186,978 14,224 18,03 16,425 9,116
pressureJSUBST 11790 expertJSUBST 7099 4 2,222 6,039 4,444 3,394 6,222 1,571
ease VERB 2357 standard SUBST 15079 2 2,458 3,544 2,458 2,711 4,458 1,157
ease VERB 2357 say_VERB 317539 28 1,869 32,051 8,984 7,344 11,484 3,843
cascJVERB 2357 expertSUBST 7099 1 2,545 1,871 0 2,001 2,545 0,829

"

Notes

1 The order of authors is arbitrary.


2 This project is earned out by Thomas Herbst, Peter Uhrig, and Katrin Gotz-
Votteler at the University of Erlangen-Nurnberg. It was supported wrth a
grant by the Sonderfonds fur wissenschaftliche Arbeiten an der Universitdt
Erlangen-Nurnberg.
3 For a characterization of varying linguistic behaviour of different text types
see also Biber (1988) and Biber et al. (1999).
4 SGML Aware Retrieval Application; the software shipped with the original
version of the BNC.
5 Thanks to Paul Rayson of Lancaster University, who kindly allowed us to use
WMatnx for this research project.
6 Elizabeth George is our only American author. This did not have any effect
on the results, despite our British reference corpus.
7 The students attended a course preparing them for an exam roughly on level
CI of the Common European Framework of Reference.
240 Peter Uhrig and Katrin Gotz-Votteler

8 http://de.babeffish.yahoo.com/
9 According to Sinclair's definition (1991: 170), "only the lexical co-
occurrence of words" counts as collocation.
10 Cf. the following sentence: "Man is too addicted to this intoxicating mixture
of adolescent buccaneering and adult perfidy to relinquish it [spying] en-
tirely."
11 There was no calculation of co-occurrences across sentence boundaries; thus
sentence length may also be held responsible for this finding. However, an
analysis of mean sentence length did not confirm this assumption.
12 For the distribution of some types of function words in different types of texts
see Biberetal. (1999: ch. 2.4).
13 If we assume that the success rate of CLAWS in our study is roughly 97% (as
published in Leech and Smith 2000), we still get about 600 ambiguous or
wrongly tagged items per 20,000 word sample.
14 The word organic*, for instance, is tagged as plural in the BNC and as singu-
lar by the current version of CLAWS. Thus no combinations containing the
word organic* were found by our automatic procedure, which always queries
word/tag combinations. (Since the XML version of the BNC was not yet
available when the present study was started, the database is based on the
BNC World Edition.)
15 A case in point would be Prime Minister.
16 In "carnivorous plant in my office", carnivorous and office are found within a
5-word span. It is not surprising, though, that they do not occur within a 5-
word span in the BNC.
17 The software may also be used for a comparison of different samples from
"New Englishes" in order to find out whether these show similar results to
British usage or have a distinct collocational behaviour. (Thanks to Christian
Man for suggesting this application of our methodology.) In addition, it is ca-
pable of identifying non-text, which means it could be used to find automati-
cally generated spam emails or web pages. So in the end this could spare us
the trouble of having to open emails which, on top of trying to sell dubious
drugs, contain a paragraph which serves to trick spam filters and reads like
the following excerpt: "Interview fired attorney david Iglesias by Shockwave
something."
Collocational behaviour of different types of text 241

References

Biber, Douglas
1988 Variation across Speech and Writing. Cambridge/New York/New
Rochelle/Melbourne/Sydney: Cambridge University Press.
Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad and Edward Fine-
gan
1999 Longman Grammar of Spoken and Written English. Harlow: Pearson
Education Limited.
Ellis, Nick, Eric Frey and Isaac Jalkanen
2009 The psycholinguists reality of collocation and semantic prosody (1):
Lexical access. In Exploring the Lexis-Grammar Interface: Studies
in Corpus Linguistics, Ute Romer and R. Schulze (eds.), 89-114.
Amsterdam: John Benjamins.
Evert, Stefan
2005 The Statistics of Word Cooccurrences: Word Pairs and Collocations.
Dissertation, Institut fur maschmelle Sprachverarbeitung, University
of Stuttgart, URN urn:nbn:de:bsz:93-opus-23714.
G6tz-Votteler,Katrm
2008 Aspekte der Informationsentwicklung im Erzahltext. Tubingen: Gun-
terNarrVerlag.
Granger, Sylviane
2011 From phraseology to pedagogy: Challenges and prospects. This
volume.
Hausmann, Franz Josef
2004 Was sind eigentlich Kollokationen? In Wortverbindungen mehr
oder wenigerfest, Kathrin Steyer (ed.), 309-334. Berlin: Walter de
Gruyter.
Leech, Geoffrey, Roger Garside and Michael Bryant
1994 CLAWS4: The tagging of the British National Corpus. In Proceed-
ings of the 15th International Conference on Computational Linguis-
tics (COLING 94), 622-628. Kyoto, Japan.
Leech, Geoffrey and Nicholas Smith
2000 Manual to accompany The British National Corpus (Version 2) with
improved word-class tagging. Lancaster. Published online at
http://ucrel.lancs.ac.uk/bnc2/.
Nesselhauf,Nadja
2005 Collocations in a Learner Corpus. Amsterdam, Philadelphia: Ben-
jamins.
Rayson,Paul
2008 Wmatrix: A web-based corpus processing environment. Computing
Department, Lancaster University, http://ucrel.lancs.ac.uk/wmatrix/.
242 Peter Uhrig and Katrin Gotz-Votteler

Sinclair, John McH.


1991 Corpus, Concordance, Collocation. Oxford: Oxford University
Press.
Underwood, Geoffrey, Norbert Schmitt and Adam Galpin
2004 They eyes have it: An eye-movement study into the processing of
formulaic sequences. In Formulaic Sequences, Norbert Schmitt (ed.),
153-172. Amsterdam/Philadelphia: Benjamins.

Corpus

BNC The British National Corpus. Distributed by Oxford University


Computing Services on behalf of the BNC Consortium.
http://www.natcorp.ox.ac.uk/.
Roland Hausser

1. Learner's dictionary and statistical tagging

In British Corpus Linguistics (CL), two schools may be distinguished: one


is associated with the University of Birmingham and its mentor John Sin-
clair, the other with the University of Lancaster and its mentors Roger Gar-
side and Geoffrey Leech. The Birmingham approach has been character-
ized as "CL-as-theory" and "doing language", the Lancaster approach as
"CL-as-method" and "doing computing" (Kirk 1998).
The difference between the two approaches is apparent in their respec-
tive analyses of a word. Take for example the word decline, analyzed in the
Collins COBUILD English Language Dictionary (CCELD, Sinclair 1987)
as a lexical entry with several readings:

decline /di'klain/, declines, declining, de-


clined. 1 If something declines, it becomes less in
quantity, importance, or strength, [examples]
2 If you decline something or decline to do some-
thing, you politely refuse to accept it or do it; a fairly
formal Word, [examples]
3 Decline is the condition or process of becoming
less in quantity, importance, or quality, [examples]

Figure I. Entry of decline in Collins COBUILD ELD 1987 (excerpt)

Intended for learners rather than fluent speakers of English, the forms de-
clines, declining and declined are explicitly listed (instead of naming the
paradigm).
In a separate column (not shown in figure 1), the CCELD characterizes
reading 1 as an intransitive Verb with the hypernym decrease, the cognate
diminish and the antonym increase Reading 2 is characterized as V V+O
OR V+ to-im, whereby V+O indicates a transitive verb. Reading 3 is
characterized as N UNCOUNT/COUNT:USU SING, i.e. as a noun t h i c h
is usually used in the singular. Chapter 3 of Sinclair (1991) provides a de-
244 RolandHausser

tailed discussion of this entry to explain the form and purpose of entries in
the CCELD in general.
Next consider the corresponding Lancaster analysis:

3682 declme NN1 1 declmedtocomment NN1


451 declme VVI 249 declines VVZ
381 declme NN1-VVB 26 declines VVZ-NN2
121 declme VVB-NN1 22 declines NN2
38 declme VVB 7 declines NN2-VVZ
1 decline-and-fall AJ0-NN1 446 declining AJO
1 declme/withdraw VVB 284 declining VVG-AJO
800 declined VVN 234 declining AJO-VVG
610 declined VVD 138 declining VVG
401 declined VVD-VVN 1 declining-cost AJO
206 declined VVN-VVD 1 declmmg-m AJO
Figure 2. Forms of decline as analyzed in the BNC 2007 XML Edition

To evaluate the tagging, we have to look up the definitions of the relevant


tag-set 2 in order to see which classifications are successful. For example,
declmmg is assigned four different tags (ambiguity), which are defined as
follows:

446 declmmg AJO adjective (unmarked) (e.g. GOOD, OLD)


284 declmmg VVG-AJO -ingfonnof lexical verb and adjective (unmarked)
234 declmmg AJO-VVG adjective (unmarked) and -ing form of lexical verb
138 declmmg VVG -ing form of lexical verb (e.g. TAKING, LIVING)

From a linguistic point of view, it would be better to classify declmmg


unambiguously as the progressive form of the verb and leave the standard
uses of the progressive as a predicate, a modifier, or a noun to the rules of
syntax.
Critical remarks on the accuracy 3 and usefulness of statistical tagging
aside, the Birmingham and the Lancaster approaches share the same meth-
odological issues of corpus linguistics, namely sampling representativeness,
size, format (and all their many sets of choices) as well as the basic tech-
niques such as the use of frequency lists, the generation of concordances,
the analysis of collocations and the question of tagging and other kinds of
in-text annotation. And both raise the question of whether their computa-
tional analysis of machine-readable texts is just a methodology (extending
the tool box) or a linguistic theory.
Corpus linguistics, generative grammar and database semantics 245

This question is addressed by Teubert and Knshnamurthy (2007: 1) as


follows:
corpus linguistics is not a branch of linguistics, but the route into linguis-
tics
corpus linguistics is not a distinct paradigm in linguistics but a methodol-
ogy
corpus linguistics is not a linguistic theory but rather a methodology
corpus linguistics is not quite a revolt against an authoritarian ideology, it
is nonetheless an argument for greater reliance on evidence
corpus linguistics is not purely observational or descriptive in its goals, but
also has theoretical implications
corpus linguistics is a practice, rather than a theory
corpus linguistics is the study of language based on evidence from large
collections of computer-readable texts and aided by electronic tools
corpus linguistics is a newly emerging empirical framework that combines
a firm commitment to rigorous statistical methods with a linguistically
sophisticated perspective on language structure and use

corpus linguistics is a vital and innovative area of research


Regarding the Birmingham "CL-as-theory" and "doing language" ap-
proach, Sinclair is quite adamant about the authonty of real data over ex-
amples invented by linguists in the Chomskyan tradition - which is a meth-
odological issue. But when it comes to writing lexical entries, Sinclair is
pragmatic, with readability for the learner as his topmost priority. For ex-
ample, in his introduction to the CCELD (1987: xix) Sinclair writes:
Within each paragraph the different senses are grouped together as well as
the word allows. Although the frequency of a sense is taken into account,
the most important matter within a paragraph is the movement from one
sense to another, giving as clear as possible a picture.

2. The place of lexical meanings

The aim of corpus linguists and dictionary builders is to provide an accu-


rate description of "the language" at a certain point in time or in a certain
time interval. It seems to follow naturally from this perspective that a lan-
guage is viewed as an object "out there in the world". As Teubert (2008)
puts it:
246 RolandHausser

Language is symbolic. A sign is what has been negotiated between sign us-
ers. The meaning of a sign is not my (non-symbolic) experience of it. Mean-
ings are not in the head as Hilary Putnam4 never got tired of repeating. The
meaning of a sign is the way in which the members of a discourse commu-
nity are using it. It is what happens in the symbolic interactions between
people, not in their minds.
On the one hand, it is uncontroversial that language meanings should not be
treated as something personal left to the whim of individuals. On the other
hand, simply declaring meanings to be real external entities is an irrational
method for making them "objective". The real reason why the conven-
tionalized surface-meaning relations are shared by the speech community is
that otherwise communication would not work.
Even if we accept for the sake of the argument that language meanings
may be viewed (metaphorically) as something out there in the world, they
must also exist in the heads of the members of the language community.
How else could speaker-hearers use language surface forms and the associ-
ated meanings to communicate with each other?
That successful natural language interaction between cognitive agents is
a well-defined mechanism is shown by the attempt to communicate in a
foreign language environment. Even if the information we want to convey
is completely clear to us, we will not be understood by our hearers if we
fail to use their language adequately. Conversely, we will not be able to
understand our foreign communication partners who are using their lan-
guage in the accustomed manner unless we have learned their language.
Given that natural language communication is a real and objective pro-
cedure, it is a legitimate scientific goal to model this procedure as a theory
of how natural language communication works. Such a theory is not only of
academic interest but is also the foundation of free human-machine com-
munication in natural language. The practical implications of having ma-
chines which can freely communicate in natural language are enormous:
instead of having to program the machines we could simply talk with them.

3. Basic structure of a cognitive agent with language

Today, talking robots exist only in fiction, such as C-3PO in the Star Wars
movies (George Lucas 1977-2005) and Roy, Rachael, etc. in the movie
Blade Runner (Ridley Scott 1982). The first and so far the only effort to
model the mechanism of language communication as a computational lin-
guistic theory is Database Semantics (DBS).
Corpus linguistics, generative grammar and database semantics 247

DBS is developed at a level of abstraction which applies to natural


agents (humans) and artificial agents (talking robots) alike. In its simplest
form, the interfaces, components and functional flow of a talking agent may
be characterized schematically as follows:

Figure 3. Structuring central cognition in agents with language (borrowed from


Hausser2006:26)

According to this schema, the cognitive agent has a body out there in the
world5 with external interfaces for recognition and action. Recognition is
for transporting content from the external world into the agent's cognition,
action is for transporting content from the agent's cognition into the exter-
nal world.6
In this model, the agent's immediate reference7 with language to corre-
sponding objects in the agent's external environment is reconstructed as a
purely cognitive procedure. An example of immediate reference in the
hearer mode is following a request, based on (i) language recognition, (ii)
transfer of language content to the context level based on matching and (111)
context action. An example in the speaker mode is reporting an observa-
tion, based on (i) context recognition, (ii) transfer of context content to the
language level based on matching and (111) language production including
sign synthesis.8
From the viewpoint of building a talking robot, the language signs exist-
ing in the external reality between communicating agents are merely acous-
tic perturbations (speech) or doodles on paper (writing) which are com-
248 Roland Hausser

pletely without any grammatical properties or meaning (cf. Hausser 2006:


Sect. 2.2; Hausser 2009b). The latter arise via the agent's wordform recog-
nition, based on matching the shapes of the external surface forms with
corresponding keys in a lexicon stored in the agent's memory.
This lexicon must be acquired by each member of the language commu-
nity. The learning procedure is self-correcting because using a surface form
with the wrong conventional meaning leads to communication problems. If
there is anything like Teubert's and Putnam's notion of language (a posi-
tion known as linguistic externalism), it is a reification of the intuitions of
members of the associated language community, manifested as signs pro-
duced by speakers (or writers) in a certain interval of time. These manifes-
tations may then be selected, documented and interpreted by corpus lin-
guists.

4. Automatic word form recognition

The computer may be used not only for the construction of dictionaries, e.g.
by using a machine-readable corpus for improving the structure of the lexi-
cal entries, but also for their use: instead of finding the entry for a word like
decline in the hardcopy of a dictionary using the alphabetical order of the
lemmata, the user may type the word on a computer containing an online
version of the dictionary - which then returns the corresponding entry on
its screen. Especially in the case of large dictionaries with several volumes
and extensive cross-referencing, the electronic version is considerably more
user-friendly to the computer-literate end-user than the corresponding hard-
copy.
Electronic lexical lookup is based on matching the unanalyzed surface
of the word in question with the lemma of the online entry, as shown in the
following schema:

Figure 4. Matching an unanalyzed surface form onto a key

There exist several techniques for matching a given surface form automati-
cally with the proper entry in an electronic lexicon.9
Corpus linguistics, generative grammar and database semantics 249

The method indicated in figure 4 is also used for the automatic word
form recognition in a computational model of natural language communica-
tion, e.g. Database Semantics. It is just that the format and the content of
the lexical descriptions are different w This is because the entries in a dic-
tionary are for human users who already have natural language understand-
ing, whereas the entries in an online lexicon are designed for building lan-
guage understanding in an artificial agent.

5. Concept types and concept tokens

The basic concepts in the agent's head are provided by the external inter-
faces for recognition and action. Therefore, an artificial cognitive agent
must have a real body interacting with the surrounding real world. The
implementation of the concepts must be procedural because natural organ-
isms as well as computers require independence from any metalanguage."
It follows that a truth-conditional or Tarskian semantics cannot be used' 2
According to the procedural approach, a robot understands the concept
of shoe, for example, if it is able to select the shoes from a set of different
objects, and similarly for different colours, different kinds of locomotion
like walking, running, crawling, etc. The procedures are based on concept
types, defined as patterns with constants and restricted variables, and used
at the context level for classifying the raw input and output data.13
As an example, consider the following schema showing the perception
of an agent-external square (geometric shape) as a bitmap outline which is
classified by a corresponding concept type and instantiated as a concept
token at the context level:
250 Roland Hauler

Figure 5. Concept types at the context and language level

The necessary properties," shared by the concept type and the correspond-
ing concept token, are represented by four attributes for edges and four
attrtbutes for angles. Furthermore, all angle attributes have the same value,
namely the constant "90 degrees" in the type and the token. The edge at-
tributes also have the same value, though it is different for the type and the
token.
The accidental property of a square is the edge length, represented by
the variable a in the type. In the token, all occurrences of this variable have
been instantiated by a constant, here 2 cm. Because of its variable, the type
of the concept square is compatible with infinitely many corresponding
tokens, each with another edge length.
At the language level, the type is reused as the literal meaning of the
English surface form square, the French surface form earre and the Ger-
man surface form Quadrat, for example. The relation between these differ-
ent surface forms and their common meaning is provided by the different
conventions of these different languages. The relation between the meaning
at the language level and the contextual referent at the context level is
based on matching using the type-token relation.
The representation of a concept type and a concept token in figure 5 is
of a preliminary holistic nature, intended for simple explanation'5 How
such concepts are exactly implemented as procedures and whether these
Corpus linguistics, generative grammar and database semantics 251

procedures are exactly the same in every agent is not important. All that is
required for successful communication is that they provide the same results
(relative to a suitable granularity) in all members of a language community.

6. Proplets

Defining a basic meaning like square as a procedure for recognition and


action is only the first step to make an artificial agent understand. Leaving
aside questions of whether or not there is a small set of "semantic primi-
tives" (Wierzbicka 1991: 6-8) from which all other meanings can be built,
and of whether or not all natural languages code content in the same way
(Nichols 1992), let us turn to the form of lexical entries in DBS.
Starting from a basic meaning, the lexical entries add morpho-syntactic
properties such as part of speech, tense in verbs, number in nouns, etc.,
needed for grammaticalized aspects of meaning, syntactic agreement, or
both. These properties are coded (i) in a way suitable for computational
interpretation and (n) as a data structure fulfilling the following require-
ments:
First, the lexical entries of DBS are designed to provide for an easy
computational method to code the semantic relations of functor-argument
and coordination structure between word forms. Second, they support a
computationally straightforward matching procedure, needed (i) for the
application of rules to their input and (n) for the interaction between the
language and the context level inside the cognitive agent. Third, they code
the semantic relations in complex expressions in an order-free manner, so
that they can be stored in a database in accordance with the needs of stor-
age in the hearer mode and of retrieval in the speaker mode.
The format for satisfying these linguistic and computational require-
ments are flat (non-recursive) feature structures called proplets. As an ex-
ample consider the lexical analysis of the English word surface form square
as a noun (as in "Anna drew a square"), as a verb (as in "Lorenz squared
his account") and as an adjective (as in "Jacob has a square napkin").
252 Roland Hausser

These proplets contain the same concept type square (illustrated in figure
5) as the value of their respective core attributes, i.e. noun, verb and adj
providing the part of speech. Different surface forms are specified as values
of the surface attribute and different morpho-syntactic properties16 are
specified as values of the category and semantics attributes. For example,
the verb forms are differentiated by the combmatonally relevant cat values
ns3' a' v, n-s3' a' v, n' a' v and a' be, whereby ns3' indicates a valency slot
(Herbst et al. 2004; Herbst and Schiiller 2008) for a nominative 3rd person
singular noun, n-s3' for a nominative non-3rd person singular noun, n' for a
nominative of any person or number, and a' for a noun serving as an accu-
sative. They are further differentiated by the sem values pres,pastlperf and
prog for tense and aspect.
Corpus linguistics, generative grammar and database semantics 253

This method of characterizing variations in lexical meaning by inserting


the same concept as a core value into different proplet structures applies
also to the word decline:

Figure 7. Lexical analysis of decline in DBS

The intransitive and the transitive verb variants are distinguished by the
absence versus presence of the a' valency position in the respective cat val-
ues. The verbs and the noun are distinguished by their respective core at-
tributes verb and noun as well as by their cat and sem values. The possible
variations of the base form surface forms correspond to those in figure 6.

7. Grammatical analysis in the hearer mode of DBS

Compared to the CCELD (1987) dictionary entries for decline (cf figure
1), the corresponding DBS proplets in figure 7 may seem rather meagre.
However, in contrast to dictionary entries, proplets are not intended for
being read by humans. Instead, proplets are a data structure designed for
processing by an artificial agent. The computational processing is of three
kinds, (i) the hearer mode, (ii) the think mode and (hi) the speaker mode.
Together, they model the cycle of natural language communication 17
In the hearer mode, the processing establishes (i) the semantic relations
of functor-argument and coordination structure between proplets (horizon-
tal relations) and (ii) the pragmatic relation of reference between the lan-
guage and the context level (vertical relations, cf. figure 3). In the think
mode, the processing is a selective activation of content in the agent's
memory (Word Bank) based on navigating along the semantic relations
between proplets and deriving new content by means of inferences. 18 In the
speaker mode, the navigation is used as the conceptualization for language
production.
254 Roland Hausser

Establishing semantic relations in the hearer mode is based solely on (i)


the time-linear order of the surface word forms and (h) a lexical lookup
provided by automatic word form recognition. As an example, consider the
syntactic-semantic parsing of "Julia declined the offer.", based on the DBS
algorithm of LA-grammar.

Figure 8. Time-linear derivation establishing semantic relations

The analysis is surface compositional in that each word form is analyzed as


a lexical proplet (cf lexical lookup, here using simplified proplets). The
derivation is time-linear, as shown by the stair-like addition of a lexical
proplet in each new line. Each line represents a derivation step, based on a
rule application. The semantic relations are established by no more and no
less than copying values, as indicated by diagonal arrows."9
Corpus linguistics, generative grammar and database semantics 255

The result of this derivation is a representation of content as an order-


free set of proplets. Given that the written representation of an order-free
set requires some order, though arbitrary, the following example uses the
alphabetical order of the core values:

Figure 9. Content of "Julia declined the offer."

The proplets are order-free because the grammatical relations between them
are coded solely by attribute-value pairs (for example, [arg: Julia offer] in
the decline proplet and [fnc: decline] in the Julia proplet) - and not in terms
of dominance and precedence in a hierarchy. As a representation of content,
the language-dependent surface forms are omitted. Compared to figure 8,
the proplets are shown with additional cat and sem features.

8. Abstract coding of semantic relations

Linguistically, the DBS derivation in figure 8 and the result in figure 9 are
traditional in that they are based on explicitly coding functor-argument (or
valency) structure 20 as well as morpho-syntactic properties. Given that
other formal grammar systems, even within Chomsky's nativism, have
been showing an increasing tendency to incorporate traditional notions of
grammar, there arises the question of whether DBS is really different from
them. After all, Phrase Structure Grammar, Categonal Grammar, Depend-
ency Grammar and their many subschools 21 have arrived at a curious state
of peaceful coexistence 22 in which the choice between them is more a mat-
ter of local tradition and convenience than a deliberate research decision.
DBS is essentially different from the current main stream grammars
mentioned above because DBS hearer mode derivations map the lexical
analysis of a language surface form directly into an order-free set of prop-
lets which is suitable (i) for storage in and retrieval from a database and
thus (ii) suitable for modelling the cycle of natural language communica-
256 RolandHausser

tion.23 This would be impossible without satisfying the following require-


ments:

Requirements for modeling the cycle of communication:


1. The derivation order must be strictly time-linear.
2. The coding of semantic relations may not impose any order on the
items of content.
3. The items of content (proplets) must be defined as flat, non-recursive
structures.

These DBS requirements are incompatible with the other grammars for the
following reasons: (1) and (2) preclude the use of grammatically meaning-
ful tree structures and as a consequence of (3) there is no place for unifica-
tion. Behind the technical differences of method there is a more general
distinction: the current main stream grammars are ^ - o r i e n t e d , whereas
DBSisag^-onented.
For someone working in sign-oriented linguistics, the idea of an agent-
oriented approach may take some getting used to.24 However, an agent-
oriented approach is essential for a scientific understanding of natural lan-
guage, because the general structure of language is determined by its func-
tion 25 and the function of natural language is communication.
Like any scientific theory, the DBS mechanism of natural language
communication must be verified. For this, the single most straightforward
method is implementing the theory computationally as a talking robot. This
method of verification is distinct from the repeatability of experiments in
the natural sciences and may serve as a unifying standard for the social
sciences.
Furthermore, once the overall structure of a talking robot (i.e., inter-
faces, components and functional flow, cf figures 3 and 5, Hausser 2009a)
has been determined, partial solutions may be developed without the danger
of impeding the future construction of more complete systems.26 For exam-
ple, given that the procedural realization of recognition and action is still in
its infancy in robotics, DBS currently makes do with English words as
placeholders for core values. As an example, consider the following lexical
proplets, which are alike except for the values of their sur and noun attri-
butes:
Corpus linguistics, generative grammar and database semantics 257

Figure 10. Different eore values in the same proplet structure

These proplets represent a class of word forms with the same morpho-
syntaetie properties. This elass may be represented more abstractly as a
proplet pattern."

Figure 11. Representing a class of word forms as a proplet pattern

By restricting the variable a to the core values used in figure 10, the repre-
sentation in figure 11 as a proplet pattern is equivalent to the explicit repre-
sentation of the proplets class in figure 10. Proplet patterns with restricted
variables are used for the base form lexicon of DBS, making it more trans-
parent and saving a considerable amount of space.
In concatenated (non-lexical) proplets, the (i) core meaning and (ii) the
compositional semantics (based on the coding of morpho-syntaetic proper-
ties) are clearly separated. This becomes apparent when the core values of
any given content are replaced by suitably restricted variables, as shown by
the following variant of figure 9:
258 RolandHausser

Figure 12. Compositional semantics as a set of proplet patterns

By restricting the variable a to the values decline, buy, eat, or any other
transitive verb, p to the values Julia, Susanne, John, Mary or any other
proper name, and y to the values it, offer, proposal, invitation, etc., this
combinatorial pattern may be used to represent the compositional semantics
of a whole set of English sentences, including figure 9.

9. Collocation

At first glance, figure 12 may seem open to the objection that it does not
prevent meaningless or at least unlikely combinations like Susanne ate the
invitation, i.e. that it fails to handle collocation (which has been one of
Sinclair's main concerns). This would not be justified, however, because
the hearer mode of DBS is a recognition system taking time-linear se-
quences of unanalyzed surface forms as input and producing a content,
represented by an order-free set of proplets, as output. In short, in DBS the
collocations are in the language, not in the grammar.
The Generative Grammars of nativism, in contrast, generate tree struc-
tures of possible sentences by means of substitutions, starting with the S
node. Originally a description of syntactic wellformedness, Generative
Grammar was soon extended to include world knowledge governing lexical
selection. For example, according to Katz and Fodor (1963), the grammar
must characterize ball in the man hit the colorful ball as a round object
rather than a festive social event. In this sense, nativism treats collocations
as part of the Generative Grammar and Sinclair is correct in his frequent
protests against nativist linguists' modelling their own intuitions instead of
looking at "real" language.
In response, generative grammarians have turned to annotating corpora
by hand or statistically (treebanks) for the purpose of obtaining broader
data coverage. For example, the University of Edinburgh and various other
Corpus linguistics, generative grammar and database semantics 259

universities are known to have syntactically parsed versions of the BNC.


The parsers used are the RASP, the Mimpar, the Charniak and the IMS
parser. Unfortunately, the resulting analyses are not freely available. Yet
even if one of them succeeded to achieve complete data coverage (accord-
ing to some still to be determined standard of wider acceptance) there re-
mains the fact that constituent-structure-based Generative Grammars and
their tree structures were never intended to model communication and are
accordingly unsuitable for it.
In DBS, the understanding of collocations by natural and artificial
agents is based on interpreting (i) the core values and (h) the functor-
argument and coordination structure of the compositional semantics (as in
figure 8) - plus the embedding into the appropriate context of use and the
associated inferencing. This is no different from the understanding of newly
coined phrases (syntactic-semantic neologisms), which are as much a fact
of life as are collocations.
Speakers or writers utilize the productivity of natural language in word
formation and compositional semantics to constantly coin new phrases.
Examples range from politics (calling the US stimulus package "the largest
generational theft bill on record") via journalism (creative use of navigate
in "President Obama has to navigate varying advice on Afghanistan") to
advertising (contrived alliteration in "Doubly Choc Chip, Bursting with
choc chips in a crunchy chocolaty biscuit base").
Another matter are idioms, such as a drop tn the bucket or a blessing tn
dtsgutse. As frozen non-literal uses, they are either i n t e r t a b l e by the
same inferencing as spontaneous non-literal uses (e.g. metaphor, cf
Hausser 2006: 75-78) or they must be learned. For example, an ax(e) to
gnnd may be viewed as similarly opaque (non-compositional or non-
Fregean) in syntax-semantics as cupboard is in morphology. Just as cup-
board must be equated with kitchen cabinet in the agent's cognition, an axe
to gnnd (attributed to Benjamin Franklin) must be equated with expressing
a serious complaint.

10. Context

The attempt of Generative Grammar to describe the tacit knowledge of the


speaker-hearer without the explicit reconstruction of a cognitive agent has
led not only to incorporating lexical selection into the grammar, but also the
context of use. Pollard and Sag (1994), for example, propose a treatment of
context in HPSG which consists in adding an attribute to lexical entries (see
260 RolandHausser

also Green 1997). The values of this attribute are called constraints and
have the form of such definitions28 as
(a) "the use of the name John is legitimate only if the intended referent is
named John."
(b) "the complement of the verb regret is presupposed to be true. "
For a meaningful computational implementation this is sadly inadequate,
though for a self-declared "sign-based" approach it is probably the best it
can do.
Instead of cramming more and more phenomena of language use into
the Generative Grammar, Database Semantics clearly distinguishes be-
tween the agent-external real world and the agent-internal cognition. The
goal is to model the agent, not the external world.29 Whether The model is
successful or not can be verified, i.e. determined objectively, (i) by evaluat-
ing the artificial agent's behaviour in its interaction with its environment
and with other agents and (n) by observing the agent's cognitive operations
directly via the service channel (cf.Hausser 2006: Sect. 1.4).
In the agent's cognition, DBS clearly separates the language and the
context component (cf figure 3) and defines their interaction via a compu-
tationally viable matching procedure based on the data structure of proplets
(cf. Hausser 2006: Sect. 3.2). In addition, DBS implements three computa-
tional mechanisms of reference for the sign kinds symbol, mdextcal and
name^ This is the basis for handling the HPSG context definition (a), cited
above, as part of a general theory of signs, whereas definition (b) is treated
as an inference by the agent.
For systematic reasons, DBS develops the context component first, in
concord with ontogeny and phylogeny (cf. Hausser 2006: Sect. 2.1). To
enable easy testing and upscaling, the context component is reconstructed
as an autonomous agent without language. The advantage of this strategy is
that practically all constructs of the context component can be reused when
the language component is added. The reuse, in turn, is crucial for ensuring
the functional compatibility between the two levels.
For example, the procedural definition of basic concepts, pointers and
markers provided by the external interfaces of the context component are
reused by the language component as the core meanings of symbols, in-
dexicals and names, respectively. The context component also provides for
the coding of content and its storage in the agent's memory, for inferencing
on the content and for the derivation of adequate actions, including lan-
guage production.
Corpus linguistics, generative grammar and database semantics 261

In human-machine communication, the context component is essential


for reconstructing two of the most basic forms of natural language interac-
tion. One is telling the artificial cognitive agent what to do, which involves
contextual action. The other is the artificial cognitive agent's telling what it
has perceived, which involves contextual recognition.

11. Conclusion

From the linguists' perspective, the learner is for an English learner's dic-
tionary what the artificial cognitive agent is for Database Semantics: each
raises the question of what language skills the learner/artificial agent should
have.
However, the learner already knows how to communicate in a natural
language. Therefore, the goal is to provide her or him with information of
how to speak English well, which requires the compilation of an easy to
use, accurate representation of contemporary English.
Database Semantics, in contrast, has to get the artificial agent to com-
municate with natural language in the first place. This requires the recon-
struction of what evolution has produced in millions of years as an abstract
theory which applies to natural and artificial agents alike.
In other words, Database Semantics must start from a much more basic
level than a learner's dictionary. For DBS, any given natural language re-
quires
- automatic word form recognition for the expressions to be analyzed,
- syntactic-semantic interpretation in the hearer mode, resulting in
- content which is stored in a database and
- selectively activated and processed in the think mode and
- appropriately realized in natural language in the speaker mode.
On the one hand, each of these requirements constitutes a sizeable research
and software project. On the other hand, the basic principles of how lan-
guage communication works is the same for different languages. Therefore,
once the software components for automatic word form recognition, syn-
tactic-semantic parsing, etc. have been developed in principle, they may be
applied to different languages with comparatively little effort.31
Because the theoretical framework of DBS is more comprehensive than
that of a learner's dictionary, DBS can provide answers to some basic ques-
tions. For example, DBS allows to treat basic meanings in terms of recogni-
tion and action procedures, phenomena of language use with the help of an
262 RolandHausser

explicitly defined context component and collocations produced in the


speaker mode in terms of what the agent was exposed to in the hearer
mode. Conversely, a learner's dictionary as a representation of a language
is much more comprehensive than current DBS and thus provides a high
standard of what DBS must accomplish eventually.

Notes

1 This paper benefited from comments by Thomas Proisl, Besim Kabashi, Jo-
hannes Handl and Carsten Weber (CLUE, Erlangen), Haitao Liu (Communi-
cation Univ. of China, Beijing), Kryong Lee (Korea Univ., Seoul) and Brian
MacWhmney (Carnegie Mellon Univ., Pittsburgh).
2 The UCREL CLAWS5 tag-set is available at http://ucrel.lancs.ac.uk/
claws5tags.html.
3 Cf. Hausser ([1999] 2001: 295-299).
4 Putnam attributes the same ontological status to the meanings of language as
Mathematical Realism attributes to mathematical truths: they are viewed as
existing eternally and independently of the human mind. In other words, ac-
cording to Putnam, language meanings exist no matter whether they have
been discovered by humans or not.
What may hold for mathematics is less convincing in the case of language.
First of all, there are many different natural languages with their own charac-
teristic meanings (concepts). Secondly, these meanings are constantly evolv-
ing. Thirdly, they have to be learned and using them is a skill. Treating lan-
guage meanings as pre-existing Platonic entities out there in the world to be
discovered by the members of the language communities is especially doubt-
ful in the case of new concepts such as transistor or ticket machine.
5 The importance of agents with a real body (instead of virtual agents) has been
emphasized by emergentism (MacWhmney 2008).
6 While language and non-language processing use the same interfaces for
recognition and action, figure 3 distinguishes channels dedicated to language
and to non-language interfaces for simplicity: sign recognition and sign syn-
thesis are connected to the language component; context recognition and con-
text action are connected to the context component.
7 Cf. Hausser (2001: 75-77); Hausser (2006: 27-29).
8 For a more extensive taxonomy see the 10 SLIM states of cognition in
Hausser (2001: 466-473).
9 See Aho and Ullman (1977: 336-341).
10 Apart from then formats, a dictionary and a system of automatic word form
recognition differ also in that the entries in a dictionary are for words (repre-
sented by then base form), whereas automatic word form recognition ana-
Corpus linguistics, generative grammar and database semantics 263

lyzes inflectional, derivational and compositional word forms on the basis of


a lexicon for allomorphs or morphemes (cf. Hausser 2001: 241-257). Statisti-
cal tagging also classifies word forms, but uses transitional likelihoods rather
than a compositional analysis based on a lexical analysis of the word form
parts.
11 Cf. Hausser (2001: 82-83).
12 Cf. Hausser (2001: 375-387).
13 For a more detailed discussion of the basic mechanisms of recognition and
action see Hausser (2001: 53-61) and Hausser (2006: 54-59).
14 Necessary as opposed to accidental (kata sumbebekos), as used in the phi-
losophical tradition of Aristotle.
15 For a declarative specification of memory-based pattern recognition see
Hausser (2005).
16 For simplicity, proplets for the genitive singular and plural forms of the noun
and any comparative and superlative forms of the adjective are omitted. Also,
the attributes nc (next conjunct) and pc (previous conjunct) for the coordina-
tion of nouns, verbs and adjectives have been left out. For a detailed explana-
tion of the lexical analysis in Database Semantics see Hausser (2006: 51-54,
209-216).
17 For a concise description of this cycle see Hausser (2009a).
18 Cf. Hausser (2006: 71-74).
19 For more detailed explanations, especially the function word absorptions in
line 3 and 4, see Hausser (2006: 87-90) and Hausser (2009a, 2009b).
20 In addition, the DBS method is well-suited for handling extrapropositional
functor-argument structure (subclauses) and intra- and extrapropositional co-
ordination including gapping, as shown in Hausser (2006: 103-160).
21 Known by acronyms such as TG (with its different manifestations ST, EST,
REST and GB), LFG, GPSG, HPSG, CG, CCG, CUG, FUG, UCG, etc.
22 This state is being justified by a whole industry of translating between the
different grammar systems and proposing conjectures of equivalence. An
early, pre-statistical instance is Sells (1985), who highlights the common core
of GB, GPSG and LFG. More recent examples are Andersen et al. (2008),
who propose a treebank based on Dependency Grammar for the BNC and Liu
and Huang (2006) for Chinese; Hockenmaier and Steedman (2007) describe
CCGbank as a translation of the Penn Treebank (Marcus, Santorim and Mar-
cmkiewicz 1993) into a corpus of Combinatory Categonal Grammar deriva-
tions.
23 A formal difference is that LA-grammar is the first and so far the only algo-
rithm with a complexity hierarchy which is orthogonal to the Chomsky hier-
archy (Hausser 1992).
24 Also, there seems to be an irrational fear of creating artificial beings resem-
bling humans. Such homuncuh, which occur in the earliest of mythologies,
264 RolandHausser

are widely regarded as violating the taboo of doppelganger similarity (Girard


1972). Another matter is the potential for misuse - which is a possibility in
any basic science with practical ramifications. Misuse of DBS (in some ad-
vanced future state) must be curtailed by developing responsible guidelines
for clearly defined laws to protect privacy and intellectual property while
maintaining academic liberty, access to information and freedom of dis-
course.
25 This is in concord with Darwin's theory of evolution in which anatomy, for
example, will be structured according to functions associated with use.
26 The recent history of linguistics contains numerous examples of naively treat-
ing morphological as well as semantic phenomena in the syntax, pragmatic
phenomena in the semantics, etc. These are serious mistakes, some of which
have derailed scientific progress for decades.
27 MacWhmney (2005) describes "feature-based patterns" arising from "item-
based patterns", which resembles our abstraction of proplet patterns from
classes of corresponding proplets.
28 These definitions are reminiscent of Montague's (1974) meaning postulates
for constraining a model structure of possible worlds, defined purely in terms
of set theory. Supposed to represent spatio-temporal stages of the actual world
plus counterfactual worlds with unicorns, etc., a realistic definition or pro-
gramming of such a model structure is practically impossible. Therefore, it is
always defined "m principle" only. Cf Hausser (2001: 392-395).
29 This is in contrast to the assumptions of truth-conditional semantics, includ-
ing Montague Grammar, Situation Semantics, Discourse Semantics, or any
other metalanguage-based approach. Cf. Hausser (2001: 371-426), Hausser
(2006: 25-26).
30 Cf. Hausser (2001: 103-107), Hausser (2006: 29-34). The type-token relation
between corresponding concepts at the language and the context level illus-
trated in 5.1 happens to be the reference mechanism of symbols.
31 For example, given (i) an on-line dictionary of a new language to be handled
and (ii) a properly trained computational linguist, an initial system of auto-
matic word form recognition can be completed in less than six months. It will
provide accurate, highly detailed analyses of about 90% of the word form
types m a corpus.
Corpus linguistics, generative grammar and database semantics 265

References

Aho, Alfred Vaino and Jeffrey David Ullman


1977 Principles of Compiler Design. Readmg, MA.: Addison-Wesley.
Andersen, 0ivm, Juhen Nioche, Edward John Briscoe and John A. Carroll
2008 The BNC parsed with RASP4UIMA. In Proceedings of the Sixth
Language Resources and Evaluation Conference (LREC'08),
Nicoletta Calzolari, Khahd Choukri, Bente Maegaard, Joseph Man-
ani, Jan Odjik, Stehos Pipendis and Daniel Tapias (eds.), 865-860.
Marrakech, Morocco: European Language Resources Association
(ELRA).
Girard,Rene
1972 La violence et le sacre. Pans: Bernard Grasset.
Green, Georgia
1997 The structure of CONTEXT: The representation of pragmatic restric-
tions in HPSG. Proceedings of the 5th Annual Meeting of the Formal
Linguistics Society of the Midwest, James Yoon (ed.), 215-232.
Studies in the Linguistic Sciences.
Hausser, Roland
1992 Complexity in left-associative grammar. Theoretical Computer Sci-
ence 106 (2): 283-308.
Hausser, Roland
2001 Foundations of Computational Linguistics: Human-Computer Com-
munication in Natural Language. Berlin/Heidelberg/New York:
Springer. 2nd edition. First published in 1999.
Hausser, Roland
2005 Memory-based pattern completion in database semantics. Language
and Information 9 (1): 69-92.
Hausser, Roland
2006 A Computational Model of Natural Language Communication: In-
terpretation, Inference and Production in Database Semantics. Ber-
lm/Heidelberg/New York: Springer.
Hausser, Roland
2009a Modeling natural language communication in database semantics. In
Proceedings of the APCCM 2009, Markus Kirchberg and Sebastian
Link (eds.), 17-26. Australian Computer Science Inc., CIPRIT, Vol.
96. Wellington, New Zealand: ACS.
Hausser, Roland
2009b From word form surfaces to communication. In Information Model-
ling and Knowledge Bases XXI, Hannu Kangassalo, Yasushi Kiyoki
and Tatjana Welzer (eds.), 37-58. Amsterdam: IOS Press Ohmsha.
266 RolandHausser

Herbst, Thomas, David Heath, Ian Roe and Dieter Goetz


2004 A Valency Dictionary of English: A Corpus-Based Analysis of the
Complementation Patterns of English Verbs, Nouns and Adjectives.
Berlin: Mouton de Gruyter.
Herbst, Thomas and Susen Schtiller
2008 Introduction to Syntactic Analysis: A Valency Approach. Tubingen:
GunterNarr.
Hockenmaier, Julia and Mark Steedman
2007 CCGbank: A corpus of CCG derivations and dependency structures
extracted from the Penn Treebank. Computational Linguistics 33 (3):
355-396.
Katz, Jerrold Jacob and Jerry Alan Fodor
1963 The structure of a semantic theory. Language 39: 170-210.
Krrk,JohnM.
1998 Review of T. McEnery and A. Wilson 1996 and of G. Barnbrook
1996, Computational Linguistics 24 (2): 333-335.
Liu, Haitao and Wei Huang
2006 A Chinese dependency syntax for treebankmg. In Proceedings of the
20th Pacific Asia Conference on Language, Information, Computa-
tion, 126-133. Beijing: Tsinghua University Press.
MacWhmney, Brian James
2005 Item-based constructions and the logical problem. Association for
Computational Linguistics (ACL), 46-54. Momstown, NJ: Associa-
tion for Computational Linguistics (ACL).
MacWhmney, Brian James
2008 How mental models encode embodied linguistic perspective. In
Embodiment, Ego-Space and Action, Roberta L. Klatzky, Marlene
Behrmann and Brian James MacWhmney (eds.), 360-410. New
York: Psychology Press.
Marcus, Mitchell P., Beatrice Santormi and Mary Ann Marcmkiewicz
1993 Building a large annotated corpus of English: The Penn Treebank.
Computational Linguistics 19: 313-330.
Montague, Richard
1974 Formal Philosophy. New Haven: Yale University Press.
Nichols, Johanna
1992 Linguistic Diversity in Space and Time. Chicago: University of Chi-
cago Press.
Pollard, Carl and Ivan Sag
1994 Head-Driven Phrase Structure Grammar. Stanford: CSLI.
Corpus linguistics, generative grammar and database semantics 267

Putnam, Hilary
1975 The meaning of "meaning". In Mind, Language and Reality: Phi-
losophical Papers, vol. 2, Hilary Patnam (ed.), 215-271. Cambridge:
Cambridge University Press.
Sinclair, John McH.(ed.)
1987 Collins COBUILD English Language Dictionary. London/Glasgow:
Collins.
Sinclair, John McH.
1991 Corpus, Concordance, Collocation. Oxford: Oxford University
Press.
Sells, Peter
1985 Lectures on Contemporary Syntactic Theories: An Introduction to
GB Theory, GPSG andLFG. Stanford: CSLI.
Teubert, Wolfgang
2008 [Corpora-List] Bootcamp: 'Quantitative Corpus Linguistics with R'~
re Louw's endorsement, http://mailman.uib.no/public/corpora/2008-
August/007089.html.
Teubert, Wolfgang and Knshnamurthy, Ramesh (eds.)
2007 Corpus Linguistics. London: Routledge.
Wierzbicka,Anna
1991 Cross-Cultural Pragmatics: The Semantics of Human Interaction.
Berlin: Mouton de Gruyter.

Corpus

BNC The British National Corpus, version 3 (BNC XML Edition). 2007.
Distributed by Oxford University Computing Services on behalf of
the BNC Consortium, http://www.natcorp.ox.ac.uk/.
Chunk parsing in corpora

Gunther Gorz and Gunter Schellenberger

1. What on earth is a chunk?

1.1. Basic features of chunks

A fundamental analytical task in Natural Language Processing (NLP) is the


segmentation and labeling of texts. In a first step, texts are broken up into
sentences as sequences of word forms (tokens).
Chunking in general means to assign * partial structure to a sentence.
Tagging assigns to the tokens labels which represent word specific and
word form specific information such as the word category and morphologi-
cal features. Chunk parsing regards sequences of tokens and tries to iden-
tify structural relations within and between segments.
Chunk parsing as conceived by Abney (1991) originated from a psycho-
linguistic motivation (1991: 257):
I begin with an intuition: when I read a sentence, I read it a chunk at a time.
For example, the previous sentence breaks up something like this:
(1) [I begin] [whh an intuition]: [when I read] [a sentence], [I read it] [a
chunk] [at a time]
These chunks correspond in some way to prosodic patterns. It appears, for
instance, that the strongest stresses in the sentence fall one to a chunk, and
pauses are most likely to fall between chunks. Chunks also represent a
grammatical watershed of sorts. The typical chunk consists of a single con-
tent word surrounded by a constellation of function words, matching a fixed
template ...There is psychological evidence for the existence of chunks...
In the context of corpus linguistics, chunk parsing is regarded as an effi-
cient and robust approach to parsing at the cost of not trying to deal with all
of language. Hence, a good coverage on given corpora can be achieved,
also in the presence of errors as it is the case with (transcribed) speech.
Although the problem is defined on the semantic-pragmatic level, it can be
captured automatically only by syntactic means - notice the analogy to the
270 Gunther Gorz and Giinter Schellenberger

problem of collocations. Chunks are understood as "non-overlapping re-


gions of text, usually consisting of a head word (such as a noun) and the
adjacent modifiers and function words (such as adjectives or determiners)"
(Bird, Klein and Loper 2006). Technically, there are two main motivations
for chunking: to locate information - for information retrieval - or to ig-
nore information, e.g. to find evidence for linguistic generalizations in lexi-
cographic and grammatical research.
The grouping of adjacent words into a single chunk, i.e. a subsequence,
should be faithful regarding the meaning of the original sentence.
The sentence
(1) The quick brown fox jumps over the lazy dog.
will be represented as: ((FOX) (JUMPS) over (DOG)) where (FOX) and
(DOG) represent the noun chunks (the quick brown fox) with head/ox and
(the lazy dog) with head dog, respectively. In a similar manner, (JUMPS) is
a one-word verb chunk headed by jumps.
The results of chunk parsing are shorter and easier to handle, e.g. for
computer aided natural language tools. Briefly, chunk parsing follows a
dividend-conquer principle as illustrated in the following commutative
diagram:

In this diagram, LF(x,y,...) represents some logical form, e.g. a predicate


like jumps over(who, over-whom), which can be extracted much more eas-
ily from the compact sequence on the lower left side of the diagram.
Chunking serves as a normalization step; the meaning of the original sen-
tence can be derived from that of the chunked result by substituting the
chunks (bottom line) with their origin. As / « „, the implementation of
m(.) is easier.

1.2. Chunk parsing and full parsing

In view of examples such as (1) above it is intuitive to think of chunk pars-


ing as an intermediate step towards full parsing The type and head of a
chunk are taken as non-terminal symbol and literal substitute, respectively,
Chunk parsing in corpora 271

in a recursive process. After the first stage of base segmentation, adjacent


chunks and isolated words can be wrapped up to chunks of a higher level
finally reaching a full constituent tree.
The advantages of introducing an additional layer of processing are
based on the assumption:
Proposition 1 Chunks have a much more simple internal structure than the
sequence of chunks inside higher level constructions, including sentences.
In order to support the intentions mentioned above, some constraints are
imposed on chunks:
Proposition 2 Chunks...
1. never cross constituent boundaries;
2. form true compact substrings of the original sentence;
3. are implemented based on elementary features of the word string, like
the POS tags or word types, avoiding deep lexical or structural param-
eterization incorporated in their implementation;
4. are not recursive.
Rule (3) is required to allow for fast and reliable construction of chunking
structures based on their simple nature. A closer inspection of rule (4),
which gives a formal definition of simplicity, shows some consequences:
- Chunks do not contain other chunks;
- Recursive rules like the following are not allowed (NO. Noun Chunk):
NC^Det NC
- Recursive rule systems, e.g. (where ADVC: Adverbial Chunk)
ADVC^Adj ADVC?
NC^Det ADVC? N
can be 'flattened' to regular structures like
NC^Det Adj? Adj? Adj? N
(Tiodicates that the preceding element .optional)

1.3. Use of chunks in spoken language processing

Our work in the area of speech processing systems, including dialogue


control, gave us additional motivation for chunk parsing. In the special case
of (transcribed) speech corpora, usually we don't have many well-formed
sentences and grammatically perfect constituents. A lot of additional ambi-
guities come up, because the usual differences in spelling cannot be found,
272 Gunther Gorz and Giinter Schellenberger

e.g. May vs. may or the German verb essen vs. the German proper noun
(name of the city) Essen. Furthermore, there are no punctuation marks,
including marks for the beginning and end of sentences, which again raises
many reading ambiguities. This means that dialogue systems have to follow
multiple different paths of interpretation. Therefore, search spaces tend to
become much bigger and we are facing time and memory limitations due to
combinatorial explosion. In practical systems, recognition errors have to be
taken into account and they should be identified as soon as possible.
Base chunks meeting the requirements of Proposition (2) are therefore
often the ultimate structures for spoken language systems as input to higher
analysis levels which have to assign semantic roles to the parts of chunked
input.
To summarize, reliable syntactic structures other than chunks are often
not available in speech processing.

1.4. Limitations and problems of chunking in English

Although on a first glance the idea of chunk parsing promises to push natu-
ral language understanding towards realistic applications, things are not as
easy as they seem.
Named entities such as John Smrth or The United Kingdom should be
identified as soon as possible in language processing. As a consequence,
named entity recognition has to be included in chunk parsing.
But merging subsequent nouns into the same chunk cannot be intro-
duced as a general rule. In fact, the famous example (1) already contains a
trap making it easy for chunk parsers to stumble over:
- jumps could be taken as a noun (plural of jump) and merged into the
preceding noun chunk which is prohibitive for semantic analysis.
- Some more examples of this kind are:
(2) The horse leaps with joy.
(3) This makes the horse leap with joy.
(4) Horse leaps are up to eight meters long.
- Similar considerations hold for named entities comprising a sequence
ofnounsortermssuchas
(5) System under construction, matter of concern, ...
- Chunk Parsers usually do not wrap compound measurement expres-
sions like
Chunk parsing in corpora 273

(6) twenty meters and ten centimetres


These examples reveal a fundamental problem of implementing rule (1) in
Proposition (2): this rule expresses a semantic constraint - how can it be
implemented consistently with the other rules? How should a chunk parser
respect boundaries of structures which are to be built later based upon its
own results?
Problems which go even deeper show up on closer inspection of verb
phrases as W over... or leap with... in the examples (1) and (2)-(4), resp.
In example (1), it might not be helpful to merge the preposition "over"
with the chunk (DOG) into a prepositional phrase. Here, the preposition
qualifies the verb, not the object. More precisely, "over" qualifies the ob-
ject's role: the dog is not a location, not a temporal unit etc., but the af-
fected entity of the predicate.
Again, whether or not a preposition following a word forms a phrasal
expression with that verb and should be merged into a chunk cannot be
decided by following the rules for chunks above. This holds, of course, for
verbal expressions spanning several words (e.g. "put up with" or "add up
to"), but also for expressions like "as far as I know". Missing or wrong
chunking could in such cases lead later processing steps into a trap again.
At least, in English, prepositions or other material that considerably change
the reading of a verb directly follow that verb or are kept close to it.

1.5. Limitations of chunking in German

The definition of chunks as subsequences of the original sentence provides


a serious coverage limitation for German with its relatively frequent dis-
continuous structures.
The burden of ambiguities between verbs and nouns in German is
somehow eased by capitalization rules, which of course do not help in the
case of speech processing. In addition, German sentences may start with a
verb; furthermore it is not common to combine the words which make up
geographical place names with dashes. So, instead of
(7) Stratford-upon-Avon
there exist
(8) WeilderStadt
(9) NeustadtanderWaldnaab
(10) NeuhausenaufdenFildern
274 Gunther Gorz and Gunter Schellenberger

What is even worse is that the constraints easing the process of verb phrase
chunking in English do not hold for German prefixed verbs which are sepa-
rable in present tense. On the contrary, the space in between verbal stem
and split prefix does not only allow for constituents of arbitrary length, it
even has to include at least passive objects:
(11) Ich hebe Apfel, welche vom Baum gefallen sind, niemak auf.
'I never pick up apples fallen down from the tree.'
(12) Ich hebe Apfel niemals auf, welche vom Baum gefallen sind.
(13) ??Ich hebe niemals auf Apfel, welche vom Baum gefallen sind.
The facts concerning German separable composite verbs in present tense
compared with phrasal verbs in English can be extended without exceptions
to auxiliary and modal constructions, including past, perfect and future.
This, amongst other linguistic phenomena, drastically limits the coverage of
chunk parsing for German. However, even for German dialogue systems,
chunking of at least noun phrases is necessary at the beginning of process-
ing.

2. Using corpora: from chunking to meaning

Our introduction suggests that a good start for chunk parsing would be to
commence with regular rules. Unfortunately, pure rule-based approaches
lack sufficient coverage. Therefore, to amend the performance of chunk
parsing, examples from corpora have to be included.

2.1. IOB tagging

A general and approved approach to solve the problems as introduced in


the preceding section is to incorporate annotated samples which are com-
pared with text to be analysed. To make this approach applicable to base
chunking, the task of chunk parsing is first transformed into a problem of
tagging and prototyping as follows:
- Given a sequence of words Wl w 2 • • • u,t ,
assign to it a corresponding sequence of tags tl t2 • • • *,... w i t h
Chunk parsing m corpora 275

- Optionally, a syntactic classification can be attached to B-tags: B-NP,


B-VP, etc.
- Alternatively, this subclassification is done separately.

2.2. Solutions by means of statistical inference

One class of solutions of the lOB-Taggmg problem is to introduce numeri-


cal functions based on numerical features of the w(l):
- For each word in «.„«•,,•••,«,....
assign a vector of w - (»„,»„,•- , « , » , - ) features, i.e. numerical
functions.
- Examples of definitions for v„:
- v,,, = 1, if w,is a noun, 0 otherwise
- v,;2 = 1, if w,-is a verb, 0 otherwise

- v,3o = 1, if w, = 'Company', 0 otherwise


- POS-tagofw,
- Prefix or suffix
- With the exception of the beginning and end of a sentence or a se-
quence, the features v,-,- are independent from i
- Each feature is a function of a window m-,m-M.- • ,wi+1, -
around w, of length 1
U is expressed as a function of the feature vector assigned wu param-
eterized by a parameter vector a independent of,:
U = MM)
The parameters a are calculated as to generate optimum results given a
corpus equipped with IOB-tagS which in turn are assumed to be correct:
Proposition 3
- Given a tagged corpus R = («*,*), («i,ti),..., (« w ,t w ), called the train-
ing corpus,
- find « to minimize E ll-MW) -U\\;i = 0 w
276 Giinther Gorz and Gunter Schellenberger

2.3. Preparing an annotated corpus

The task of selecting the function F*(.) is far from being trivial and so is
the minimization task for the parameters; there are several approaches to
finding a satisfactory result. In general, the problem is given a geometric
interpretation: the vectors of evaluated features are taken as points in a
multidimensional space, each equipped with a tag, namely the assigned
705-tog.Thetaskthenisto:
- Identify clusters of points carrying identical tag;
- Express membership to or distance from clusters by appropriate func-
tions.
- Example: The Support Vector Machine SVM separates areas of differ-
ent tags with hyperplanes of maximum coverage.
Calculating is the true proficiency of computers, so as soon as appropriate
features and Fw(.) are selected, chunk parsing can be done efficiently as
required.
But how can the laborious work of preparing a training corpus n be
facilitated? There are two main options:
1. An automatically annotated corpus is corrected manually. For the be-
ginning, a chunk parser, based on a few simple rules incorporating ba-
sic features, e.g. POS-tags, is used (cf Proposition (2)). This initial
parser is called the "baseline".
2. Chunk structures are derived from a corpus already equipped with
higher level analyses, for example constituency trees (treebank). For
an example, cf. Tjong Kim Sang and Buchholz (2000).
At this place, it is worthwhile to point out that there is no formal and verifi-
able definition of correct chunking. Tagging a corpus to train a chunker
also means to define the chunking task itself.

2.4. Transformation-based training

In view of option (2), another approach to the tagging task in general, in-
cluding IOB-taggmg, opens up. The method outlined in the following is
called Transformation-basedLearnmg (cf. Ramshaw and Marcus 2005).
Chunk parsing in corpora 277

1. Start with a baseline tagger.


Example: Use POS-tags of words alone to define the mapping into
IOB-space.
2. Identify the set of words in the input wh w2, ... which have been
mistagged.
3. Add one or more rules to correct as many errors as possible.
4. Retag the corpus and restart at step 2 until no or not enough errors
remain.
5. Given a sequence of words uu u2, ... outside the corpus to be tagged,
do baseline tagging, then apply the rules found in the steps above.
Example: "Adjectives ... that are currently tagged I but that are followed by
words tagged O have their tags changed to O" (Ramshaw and Marcus
2005: 91).

3. Assessment of results

3.1. Measuring the performance of chunk parsers

A manually tagged corpus (see 1Z in Proposition (3)) is passed to a


chunker for automatic tagging.
The performance is measured in terms of Precision and Recall. Infor-
mally, precision is the number of segments correctly labeled by the chun-
ker, divided by the total number of segments found by the chunker; i.e. the
sum of true positives and false positives, which are items incorrectly la-
beled as belonging to the class. Recall is defined in this context as the num-
ber of true positives divided by the total number of elements that actually
belong to the class; (i.e. the sum of true positives and false negatives, which
are items which were not labeled as belonging to that class, but should have
been).
The formal definitions of precision and recall are as follows:
- True Positives (TP): segments that are identified and correctly labelled
by the chunker;
- False Positives (FP): segments that are labelled by the chunker, but
not in K-
- n False Negatives (FN): segments that are labelled in *, but not by the
chunker;
278 Gunther Gorz and Gunter Schellenberger

- True Negatives (TN): segments that are labelled neither in K, nor by


the chunker;
- Precision: percentage of selected items that were correctly labelled:
Precision = TJ^FP
- Recall: percentage of segments that were detected by the chunker
Recall = ^ 5 w
The results of the "CoNLL-2000 Shared Task in Chunking" (Tjong Kim
Sang and Buchholz 2000) are still representative for the state of the art:
precision and recall just below 94 % for pure IOB-tagging have been
achieved. Bashyam and Taira (2007) report on lower results when training
a special domain chunker for anatomical phrases.

3.2. How far can statistical chunking reach?

At the Conference on Natural Language Learning in 2004, the CoNLL-


2004 Shared Task of Semantic Role labeling (SRL) had been introduced.
SRL can be understood as the minimum requirement for automated seman-
tic analysis of free input text and addresses questions such as the following
(cf Carreras and Marquez 2004):
- Who is the agent addressed by the verb of a sentence?
- Who or what is the patient or instrument?
- What adjuncts specify location, manner or cause belonging to the
verb?
The particular challenge was to restrict machinery involved in solving the
task to levels below chunk parsing (i.e. words and POS-tags) and more
basic chunk parsing applications: pure segmentation, i.e. IOB-tagging
without further labeling of segments, or named entity recognition, i.e. iden-
tification of noun chunks representing names of persons, organizations etc.
together with a label indicating the type of the entity: person, organization,
location and other.
The results achieved for language-dependent named entity recognition
are:

Precision Recall
^rW below 84% below 65%
English below 90% below 90%
Chunk parsing in corpora 279

As a consequence of the mentioned constraints, the results of the CoNLL-


2004 competition can be taken as a realistic orientation mark as far as the
applicability of chunk parsing is concerned in the sense of a realistic use of
automatic language analysis for whatever specific application.
So, the overall precision of participants in SRL hardly exceeds 75 %; for
correct identification of the agent of a sentence, precision reaches 94 %.

4. Conclusion

The authors of the CoNLL-2004 Shared Task (Carreras and Marquez 2004)
conclude:
... state-of-the-art systems working with full syntax still perform substan-
tially better, although far from a desired behavior for real-task applications.
Two questions remain open: which syntactic structures are needed as input
for the task, and what other sources of information are required to obtain a
real-world, accurate performance.

Appendix: Some publicly available chunk parsers

There are several chunk parsers which can be downloaded for free from the
World Wide Web. We present a small selection of those we tried with test
data.
One the one hand, there are natural language processing toolkits and
platforms such as NLTK (Natural Language ToolKit),2 or GATE (General
Architecture for Text Engineering)3 which contain part-of-speech taggers
and chunk parsers. In particular, NLTK offers building blocks for ambi-
tious readers, who want to develop chunkers or amend existing ones on
their own (in Python).
SCP is a Simple rule-based Chunk Parser by Philip Brooks (2003),
which is part of ProNTo, a collection of Prolog Natural language Tools. A
POS tagger and a chunker for English with special features for parsing a
"huge collection of documents"4 have been developed by Tsuruoka and
Tsujii (2005). A state-of-the-art pair of a tagger and a chunker, along with
parameter files trained for several languages, has been developed at the
University of Stuttgart.5 The "Stuttgart-Tubingen Tagset" STTS has be-
come quite popular in recent years for the analysis of German and other
languages.
280 Gunther Gorz and Giinter Schellenberger

Finally, chart parsers running in bottom-up mode and equipped with an


appropriate chunk grammar, can be used for chunk parsing as well. This
technique has been used in our dialogue system CONALD (Ludwig, Reiss
and Gorz 2006).

Notes

1 The authors are indebted to Martin Hacker for critical remarks on an earlier
draft of this paper.
2 http://nltk.sourceforge.net/mdex.php/Mam Page, accessed 31-10-2008.
3 http://gate.ac.uk/, accessed 31-10-2008. "
4 available for download from http://www-tsujii.is.s.u-tokyo.ac.jp/tsuruoka/
chunkparser/, accessed 31-10-2008.
5 available for download from http://www.ims.um-stuttgart.de/projekte/corplex/
TreeTagger/, accessed 31-10-2008.

References

Abney, Steven
1991 Parsing by chunks. In Principle-Based Parsing, Robert Berwick,
Steven Abney and Carol Tenny (eds.), 257-278. Dordrecht: Kluwer.
Bashyam, Vijayaraghavan and Ricky K. Taira
2007 Identifying anatomical phrases in clinical reports by shallow seman-
tic parsing methods. In Proceedings of the 2007 IEEE Symposium on
Computational Intelligence and Data Mining (CIDM 2007), Hono-
lulu, 210-214. Honolulu, Hawati: IEEE.
Bird, Steven, Ewan Klein and Edward Loper
2006 Chunk Parsing. (Tutorial Draft) University of Pennsylvania.
Brooks, Phitip
2003 SCP: A Simple Chunk Parser. University of Georgia. ProNTo
(Prolog Natural Language Tools), http://www.ai.uga.edu/mc/
ProNTo, accessed 31-10-2008.
Carreras.XavierandLluisMarquez
2004 Introduction to the CoNLL-2004 Shared Task: Semantic Role Label-
ing. CoNLL-2004 Shared Task Web Page: http://www.lsi.upc.edu/
srlconll/st04/papers/mtro.pdf, accessed 31-10-2008.
Ludwig, Bernd, Peter Reiss and Gunther Gorz
2006 CONALD: The configurable plan-based dialogue system. In Pro-
ceedings of the 2006IAR Annual Meeting. German-French Institute
for Automation and Robotics, Nancy, November 2006. David Brie,
Keith Burnham, Steven X. Ding, Luc Dugard, Sylviane Gentil,
Chunk parsing in corpora 281

Gerard Gissinger, Michel Hassenforder, Ernest Hirsch, Bernard


Keith, Thomas Leibfned, Francis Lepage, Dirk Soeffkher and Heinz
Worn (eds.),A2.15-18. Nancy: IAR.
Ramshaw, Lance A. and Mitchell P. Marcus
2005 Text chunking using transformation-based learning. In Proceedings
of the Third Workshop on Very Large Corpora, David Yarowsky and
Kenneth Church (eds.), 82-94. Cambridge, MA: MIT, Association
for Computational Linguistics.
Tjong Kim Sang, Erik F. and Sabine Buchholz
2000 Introduction to the CoNLL-2000 shared task: Chunking. Proceed-
ings of the Fourth Conference on Computational Natural Language
Learning, CoNLL-2000, Lisbon, Claire Cardie, Walter Daelemans,
Claire Nedellec and Erik Tjong Kim Sang (eds.), 127-132. New
Brunswick, NJ: ACL.
Tsuruoka, Yoshimasan and Jun'ichi Tsujii
2005 Chunk parsing revisited. Proceedings of the 9th International Work-
shop on Parsing Technology (IWPT 2005), Vancouver, British Co-
lumbia: Association for Computational Linguistics, Hany Bunt (ed.),
133-140. New Brunswick, NJ: ACL.
properties contributing to i d i o L t i c i t y

Ulrich Held

1. Idiomaticity, multiwords, collocations

1.1. Objectives

In linguistic phraseology and in computational linguistics, many different


definitions of idiomatic expressions and collocations have been given. The
traditional view involves a difference between compositional word combi-
nations and non-compositional or semi-compositional ones, classifying the
latter ones as idiomatic expressions (cf e.g. Burger 1998). In work on col-
locations, it has been observed that the degree of opacity (i.e. non-
compositionality) differs between its two elements (Hausmann 1979,
Hausmann 2004) and within the range of word pairs commonly denoted by
the term "collocation": Grossmann and Tutin (2003: 8) distinguish regular,
transparent and opaque collocations. For them, collocations are semi-
compositional and located between full idioms and word combinations
only governed by semantic selection criteria; the three subtypes show the
boundary between collocations where the meaning of the collocate is
opaque for decoding and collocations where only lexical selection is idio-
syncratic (i.e. they pose an encoding problem).
In many studies in Natural Language Processing (henceforth: NLP),
emphasis has so far mainly been on identifying idiomatic expressions and
less on classifying them. Thus the rather general term 'multiword expres-
sion' has been used, which denotes a wide variety of phenomena, ranging
from multiword function words (e.g. in spite of) over collocations and idi-
oms in the traditional phraseological sense (e.g. eat humble pie, give a
talk), to multiword names (e.g. Rio de Janeiro, New York; for a more de-
tailed list, see Heid 2008: 340).
In the following, we will start from the main trends in research about
collocations (section 1.2), and we argue that knowledge about collocations
used in lexicography and in NLP should be more than just knowledge
284 UlrichHeid

about the combinability of lexemes. In fact, not only the idiomatic nature
of collocations, but also of other idiomatic multiword expressions, is char-
acterized by a considerable number of morphological, syntactic, semantic
and pragmatic preferences, which contribute to the peculiarity of these
word combinations (section 2); these properties are observable in corpus
data. From the point of view of language learning, they should be learned
along with the word combination, very much the same way as the corre-
sponding properties of single words (cf Heid 1998 and Heid and Gouws
2006 for a discussion of the lexicographic implications of this assumption).
From the viewpoint of NLP, they should be part of a lexicon.
If we accept this assumption, the task of computational linguistic data
extraction from corpora goes far beyond the identification of significant
word pairs. We will not only show which additional properties may play a
role for German noun+verb combinations (section 2), but also sketch a
computational architecture (section 3.3) that allows us to extract data from
corpora which illustrate these properties. Some such properties can also be
used as criteria for the identification of idiomatic or collocational
noun+verb combinations, others just provide the necessary knowledge one
needs when one wants to write a text and to insert the multiword expres-
sions in a morphologically and syntactically correct way into the surround-
ing sentence.
Our extraction architecture relies on more complex preprocessing than
most of the tool setups used in corpus linguistics: it presupposes a syntactic
analysis down to the level of grammatical dependencies. We motivate this
by comparison with a flat approach, namely the one implemented in the
Sketch Engine (Kilgarnff et al. 2004), for English and Czech (section 3.2).

1.2. Collocations in linguistics and NLP

The firthian notion of collocation (cf. Firth 1957) is mainly oriented to-
wards lexical cooccurrence ("You shall know a word by the company it
keeps" (Firth 1957: 11)). British contextualism has soon discovered co-
occurrence statistics as a device to identify word combinations which are
collocational in this sense. John Sinclair places himself in this tradition, in
Corpus, Concordance, Collocation (Sinclair 1991), emphasizing however
the idiomatic nature of the combinations by contrasting the idiom principle
and the open choree principle. The range of phenomena covered by his
approach as presented in Sinclair (1991: 116-118) includes both lexical
collocations and grammatical collocations (in the sense of Benson, Benson
German noun+verb collocations in the sentence context 285

and Ilson 1986): for example, pay attention and back to both figure in the
lists of relevant data he gives.
The lexicographically and didactically oriented approach advocated,
among others, by Hausmann (1979), Hausmann (2004), Mel'cuk et al.
(1984, 1988, 1992, 1999), Bahns (1996) is more oriented towards a syntac-
tic description of collocations: Hausmann distinguishes different syntactic
types, in terms of the category of the elements of the collocation:
noun+verb collocations, noun+adjective, adjective+adverb collocations,
etc. Moreover, Hausmann and Mel'cuk both emphasize the binary nature
of collocations, distinguishing between the base and the collocate. Haus-
mann (2004) summarizes earlier work by stating that bases are typically
autosemantic (i.e. have the same meaning within collocations as outside),
whereas collocates are synsemantic and receive a semantic interpretation
only within a given collocation. Even though this distinction is not easy to
operationalize, it can serve as a useful metaphor, also for the analysis of
longer multiword chunks, where several binary collocations are combined.
Computational linguistics and NLP have followed the contextualist
view, in so far as they have concentrated on the identification of colloca-
tions within textual corpora, designing different types of tools to assess the
collocation status of word pairs. Most simply, a sorting of word pairs by
their number of occurrence (observed frequency) has been used on the
assumption that collocations are more frequent than non-collocational pairs
(cf. Krenn and Evert 2001). Alternatively, association measures are used to
sort word pairs by a statistical measure of the 'strength' of their association
(cf. Evert 2005); to date over 70 different formulae for measuring the asso-
ciation between words have been proposed.
An important issue in the context of collocation identification from
texts is that of defining the kinds of word pairs to be counted and statisti-
cally analyzed: by which procedures can we extract the items to be
counted? Simple approaches operate on windows around a keyword, e.g.
by looking at items immediately preceding or following that item. Word-
smith Tools (Scott 2008) is a well-known piece of software which embeds
this kind of search as its 'collocation retrieval' function (fixed distance
windows, left and right of a given keyword). Smadja (1993) combines the
statistical sorting of word pair data with a grammatical filter: he only ac-
cepts as collocation candidates those statistically relevant combinations
which belong to a particular syntactic model, e.g. combinations of an ad-
jective and a noun, or of a verb and a subsequent noun; in English, such a
sequence mostly implies that the noun is the direct object of the verb.
286 UlrichHeid

For German and other languages with a more variable word order than
English, the extraction of pairs of grammatically related items is more de-
manding (see below, section 3.2). The guiding principle for collocation
extraction for such languages is to extract word pairs which all homogene-
ously belong to one syntactic type, in Hausmann's (1979) sense, e.g. verbs
and their object nouns. Proposals for this kind of extraction have been
made, among others, by Held (1998), Krenn (2000), Ritz and Held (2006).
This syntactic homogeneity has two advantages: on the one hand, it pro-
vides a classification of the word pairs extracted in terms of their gram-
matical categories, and on the other hand, it leads to samples of word pairs,
from where the significance of the association can be computed with re-
spect to a meaningful subset of the corpus (e.g. all verb+object pairs).
More recent linguistic work on multiword expressions has questioned
some of the restrictions inherent to the lexico-didactic approach. At the
same time, the late John Sinclair has suggested that quantitative and struc-
tural properties seem to jointly characterize most such expressions, cf
Togmm-Bonelli, this volume. In this sense, a combination of the two main
lines of tradition can be seen as an appropriate basis for computational
linguistic data extraction work.
In terms of a modification of the lexico-didactic approach, Schafroth
(2003: 404-409) has noted that there are many idiomatized multiword ex-
pressions which cannot be readily accounted for in terms of the strictly
binary structure postulated by Hausmann (1979). Siepmann (2005) has
given more such examples. Some of them can be explained by means of
recursive combinations of binary collocations: e.g. scharfe Kritik iiben
('criticize fiercely', Schafroth 2003: 408, 409) can be seen as a combina-
tion of Kritik uben ('criticize', lit. 'carry out criticism') and the typical
adjective+noun collocation for Kritik, namely scharfe Kritik (cf. Held
1994: 231; Hausmann 2004; Held 2005). But other cases are not so easy to
explain and the question of the 'size' of collocations is still under debate.
But the notion of collocation has not only been widened with respect to
the size of the chunks to be analyzed; researchers also found significant
word combinations which are of other syntactic patterns than those identi-
fied e.g. by Hausmann (1989). Examples of this are combinations of dis-
course particles in Dutch (e.g. maar even, 'a bit'), where the distinction
between an autosemantic base and a synsemantic collocate is hard to draw.
In conclusion, it seems that the different strands of collocation research,
in the tradition of both the early work of John Sinclair and of the lexico-
graphic and didactic approach, can jointly contribute to a better under-
German noun+verb collocations in the sentence context 287

standing of the properties of idiomatized multiword expressions. This is


especially the case for computational linguistic work on such multiwords,
i.e. on their extraction from large corpora and on their detailed linguistic
description, for formal grammars, text understanding or high quality in-
formation extraction. For the purpose of this article (and for the work un-
derlying it), we take Sinclair's idiom principle as a theoretical starting
point and we are interested in identifying linguistic properties of multiword
items which contribute to their idiomaticity.

2. Linguistic properties of collocations:


German verb + object pairs as a case in point

Noun+verb collocations and verbal idioms have been discussed in some


detail in the literature (cf e.g. Fellbaum, Kramer and Neumann 2006). As
they have a number of interesting linguistic properties, we nevertheless use
them again to illustrate the importance of a detailed and comprehensive
description of those context-related properties which govern the insertion
of the collocations into the sentence. At the same time, these properties can
be seen as an expectation horizon for data extraction from corpora.

2.1. Determination and modification of the noun

The literature on German noun+verb collocations contains discussions of


aspects of morphosyntactic fixedness. Helbig (1979) already mentions a
number of morphosyntactic properties of German 'Funktionsverbgefiige'
(support verb constructions, svc), which he uses to distinguish what he
calls 'localized' as opposed to 'non-lexicalized' support verb construc-
tions. These properties include the morphosyntactic correlates of referen-
tially fully available nouns (and thus noun phrases) in the case of non-
lexicalized support verb constructions, and, vice versa, morphosyntactic
restrictions in legalized ones. Examples of such properties are the use of
articles or the possibility to pluralize the noun of a support verb construc-
tion, to modify it with an adjective, a relative clause, etc., or to make refer-
ence to it with a personal or interrogative pronoun (cf. Held 1994: 234 for
a summary).
Examples of these properties are given in (1) to (4), where we use the
collocation Frage + stellen ('ask + question') as an example of a combina-
tion where the noun is referentially available (a non-lexicalized svc, in
288 UlrichHeid

Helbig's terms), and zur Sprache bnngen ('mention', 'bring to the fore'),
as an example of a legalized support verb construction. In line with e.g.
Burger's (1998) discussion of the modifiability and fixedness of idioms,
Helbig's distinction can also be recast in terms of more vs. less idiomatiza-
tion.
(1) Fragen stellen ('[to] ask questions')
*zuSprachen bringen
(2) eine Frage stellen die Frage stellen diese Frage stellen ('[to] ask a/the/
this question')
*zu einer Sprache bringen *zu der/dieser Sprache bringen
(3) eine relevante Frage stellen ('[to] ask a relevant question')
*zurrelevanten Sprache bringen
(4) eine Frage stellen, die enorm wichtig ist ('[to] ask a question which is
enormously important')
*zur Sprache bringen, die enorm wichtig ist
The examples in (1) to (4) show the morphosyntactically fixed nature of
zur Sprache bnngen, which does not accept any of the operations which
are perfectly possible with Frage + stellen. A task for data extraction from
corpora is thus to test each collocation candidate for these properties (num-
ber, determination, modifiability of the noun) and to note the respective
number of occurrences of each option in the corpus.

2.2. Syntactic subcategonzation

In Burger's work on idioms, examples are given of verbal idiomatic ex-


pressions which have their own syntactic subcategonzation behaviour,
different from that of any of their components: for example, (etnen) Baren
aujbmden ('[to] pull someone's leg') takes a subject and an indirect object,
and it can have a ^ . - c l a u s e which expresses the contents of the statement
made (cf also Keil 1997).
(5) Der Kollege hat dir einen Baren aufgebunden.
Er hat mir den Baren aufgebunden, dass ...
Like verbal idioms, verb+object collocations can also have their own
valency, but this property has long not been recognized. Krenn and Erbach
(1994) state that the subcategonzation of the noun is taken over in the sup-
port verb construction. This is true in many cases, but not all: Lapshinova
and Heid (2007) show examples of collocations which have their own sub-
categonzation properties (cf. (6), below).
German noun+verb collocations in the sentence context 289

(6) zum Ausdruck bringen ('[to] express', lit. 'bring to the expression'),
zum Ausdruck kommen ('[to] be expressed'),
zurSprache bringen ('[to] mention'),
inAbredestellenC[to]dmf\
zu Protokollgeben C[to] state').
The collocations in (6) all s u b c a t e g o r y for a sentential complement, while
the nouns Ausdruck, Protokoll, Sprache and Abrede (except in another
reading) as well as the verbs involved do not allow a complement clause.
Thus, even if the number of cases is relatively small, it seems that some
collocational multiwords require their own subcategonzation description.
A related fact is nominal 'complementation' of the nominal element of
noun+verb collocations, i.e. the presence of genitive attributes. Many col-
locations contain relational nouns or other nouns which have a preference
for a genitive attribute. An example is the collocation im Mittelpunkt (von
X) stehen ('[to] be at the centre [of X]'). Other examples, extracted from
large German newspaper corpora are listed in (7), below. 1
(7) (a) in + Mittelpunkt + GEN + stellen/rucken
('[to] put into the centre of...', Frankfurter Rundschau 40354632)
.... die die Sozialpolitik mehr in den Mittelpunkt des offentlichen
Interesses stellen will ('which wants to put social politics more into
the centre of public interest');
(b) auf + Boden + GEN + stehen
('[to] be on the solid ground of ...', Frankfurter Rundschau,, 440l29)
[...Jfragte er seinen schlaftrunkenen Kollegen, der mit einem Mai
wieder auf dem Boden der Realitat stand ('[...] he asked Ms sleepy
colleague who, all of a sudden, was back to reality');
(c) sich auf+ Niveau + GEN + bewegen
('[to] be at the level of...', Stuttgarter Zeitung4245965S)
[...J wahrend sich der Umfang des Auslandsgeschafts auf dem
Niveau des Voriahres bewegte ('whereas the amount of foreign trade
was at the level of the previous year').
Some of the noun+verb collocations where the noun tends to have a geni-
tive (or a v 0 «-phrase) seem to have, in addition to this syntactic specificity,
also particular lexical preferences: for example, we find auf dem Boden der
Realitat stehen, auf dem Boden der Verfassung stehen ('[to] be rooted in
the constitution'), auf dem Boden (seiner) Uberzeugungen stehen ('[to] be
attached to one's convictions') more frequently than other combinations of
auf+ Boden + stehen with genitives. Moreover, these combinations often
290 UlrichHeid

come with the adverb fast, such that the whole expression is similar to 'be
firmly rooted in ...'. The analysis of such combinations of collocations
requires very large corpora and ideally corpora of different genres: our data
only come from newspapers and administrative texts. A more detailed
analysis should show larger patterns of relatively fixed expressions, likely
specific to text types, genres etc. At the same time, it would show how the
syntactic property of the nouns involved (to take a genitive attribute) inter-
acts with lexical selection, and how collocational selection properties of
different lexemes interact to build larger idiomatic chunks of considerable
usage frequency.
On the other hand, there are collocations which hardly accept the inser-
tion of a genitive after the noun, and if it is inserted, the construction seems
rather to be a result of linguistic creativity than of typical usage. Examples
are given in (8): the collocations have no genitives in over 97 % of all ob-
served cases (the absolute frequency of the collocation in 240 M words is
given in parentheses), and our examples are the only ones with a genitive:
(8) (a) in Flammen aufgehen ('[to] go up in flames', Die Zeit3926l 518, 433):
LaButes Monolog ist ein Selbstrechtfertigungssystem von so
trockener Vernunft, dass es iederzeit in den Flammen des Wahnsinns
aufeehen konnte ('LaBute's monologue is a self-justification system
of such and reason, that it could go up, at any moment, in the flames
of madness');
(b) in die Irre fiihren ('[to] mislead', Frankfurter Allgemeine
Zeitung62122S15, 344):
Ihr 'Requiem' versucht sich in unmittelbarer Emotionalitat, von der
die Kunstja immer wieder traumt, und die doch so oft in die Irre der
Banalitat fuhrt ('Her 'Requiem' makes an attempt at immediate
emotionality, which art tends to dream of every now and then, and
which nevertheless misleads quite often towards banality').
Similar strong preferences for the presence or absence of modifying ele-
ments in noun+verb collocations are also found with respect to adjectival
modification. This phenomenon has however mostly been analyzed in line
with the above mentioned morphosyntactic properties which depend on the
referential availability of the noun. Examples which require modifying
adjectives are listed in (9) and a few combinations which do not accept
them are given in (10).
(9) eine gute/brillante/schlechte/... Entwicklung nehmen
('[to] progress well/brilliantly/not to progress well')
eine gute/schlechte/traurige/... Figur abgeben
German noun+verb collocations in the sentence context 291

('[to] cut a good/bad/poor figure')


imjruhen/letzten/entscheidenden/... Stadium sein
('[to] be in an early/the last/the decisive stage')
aus guten/kleinen/geordneten/... Verhaltnissen stammen
('[to] be of .../humble/... origin')
(10) Gebrauchmachen CM make use')
PlatznehmenCMtske a seat')
Stellung beziehen ('[to] position oneself)
Schulemachen ('[to] find adherents')
The examples in (10) can only be modified wrth adverbs (eindeutig Stel-
lung beztehen) and are, at least in our corpora, not used with adjectives.
With many other collocations, both options are available (cf. Starrer 2006
on examples such as brieflich in Kontakt stehen vs. in brieflichem Kontakt
stehen, both meaning 'be in (postal) correspondence with sb.').
The examples above seem to suggest that it is necessary to keep track of
the syntactic valency behaviour of nouns in noun+verb collocations in
more detail than this is often done in the literature and in lexicography.
Moreover, the valency behaviour seems to be one of the factors contribut-
ing to the development of larger collocational clusters.

2.3. Negation and coordination in noun+verb combinations

Preferences for negation, as well as for coordinated nouns are a strong


indicator of idiomatization, and noun+verb combinations with strong pref-
erences of this kind tend to be full idioms. But also collocations where the
nominal part can be seen as 'auto-semantic' in Hausmann's sense, show
such preferences.
Many collocations show up both in a positive and in a negated form.
For German data, the difference between verb negation (ntcht) and NP
negation {kern) needs to be accounted for in addition. For a lexicographic
description, it is necessary to indicate preferences in this respect. Similarly,
it makes sense to indicate which collocations show a marked tendency
towards negation; for example, the proportion of negated vs. non-negated
instances in a big corpus could be indicated. Typical examples of combina-
tions which are idiomatic and most often found in the negated form are
given in (11) below, in order of decreasing frequency in our corpus.
(11) keinenHehl machen aus ('[to] make no secret of...')
[einerSache] keinen Abbruch tun ('not to spoil sth.')
292 UlrichHeid

keine Grenze(n) kennen ('[to] know no bounds')


kein Ende nehmen (cf. 'there is no end to ...')
kein Wort verlreren ('not to waste any words on ...')
A similar sign of idiomatization is the presence of coordinated noun
phrases in the multiwords; the items in (12) are typical examples of this
phenomenon; they are not perceived as correct if one of the conjuncts is
missing. On the other hand, coordinated NPs in 'normal' collocations (e.g.
pay attention, ask + question, etc.) are rather rare or a matter of creative
use.
(12) in Fleisch undBlut ubergehen ('[to] become second nature [to s.o.]')
rnLohn undBrot stehen/bringen ('[to] be employed by [s.o.]')
hinterSchlossundRiegelsitzen/bringen/...
('[to] l « ^ e kept under lock and key')
in Sack undAsche gehen ('[to] repent in sackcloth and ashes')

2.4. Preferences with respect to word order

The properties discussed so far mainly have to do with a collocation's form


(cf. 2.1, 2.3) or with its syntactic embedding in a sentence (2.2). Many
collocations also seem to have preferences with respect to word order: this
property does not affect their form directly, but it is part of the linguistic
knowledge governing their correct use in a sentence. It is likely that these
preferences have to do with the status of the nominal element of the collo-
cations and with the nature of some of the sequential positions in German
sentences.
German has three different word order patterns. They are defined with
respect to the model of topological fields. It distinguishes two areas where
verbal elements (finite or non-finite verbs, verb complexes, or verb parti-
cles) or conjunctions can be placed: 'linke Satzklammer' and 'rechte
Satzklammer', LK and RK, respectively, in table 1, below. Furthermore, it
identifies three areas for other types of constituents, ' Vorfeld' (VF, in table
1), 'Mittelfeld' (MF) and 'Nachfeld' (right of RK, left out from table 1, for
sake of simplification).
These areas can be filled in different ways, depending on the place of
the finite verb. It can be in RK, as it happens in subclauses; this model is
the 'verb-final' model (VL, for 'verb-letzt', in table 1); it accounts for
approx. 25 % of all occurrences of finite verbs in our corpora. Alterna-
tively, the finite verb is in LK, with a full constituent in Vorfeld ('verb-
German noun+verb collocations in the sentence context 293

second' model, V2 in table 1), or with an expletive or no constituent at all


in Vorfeld ('verb-first' model, used in interrogatives and conditionals, V I ) .
V2 sentences are by far the most frequent ones (approx. 65 % ) , the verb-
first model accounting at most for approx. 10 % of our data. These facts
are summarized using the example sentence in (13) and its word order
variants, in table 1.
(13) DieFrage (kann/wird) (in Darmstadt) (dann/wohl) gestellt
(werden).
This question (can/will) (in Darmstadt) (then/maybe) asked (be)
'This question will/could (then/maybe) be asked (in Darmstadt)'.

Table 1. German verb placement models


VF LK MF RK
VI (Es) wird dieFrage dann gestellt
Kann dieFrage gestellt werden?
V2 DieFrage wird dann gestellt
DieFrage kann in Darmstadt gestellt werden
VL weil dieFrage dann gestellt wird
dass dieFrage dann gestellt werden kann

The table shows that a noun+verb collocation like Frage + stellen ('ask +
question') can appear under all three word order models. At the time of
writing this article, we are in the process of investigating in more detail
which collocation candidates readily appear in all three models and which
ones do not. This was prompted by work on word order constraints for
Natural Language Generation (cf Cahill, Weller, Rohrer and Heid 2009)
and by the following observation: some collocations which are highly idio-
matized and morphosyntactically relatively fixed, such as Gebrauch ma-
chen ('make use') tend not to have their nominal element in Vorfeld (cf.
Heid and Weller 2008). This is illustrated with the examples in (14), which
are contrasted with instances of Frage + stellen.
(14) (a) [...,] weil derChefeinerelevante Frage stellt.
('because the boss asks a relevant question', VL)
(b) [...,] weil erdavon Gebrauch macht.
('because he makes use of h', VL)
(c) Fine relevante Frage stellt der Chef(V2).
294 UlrichHeid

(d) *Gebrauchmachter davon Qfl).


In fact, what is marked with an asterisk in (14d) is maybe not ungrammati-
cal, but at least dispreferred. There are examples of NP or PP fronting (i.e.
the NP or PP in Vorfeld), under contrastive stress, as in the (invented) ex-
ample (15):
(15) Das Medikament ist seither im Hans. Gebrauch hat davon noch niemand
gemacht.
('We have this medicine at home since then. Nobody has so far made use of
it')
It should be noted that the example in (15) has an auxiliary in Vorfeld posi-
tion (instead of Gebrauch macht davon niemand, in simple present), which
seems, according to our preliminary data, a factor which makes this par-
ticular word order a bit more likely. Another variant, which is also more
likely under contrastive stress, is partial VP fronting, i.e. a case like in
(16), where the NP or PP and a non-finite verb form of the collocation find
themselves left of the finite auxiliary.
(16) Zur Kenntnis genommen wurde das Hauflein Bamberger Demonstranten
kaum.
('The small group of demonstrators from Bamberg was barely noticed.')
A comparison between idiomatized verb+pp collocations and non-
idiomatic combinations of the same syntactic form (verb + prepositional
phrase) seems to indicate that the collocations hardly accept the fronting of
their prepositional phrase, whereas this is normal for trivial combinations.
In (17) we reproduce data from a preliminary test (earned out by
Marion Weller, IMS Stuttgart, unpublished), where idiomatized colloca-
tions are listed, along with the total frequency with which they were found
in the corpus ('tot': in [17]), as well as the absolute observed frequencies
of pp fronting with simple tense forms ('pp-frt'), pp fronting with complex
tense forms (i.e. the auxiliary being in VF, 'pp-aux') and partial VP front-
ing ('vp-frt').
(17) zuVerfugungstehen tot:9825 pp-frt: 41 pp-aux:0 vp-frt: 1
zuVerfugungstellen tot:7209 PP-frt:0 PP -aux:2 vp-frt:0
inMittelpunktstehen tot:5172 pp-frt: 1613 PP-aux:171 vp-frt:0
inAnspruchnehmen tot:4984 pp-frt:2 PP -aux:2 vp-frt:2
umLebenkommen tot:4691 pp-frt:3 pp-aux: 1 vp-frt:0
mKrafttreten tot:3886 pp-frt:9 pp-aux:0 vp-frt: 17
inFragestellen tot:3884 pp-frt: 1 pp-aux:0 vp-frt:9
inVordergrundstehen tot:3173 pp-frt:511 pp-aux:61 vp-frt:0
German noun+verb collocations in the sentence context 295

zuEndegehen tot:3087 pp-frt:5 pp-aux:0 vp-frt:0


mFragekommen tot:3080 pp-frt:106 pp-aux:3 vp-frt:3
zuKenntmsnehmen tot:2835 pp-frt:0 pp-aux:0 vp-frt:3
In (18), we reproduce a similar table, with top frequency combinations
which do allow pp fronting. This table contains almost exclusively combi-
nations of verbs and prepositional phrases which either do not semantically
belong together (as the PP is an adjunct, cf in Monat + steigen, 'rise + in
[the] month [of]'), or which are not to be considered as idiomatized collo-
cations (cf. mtt [dem] Ban begmnen, '[to] start the building work').
(18) anStelletreten tot: 1566 pp-frt:341 pp-aux:248 vp-frt:0
in Monat steigen tot: 1050 pp-frt:150 pp-aux:190 vp-frt:0
mitBaubeginnen tot:864 pp-frt:2 pp-aux:130 vp-frt:2
aufSeitenaben tot:568 pp-frt:0 pp-aux:59 vp-frt:0
in Quartal steigen tot:545 pp-frt:69 pp-aux:95 vp-frt:0
in Halbjahr steigen tot:512 pp-frt:73 pp-aux:88 vp-frt:0
nachAnsichthaben tot:470 pp-frt:0 pp-aux:138 vp-frt:0
mZusammenhanghmwe1Sentot:465 pp-frt:109 pp-aux:78 vp-frt:0
anEndehaben tot:455 pp-frt:0 pp-aux:124 vp-frt:0
inZeitnaben tot:438 pp-frt:0 pp-aux:81 vp-frt:0
mPras enthalten tot: 167 pp-frt:0 pp-aux:65 vp-frt:48
In (17), im Mtttelpunkt stehen and im Vordergrund stehen stand out as
having a considerable number of pp fronting cases. These two collocations
have a very strong preference for a genitive attribute (see example [7],
above), which makes the fronted pp longer and 'more contentful'. Table
(18) contains im Frets enthalten (sem) ('[to] be included in the price'),
which has particularly and unexpectedly high counts of partial VP fronting.
This is due to text-type specificities: the expression is typically used when
offers from hotels or travel agencies are described, and the respective sen-
tences almost invariably come as im Frets (von X Euro) enthalten stnd
Ubernachtung, Fruhstuck, Abendessen und Hoteltransfer ('the price [of X
Euro] includes the hotel room, breakfast, dinner and bus transfer to the
hotel').
A preliminary conclusion which can be drawn from the data in (17) and
(18) is that idiomatic verb+object and verb+pp combinations tend to show
restrictions with respect to the position of their nominal or prepositional
phrases in Vorfeld, much more so than non-idiomatic combinations of the
same structure.2 The Vorfeld test seems thus to some extent to be usable, at
least for items of a medium or high frequency, to distinguish idiomatic
verb+object and verb+pp combinations from non-idiomatic ones.
296 UlrichHeid

2.5. Vanation by text types, regions and registers

In addition to the formal properties discussed so far, several layers of varia-


tion need to be taken into account. Obviously, there is variation related to
text types and domains. An analysis of legal journals from the field of in-
tellectual property law and trademark legislation shows that not only this
domain has its own specialized phraseology (cf Heid et al. 2008), but also
preferences with respect to the use of certain collocations from general
language.
Regional variation is also very clearly observable; in a preliminary
study, we analyzed German newspaper texts from Germany, Switzerland,
Austria and South Tyrol, using part of the corpus material gathered by the
Institut fur deutsche Sprache (Mannheim) and the universities of Tubingen
and Stuttgart, in the framework of the DeReKo project (Deutsches Refe-
renzkorpus), as well as texts from the South Tyrolean newspaper Dolo-
rrnten (made available to us under a specific contract). Obviously, a corpus
containing different Swiss newspapers (and different amounts of material
from the individual newspapers) is nothing more than an opportunistic
'archive'-like collection, and only partial generalizations about regional
differences in collocational behaviour can be derived from that material;
but it nevertheless provides data which indicate a few tendencies.
A simple comparison of relative frequencies of collocations in the news
texts from different regions suggests that there are considerable differences
between, say, Swiss and German texts. These differences do not (only)
have to do with regional objects, institutions or procedures (cf. e.g. Kanton,
kantonal, em Nem m dre Urne legen ['{to} vote no']). They neither are
exclusively due to specific regional lexemes which are synonyms of items
used in Germany, such as CH Entscherd for DE Entscheidung ('decision'),
or South Tyrolean Erhebungen aufnehmen for DE Errmttlungen aufnehmen
('[to] take up investigations'). There are also regional differences in collo-
cate selection: Swiss texts have for example a much higher proportion of
the collocation tiefer Prers ('low price'), which is not completely absent
from texts from Germany, but very rare (2 occurrences in 100 million
words from the tageszeitung); German texts from Germany tend to have
mednger Prers instead, which itself exists in the Swiss data, but is less
frequent than tiefer Preis.
Moreover, first analyses of the above-mentioned data suggest that there
may also be some regional differences with respect to the idiomatic use of
determiners in the noun phrase of noun+verb collocations. Concerning the
German noun+verb collocations in the sentence context 297

collocation Geschaft + machen ('[to] make [a good] bargain'), for exam-


ple, our data provide evidence for a preference, in texts from Germany, for
an indefinite plural: Geschafte machen, gute Geschafte machen; this form
is the most frequent one also in Switzerland, but, contrary to the data from
Germany, Swiss news texts have also a high proportion of indefinite singu-
lars: em Geschaft machen, em gutes Geschaft machen.
More analyses are certainly necessary to get a real picture of such dif-
ferences. But it would not be all that surprising to see the same types of
variation in collocation use as they are found in the use of single word
items. And collocations not only show preferences at the morphosyntactic
and syntactic level, but these may also be subject to regional and text type-
related variation.

2.6. Intermediate summary

In this second part of the paper, a few linguistic properties of noun+verb


collocations have been briefly discussed: determination and modification,
subcategonzation, preferences for negation or coordination, compatibility
with the three German word order models and regional variation.
All of these are preferential in character, not categorical. All of them
play a role in the insertion of collocations into sentences and texts, but not
all are relevant as indicators of the idiomatic status of the combinations,
even though the observable preferences are likely due to idiomatization. In
fact, restrictions with respect to determination, modification and prefer-
ences for negation or coordination have been used as indicators of non-
compositionahty in NLP research (cf Fazly and Stevenson 2006). We have
the impression that the compatibility with NP or PP fronting is also a good
criterion to separate collocations from non-idiomatic, fully compositional
combinations. Moreover, as the description of collocations should cover
both lexical selection and morphosyntactic preferences, likely both would
need to be analyzed for variation.
The idea underlying e.g. Fazly and Stevenson's (2006) work is that
idiomaticity and fixedness are correlated: the more fixed the morphosyntax
of a multiword, the more likely it is idiomatic. Fazly and Stevenson (2006)
analyze fixedness phenomena and build a sort of "fixedness vector", by
adding points for all pieces of evidence that speak for fixedness. Beyond a
certain threshold, they classify the respective multiword as idiomatic. This
procedure does not keep track of the individual properties of the multi-
words, but it leads to a relatively adequate classification. With extraction
298 UlrichHeid

tools that keep track of the type of preference, both functions could be
achieved: a broad classification into [± idiomatic] and a detailed descrip-
tion.
This inventory of phenomena to be analyzed leads to rather complex
expectations for (semi-)automatic data extraction from corpora. Beyond
lexical selection, there is also a need to extract evidence for the properties
discussed above. And in the ideal case, careful corpus design and a detailed
classification of the corpus data used should allow for the variational
analysis suggested here.

3. Procedures for extracting collocation candidates from texts

In this section, a brief overview of the main approaches to the extraction of


collocation candidates from text corpora will be given. More details can be
found in Evert (2005) and Evert (2009). In this article, we will mainly ana-
lyze the existing approaches and contrast them with the expectation hori-
zon presented in section 2. This will be done by comparing English and
German. At the end of this section, we discuss the architecture used in our
own work.

3.1. Properties of collocations used for their extraction

There are three groups of properties of collocations which are used to ex-
tract collocational data from text corpora, at least for English. These are (i)
cooccurrence frequency and significance, (n) linear order and adjacency
and (in) morphosyntactic fixedness. All three properties as well as their
use for data extraction have been briefly mentioned above (in section 1.2).
Extraction procedures based exclusively on statistics (frequency, asso-
ciation measures) are at a risk of identifying at least two types of noise: one
word combinations that are typical of a certain text type, but phraseologi-
cally uninteresting. An example is die Polizei teilt mt(, dass) ('the police
informs [that...]'): this word combination is particularly frequent in Ger-
man newspaper corpora (because newspapers report about many events
where the police has to intervene, and to inform the public thereafter), but
it is not particularly idiomatic, since any noun denoting a human being or
an institution composed of human beings can be a subject of mitteilen.
Next to these semantical^ and lexically trivial word combinations, also
German noun+verb collocations in the sentence context 299

pairs of words could be extracted which are not even in a grammatical


relationship.
Thus, association measures have been supplemented with filters based
on a modelling of grammatical relations. Here, language specific problems
arise (see below, section 3.2): the effort that needs to be spent in order to
extract correct relational pairs from large corpora with a satisfactory recall
differs from one language to the other. As mentioned above, both se-
quences of statistical and symbolic procedures have been used (cf section
1.2: Smajda [1993] vs. e.g. Krenn [2000]).
The third type of properties is morphosyntactic fixedness, i.e. the prop-
erties discussed above in the second section. Fazly and Stevenson (2006)
take morphosyntactic preferences or formal fixedness as an indicator of
idiomaticity. Ritz (2006), Ritz and Held (2006) used a similar procedure on
a homogeneous set of verb+pp combinations extracted from German
prenominal participle constructions (e.g. der emgeretchte Antrag - An-
trag + einreichen, 'the submitted proposal, submit + proposal'), and they
evaluated which percentage of the extracted noun+verb combination types
could be classified as idiomatized collocations on the basis of being mor-
phosyntactically restnced or fixed; in fact, only 35 % of the combinations
were accepted as being idiomatic or collocational in manual evaluation (2
evaluators).
Obviously, most practical approaches to collocation candidate extrac-
tion use two or all three of the above types as extraction criteria, as none of
them, taken in isolation, performs well enough.

3.2. Extracting grammatical relationships

As mentioned before, there are several types of noun+verb collocations:


verb+object and verb+subject collocations and verb+pp collocations. In
addition to direct objects, German also has indirect objects, and we believe
that these also can be part of collocational combinations with verbs; exam-
ples are given in (19). These have been extracted from a corpus of legal
journals, but could also be found in other text types.
(19) dem Zweifel unterliegen ('[to] be doubtful'),
denAnforderungengenugen ('[to] satisfy [the] requirements'),
einem Antrag stattgeben ('[to] accept a proposal')
A task within collocation extraction is thus to identify the different syntac-
tic subtypes of noun+verb collocations. For English, this task is relatively
300 UlrichHeid

easy, as typically the subject of an English verb can be found to its left and
the object to its right. To identify English noun+verb combinations, part-
of-speech tagged corpora and (regular) models about the adjacency of
noun and verb phrases are mostly sufficient; it is possible to account for
non-standard word order types (e.g. passives, relative clauses) with rela-
tively simple rules.
For inflecting languages (like the Slavonic languages, or Latin), nomi-
nal inflection gives a fair picture of case and thus of the grammatical rela-
tion between a nominal and its governing verb. Even though many inflec-
tional forms of e.g. nouns in Czech are case-ambiguous in isolation, a large
percentage of noun groups (e.g. adjectives plus nouns) is unambiguous.
Thus, inflecting languages, allowing for flexible constituent order, still
lend themselves fairly well to a morphology-based extraction approach of
relational word pairs.
In fact, the collocation extraction within the lexicographic tool Sketch
Engine (Kilgarnff et al. 2004) is based on the above mentioned principles
for English and Czech: the extraction of verb+object pairs from English
texts relies on sequence patterns of items described in terms of parts-of-
speech and the tool for Czech on patterns of cooccurrence of certain mor-
phological forms, in arbitrary order, and within a window of up to five
words.
German is different from both English and Czech, for that matter. Due
to its variable constituent order (see table 1), sequence patterns and the
assumption of verb+noun adjacency do not provide acceptable results.
German has four cases and nominals inflect for case; but nominal inflec-
tion contains very much syncretism, such that, for example, nominative and
accusative, or genitive and dative are formally identical in several inflec-
tion paradigms. Evert (2004) has extracted noun and noun phrase data from
the manually annotated Negra corpus (a subset of the newspaper Frank-
furter Rundschau) and he found that only approx. 21 % of all noun phrases
in that corpus are unambiguous for case, with roughly the same amount not
giving any case information (i.e. being fully four-way ambiguous, as is the
case with feminine plural nouns!), and 58 % being 2- or 3-way ambiguous
(cf table 2, below).
German noun+verb collocations in the sentence context 301

Table 2. Case syncretism: Evert (2004: 1540) on Negra


Nouns unambiguous 2/3 alternatives no information
forms alone 7% 3^^% 5^%
NP + agreement 21% 58% 21%

We have tested different ways to improve verb+object precision in the


extraction of data from German text corpora. One option is to use full syn-
tactic analysis. This is what we do in the work underlying the data pre-
sented in this paper. Another option is partial syntactic analysis, in the
sense of chunking (cf Abney 1991). This involves the recognition of the
pre-head part of nominal, adjectival and prepositional phrases, and of the
head; it does not account for post-head modification, pp attachment and
overall sentence structure. Such an approach is useful, when the chunks are
annotated with hypotheses about case (cf. Kermes 2003), from where to
derive grammatical relations.
Yet other approaches rely on an approximative modelling of case and
grammatical relations by means of data on case endings. As mentioned,
these are too synergistic to be used in isolation; they need at least to be
combined with knowledge about a noun's grammatical gender. In our ex-
periments (Ivanova et al. 2008), this knowledge was inferred from deriva-
tional affixes (e.g. 'all nouns in -hett/kett ['-ity'] are feminine'). The ex-
periments carried out were performed on the Sketch Engine, with different
versions of a sketch grammar for German; the different versions are used to
show and to compare the impact of the use of different amounts of linguis-
tic knowledge.
The different versions of the grammar contain the types of information
listed in (20), below.
(20) Data available to different German sketch grammar versions:
(a) case guessed from inflection forms in affix sequences:
dem (dative-sg) kleinen (dative-sg) Hans ([nom|akk|dat]-sg);
(b) like (a), plus gender guessed from derivational affixes (-heit: fern);
(c) inflection-based guessing as in (a) plus adjacency of np and verb
under the verb-final constituent order model;
(d) tike (c), plus gender guessing (as in (b));
302 UlrichHeid

(e) inflection-based guessing plus explicit sentence structure models for


sentences which contain only and exactly a subject and object, an
indirect object and/or a prepositional object (sentence patterns);
(f) like (e), plus gender guessing (as in [b]).
The results of an evaluation against a small set of sentences manually an-
notated for the case of noun phrases showed that condition (20f), i.e. the
most complex one, gave the best results, for precision as well as for recall;
the other conditions with restrictive patterns also provided better results
than the more 'sloppy' extractors (Ivanova et al. 2008). This seems to indi-
cate that the extraction of verb+object pairs from German data is harder
than, e.g. from English or Czech data. Our conclusion is that instead of the
approximative modelling summarized in (20), foil syntactic analysis seems
to be the most appropriate preprocessing of corpora for collocation extrac-
tion from German texts.
A similar conclusion is reached by Seretan (2008) who evaluated pat-
tern-based methods against parsing-based collocation extraction for both
English and French. She finds that in particular the recall of the extraction
procedures can be increased, if parsed data are used, instead of only pos-
tagged material. In a mini-experiment on the comparison of a chunker
(Kermes 2003) with parsing-based extraction, we found a similar discrep-
ancy: on the top 250 collocation candidates identified by each method (top
250 by frequency and significance), on one and the same corpus, almost no
differences in precision could be found between the chunking-based ex-
tractor and the parsing-based one. But the parsed data provided almost
twice as much material, i.e. a massively higher recall (cf Heid et al. 2008).
It seems to be easier for a linguist to write sufficiently restrictive extraction
rules (i.e. to control precision) than to have a clear view of the cases
missed by the extractor (i.e. to avoid losses in recall).

3.3. An architecture for the extraction and classification of German


noun+verb collocations

On the basis of the results of the experiments described above, it seems


natural to use syntactically preprocessed text corpora for collocation can-
didate extraction. For the work described in this paper (e.g. the data dis-
cussed above, in section 2), we thus used Schiehlen's (2003) dependency
parser. It produces a tabular representation of dependency structures as an
output and it has an acceptable coverage; firrthermore, it annotates all cases
German noun+verb collocations in the sentence context 303

of local syntactic ambiguity and of non-attachment into the analysis result:


if necessary, one can thus skip those parsing results which seem to be am-
biguous (e.g. with respect to case) or which may not have been assigned
enough structure (i.e. where most items are directly attached to the top
node of the dependency tree).
Figure 1 contains atree representation of the sentence in (21), as well as
the tabular output produced by the parser.
(21) die zweite Studie lieferte ahnliche Ergebnisse
('the second study provided similar results')

0 Die ART d | 2 SPEC


1 zweite ADJA 2. | 2 ADJ
2 Studie NN Studie Nom:F:Sg 3 NP:nom
3 lieferte VVFIN liefern 3:Sg:Past:Ind* -1 TOP
4 ahnliche ADJA ahnlich | 5 ADJ
5 Ergebnisse NN Ergebnis Akk:N:P | 3 NP:akk
6 . $. . | -1 TOP
Figure 1. Dependency structure output used for collocation extraction: tree rep-
resentation and internal format of the parser by Schiehlen (2003)
(from: Fntzmgeretal. 2009)

To extract collocation candidates from the parsed output, we use pattern


matching at the level of grammatical functions: for example, we extract
combinations of main verbs and nouns, where the nouns are in a direct
object relation to the respective main verbs. Such data can be read off the
parsing output. As the parser marks ambiguities, we could also just work
with non-ambiguous sentences (so far, the quality achieved by using both
unambiguous and ambiguous data is however satisfactory).
The parsing output not only contains the word form found in the sen-
tence analyzed (second column figure 1), but also its lemma and its mor-
304 UlrichHeid

phosyntactic features (case, number, etc., cf. third and fourth column of
figure 1). In the extraction work, we rely on these data, as they give hints
on the morphosyntactic properties of the collocations extracted: the mor-
phosyntactic features, as well as the form of the determiner, possible nega-
tion elements in the noun phrase or in the verb phrase, possible adverbs,
etc. are extracted along with the lemma and form of base and collocate.
This multiparametnc extraction is modular: new features or context part-
ners can be added if this is necessary. Similarly, additional patterns, which
in this case cover the sentence as a whole, can be used to detect passives
and/or to identify the constituent order models involved. In this way, we
get data for the specific analyses discussed in section 2.
All extracted data for a given pair of base and collocate are stored in a
relational database, along with the sentence where these data have been
found. An example of a data set for a sentence is given in (22), below, for
sentence (23). For this example, the database contains information about
the noun and verb lemma (the verb here being a compound: geltend ma-
chen, 'put forward'), but also about the number, the kind of determiner
present in the NP (here: null, i.e. none), the presence of the passive (includ-
ing the lemma of the passive auxiliary, here werden), the sentence type
(verb-second), modifiers found in the sentence (adverbs and prepositional
phrases) and about the fact that the verb is embedded under a modal (for
details on the procedures, see Heid and Weller 2008).
(22) njemma | Grand
vjemma | geltend machen
number | PI
type_of_det | null
active/passive | passive
pass.auxiliaiy | werden
serotype | v-2
modifiers | auch (ADV), PP:fur:Emchtung, PP:fur:Land
modal |konnen
preposition | null
chunk | Solche Grande konnen auch fur die Emchtung ernes
gememsamen Patentamtes for die Lander geltend
gemacht werden
(23) Solche Griinde konnen auch fur die Errichtung eines gemeinsamen
Patentamtes fur die Lander geltend gemacht werden.
('Such reasons can also be put forward for the installation of a common
patent office for the Lander').
German noun+verb collocations in the sentence context 305

When it comes to interpreting the data for a given collocation, we extract


all individual records, sentence by sentence, for a given pair of base and
collocate from the database. We sum up over the features and, if necessary,
combine these observed frequencies with data from association measures
(to identify those lemma combinations which are significant) and with a
calculus of the relative proportions of individual feature values (e.g. the
relationship between singular and plural). For the latter, we use a calculus
proposed by Evert (2004). Such quantitative analyses provide a lower
bound of a confidence interval for the percentage of cases that display a
certain feature, e.g. a preference for singular null articles. An example is
given in figure 2: it shows absolute frequencies of different parameter dis-
tributions for the collocations Rechnung ausstellen ('[to] make out a bill')
and Rechnung tragen ('[to] take into account'), in data from the Acquis
Communautaire corpus of the European Joint Research Centre (JRC,
Ispra). Rechnung tragen clearly prefers a null article in the singular, but it
allows both active and passive.

f n lemma v lemma type of det number active/passive


5 Rechnung ausstellen def Sg passive
4 Rechnung ausstellen indef Sg active
4 Rechnung ausstellen def Sg active
1 Rechnung ausstellen def PI active
1387 Rechnung tragen null Sg active
262 Rechnung tragen null Sg passive
136 Rechnung tragen null Sg passive
1 Rechnung tragen dem Sg active
1 Rechnung tragen poss Sg active
1 Rechnung tragen def Sg passive
Figure 2. Sample cumulative database entry for Rechnung ausstellen and Rech-
nung tragen m the collocations database

From the database, we can read off relevant combinations of morphosyn-


tactic features and combine these data for manual inspection.

4. Conclusions

In this paper, we have discussed current approaches to the extraction of


collocation candidates from text corpora. Emphasis was on the linguistic
306 UlrichHeid

properties which need to de described in detail, in addition to mere knowl-


edge about lexical combinatorics. We discussed the differences between
configurational languages (e.g. English), inflecting languages (e.g. Czech)
and German with respect to the devices necessary to extract grammatical
relations between base and collocate (e.g. verb+object pairs), and we pre-
sented a parsing-based architecture for German which allows us to extract
such relational pairs.
We feed all extraction results into a database, so as to be able to inves-
tigate the behaviour of collocational items in a multiparametnc way. Even
though this work is still at its beginning, the database can already be used
as a tool for research: with its help, we were able to analyze the prefer-
ences of collocations with respect to the three German word order models,
discussed in section 2.4.
In the future, work on large corpora from different sources (e.g. regional
variants, different text types, different degrees of formality, different do-
mains, etc.) and a thorough awareness of these metadata should allow us
also to undertake investigations into the variation potential of the German
language with respect to collocations. Furthermore, we expect to be able to
analyze in detail the interplay between idiomatic fixedness (at the morpho-
syntactic level) and grammatical constraints: if a noun in a collocation like
zu + Schluss + gelangen ('[to] arrive at + conclusion') subcategories for a
^ - c l a u s e , the presence of this clausal complement will enforce the defi-
mteness of the noun, i.e. its definite article, and, in the particular case, even
a non-fused preposition+article group: zu dem Schluss kommen, dass...
(and not: zum Schluss kommen, dass...); by looking at other cases of the
same parameter constellation (singular, preposition, sentence complement
after the collocation), we hope to be able to more closely inspect such
cases of the interplay between grammar and collocation, i.e. between open
choice and idiomaticity.

Notes

1 The examples are taken from ongoing work by Marion Weller (IMS Stuttgart)
on a corpus of German newspaper texts from 1992 to 1998, comprising mate-
rial from SMtgarter Zeitung, Frankfurter Rundschau, Die Zeit and Frank-
furter Allgemeine Zeitung, a total of ca. 240 million words. These sources are
indicated by the title and the onset of the citation in the IMS version of the re-
spective corpora. The text of Frankfurter Rundschau (1993/94) has been pub-
lished by the European Corpus Initiative (ELSNET, Utrecht, The Nether-
German noun+verb collocations in the sentence context 307

lands) in its first multilingual corpus collection (ECI-MC1). The other news-
papers, as well as the juridical corpus cited below have been made available
to the author under specific contracts for research purposes.
2 A related observation has to do with verb-final contexts: there, the support
verb and the pertaining noun tend to be adjacent. Only few types of phrases
can be placed between the two elements, e.g. adverbs, pronominal adverbs or
prepositional phrases. However, in the data used for our preliminary investi-
gations, this criterion does not help much to distinguish idiomatic groups from
non-idiomaticones.

References

Abney, Steven
1991 Parsing by chunks. In Principle-Based Parsing, Robert Berwick,
Steven Abney and Carol Tenny (eds.), 257-278. Dordrecht: Kluwer.
Bahns,Jens
1996 Kollokationen als lexikographiscf.es Problem: Eine Analyse allge-
meiner und spezieller Lernerworterbucher des Englischen. Lexico-
graphica Series Maior 74. Tubingen: Max Niemeyer.
Benson, Morton, Evelyn Benson and Robert Ilson,
1986 The Lexicographic Description of English. Amsterdam/Philadelphia:
John Benjamins.
Burger, Harald
1998 Phraseologie: Eine Einfuhrung am Beispiel des Deutschen. Berlin:
Erich Schmidt Verlag.
Cahill, Aiofe, Marion Weller, Christian Rohrer and Ulnch Held
2009 Using tn-lexical dependencies in LFG parse disambiguation. In
Prodeedings of the LFG09 Conference, Miriam Butt and Tracy Hol-
loway King (eds.), 208-221. Standford: CSLI Publications.
Evert, Stefan
2004 The statistical analysis of morphosyntactic distributions. In Proceed-
ings of the 4th International Conference on Language Resources and
Evaluation (LREC 2004), 1539-1542. Lisbon: ELRA.
Evert, Stefan
2005 The Statistics of Word Cooccurrences: Word Pairs and Collocations.
Stuttgart: University of Stuttgart and http://www.collocations.de/
phd.html.
Evert, Stefan
2009 Corpora and collocations. In Corpus Linguistics: An International
Handbook, Anke Ludelmg and Merja Kyto (eds.), 1212-1248. Ber-
lin/New York: Walter de Gruyter.
308 UlrichHeid

Fazly, Afsaneh and Suzanne Stevenson


2006 Automatically constructing a lexicon of verb phrase idiomatic combi-
nations. In Proceedings of the 11th Conference of the European
Chapter of the Association for Computational Linguistics (EACL-
2006), 337-344. Trento, Italy, April, 2006. http://www.cs.toronto.
edu/~suzanne/papers/FazlyStevenson2006.pdf.
Fellbaum, Christiane, Undine Kramer and Gerald Neumann
2006 Corpusbasierte lexikographische Erfassung und lmguistische Analyse
deutscher Idiome. In Phraseology in Motion I: Methoden und Kritik,
Annelies Buhofer and Harald Buger (eds.), 43-56. Basel: Balt-
mannsweiler.
Firth, John Rupert
1957 Modes of Meaning. In Papers in Linguistics 1934-51, John Rupert
Firth (ed.), 190-215. Oxford: Oxford University Press.
Fntzmger, Fabienne, Ulnch Held and Nadine Siegmund
2009 Automatic extraction of the phraseology of a legal subdomam. In
Proceedings des XVII European Symposium on Languages for Spe-
cific Purposes, Arhus, Danmark. http://www.ims.um-stuttgart.de/
~fntzife/pub.html.
Grossmann, Francis and Agnes Tutin
2003 Quelques pistes pour le traitement des collocations. In Les colloca-
tions: analyse et traitement, Francis Grossmann and Agnes Tutin
(eds.), 5-21. Amsterdam: DeWerelt.
Hausmann, Franz Josef
1979 Un dictionnaire des collocations est-il possible? Travaux de linguis-
tiqueetdelitterature XVII (1): 187-195.
Hausmann, Franz Josef
1989 Le dictionnaire de collocations. In Worterbucher, Dictionaries, Dic-
tionnaires: Ein Internationales Handbuch, Franz Josef Hausmann,
Oskar Reichmann, Herbert-Ernst Wiegand and Laidslav Zgusta (eds.),
1010-1019. Berlin: De Gruyter.
Hausmann, Franz Josef
2004 Was smd eigentlich Kollokationen? In Wortverbindungen mehr
oder wenigerfest, Institut fur Deutsche Sprache Jahrbuch 2003, Kath-
rin Steyer (ed.), 309-334. Berlin: De Gruyter.
Held, Ulnch
1994 On ways words work together: Topics in lexical combinatorics. In
Proceedings of the Vlth Euralex International Congress, Willy Mar-
tin et al. (eds.), 226-257. Amsterdam: Euralex.
German noun+verb collocations in the sentence context 309

Heid,Ulrich
1998 Building a dictionary of German support verb constructions. In Pro-
ceedings of the 1st International Conference on Linguistic Resources
and Evaluation, Granada, May 1998, 69-73. Granada: ELRA.
Heid,Ulnch
2005 Corpusbasierte Gewmnung von Daten zur Interaktion von Lexik und
Grammatik: Kollokation - Distribution - Valenz. In Corpuslinguistik
in Lexik und Grammatik, Fnednch Lenz and Stefan Schierholz (eds.),
97-122. Tubingen: Stauffenburg.
Heid,Ulnch
2008 Computational phraseology: An overview. In Phraseology: An Inter-
disciplinary Perspective, Sylviane Granger and Fanny Meumer (eds.),
337-360. Amsterdam: John Benjamins.
Held, Ulnch, Fabienne Fntzmger, Susanne Hauptmann, Julia Weidenkaff and Mari-
on Weller
2008 Providing corpus data for a dictionary for German juridical phraseol-
ogy. In Text Resources and Lexical Knowledge: Selected Papers from
the 9th Conference on Natural Language Processing, KONVENS
2008, Angelika Starrer, Alexander Geyken, Alexander Siebert and
Kay-Michael Wiirzner (eds.), 131-144. Berlin: Mouton de Gruyter.
Held, Ulnch and RufusH.Gouws
2006 A model for a multifunctional electronic dictionary of collocations. In
Proceedings of the Xllth Euralex International Congress, 979-989.
Alessandria: Ediziom dell'Orso.
Held, Ulnch and Marion Weller
2008 Tools for collocation extraction: Preferences for active vs. passive. In
Proceedings of LREC-2008: Linguistic Resources and Evaluation
Conference, Marrakesh, Morocco. CD-ROM.
Helbig, Gerhard
1979 Probleme der Beschreibung von Funktionsverbgefugen im Deutschen.
Deutsche Fremdsprache 16: 273-286.
Ivanova, Kremena, Ulnch Held, Sabine Schulte im Walde, Adam Kilgarnff and Jan
Pomikalek
2008 Evaluating a German sketch grammar: A case study on noun phrase
case. In Proceedings of LREC-2008: Linguistic Resources and
Evaluation Conference, Marrakech, Marocco. CD-ROM.
Keil, Martina
1997 Wort fur Wort: Representation und Verarbeitung verbaler Phraseo-
logismen (Phraseolex). Tubingen: Niemeyer.
Kermes, Hannah
2003 Offline (and Online) Text Analysis for Computational Lexicography.
Dissertation, IMS, University of Stuttgart.
310 UlrichHeid

Kilgamff, Adam, Pavel Rychly, Pavel Smrz and David Tugwell


2004 The sketch engine. In Proceedings of the Xlth EURALEX Interna-
tional Congress, G. Williams and S. Vessier, (eds.), 105-116. Lon-
entiUmversitedeBretagneSud.
Krenn,Bngitte
2000 The Usual Suspects: Data-Oriented Models for the Identification and
Representation of Lexical Collocations. Saarbrucken: DFKI, Univer-
sity des Saarlandes.
Krenn,BngitteandGregorErbach
1994 Idioms and Support Verb Constructions. In German in Head Driven
Phrase Structure Grammar, John Nerbonne, Klaus Netter and Carl
Pollard, (eds.), 297-340. Stanford, CA: CSLI Publications.
Krenn,Bngitte and Stefan Evert
2001 Can we do better than frequency? A case study on extracting PP-verb
collocations. In Proceedings of the ACL Workshop on Collocations,
39^6. Toulouse: Association for Computational Linguistics.
Lapshmova, Ekaterma and Ulnch Held
2007 Syntactic subcategonzation of noun+verb-multiwords: Description,
classification and extraction from text corpora. In Proceedings of the
26th International Conference on Lexis and Grammar, Bonifacio, 2-
6 October 2007.
Mel'cuk, Igor A., Nadia Arbatchewsky-Jumarie, Leo Elmtsky, Lidija Iordanskaja
andAdeleLessard
1984-99 Dictionnaire explicatif et combinatoire du francais contemporain:
Recherches Lexico-Semantiques I-IV. Montreal: Presses Universi-
t i e s de Montreal.
Ritz, Julia
2006 Collocation extraction: Needs, feeds and results of an extraction sys-
tem for German. In Proceedings of the Workshop on Multiword-
expressions in a Multilingual Context, 11th Conference of the Euro-
pean Chapter of the Association for Computational Linguistics,
Trento, Italy, April 2006, 41-48. Trento: Association for Computa-
tional Linguistics.
Ritz, Julia and Ulnch Held
2006 Extraction tools for collocations and their morphosyntactic specifici-
ties. In Proceedings of the Linguistic Resources and Evaluation Con-
ference, LREC 2006, Genova, Italia, 2006. CD-ROM.
Schafroth,Elmar
2003 Kollokationen im GWDS. In Untersuchungen zur kommerziellen
Lexikographie der deutschen Gegenwartssprache I. "Duden: Das
grofie Worterbuch der deutschen Sprache in zehn Banden", Herbert
Ernst Wiegand (ed.), 397-412. Tubingen: Niemeyer.
German noun+verb collocations in the sentence context 311

Schiehlen, Michael
2003 A cascaded finite-state parser for German. In Proceedings of the
Research Note Sessions of the 10th Conference of the European
Chapter of the Association for Computational Linguistics (EACL
2003), Budapest, April 2003, 133-166. Budapest: Association for
Computational Linguistics.
Scott, Mike
2008 WordSmith Tools, version 5, Liverpool: Lexical Analysis Software.
Seretan,Violeta
2008 Collocation Extraction Based on Syntactic Parsing. Dissertation No.
653, Dept. de lmguistique, Umversite de Geneve, Geneve.
Siepmann,Dirk
2005 Collocation, colligation and encoding dictionaries. Part I: Lexicologi-
cal Aspect. International Journal of Lexicography 18 (4): 409-444.
Sinclair, John McH.
1991 Corpus, Concordance, Collocation. Oxford: Oxford University Press.
Smadja, Frank
1993 Retrieving collocations from text. Computational Linguistics 19 (1):
143-177.
Starrer, Angelika
2006 Zum Status der nommalen Komponente in Nommahsierungsverbge-
fugen. In Grammatische Untersuchungen: Analysen und Reflexionen,
Eva Bremdl, Lutz Gunkel and Bruno Strecker (eds.), 275-295. Tu-
bingen: Narr.

Corpora

JRC Acquis Acquis Communautaire described in the following paper: Stemberger


Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaz Erja-
vec, Dan Tufis, Daniel Varga (2006). The JRC-Acquis: A multilin-
gual aligned parallel corpus with 20+ languages. Proceedings of the
5th International Conference on Language Resources and Evalua-
tion. Genoa, Italy, 24-26 May 2006. The corpus is available at
http://langtech.jrc.it/JRC-Acquis.html.

See note 1 for further sources.


Author index
Aarts,Jan,18 Calde 1 ra,S 1 lv 1 aM.G.,91
Abdel Rahman, Rasha, 92 Capocci, Andrea, 91
Abney, Steven, 269, 301 Carreras,Xav ie r,278,279
Adolphs,Svenja,49 Carter, Ronald, 115
Aho, Alfred Vamo, 262 Cecchi,GuillennoA.,91
Ahulu, Samuel, 160 Channell, Joanna, 8, 161
Algeo, John, 201 Cheng, Winnie, 10, 216
Altenberg,Bengt,47,127 ChevaHer, Jean-Claude, 65
Andersen, 0rvm, 263 Choi, Key-Sun, 91
Andrade,R.F.S.,91 Chuquet,Helene,72,73,78
Antiqueira, Lucas, 91 ColHer, Alex, 114-116
Asher,JamesJ.,153 Conklm,Kathy,127
Atkins, Beryl T., 49 Connor, Ulla, 41, 49, 199,211
Conzett, Jane, 128, 132
Bahns, Jens, 285 Corso,Grlberto,91
Ballard, Michel, 73 Cosenu,Eugemo,43,147
Barfield, Andy, 131, 137 Coulthard,R.Malcom,5,6,20,179
Barlow, Michael, 211, 214 Cowie, Anthony Paul, 47, 49, 132
Barnbrook, Geoff, 213 Coxhead,Averil,127,138
Bashyam,Vijayaraghavan,278 Croft, William, 30, 47
Bauer, Laurie, 136 Cruse, D.Alan, 30, 47
Behrens,H ei ke,32,48 Crystal, David, 160
Belica, Cyril, 156 Csardi,Gabor,107
Benson, Evelyn, 202, 284 Cullen, Richard, 140
Benson, Morton, 202, 284
Besch, Werner, 187 Dagneaux,Estelle,138
Biber, Douglas, 47, 148, 179, 191, Daley, Robert, 3, 7, 8, 18, 20, 87,
211,239,240 113
Bird, Steven, 270 Darbelnet,Jean,76
Bordag, Stefan, 91 Dasgupta,Partha,91
Bowker,Lynne,211 De Cock, Sylvie, 137
Bozicevic,Miran,91 de la Torre, Rita, 153
Braine, Martin Dan Isaac, 153 DeCarrico,JeanetteS.,127,129,133
Brazil, David, 5 Dechert,Hans-Wilhelm,128
Brooks, Philip, 279 Delport, Mane-France, 65
Bryant, Michael, 231 Deveci,Tanju,132
Buchholz, Sabine, 276, 278 Domazet,Mladen,91
Burger, Harald, 283, 288
Bybee,Joan,32 Ellis, Nick, 49, 127, 129, 229
Erbach,Gregor,288
Cahill,Aiofe,293 Evert, Stefan, 101, 232, 285, 298,
Caldarelli,Guido,91 300,301,305
314 Author index

Faulhaber,Susen,31 Grossmann,Franc1S,283
Fazly,Afsaneh,297,299 GroB, Annette, 74, 76
Feilke, Helmut, 62 Guan,Jihong,91
FeUbaum, Christians 287
Ferrer iCancho, Ramon, 91 Hamilton, Nick, 132
Ferret, Olivier, 91 Han,ZhaoHong,152
Fillmore, Charles, 40, 41, 49, 199 Handl,Susanne,47,262
Firth, John Rupert, 5, 6, 12, 19, 49, Harris, ZelhgS., 213
147,211,212,221,223,284 Harwood, Nigel, 124, 126, 131
Fischer, Kerstm, 47, 49 Hasselgren, Angela, 130
Fuzpatnck,Tess,136 Hausmann, Franz Josef, 28-30, 32-
Fletcher, William H., 211, 214 34,37,47,60,61,65-68,77,78,
Fodor, Jerry Alan, 258 200,208,230,236,283,285,
Francis, Gill, 8, 47, 67 286,291
Francis, W.Nelson, 7, 18 Hausser, Roland, 247, 248, 256, 259,
Frath,Pierre,61,62,65,78 260,262-264
Frey,Enc,49,229 Heaton, John B., 161
Fritzinger,Fabienne,303 Heid,Ulrich, 283, 284, 286-288,
293,296,299,302,304
Gabnelatos,Costas,133,140 Helbig, Gerhard, 287, 288
Gallagher, John D., 60, 65, 73, 78 Herbst,Thomas,28,29,31,39,41,
Galpm,Adam,49,229 47,49,50,78,161,203,252
Garside, Roger, 231, 243 Heringer, Hans Jurgen, 156
Gavioli, Laura, 211 Hinrichs, Lars, 187
Gilquin,Gaetanelle,28,127,138 Hockemnaier, Julia, 263
Girard, Rene, 264 Hoey, Michael, 24, 66, 67, 78, 123,
Gledhill, Christopher, 61, 62, 65, 78 128
Glaser,Rosemarie,41 Hoffmann, Sebastian, 160, 180
G6rz,Gunther,280 Hofland,Knut,47
Gotz, Dieter, 50, 153, 156, 239 Hoover, David, 3
G6tz-Votteler,Katrm,50,156,231 Hopper, Paul, 183
Goldberg, AdeleE., 31, 40, 44,47- Howarth, Peter, 127, 161
49 Hu,Guobiao,91
Gouverneur, Celine, 133 Huang, Wei, 263
Gouws,RufusH.,284 Hugon,Cla1re,134
Grandage, Sarah, 49 Hundt, Marianne, 160
Granger,Sylvrane,28,41,46,47, Hunston,Susan,8,14,47,67,123,
127,134,138,153,161,200,230 212,214,224
Grant, Lynn, 136 Hyland, Ken, 137,211
Greaves, Chris, 10,211,214,216
Green, Georgia, 67-69, 74, 260 Ilson, Robert, 202, 285
Greenbaum, Sidney, 49 Ivanova,Kremena,301,302
Gries, Stefan Th., 47, 50
Grimm, Anne, 201, 205 Jackson, Dunham, 116,218
Author index 315

Jalkanen, Isaac, 49, 229 Loper, Edward, 270


Janulevieiene,Violeta,132,133 Lourenco,G.M.,91
Johansson, Stig, 17, 21, 47 Lowe, Charles, 125, 126, 130, 136
Johns, Tim, 132 Ludw lg ,Bernd,280
Jones, Susan, 3, 7, 8, 18, 20, 87, 113 Leon, Jaquelme, 212

Kaszubski,Przemyslaw,160 MacWhinney, Brian James, 262, 264


Katz,Jerrold Jacob, 258 Magnusson, Camilla, 91
Kavaliauskiene,Galina,132,133 Mair, Christian, 160, 174, 185, 187,
Kay, Paul, 41, 49, 199 240
Keil, Martina, 288 Makkai,Adam,30
Kenny, Dorothy, 65 Manca, Elena, 149
Kermes, Hannah, 301, 302 Manning, Elizabeth, 8, 67
Kilgarriff, Adam, 284, 300 Marcmkiewicz, Mary Ann, 263
Kinouchi,Osame,91 Marcus, Mitchell P., 263, 276, 277
Kirk, John M., 243 Marquez,Llms,278,279
Klein, Barbara, 186, 187, 191 Martinez, Alexandre Souto, 91
Klein, Ewan, 270 Masucci,A.P., 91
Klotz, Michael, 39, 78, 203 Mauranen,Anna,12,23,206,214
K6hler,Remhard,91 Maynard, Carson, 127, 129
Kramer, Undine, 287 Meara,PaulM.,91
Krenn,Bn gl tte,285,286,288,299 Meehan, Paul, 133
Knshnamurthy,Ramesh,8,18,93, Mehler, Alexander, 115
123,245 Melmger,Atissa,92
Krug, Manfred, 185 Meunier, Fanny, 138
Kundert,K.R.,116 Milroy, Lesley, 88
Kusudo,JoAnne,153 Milton, John, 137
Kucera,Henry,7,18 Miranda, J. G. V., 91
Mittmann,Bngitta,50,197,201,
Lai,Ying-Cheng,91 202,203,206
Langacker, Ronald W., 48 MiBler,Bettma,74,76
Lapshmova,Ekaterma,288 Montague, Richard, 264
Leech, Geoffrey, 49, 180, 185, 231, Moon, Rosamund, 8, 9, 208
240,243 Motter,AdilsonE.,91
Lehrberger, John, 213 Moura,AlessandroP.S.de,91
Lennon, Paul, 128 Mukherjee,Joybrato,47,159,160,
Lewis, Michael, 124-126, 129, 131- 170
133,136,137,139 Murtra,BernatCorommas,90
Li,Jianyu,91 Myles, Florence, 136
Lian,HoMian,160
L ie ven,Elena,32,48 Nattinger, James, 127, 129, 133
Lima,GilsonFranzisco,40,91 Neme, Alexis, 91
Lm,Haitao,91,262,263 Nesselhauf,Nadja,28,30,47,128,
Lobao, Thierry C, 91 137,161,163,170,171,174,236
316 Author index

Neumann, Gerald, 287 Sand, Andrea, 35, 160, 161, 165, 174
Nicewander, W.Alan, 116 Santorini, Beatrice, 263
Nichols, Johanna, 251 Saussure, Ferdinand de, 43, 218
Schafroth,Elmar,286
01iveira,OsvaldoN.,91 Schiehlen, Michael, 302, 303
Overstreet,Maryann,198 Schilk, Marco, 160
Schm 1 d,Hans-J6rg,36,47,48
Pacey, Mike, 114-116 Schmied, Josef, 160
Paillard,Michel,72,73,78 Sch mi tt,Norbert,49,127,229
Paquot,Magali,41,46,47,127,134, Schneider, Edgar W., 160
138,200 Schuller,Susen,41,252
Park, Young C, 91 Schur, Ellen, 91
Partington, Alan, 66 Scott, Mike, 198, 246, 285
Pawley, Andrew, 11,40 Selinker, Larry, 152
Pearson, Jennifer, 116,211 Sells, Peter, 263
Piatt, John, 160 Seretan,Violeta,302
Pollard, Carl, 259 Shei,Chi-Chiang,161
Polzenhagen, Frank, 160 S 1 epmann,D 1 rk,30,44,47,61-63,
Porto, Melma, 125, 127, 129 67,78,156,286
Proisl, Thomas, 49, 262 Sigman, Mariano, 91
Pulverness, Alan, 130, 139 Simpson-Vlach, Rita, 127, 129
Putnam, Hilary, 246, 248, 262 Sinclair, John McH., 1-14, 17-24,
27-29,32,37,38,41-43,45,59-
Quirk, Randolph, 18 63,66,67,71,77,78,87,89,93,
103,109,113,115,116,123,
Raab-F1scher,Rosw1tha,187,191 124, 126, 128, 130, 132, 134,
Ramshaw,LanceA.,276,277 135, 138, 139, 147, 154, 156,
Rayson, Paul, 115,231,239 159,179,197,206,208,211,
Reiss, Peter, 280 212,214,215,217,223,224,
Renouf, Antoinette, 8, 114-116, 130, 229,240,243,245,258,284,286
134,135,138 Siyanova, Anna, 127
Richards, Jack C, 128 Skandera, Paul, 160, 170
Risau-Gusman, Sebastian, 91 Smadja, Frank, 285
Ritz, Julia, 286, 299 Smith, Nicholas, 185, 187, 240, 272
Rodgers,G.J.,91 Scares, MarcioMedeiros, 91
Rodgers, Joseph Lee, 116 Sole, Richard V., 90
Romer.Ute, 174,200,211,214 Speares, Jennifer, 48
Rogers, Ted, 131 Steedman, Mark, 263
Rohdenburg,Gunter,180 Steels, Luc, 90
Rohrer, Christian, 293 Stefanci6,Hrvoje,91
Rosenbach,Anette,187 Stefanowitsch,Anatol, 44, 47-50
Stein, Stephan, 65
Sag, Ivan, 259 Stevenson, Suzanne, 297, 299
Salkoff, Morris, 77 Steyvers,Mark,90
Author index 317

Starrer, Angelika, 291 Valverde,Ser gl ,90


Stubbs,Michael,24,29,47,60,96, Vanharanta,Hannu,91
115 Vinay, Jean-Paul, 76
Svartvrk,Jan,18 Vitevitch, MichaelS., 91
Syder, Frances Hodgetts, 40
Szmrecsanyi,Benedikt,187 Waibel,Birgit,137
Warren, Martin, 10, 216
Taira, Ricky K., 278 Weber, Heidi, 160, 262
Tenenbaum, Joshua B., 91 Weller, Marion, 293, 294, 304, 306
Teubert, Wolfgang, 18, 20, 245, 248 Widdowson, Henry G., 140
Thompson, Geoff, 117,214 Wierzbicka, Anna, 251
Thornbury, Scott, 136 Wiktorsson, Maria, 137
Timmis, Ivor, 139, 140 Wilks, Clarissa, 91
Tjong Kim Sang, Erik F., 276, 278 Williams, Jessica, 161
Togmm-Bonelh,Elena,3,45,149, Willis, Dave, 132, 134, 136, 138
286 Wilson, Edward O., 13
Tomasello, Michael, 32, 48 Wolf, Hans-Georg, 160
Traugott, Ehzabeth, 181 Wolff, Dieter, 74, 76
Tremblay,Antome,129 Wolter, Brent, 91
Tsujii,Jun'ichi,279 Woolard, George, 126, 130, 132
Tsuruoka,Yosh im asan,279 Wray, Alison, 129, 135, 136, 203,
Turton,N lg elD.,161 206
Tutin, Agnes, 283
Zhang, Zhongzhi, 91
Uhng,Peter,31,47,49 Zhou,Jie,91
Ullman, Jeffrey David, 262 Zhou,Shuigeng,91
Underwood, Geoffrey, 49, 229 Zlatic,Vmko,91
Upton, Thomas A., 211
Subject index
active, 3, 4, 73, 133, 304, 305 cluster, 50, 74, 198-200, 203-206,
adjective, 30, 31, 33, 37, 47, 48, 64, 276,291
67,70,72,73,75-77,128,138, co-occurrence, 20, 22, 28, 31, 36, 42,
215,236,238,244,251,263, 43,46,47,49,66,87,90,91,94,
270,285-287,290,291,300 95,97,98,100-102,104,108,
adjunct, 278, 295 111,112,115-117,134,156,
adverb, 41, 49, 65, 70, 72, 73, 95, 229-232,240
133,200,203,285,290,291, co-selection, 9, 27, 154
304,307 Cobuild,31,130,139
adverbial, 41, 200, 206 cognitive, 28, 32, 35, 36, 38, 42, 45,
affected, 273 103,113,246,247,249,251,
agent, 278, 279 259-261
algorithm, 93, 100, 105, 117,254, coherence, 89
263 colligation, 78, 179
ambigmty, 132, 240, 244, 300, 303 collocate, 11,20,22,30,33,37,38,
annotation, 179, 237, 244, 274, 276, 44,47,49,62,65,75,90,91,94,
300-302 100,102-105,107,109,111-114,
argument, 31, 251, 253, 255, 259, 138,152,160,234,283,285,
263 286,296,304-306
argument structure, 31, 263 collocation, 1,3,7,9, 11, 19-21,23,
aspect, 252 28-30,32-37,39,41,43-45,47,
association measure, 231, 232, 285, 49,50,60,61,63,65-68,70,72,
298,299,305 74-79,82,85,87-95,101-104,
107,108,112-116,123,125,
base, 30, 32, 33, 44, 219, 253, 257, 128, 133, 136, 137, 147, 149,
259,262,271,274,285,286, 151, 152, 155, 156, 159, 160,
304-306 163-166, 168, 169, 172-174, 179,
200,212,214,229,230,232,
case, 300-304 233,235,236,238,240,244,
category, 41, 45, 46, 61, 76, 77, 133, 258,259,262,270,283-300,
190,221,231,252,269,285,286 302-306
Chinese, 91, 263 collocator,230
choice, 27, 29, 31-33, 37, 38, 40, 42, constructional, 50
44-46,59,63,65,69,77,133, competence, 4, 113, 126, 128, 152
165,190,306 complement, 31, 38, 41, 49, 59, 126,
chunk, 12, 28, 29, 34, 40-42, 45, 132,181,183-185,191,200,
124, 129, 132, 147, 149, 151- 223,260,289,306
156,179,180,190,206,207, complementation, 167-169,231,289
214,269-274,276-280,285,286, complex conjunction, 200
290,301,304 complex preposition, 49, 180,200
chunking, 270-274, 276, 278, 301, compound, 29, 32-37, 39, 41, 48,
302 219,272,304
320 Subject Mex

computational linguistics, 115,283 discourse, 1-3, 5, 6, 13, 20, 24, 88,


concordance, 9, 10, 96, 114, 166, 95, 104, 127, 153, 183, 189, 190,
214,217-223,244 198,206,208,211-213,223,
constituent, 49, 61, 199, 206, 207, 246,264,286
216,259,271,274,292,300, discourse analysis, 1-3, 5, 6, 13, 20,
301,304 88, 153
construction, 2, 27-29, 31, 37, 40, Dutch, 162, 286
41,45-47,49,62,64,67,70,73,
74,90,126,136,156,172,174, elicitation, 160
181,183,184,190,199,200, emergence, 188
202,203,213,218,248,256, emergentism,262
271,272,274,287,288,290,299 encoding idiom, 30, 34
construction grammar, 28, 29, 46, English, 2, 4, 6-8, 11, 13, 14, 17, 18,
47,49,62,126,200 20,21,23,24,30,31,33-41,46,
context, 17, 22, 42, 69, 70, 72, 74, 48-50,60,67-76,78,79,83,84,
87,89,90,94,99,101,103,109, 91,104,117,137-139,148-150,
111,112,114,116,123,128, 152, 154-156, 159-166, 169-171,
129,183,203,205,217,221, 173,174,179-191,197-204,207,
222,247,249-251,253,259-262, 208,211-213,216,220-222,224,
264,287,304 230,231,243,250,251,256,
contrasts, 73, 128, 294 258,261,272-274,278,279,
core, 22, 78, 125, 126, 206, 211, 284-286,298-300,302,306
214,252,253,255-257,259, error, 113, 128, 138,218,237,269,
260,263 272,277
corpus-based, 3, 21, 59, 134, 137, extended units of meaning, 9, 40, 45
159, 160, 179, 185, 187, 188 extraction, 91, 102, 214, 216, 284,
corpus-driven, 123, 132 286-288,297-306
culture, 7, 160
figurative, 136
database, 216, 230, 239, 240, 251, fixedness, 134, 203, 205, 287, 288,
255,261,304-306 297-299,306
declarative, 13, 263 foreign language, 28-30, 42, 43, 60,
dependency, 72, 87, 103, 203, 205, 76,113,123,130,133,149,150,
255,278,302 152,154,156,230,246
descriptive, 45, 46, 190, 212, 245 foreign language teaching, 30, 43,
determiner, 41, 270, 296, 304 60, 149
diachronic, 180, 182, 184, 185, 188, formula, 22, 49, 90, 103-106, 108-
190 110,130,197,200,204,205,
dialect, 161 208,214,285
dictionary, 2, 8, 9, 21, 30, 33-35, 39, frame, 40, 138, 214-217, 219, 223
40,46,49,50,76,77,82,89, free combination, 33, 39, 45, 229
103,111,130,133,153-156, French, 63, 67-76, 79, 162, 250, 302
200,213,216,243,245,248, frequency, 8-11, 29, 30, 32, 37, 38,
249,253,261,262,264 43-45,47,49,66,68,70,75,92,
Subject index 321

94,95,98,100,102,106,110, idiom principle, 27, 28, 38-40, 42,


115,116,129-131,134,135, 43,45,59-64,66-68,77,123,
138, 147, 148, 161-166, 169, 147,152,156,159,197,208,
184-188,197-200,203,204,206, 224,229,284,287
213,214,217,219-223,230-232, idiomatic expression, 47, 49, 137,
234,236,244,245,258,273, 283,288
285,290,293-298,302,305 idiomaticity,36,39,40,208,230,
function, 24, 41, 47, 60, 62, 95, 117, 231,236,237,287,297,299,306
123, 127, 130, 133, 134, 136, idiomatization,39,288,291,292,
138, 140, 147, 153, 165, 186, 297
191,201,203,207,208,214, idiosyncratic, 17, 28, 35, 283
221,230,232-235,237,238, imperative, 203
240,256,263,264,269,270, instrument, 278
275,276,283,285,298 intransitive, 243, 253
functor, 251, 253, 255, 259, 263 intuition, 68, 103, 113, 152,223,269
Italian, 149, 151, 154, 162
generalization, 31, 270, 296 item-specific, 33, 40, 43, 156
generative grammar, 28, 153, 258
German, 33-36, 38, 39, 48, 60, 78, Japanese, 63, 219
149-152, 154, 156, 162, 163,
166,170,197,203,207,231, language acquisition, 29, 88, 127,
250,272-274,278,279,284, 129, 130, 135
286,287,289,291-293,296-302, Latin, 300
306 learner, 2, 9, 28, 30, 31, 34, 43, 46,
grammar, 2, 4, 6, 8, 11, 12, 17-19, 49,50,59,123,124,126-140,
23,28,38,41,46,47,123-126, 149-156, 160-164, 166, 168-171,
132, 133, 139, 140, 179, 180, 174,230,237,238,243,245,261
183,189,190,200,201,212, learning, 110, 113, 123, 128-130,
213,224,254-256,258,259, 132, 133, 135, 136, 139, 149,
263,280,287,301,306 151,153,156,248,284
grammatical function, 303 lemma, 8, 47, 50, 91, 96, 216, 248,
grammatical relation, 255, 299-301, 303-305
306 lexeme, 45, 284, 290, 296
grammaticatization,189,216 lexical bundle, 136, 137, 148,214
lexical grammar, 212, 224
head, 41, 190, 191, 270, 301 lexical item, 8, 19, 22, 23, 28, 35, 36,
headword, 155 38,47,59,60,64,76,77,90,92,
93,96,103,107,112,114,116,
idiom, 27, 28, 36, 38-40, 42, 43, 45, 130,131,133,137,214,237
59,62,63,65,67,77,123,127, lexical unit, 3, 10, 47, 61, 96
133, 136, 149, 152, 156, 159, legalization, 139
160,197,199,208,224,229, lexicography, 1-2, 7-9, 24, 30, 43,
259,283,284,287,288,291 59,77,153,156,283,291
322 Subject Mex

lexicon, 19, 36, 48, 130, 179, 248, objective, 7, 76, 246
249,257,263,284 obligatory, 59
lexis, 7, 8, 11, 18, 19, 21, 23, 24, 46, OED,35
123-126,130,131,133,140,200 omission, 96, 160
Hght verb, 174 opacity, 283
opaque, 259, 283
meaning, 1,3, 7, 8, 10-12, 21, 23, open choice principle, 28, 31, 37, 38,
27,29-31,33-39,41-46,48,49, 44,59,61,64-66,68,229,284
64,76,87-89,102,103,109, optional, 49, 271
111-114,116,123,128,130,
133, 134, 137, 138, 140, 147, paradigmatic, 45
150-152,156,165,172,202, paraphrase, 152
204,211,212,214,216,217, parole, 43, 61
219,221-224,245,246,248, parser, 259, 272, 273, 276, 277, 279,
250,251,253,257,260-262, 280,302,303
264,270,274,283,285,291 parsing, 254, 261, 269-272, 274,
mental lexicon, 29, 36 276,278-280,302,303,306
metaphor, 65, 67, 68, 136, 259, 285 participant, 96, 224, 279
metaphorical, 10, 89 particle, 160, 200, 203, 206, 207,
metonymy, 65 286,292
modifier, 167, 190, 244, 270, 304 passive, 60, 125, 133, 274, 304, 305
motivation, 28, 127, 135, 137, 139, patient, 278
269,271 pattern, 2, 9, 12, 13, 17, 18, 20-22,
multi-word, 39, 45, 50, 61, 78, 124, 27,31,32,46,47,62,63,65-67,
131-134,199,203,214,237 69,87-91,94,101-103,110,112,
multilingual, 307 113,116,123,126,129,138,
154,155,165-167,189,201,
n-gram, 198, 214, 216, 217, 219, 223 207,211,212,215-217,219,
Natural Language Processing (NLP), 221-224,229,249,257,258,
269 263,264,269,286,290,292,
negation, 291, 297, 304 300,302-304
network, 87, 88, 90-95, 98-102, 105- performance, 147, 274, 277, 279
108,110,112-115,123 periphery, 28, 127, 206
New Englishes, 240 phrasal, 8, 36, 123, 127, 137, 139,
non-compositionality, 10, 127, 134, 160,199,214,273,274
259,283,297 phrasalverb,36,123,127,137,160,
noun, 22, 31, 37, 43, 47, 60, 62, 69, 274
73, 128, 155, 165, 167, 168, 187, phrase, 6, 8, 11,27,28,41,42,46,
190,191,204,217,221,234, 49,59,60,62,64,65,68,69,74,
238,251,263,271-273,286, 75,78,89,125,127,129,131-
287,289-291,300,301,303 138, 150, 154, 180, 187, 188,
190,191,204,206,207,214,
object, 44, 149, 273, 274, 285-288, 215,219,259,273,274,278,
295,299-303,306
Subject index 323

287,289,292,294-296,300-302, regularity, 165


304,307 routine formula, 208
phraseological, 11, 13,28,30,36, rule, 60, 72, 129, 132, 200, 205, 206,
38,39,41,46,61,63,71,78, 244,251,254,271-274,276,
126, 147, 159, 160, 162, 165, 277,279,300,302
170,173,174,200,211,212, Russian, 162
216,217,221-223,283
phraseological umt, 13, 36, 39, 41, salience, 102
61,63,147,170 schema, 247-249
phraseologism,47 selection, 21, 23, 112, 124, 128, 134,
phraseology, 3, 9, 11, 13, 28, 39, 45- 135,138,258,259,283,290,
47,77,123,127,131,132,134, 296-298
136, 137, 159, 160, 165, 174, semantic prosody, 22, 78, 123,222
199,200,212,224,229,283,296 semantic role, 149,272
pragmatic, 10, 41, 47, 60, 123, 190, semantics, 12, 19, 33, 40, 87, 216,
201,203,206,208,245,253, 249,252,257-259,264
264,269,284 sense, 10, 21, 34-36, 38, 76, 82, 103,
predicate, 219, 244, 270, 273 104,111,152,166,173,219,245
predicative, 32, 34, 36, 37 significance, 30, 37, 88, 95, 191,
prefab, 197, 198, 203, 204, 206 200,286,298,302
preference, 23, 31, 78, 138, 165, speech act, 147
181,182,184,221,223,284, spoken,2,5,7,14,19,113,123,
289-292,295-299,305,306 127,161,162,165,179-181,
premodification, 167, 168 183, 184, 186-190, 197, 199-201,
preposition, 27, 41, 49, 60, 67, 69, 204,208,220,271,272
74,78,117,133,159,170,172, statistics, 7, 37, 39, 47, 64, 87, 102,
173,273,304,306 181,183,187,192,201,229,
prepositional verb, 159, 170-174, 238,243-245,263,275,278,
199 284,285,298,299
prescriptive, 189 storage, 32, 33, 38, 42, 45, 136, 139,
probabeme,39,40,43,45,65,78, 150,251,255,260
203 structural, 42, 123, 124, 130, 181,
processing, 36, 114, 136, 139,206, 189,269,271,286
229,230,253,262,271-274,279 structure, 6, 10, 23, 62, 132, 183,
productivity, 259 184,188,190,192,202,211,
pronoun, 40, 42, 184, 200, 287 220,245,246,248,251-253,
proplet, 251-258, 260, 263, 264 255-260,264,269,271-273,276,
proposition, 214, 217 279,286,295,301-303
prosody, 78 style, 2, 6, 12, 18, 69, 70, 73, 88,
proverb, 28, 127 112,117,135,154,187,219,234
subject, 42, 60, 62, 73, 204, 288,
quantitative, 7, 171,231,286,305 298-300,302
synchronic, 181
recurrence, 43, 134, 147, 152 syntagmatic,87,127
324 Subject Mex

tagging, 152, 237, 243, 244, 263, valency, 28, 31, 32, 38, 41, 46, 49,
274,276-278 65,149,151,156,199,252,253,
text, 1-3, 6, 7, 9-13, 17, 18, 20-24, 255,288,291
27,28,33,35,38,45,48,49,59, variability, 166, 169, 173, 174, 215
61-63,65,66,68,70,74,79,87- variant, 8, 73, 165, 169, 170, 184,
93,96,97,100,101,103,104, 186,190,191,215,253,293,
109,111-113,115,116,123, 294,306
125, 128, 136, 138, 150, 162, variation, 42, 49, 73, 149, 181, 215,
185, 186, 189, 191, 197, 198, 216,296,297,306
203,211-214,216,218,221, variety, 4, 137, 159-161, 163, 165,
222,224,229-232,234-240,244, 168-171,173,174,181,188,
245,269,270,274,278,284, 189,197-201,203-208,211
285,287,290,295-302,305,306 verb, 22, 23, 27, 31, 32, 37, 38, 40-
theme, 1,3, 180 44,49,60,67,68,74,76,78,89,
token, 91, 94, 100, 104, 190, 215, 123, 134, 138, 155, 156, 159-
249,250,264,269 161, 165, 170-174, 185, 186,
transitive, 220, 243, 253, 258 202,203,206,207,216,243,
translation, 39, 45, 59-61, 67-70, 73- 244,251-253,258,260,263,
79,149,153,154,222,231-237, 270,272-275,278,284-297,299-
263 304,306,307
type, 91, 93, 94, 102, 104, 215, 216,
249,250 word class, 204
word form, 123,200,230,231,233,
umtofmeanmg,l,3,8,ll,13,36, 248,249,251,254,257,259,
45,211,214,223 261-264,269,273,303
usage, 31, 32, 147, 154, 170, 181, written, 7, 14, 113, 123, 127, 161-
184,188,189,238,240,290 165,179-181,183-191,212,220-
usage-based, 32 223
utterance, 2, 40, 46, 147, 150, 152,
189,206,221

You might also like