Summary of Chapter 4 Foundational
Quantitative Concepts in Corpus Linguistics.
Corpora in Applied Linguistics Susan Hunston
1. Introduction
● Focus: Introduces statistical methods foundational to corpus linguistics and emphasizes
frequency-based analysis techniques.
● Purpose: Explains how these methods provide insights into the general linguistic
features of corpora.
● Structure:
1. Techniques for analyzing words and phrases:
▪ Frequency and normalization (Section 4.3).
▪ Keyword analysis (Section 4.4).
▪ Collocation measures (Section 4.5).
▪Lexical bundles (Section 4.6).
2. Techniques for analyzing categories:
▪ Multidimensional analysis (MDA, Section 4.7).
▪ Semantic annotation (Section 4.8).
● Key Point: These methods help researchers explore and compare linguistic patterns
across various text types and genres, complementing more focused studies like
concordance line analysis.
2. Frequency and Normalization
2.1 Frequency
● Definition: Measures how often a word or lemma occurs in a corpus.
o Example: The word "disappearance" occurs 632 times in the British National
Corpus (BNC).
● Issue: Raw frequency is uninformative without context. Researchers need comparative
frameworks to interpret these numbers.
2.2 Normalization
● Purpose: Accounts for corpus size to make word frequencies comparable.
● Formula:
o Example:
▪ Word frequency: 350.
▪ Corpus size: 15,000 tokens.
▪ Basis: 1,000 tokens.
▪ Normalized frequency =
2.3 Comparisons
1. Within-Corpus Comparison:
o Examines relative word frequencies in a single corpus.
o Example:
▪ "Appearance" occurs 5,310 times in the BNC compared to
"disappearance" (632 times). The frequency gap is explained by semantic
diversity: "appearance" has multiple meanings, while "disappearance" is
more specific.
▪
2. Between-Corpus Comparison:
o Compares frequencies of words across different corpora.
o Example:
▪ In the Global Environmental Change (GEC) corpus:
▪ "Disappearance" occurs 34 times out of 80 total instances of
"appearance" and "disappearance," forming 42.5% of the total.
▪ In the BNC, "disappearance" forms only 10.6% of the total,
reflecting different topical emphases.
3. Keywords
3.1 Definition
● Keywords are statistically prominent words in one corpus compared to another.
● Purpose: Identify themes, stylistic elements, or topical focus.
3.2 Examples
1. Shakespeare Analysis (Scott & Tribble, 2006):
o Keywords in Romeo and Juliet: "love," "death," "banished," "night," "poison."
o Insights:
▪ Thematic terms (e.g., "love") reflect plot content and dialogues.
▪ Stylistic terms (e.g., "night") concentrate in specific scenes or speeches.
2. Travel Writing Analysis (Gerbig, 2010):
o 19th-century keywords: "desert," "reindeer," "tent" (explorative and exotic
narratives).
o 21st-century keywords: "visa," "taxi," "guesthouse" (practical and relatable
themes).
3. Political Manifestos (Rayson, 2008):
o Labour Party keywords: "reform," "new."
o Liberal Democrats: "freedom," "entitled."
o Findings: Labour emphasizes societal change, while Liberal Democrats focus on
individual rights.
3.3 Keyword Studies Considerations
● Reference Corpus Selection: Researchers can compare:
1. Specialized corpora against general corpora (e.g., Romeo and Juliet vs. all
Shakespeare plays).
2. Sub-corpora within the same dataset (e.g., character-specific speech in Romeo
and Juliet).
3.4 Limitations
● Focus on differences may exaggerate contrasts.
● Stereotypes: Rayson et al. (1997) highlighted gender-associated keywords, which
inadvertently reinforced stereotypes about male and female speech patterns.
4. Measuring Collocation
4.1 Definition
● Collocation: Statistically significant co-occurrence of words within a specified span
(e.g., ±4 words).
● Purpose: Explores contextual patterns and thematic significance.
4.2 Metrics
1. Log-Likelihood:
o Measures how significant a co-occurrence is.
o Example: "Species" + "of" is highly significant due to recurring phrases like
"species of bird."
2. Mutual Information (MI):
o Highlights strong, rare pairings.
o Example: "Mutability of species" has high MI because "mutability" co-occurs
almost exclusively with "species."
3. T-Score:
o Balances strength with evidence, prioritizing frequent combinations.
o Example: "Species" + "new" reflects common academic usage in evolutionary
contexts.
4.3 Examples
● In The Rough Guide to Evolution:
o "Species" collocates with "of" (176 times), "new" (41 times), and "mutability" (6
times).
o Findings illustrate phraseologies:
▪ "Species of [noun]" (e.g., "species of bird").
▪ "[Adjective] species" (e.g., "new species").
5. Lexical Bundles
5.1 Definition
● Multi-word units recurring frequently in a corpus (e.g., "on the other hand").
● Automatically identified based on thresholds for frequency and dispersion.
5.2 Examples
● Academic Writing (Global Environmental Change Corpus):
o Subject-specific bundles: "Impacts of climate change" (536 occurrences).
o General-purpose bundles: "As well as the" (381 occurrences).
5.3 Functions (Biber, 2006b):
1. Stance Expressions: Convey attitudes or likelihood (e.g., "It is important to").
2. Discourse Organizers: Structure arguments (e.g., "On the other hand").
3. Referential Expressions: Specify attributes or relationships (e.g., "At the end of the
year").
5.4 Applications
● Lexical bundles reflect disciplinary norms:
o Sciences: Frequent bundles related to methodology (e.g., "in the case of").
o Humanities: Focus on framing arguments (e.g., "on the basis of").
6. Multidimensional Analysis (MDA)
6.1 Definition
● Statistical technique comparing linguistic feature distributions across sub-corpora.
● Introduced by Biber (1988).
6.2 Process
1. Tagging: Annotate corpora for linguistic features (e.g., pronouns, tenses).
2. Factor Analysis: Group co-occurring features into factors.
3. Interpretation: Assign dimensions based on feature patterns.
6.3 Dimensions (Biber, 1988):
1. Involved vs. Informational Production:
o Positive features: First-person pronouns, contractions.
o Negative features: Nouns, prepositions.
2. Narrative vs. Non-Narrative Concerns:
o Positive features: Past tense, public verbs.
o Negative features: Present tense, attributive adjectives.
7. Applications and Challenges
7.1 Applications
● Thematic analysis: Keywords and collocations identify topics and stylistic patterns (e.g.,
political discourse).
● Genre-specific insights: Lexical bundles reveal academic writing conventions.
7.2 Challenges
● Statistical measures like MI may highlight rare, unrepresentative collocations.
● Overemphasis on differences risks perpetuating stereotypes (e.g., gender-specific
keywords).