EXPLORING CORPORA
BBI5411
TASK 1
Aim: to create awareness of the available corpora in the world!
TASK: Complete the following questions. Search and find the answers.
Go to http://martinweisser.org/corpora_site/CBLLinks.html
1. Go to CBL Links -> corpora section.
a. What does it mean by first generation and second generation corpora?
First generation corpora are almost exclusively written, and generally modelled on the Brown
corpus. Their size is usually 1 million words. These corpora were typically smaller in size and often
compiled manually or with limited automation.
Second generation corpora, also called modern corpora, represent a more advanced and
sophisticated stage of corpus linguistics development. Second generation corpora are larger and
more diverse, such as the BNC and COCA.
b. What are the types of corpora listed in this corpora section?
pre-electronic corpora, first-generation corpora, and second-generation corpora
c. List the names of corpora in the first generation and the second generation.
First-generation corpora: The Brown Corpus, FLOB , LOB Corpus, B-BROWN, AmE06 Corpus,
Frown Corpus, BE06 Corpus, BLOB-1931 Corpus, Kolhapur Corpus of Indian English, Wellington
Corpus of Written New Zealand English, Australian Corpus of English (ACE), Corpus of English-
Canadian Writing, and London-Lund Corpus (LLC).
Second-generation (Mega) corpora: British National Corpus (BNC), Corpus of Contemporary
American English (COCA), Corpus of Historical American English (COHA), TIME Corpus, and Corpus
del Español.
d. What are the available multimodal corpora mentioned by this website? What kinds of texts
are included in these corpora?
British National Corpus (BNC)
Corpus of Contemporary American English (COCA)
International Corpus of English (ICE)
Santa Barbara Corpus of Spoken American English
Spoken Dutch Corpus
Dr Afida
These corpora include different types of texts, such as spoken and written language, fiction and
non-fiction, and different genres.
e. What are the different types of data compiled in specialized corpora?
specific dialects, genres, registers
f. Give 3 examples of parallel corpora and the languages that they cover
Europarl Corpus - contains parallel texts in 21 European languages, including English, French,
German, Italian, Spanish, and Polish.
Hansard Corpus - contains parallel texts in English and Welsh..
Canadian Hansard Corpus - contains parallel texts in English and French.
g. Give 3 examples of Chinese corpora , mention the word count and the texts included.
Academia Sinica Balanced Corpus(Chinese) : 5-m-word corpus of Chinese at the Academia
Sinica (Taiwan); Traditional script/Big5-encoded, free online access.
Lancaster Corpus of Mandarin Chinese (LCMC) : a 1-million-word corpus (including punctuation)
that can be used to compare spoken with written Mandarin; texts from around 2004-2007
UCLA Chinese Corpus (UCLACC):1-m words (incl. punctuation); texts from 2000-2005; can be used
vis-à-vis LCMC to track lg change over a decade; examine potential influence of the Web on (written)
Chinese.
2. Search for the BNC corpus website.
Find the answers to the following questions by exploring the particular website.
a. What is the BNC corpus?
The British National Corpus (BNC) is a 100 million word collection of samples of written and
spoken language from a wide range of sources, designed to represent a wide cross-section of
British English from the later part of the 20th century, both spoken and written.
b. Which year did the work on BNC begin and complete?
Work on building the corpus began in 1991, and was completed in 1994.
c. Which type(s) of corpus does the BNC belong to?
General Corpora, Monolingual Corpora, and Synchronic Corpora.
d. What are the main uses of the BNC?
The main uses of the BNC are for linguistic research, language teaching, and lexicography.
e. What is the BNC Simple Search?
The BNC Simple Search is a tool that provides a simple way to query a corpus and retrieve
usage examples of a word or phrase in BNC text.
f. How was the BNC corpus created?
Making the BNC was a joint effort of a large number of participants; organisations and
individuals. It comprised two main stages: the planning (design stage) and the execution
(creation stage). The BNC project started with a careful planning stage where the design
principles for the corpus were drawn up. These established a number of selection criteria which
Dr Afida
were then used for identifying suitable texts to be included in the corpus. In addition to the
selection criteria for the written and spoken components, a large number of classification
features were identified for the texts in the corpus.
g. What are the selection criteria for BNC’s written texts?
Texts were selected for inclusion in the corpus according to three independent selection criteria:
domain, time, and medium. Target proportions were defined for each of these criteria, as listed
below.
Domain
The domain of a text indicates the kind of writing it contains.
75% of the written texts were to be chosen from informative writings: of which roughly
equal quantities should be chosen from the fields of applied sciences, arts, belief & thought,
commerce & finance, leisure, natural & pure science, social science, world affairs.
25% of the written texts were to be imaginative, that is, literary and creative works.
Medium
The medium of a text indicates the kind of publication in which it occurs. The classification
used is quite broad.
60% of written texts were to be books
25% were to be periodicals (newspapers etc.)
between 5 and 10% should come from other kinds of miscellaneous published material
(brochures, advertising leaflets, etc)
between 5 and 10% should come from unpublished written material such as personal letters
and diaries, essays and memoranda, etc
a small amount (less than 5%) should come from material written to be spoken (for example,
political speeches, play texts, broadcast scripts, etc.)
Time
The time criterion refers to the date of publication of a text. Being a synchronic corpus, the
BNC should contain texts from roughly the same period. The intention was that no text
should date back further than 1975. This condition was relaxed for imaginative works only, a
few of which date back to 1964, because of their continued popularity and consequent
effect on the language.
h. What are the classification features for BNC’s written texts?
In addition to the selection criteria, a large number of classification features were identified for the
texts in the corpus. No fixed proportions were specified for these features, although the intention
was to make sure that there should be an appropriate level of variation within each criterion. The
classification criteria include such things as:
Dr Afida
Sample size (number of words) and extent (start and end points)
Topic or subject of the text
Author's name, age, gender, region of origin, and domicile
Target age group and gender
"Level" of writing (a subjective measure of reading difficulty) : the more literary or technical
a text, the "higher" its level.
Information was added when available which means that the amount of information added to each
text varies.
i. How is the result of a search term displayed in the BNC?
The result of a search term in the BNC is displayed in a web-based client program that
allows users to search and retrieve lexical, grammatical, and textual data from the corpus. The
search results can be restricted by text category and word-class, and users can choose different
search result display options.
j. What are the two parts in the creation of the spoken BNC and what do they contain?
The spoken BNC corpus was created in two parts, a demographic part, containing
transcriptions of spontaneous natural conversations made by members of the public and
a context-governed part, containing transcriptions of recordings made at specific types of
meeting and event.
k. What is the difference between the demographic part and the context-governed part of the
spoken corpus?
The demographic part of the spoken BNC corpus contains recordings of people from
different regions, ages, and social backgrounds, while the context-governed part contains
recordings of people in specific situations, such as job interviews, sports commentaries,
meetings and casual conversations.
l. What are the 3 ways in making the text machine-readable and electronic?
The three ways of making the text machine-readable and electronic are scanning,
keyboarding, and reuse of existing electronic texts.
m. Why is hand editing the texts required in making electronic texts?
Hand editing is required in making electronic texts to correct scanning errors and insert
textual markup.
n. Why does mark-up or encoding the texts need to be standardized?
Mark-up or encoding the texts needs to be standardized to ensure that the texts are
machine-readable and can be used by different software programs.
o. What is the meaning of tagging and grammatical tag?
Tagging in the context of language and text analysis refers to the process of assigning labels
or tags to individual words or tokens in a piece of text to identify their grammatical or semantic
properties. Grammatical tagging, often referred to as part-of-speech tagging (POS tagging), is the
specific process of labeling each word or token in a sentence with its corresponding grammatical
category or part of speech.
p. What is the final stage in creating the BNC?
The last stage in creating the corpus was to add detailed descriptive information to each
text, in the form of a header, and to validate the SGML structure of the whole. Some hand
Dr Afida
editing was necessary to correct small SGML errors, but no more than 5-10% of the texts had to
be altered.
3. Go to http://corpus.byu.edu/
Find the answers to the following questions by exploring the particular website.
a. Who is the creator of this website?
The creator of the website is Mark Davies, a professor of linguistics at Brigham Young
University.
b. How many corpora are there in this website? List them all .
News on the Web (NOW) corpus
iWeb: The Intelligent Web-based Corpus
Global Web-based English (GloWbE)
Wikipedia corpus
Coronavirus Corpus
Corpus of Contemporary American English (COCA)
Corpus of Historical American English (COHA)
The TV Corpus
The Movie Corpus
Corpus of American Soap Operas
Hansard corpus
Early English Books Online
Corpus of US Supreme Court Opinions
TIME Magazine Corpus
British National Corpus (BNC)
Strathy Corpus (Canada)
CORE Corpus
American English
British English
c. What is the time range for the NOW corpus?
The NOW corpus covers the period from 2010 to the present.
d. What kind of searches can be done using COCA?
There are six main ways to search the corpus:
First, you can search for phrases and strings. And because the corpus is optimized for speed,
searches for substrings (*ism, un*able) and phrases are very fast, e.g.: got VERB-ed, BUY *
ADJ NOUN, "gorgeous" NOUN -- and even high frequency phrases like: from ADJ to
ADJ, phrasal verbs, or NOUN NOUN.
Second, you can browse a frequency list of the top 60,000 words in the corpus, including
searches by word form, part of speech, ranges in the 60,000 word list, and even by meaning
or pronunciation. This should be particularly useful for language learners and teachers.
Dr Afida
Third, you can browse through the Academic Vocabulary List (AVL) (Gardner and Davies,
2013), and then see detailed entries for each of the 3,000 words. This is a great option for
those who are interested mainly in academic English.
Fourth, you can search by individual word, and see collocates, topics, clusters, websites,
concordance lines, and related words for each of these words. Note that some of these
searches are unique to COCA and iWeb.
Fifth, you can input entire texts and then use data from COCA to get detailed information on
the words and phrases in the text.
And finally, you can find random words and also browse through randomly-selected "Words
of the Day", and then save new words and come back and review them later.
You might pay special attention to the comparisons between genres and years and virtual
corpora, which allow you to create personalized collections of texts related to a particular
area of interest.
e. Which corpora would you be interested to analyse? What would you be analysing for?
The British National Corpus (BNC), this corpus contains over 100 million words of written and spoken
British English from the late 20th century. It is a valuable resource for studying the use of language in
different contexts and genres.
4. Search for the Freiburg – LOB Corpus of British English corpus.
a. Who was the creator of FLOB and from which institution is the person from?
Created by Christian Mair from the Albert-Ludwigs-Universität Freiburg.
b. In which year did work on FLOB began?
1991
c. What was the reason behind its creation?
The reason behind its creation was to provide an updated version of the original Brown and LOB
corpora, which were both published in 1960. The FLOB corpus was designed to represent the British
English of the late 1980s and early 1990s.
d. What is the word count for FLOB and how many text samples does it contain?
FLOB contains 500 texts of around 2000 words each, giving a total of around one million
words.
e. What is the language variety of FLOB?
The language variety of FLOB is British English.
f. Which corpus is the counterpart of FLOB for American English?
The Freiburg update of the LOB corpus (F-LOB)
5. Search for Pakistan National Corpus of English (PNCE)
a. Who created this corpus and which institution is the person from?
Dr Afida
It was created by the Corpus Research Centre at Air University, which is the first corpus
centre of Pakistan.
b. In which year was the corpus created?
2017
c. What is the word count of this corpus?
The word count of this corpus is approximately 7,586,110 words.
d. What text categories is the corpus composed of?
Non-Fiction
Newspapers and Magazines
Dissertations and Research Articles
Legal and Official Language
6. Search for The London-Lund Corpus of Spoken English.
a. Who created this corpus and in which year was it created?
The researchers of University of London and Lund University created this corpus.
This corpus was created in the early 1990s
b. What were the 2 projects that contributed to the making of the corpus?
The London-Lund Corpus project and The Linköping-Lund Corpus project.
c. What are the categories of spoken genres included in the corpus?
Informal conversions, interviews, radio broadcast, debates, monologues, telephone
conversions, public speeches, private dialogues.
7. Go to
https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.html
a. How many learner corpora are listed in this website?
There are 201 learner corpora.
b. What is the meaning of learner corpora?
Learner corpora are electronic collections of language data produced by L2 learners,
that is, second or foreign-language learners.
c. State the number of written corpora and spoken corpora. For the written corpora, what
are the range of text types collected? For the spoken corpora, what are the range of
text types collected?
wrriten spoken: 67; spoken corpora:152
For the written corpora, narrative, exams, assignments, exam essay, argumentative essays,
literary essays, letters, diaries, picture descriptions, book reviews, short reviews, short
dialogues, students essays,academic research writing, in-class assignments, ESP papers,
exam scripts, job application cover letter, short answers, written tasks, informal learning
contexts, dissertation, term paper, analysis, TOEFL English essays, business
correspondence, reports.
Dr Afida
For spoken corpora, discussion,interviews, reading of texts, sentences, spontaneous oral
language, national oral English test, speeches,spoken English test, talks in classroom,
interaction,
d. How many specialised corpora are shown in the list and state what they are. Specialised
refers to a professional field or discipline e.g. business, engineering etc. or specific
dialects, genres, registers.
There have 33 specialised corpora are shown in the list, they are The Aachen Corpus of
Academic Writing (ACAW), The Advanced Learner English Corpus (ALEC), The BATMAT
Corpus, The British Academic Written English (BAWE) corpus, Canadian job cover letter
corpus, The Chinese/English Political Interpreting Corpus (CEPIC), The Chinese Academic
Written English corpus (CAWE), The Corpus Archive of Learner English in Sabah/Sarawak
(CALES), The Corpus of Multilingual Opinion Essays by College Students (MOECS), Corpus of
Written Spanish, L2 and Heritage Speakers (COWS-L2H), The EFL Teacher Corpus (ETC), The
ETS Corpus of Non-Native Written English, The Europarl corpus of Native Non-native and
Translated Texts (ENNTT) , The GICLE corpus (German component of ICLE), The Indianapolis
Business Learner Corpus (IBLC), The International Teaching Assistants corpus (ITAcorp),
Lancaster Corpus of Academic Written English (LANCAWE), The Lang-8 Learner Corpora, The
Learner Corpus of Engineering Abstracts (LCEA), The Learner Corpus of English for Business
Communication, Multilingual Corpus of Second Language Speech (MuSSeL), The Polish
Learner Corpus PoLKo, The Québec learner corpus , The Russian Learner Translator Corpus
(RusLTC), The Undergraduate Learner Translator Corpus (ULTC), The UPF Learner
Translation Corpus, The Varieties of English for Specific Purposes dAtabase learner corpus
(VESPA), The deL1L2IM corpus, MISTiC (Multiple Italian Student TranslatIon Corpus), The
Tartu Learner Corpus of Spanish as a L3+, The MeLLANGE Learner Translator Corpus (LTC),
The University of Toronto Romance Phonetics Database (RPD).
State ALL the target languages of these corpora.
e.
Atabic, Chinese, Croatian, Czech, Dutch, English, Italian, Spanish, Cantonese,
Putonghua, Portuguese, German, Icelandic, Slovene, Latvian, Czech, Russian, French,
Norwegian, Polish, Estonian, Finnish, Swedish, Gaelic, Hungarian, Korean, Latvian,
Lithuanian, Persian, Catalan, Romanian.
8. Search for the ARCHER corpus website.
a. What does the word ARCHER stand for?
ARCHER is the name of a corpus of British and American English.
b. What type of corpus does ARCHER belong to?
ARCHER is a multi-genre corpus of British and American English.
c. In which year did work on this corpus begin?
It is first constructed in the 1990s.
d. In which time range did the corpus cover for the compilation of ARCHER?
Dr Afida
The ARCHER corpus covers the period from 1600 to 1999 for the compilation of its
content.
e. Who created this corpus?
Douglas Biber and Edward Finegan.
f. In which institution is the corpus based?
The ARCHER corpus is based at the University of Manchester, as indicated by the email
contact provided in the text: archer@manchester.ac.uk.
g. What is the objective in creating ARCHER?
The objective in creating ARCHER is to develop a multi-genre corpus of British and
American English that covers the period from 1600 to 1999. This corpus is intended for
linguistic research and analysis, allowing researchers, scholars, and students to study the
evolution and usage of the English language over this historical time span.
h. How many versions of ARCHER are there and state the name of the creators, their
affiliation, year of completion and number of words for each version.
Five versions
ARCHER 1: This version was completed in 1993. The number of words in this version is
not specified in the search results.
ARCHER 2: This version was completed in 2004-2005. The number of words in this
version is not specified in the search results.
ARCHER 3.1: This version was completed in the summer of 2006. The total number of
words in this version is 1,253,557.
ARCHER 3.2: This version was completed in 2013. The total number of words in this
version is not specified in the search results.
ARCHER 3.3: This version is currently in preparation.
9. Go to http://rcpce.engl.polyu.edu.hk/index.html and search for the Hong Kong Engineering
Corpus.
Based on the information on the website, answer the following questions.
a. When did this project begin?
2006.
b. Who are the people behind it?
Members and associates of the RCPCE are involved in solid and internationally
acclaimed work in (critical) discourse analysis, discourse intonation, intercultural
communication studies, language assessment, lexical studies, systemic functional
grammar, and so on related to professional contexts. Such as Prof. Eric Friginal and Dr
William Feng.
Dr Afida
c. What is the current word count for this corpus?
The Hong Kong Engineering Corpus currently contains 9,224,384 words.
d. What is the specific purpose of compilation?
The Hong Kong Engineering Corpus (HKEC) is compiled for users to learn more about the
language of the engineering industry, in particular the study of the patterns of use of
specific words and phrases. It is an educational and research resource that is publicly
available via the website of the Research Centre for Professional Communication in
English (RCPCE) to benefit engineering professionals, academics, and students locally
and internationally. The HKEC corpus searches display only short segments of texts (12
words either side of the word(s) being studied), and therefore the complete texts
contained in the HKEC are not available to the user to either read in their entirety or to
download.
e. What is the context of the corpus?
The context of the Hong Kong Engineering Corpus (HKEC) is to serve as a resource for
studying the language of the engineering industry. It is designed to help users,
particularly engineering professionals, academics, and students, understand the patterns
of usage of specific words and phrases within the context of engineering-related texts.
The corpus contains text segments from various sources related to engineering, and it
allows users to search for specific words or phrases and view short contextual snippets
(typically 12 words on either side of the word or phrase being studied). The goal is to
provide insight into how language is used in the field of engineering, facilitating research
and education in this domain. Users can access HKEC via the website of the Research
Centre for Professional Communication in English (RCPCE) to benefit their understanding
of engineering language.
f. What are the types of texts used in the corpus?
The types of texts used in the Hong Kong Engineering Corpus (HKEC) are typically related
to the field of engineering. These texts can include a variety of documents, such as
research papers, technical reports, academic articles, manuals, project documentation,
and other written materials that are relevant to engineering and its various subfields.
The corpus is designed to represent the language and terminology commonly used in
the engineering industry, making it a valuable resource for studying the language of
engineering and understanding how specific words and phrases are used in this context.
g. State the range of subject matter/topics covered in this corpus.
The Hong Kong Engineering Corpus (HKEC) covers a wide range of topics related to the
engineering sector. Here are some of the text types and their corresponding codes
included in the HKEC:
About Us (AU): 647,013 words
Abstracts (A): 94,671 words
Agreements (AG): 127,895 words
Circular Letters (CL): 143,313 words
Code of Practice ©: 997,228 words
Conference Proceedings (CP): 196,498 words
Consultation Papers (CSP): 111,494 words
Fact Sheets (FS): 26,059 words
Frequently Asked Questions (FAQ): 55,726 words
Dr Afida
Guides (G): 783,805 words
Handbooks (HB): 67,284 words
Letters to Editor (LE): 3,492 words
Manuals (M): 296,299 words
Media Releases (MR): 1,566,742 words
Notes (N): 156,255 words
Ordinances (O): 139,176 words
Plans (PL): 4,173 words
Position Documents (PD): 75,660 words
Publicity Material (PM): 599,407 words
Product Descriptions (PRD): 611,549 words
Project Summaries (PS): 115,829 words
Q & A (QA): 27,703 words
Reports ®: 979,170 words
Review Papers (RP): 106,506 words
Speeches (SP): 2,822 words
Standards (S): 136,024 words
Technical Papers (TP): 65,731 words
Tender Notices (TN): 4,242 words
Transaction Discussions (TRD): 7,149 words
Transaction Notes (TRN): 79,058 words
Transaction Proceedings (TRP): 1,055,248 words
These text types cover a broad spectrum of engineering topics, providing a
comprehensive resource for studying the language used in the engineering industry.
h. State the variety(s) of English that are included in this corpus.
The Hong Kong Engineering Corpus (HKEC) includes texts collected from the engineering
sector of Hong Kong. The variety of English included in this corpus is primarily Hong
Kong English, which is the English language as it is used in Hong Kong. The corpus is a
reflection of the English language as it is used in the professional engineering context in
Hong Kong.
10. Search for TCSE: Ted Corpus Search Engine - yohasebe.com.
a. What is TCSE?
TCSE stands for TED Corpus Search Engine.
b. What is the type of texts used as corpus in this website?
The type of texts used as corpus in this website are transcripts of TED Talks.
c. What is the objective of this website?
The objective of this website is to provide a search engine that specializes in exploring transcripts of
TED Talks for educational and scientific purposes.
d. Who is the creator is this corpus website?
The creator of this corpus website is Yoichiro Hasebe at Doshisha University, Kyoto, Japan.
e. What are the various languages included in the TED talk corpus?
The TED talk corpus include Arabic, Bulgarian, Burmese, Chinese(simplified and traditional), Croatian,
Czech, Dutch, French, German, Greek, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese,
Dr Afida
Korean, Kurdish, Northern Kurdish, Persian, Polish, Portuguese, Portuguese, Brazilian, Romanian,
Russian, Serbian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, and Vietnamese.
f. Type this phrase “let me put it this way” in the search box. How many hits in how
many talks did the result show?
There are 137,637 hits in 5 talks.
g. Why do you think the speaker used this phrase in his talk?
The speaker might have used this phrase to rephrase or clarify a point in a different way that is
easier to understand for the audience. It is a common phrase used in English to introduce a different
way of explaining something.
11. Search for https://clic.bham.ac.uk/.
a) What is the meaning of CLiC and what is this website about?
CLiC stands for Corpus Linguistics in Cheshire, and it is a website that provides a web app for the
analysis of literary texts. The website is part of the CLiC Dickens project, which demonstrates how
computer-assisted methods can be used to study literary texts through corpus stylistics.
b) When did this project begin?
The CLiC Dickens project started at the University of Nottingham in 2013.
c) Who is responsible for spearheading this endeavor?
Michaela Mahlberg, a professor of corpus stylistics at the University of Birmingham
d) What is the present total word count for this corpus?
There are 16,732,352 words.
e) What is the distinct objective for creating this compilation?
It aims to make the CLiC website as easy to use and as accessible as possible for all users. And the
site will meet the recommended government standard for web accessibility (WCAG 2.1 AA).
f) What contextual background is associated with the corpus?
The CLiC Dickens project is funded by the Arts and Humanities Research Council, grant reference
AH/P504634/1. Project team: Prof. Michaela Mahlberg, Prof. Peter Stockwell, Viola Wiegand
The CLiC web app is part-funded by GLARE. GLARE is a project funded by the European Commission
within Marie Sklodowska-Curie Actions (reference number EU 749521). Project team: Prof. Michaela
Mahlberg, Dr. Anna Cermakova.
g) What kinds of textual materials are incorporated within this corpus?
Including DNov, 19C, ChiLit, ArTs and AAW.
h) Can you delineate the breadth of subject matter/topics addressed within this corpus?
The breadth of subject matter/topics addressed within the CLiC corpus is limited to literary texts,
specifically novels written by Charles Dickens and other 19th-century authors. The corpus is designed
to support the analysis of fictional speech and literary body language in literary texts. Therefore, the
subject matter and topics addressed within the corpus are limited to the literary aspects of the texts,
such as character development, plot, themes, and literary devices. The corpus is not intended to
cover a wide range of topics or subject matter outside of the literary context.
Dr Afida
i) Specify the English language varieties encompassed within this corpus.
The CLiC corpora covers different kinds of narrative fiction. It focus on the distinction between
narrative and fictional speech, and the stylistic choices made by these authors, particularly their
approach to sentence patterns, diction and punctuation.
j) State the functions given to use this tool.
The CLiC functions can be divided into two groups:
The ‘Concordance’ and ‘Subsets’ tabs both display text (patterns) from the selected books in
context. This is where you can analyse the use of particular words and phrases.
The ‘Clusters’ and ‘Keywords’ tabs both show lists of frequent patterns (without context),
but they differ in their applications. The Clusters tab lists frequent words and word sequences
(‘clusters’) in a single corpus (or several corpora if you have selected more than one). In the
Keywords tab, you can compare the frequency of words and clusters in one corpus with another.
Dr Afida