0% found this document useful (0 votes)

57 views13 pages

Exploring Corpora Task 1 - 2023

The document provides information about the British National Corpus (BNC). It discusses what the BNC corpus is, when work on it began and was completed, the types of corpus it belongs to, its main uses, how it was created, and its selection criteria for written texts. It also describes the two parts that make up the spoken component of the BNC and how they differ, as well as the process used to make the text machine-readable and electronically searchable.

Uploaded by

gujianing1110

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views13 pages

Exploring Corpora Task 1 - 2023

Uploaded by

gujianing1110

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

EXPLORING CORPORA

BBI5411

TASK 1

Aim: to create awareness of the available corpora in the world!

TASK: Complete the following questions. Search and find the answers.

Go to http://martinweisser.org/corpora_site/CBLLinks.html

1. Go to CBL Links -> corpora section.

a. What does it mean by first generation and second generation corpora?

First generation corpora are almost exclusively written, and generally modelled on the Brown
corpus. Their size is usually 1 million words. These corpora were typically smaller in size and often
compiled manually or with limited automation.

Second generation corpora, also called modern corpora, represent a more advanced and
sophisticated stage of corpus linguistics development. Second generation corpora are larger and
more diverse, such as the BNC and COCA.

b. What are the types of corpora listed in this corpora section?

pre-electronic corpora, first-generation corpora, and second-generation corpora

c. List the names of corpora in the first generation and the second generation.

First-generation corpora: The Brown Corpus, FLOB , LOB Corpus, B-BROWN, AmE06 Corpus,
Frown Corpus, BE06 Corpus, BLOB-1931 Corpus, Kolhapur Corpus of Indian English, Wellington
Corpus of Written New Zealand English, Australian Corpus of English (ACE), Corpus of English-
Canadian Writing, and London-Lund Corpus (LLC).

Second-generation (Mega) corpora: British National Corpus (BNC), Corpus of Contemporary

American English (COCA), Corpus of Historical American English (COHA), TIME Corpus, and Corpus
del Español.

d. What are the available multimodal corpora mentioned by this website? What kinds of texts
are included in these corpora?

British National Corpus (BNC)

Corpus of Contemporary American English (COCA)

International Corpus of English (ICE)

Santa Barbara Corpus of Spoken American English

Spoken Dutch Corpus

Dr Afida
These corpora include different types of texts, such as spoken and written language, fiction and
non-fiction, and different genres.

e. What are the different types of data compiled in specialized corpora?

specific dialects, genres, registers

f. Give 3 examples of parallel corpora and the languages that they cover

Europarl Corpus - contains parallel texts in 21 European languages, including English, French,
German, Italian, Spanish, and Polish.

Hansard Corpus - contains parallel texts in English and Welsh..

Canadian Hansard Corpus - contains parallel texts in English and French.

g. Give 3 examples of Chinese corpora , mention the word count and the texts included.

Academia Sinica Balanced Corpus(Chinese) ： 5-m-word corpus of Chinese at the Academia

Sinica (Taiwan); Traditional script/Big5-encoded, free online access.

Lancaster Corpus of Mandarin Chinese (LCMC) ： a 1-million-word corpus (including punctuation)

that can be used to compare spoken with written Mandarin; texts from around 2004-2007

UCLA Chinese Corpus (UCLACC)：1-m words (incl. punctuation); texts from 2000-2005; can be used
vis-à-vis LCMC to track lg change over a decade; examine potential influence of the Web on (written)
Chinese.

2. Search for the BNC corpus website.

Find the answers to the following questions by exploring the particular website.
a. What is the BNC corpus?
The British National Corpus (BNC) is a 100 million word collection of samples of written and
spoken language from a wide range of sources, designed to represent a wide cross-section of
British English from the later part of the 20th century, both spoken and written.
b. Which year did the work on BNC begin and complete?
Work on building the corpus began in 1991, and was completed in 1994.
c. Which type(s) of corpus does the BNC belong to?
General Corpora, Monolingual Corpora, and Synchronic Corpora.
d. What are the main uses of the BNC?
The main uses of the BNC are for linguistic research, language teaching, and lexicography.
e. What is the BNC Simple Search?
The BNC Simple Search is a tool that provides a simple way to query a corpus and retrieve
usage examples of a word or phrase in BNC text.
f. How was the BNC corpus created?
Making the BNC was a joint effort of a large number of participants; organisations and
individuals. It comprised two main stages: the planning (design stage) and the execution
(creation stage). The BNC project started with a careful planning stage where the design
principles for the corpus were drawn up. These established a number of selection criteria which

Dr Afida
were then used for identifying suitable texts to be included in the corpus. In addition to the
selection criteria for the written and spoken components, a large number of classification
features were identified for the texts in the corpus.
g. What are the selection criteria for BNC’s written texts?

Texts were selected for inclusion in the corpus according to three independent selection criteria:
domain, time, and medium. Target proportions were defined for each of these criteria, as listed
below.

Domain

The domain of a text indicates the kind of writing it contains.

 75% of the written texts were to be chosen from informative writings: of which roughly
equal quantities should be chosen from the fields of applied sciences, arts, belief & thought,
commerce & finance, leisure, natural & pure science, social science, world affairs.
 25% of the written texts were to be imaginative, that is, literary and creative works.

Medium

The medium of a text indicates the kind of publication in which it occurs. The classification
used is quite broad.

 60% of written texts were to be books

 25% were to be periodicals (newspapers etc.)
 between 5 and 10% should come from other kinds of miscellaneous published material
(brochures, advertising leaflets, etc)
 between 5 and 10% should come from unpublished written material such as personal letters
and diaries, essays and memoranda, etc
 a small amount (less than 5%) should come from material written to be spoken (for example,
political speeches, play texts, broadcast scripts, etc.)

Time

The time criterion refers to the date of publication of a text. Being a synchronic corpus, the
BNC should contain texts from roughly the same period. The intention was that no text
should date back further than 1975. This condition was relaxed for imaginative works only, a
few of which date back to 1964, because of their continued popularity and consequent
effect on the language.

h. What are the classification features for BNC’s written texts?

In addition to the selection criteria, a large number of classification features were identified for the
texts in the corpus. No fixed proportions were specified for these features, although the intention
was to make sure that there should be an appropriate level of variation within each criterion. The
classification criteria include such things as:

Dr Afida
 Sample size (number of words) and extent (start and end points)
 Topic or subject of the text
 Author's name, age, gender, region of origin, and domicile
 Target age group and gender
 "Level" of writing (a subjective measure of reading difficulty) : the more literary or technical
a text, the "higher" its level.

Information was added when available which means that the amount of information added to each
text varies.

i. How is the result of a search term displayed in the BNC?

The result of a search term in the BNC is displayed in a web-based client program that
allows users to search and retrieve lexical, grammatical, and textual data from the corpus. The
search results can be restricted by text category and word-class, and users can choose different
search result display options.
j. What are the two parts in the creation of the spoken BNC and what do they contain?
The spoken BNC corpus was created in two parts, a demographic part, containing
transcriptions of spontaneous natural conversations made by members of the public and
a context-governed part, containing transcriptions of recordings made at specific types of
meeting and event.
k. What is the difference between the demographic part and the context-governed part of the
spoken corpus?
The demographic part of the spoken BNC corpus contains recordings of people from
different regions, ages, and social backgrounds, while the context-governed part contains
recordings of people in specific situations, such as job interviews, sports commentaries,
meetings and casual conversations.
l. What are the 3 ways in making the text machine-readable and electronic?
The three ways of making the text machine-readable and electronic are scanning,
keyboarding, and reuse of existing electronic texts.
m. Why is hand editing the texts required in making electronic texts?
Hand editing is required in making electronic texts to correct scanning errors and insert
textual markup.
n. Why does mark-up or encoding the texts need to be standardized?
Mark-up or encoding the texts needs to be standardized to ensure that the texts are
machine-readable and can be used by different software programs.
o. What is the meaning of tagging and grammatical tag?
Tagging in the context of language and text analysis refers to the process of assigning labels
or tags to individual words or tokens in a piece of text to identify their grammatical or semantic
properties. Grammatical tagging, often referred to as part-of-speech tagging (POS tagging), is the
specific process of labeling each word or token in a sentence with its corresponding grammatical
category or part of speech.
p. What is the final stage in creating the BNC?
The last stage in creating the corpus was to add detailed descriptive information to each
text, in the form of a header, and to validate the SGML structure of the whole. Some hand

Dr Afida
editing was necessary to correct small SGML errors, but no more than 5-10% of the texts had to
be altered.

3. Go to http://corpus.byu.edu/
Find the answers to the following questions by exploring the particular website.

a. Who is the creator of this website?

The creator of the website is Mark Davies, a professor of linguistics at Brigham Young
University.

b. How many corpora are there in this website? List them all .
News on the Web (NOW) corpus
iWeb: The Intelligent Web-based Corpus
Global Web-based English (GloWbE)
Wikipedia corpus
Coronavirus Corpus
Corpus of Contemporary American English (COCA)
Corpus of Historical American English (COHA)
The TV Corpus
The Movie Corpus
Corpus of American Soap Operas
Hansard corpus
Early English Books Online
Corpus of US Supreme Court Opinions
TIME Magazine Corpus
British National Corpus (BNC)
Strathy Corpus (Canada)
CORE Corpus
American English
British English

c. What is the time range for the NOW corpus?

The NOW corpus covers the period from 2010 to the present.

d. What kind of searches can be done using COCA?

There are six main ways to search the corpus:
First, you can search for phrases and strings. And because the corpus is optimized for speed,
searches for substrings (*ism, un*able) and phrases are very fast, e.g.: got VERB-ed, BUY *
ADJ NOUN, "gorgeous" NOUN -- and even high frequency phrases like: from ADJ to
ADJ, phrasal verbs, or NOUN NOUN.
Second, you can browse a frequency list of the top 60,000 words in the corpus, including
searches by word form, part of speech, ranges in the 60,000 word list, and even by meaning
or pronunciation. This should be particularly useful for language learners and teachers.

Dr Afida
Third, you can browse through the Academic Vocabulary List (AVL) (Gardner and Davies,
2013), and then see detailed entries for each of the 3,000 words. This is a great option for
those who are interested mainly in academic English.
Fourth, you can search by individual word, and see collocates, topics, clusters, websites,
concordance lines, and related words for each of these words. Note that some of these
searches are unique to COCA and iWeb.
Fifth, you can input entire texts and then use data from COCA to get detailed information on
the words and phrases in the text.
And finally, you can find random words and also browse through randomly-selected "Words
of the Day", and then save new words and come back and review them later.
You might pay special attention to the comparisons between genres and years and virtual
corpora, which allow you to create personalized collections of texts related to a particular
area of interest.

e. Which corpora would you be interested to analyse? What would you be analysing for?

The British National Corpus (BNC), this corpus contains over 100 million words of written and spoken
British English from the late 20th century. It is a valuable resource for studying the use of language in
different contexts and genres.

4. Search for the Freiburg – LOB Corpus of British English corpus.

a. Who was the creator of FLOB and from which institution is the person from?

Created by Christian Mair from the Albert-Ludwigs-Universität Freiburg.

b. In which year did work on FLOB began?

1991
c. What was the reason behind its creation?

The reason behind its creation was to provide an updated version of the original Brown and LOB
corpora, which were both published in 1960. The FLOB corpus was designed to represent the British
English of the late 1980s and early 1990s.

d. What is the word count for FLOB and how many text samples does it contain?
FLOB contains 500 texts of around 2000 words each, giving a total of around one million
words.

e. What is the language variety of FLOB?

The language variety of FLOB is British English.
f. Which corpus is the counterpart of FLOB for American English?
The Freiburg update of the LOB corpus (F-LOB)

5. Search for Pakistan National Corpus of English (PNCE)

a. Who created this corpus and which institution is the person from?

Dr Afida
It was created by the Corpus Research Centre at Air University, which is the first corpus
centre of Pakistan.
b. In which year was the corpus created?
2017
c. What is the word count of this corpus?
The word count of this corpus is approximately 7,586,110 words.
d. What text categories is the corpus composed of?

Non-Fiction
Newspapers and Magazines
Dissertations and Research Articles
Legal and Official Language

6. Search for The London-Lund Corpus of Spoken English.

a. Who created this corpus and in which year was it created?

The researchers of University of London and Lund University created this corpus.
This corpus was created in the early 1990s
b. What were the 2 projects that contributed to the making of the corpus?
The London-Lund Corpus project and The Linköping-Lund Corpus project.
c. What are the categories of spoken genres included in the corpus?
Informal conversions, interviews, radio broadcast, debates, monologues, telephone
conversions, public speeches, private dialogues.

7. Go to
https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.html

a. How many learner corpora are listed in this website?

There are 201 learner corpora.

b. What is the meaning of learner corpora?

Learner corpora are electronic collections of language data produced by L2 learners,
that is, second or foreign-language learners.

c. State the number of written corpora and spoken corpora. For the written corpora, what
are the range of text types collected? For the spoken corpora, what are the range of
text types collected?
wrriten spoken: 67; spoken corpora:152
For the written corpora, narrative, exams, assignments, exam essay, argumentative essays,
literary essays, letters, diaries, picture descriptions, book reviews, short reviews, short
dialogues, students essays,academic research writing, in-class assignments, ESP papers,
exam scripts, job application cover letter, short answers, written tasks, informal learning
contexts, dissertation, term paper, analysis, TOEFL English essays, business
correspondence, reports.

Dr Afida
For spoken corpora, discussion,interviews, reading of texts, sentences, spontaneous oral
language, national oral English test, speeches,spoken English test, talks in classroom,
interaction,

d. How many specialised corpora are shown in the list and state what they are. Specialised
refers to a professional field or discipline e.g. business, engineering etc. or specific
dialects, genres, registers.
There have 33 specialised corpora are shown in the list, they are The Aachen Corpus of
Academic Writing (ACAW), The Advanced Learner English Corpus (ALEC), The BATMAT
Corpus, The British Academic Written English (BAWE) corpus, Canadian job cover letter
corpus, The Chinese/English Political Interpreting Corpus (CEPIC), The Chinese Academic
Written English corpus (CAWE), The Corpus Archive of Learner English in Sabah/Sarawak
(CALES), The Corpus of Multilingual Opinion Essays by College Students (MOECS), Corpus of
Written Spanish, L2 and Heritage Speakers (COWS-L2H), The EFL Teacher Corpus (ETC), The
ETS Corpus of Non-Native Written English, The Europarl corpus of Native Non-native and
Translated Texts (ENNTT) , The GICLE corpus (German component of ICLE), The Indianapolis
Business Learner Corpus (IBLC), The International Teaching Assistants corpus (ITAcorp),
Lancaster Corpus of Academic Written English (LANCAWE), The Lang-8 Learner Corpora, The
Learner Corpus of Engineering Abstracts (LCEA), The Learner Corpus of English for Business
Communication, Multilingual Corpus of Second Language Speech (MuSSeL), The Polish
Learner Corpus PoLKo, The Québec learner corpus , The Russian Learner Translator Corpus
(RusLTC), The Undergraduate Learner Translator Corpus (ULTC), The UPF Learner
Translation Corpus, The Varieties of English for Specific Purposes dAtabase learner corpus
(VESPA), The deL1L2IM corpus, MISTiC (Multiple Italian Student TranslatIon Corpus), The
Tartu Learner Corpus of Spanish as a L3+, The MeLLANGE Learner Translator Corpus (LTC),
The University of Toronto Romance Phonetics Database (RPD).

State ALL the target languages of these corpora.

e.
Atabic, Chinese, Croatian, Czech, Dutch, English, Italian, Spanish, Cantonese,
Putonghua, Portuguese, German, Icelandic, Slovene, Latvian, Czech, Russian, French,
Norwegian, Polish, Estonian, Finnish, Swedish, Gaelic, Hungarian, Korean, Latvian,
Lithuanian, Persian, Catalan, Romanian.
8. Search for the ARCHER corpus website.

a. What does the word ARCHER stand for?

ARCHER is the name of a corpus of British and American English.

b. What type of corpus does ARCHER belong to?

ARCHER is a multi-genre corpus of British and American English.

c. In which year did work on this corpus begin?

It is first constructed in the 1990s.

d. In which time range did the corpus cover for the compilation of ARCHER?

Dr Afida
The ARCHER corpus covers the period from 1600 to 1999 for the compilation of its
content.

e. Who created this corpus?

Douglas Biber and Edward Finegan.

f. In which institution is the corpus based?

The ARCHER corpus is based at the University of Manchester, as indicated by the email
contact provided in the text: archer@manchester.ac.uk.

g. What is the objective in creating ARCHER?

The objective in creating ARCHER is to develop a multi-genre corpus of British and
American English that covers the period from 1600 to 1999. This corpus is intended for
linguistic research and analysis, allowing researchers, scholars, and students to study the
evolution and usage of the English language over this historical time span.

h. How many versions of ARCHER are there and state the name of the creators, their
affiliation, year of completion and number of words for each version.
Five versions
ARCHER 1: This version was completed in 1993. The number of words in this version is
not specified in the search results.
ARCHER 2: This version was completed in 2004-2005. The number of words in this
version is not specified in the search results.
ARCHER 3.1: This version was completed in the summer of 2006. The total number of
words in this version is 1,253,557.
ARCHER 3.2: This version was completed in 2013. The total number of words in this
version is not specified in the search results.
ARCHER 3.3: This version is currently in preparation.

9. Go to http://rcpce.engl.polyu.edu.hk/index.html and search for the Hong Kong Engineering

Corpus.

Based on the information on the website, answer the following questions.

a. When did this project begin?

2006.

b. Who are the people behind it?

Members and associates of the RCPCE are involved in solid and internationally
acclaimed work in (critical) discourse analysis, discourse intonation, intercultural
communication studies, language assessment, lexical studies, systemic functional
grammar, and so on related to professional contexts. Such as Prof. Eric Friginal and Dr
William Feng.

Dr Afida
c. What is the current word count for this corpus?
The Hong Kong Engineering Corpus currently contains 9,224,384 words.

d. What is the specific purpose of compilation?

The Hong Kong Engineering Corpus (HKEC) is compiled for users to learn more about the
language of the engineering industry, in particular the study of the patterns of use of
specific words and phrases. It is an educational and research resource that is publicly
available via the website of the Research Centre for Professional Communication in
English (RCPCE) to benefit engineering professionals, academics, and students locally
and internationally. The HKEC corpus searches display only short segments of texts (12
words either side of the word(s) being studied), and therefore the complete texts
contained in the HKEC are not available to the user to either read in their entirety or to
download.

e. What is the context of the corpus?

The context of the Hong Kong Engineering Corpus (HKEC) is to serve as a resource for
studying the language of the engineering industry. It is designed to help users,
particularly engineering professionals, academics, and students, understand the patterns
of usage of specific words and phrases within the context of engineering-related texts.
The corpus contains text segments from various sources related to engineering, and it
allows users to search for specific words or phrases and view short contextual snippets
(typically 12 words on either side of the word or phrase being studied). The goal is to
provide insight into how language is used in the field of engineering, facilitating research
and education in this domain. Users can access HKEC via the website of the Research
Centre for Professional Communication in English (RCPCE) to benefit their understanding
of engineering language.

f. What are the types of texts used in the corpus?

The types of texts used in the Hong Kong Engineering Corpus (HKEC) are typically related
to the field of engineering. These texts can include a variety of documents, such as
research papers, technical reports, academic articles, manuals, project documentation,
and other written materials that are relevant to engineering and its various subfields.
The corpus is designed to represent the language and terminology commonly used in
the engineering industry, making it a valuable resource for studying the language of
engineering and understanding how specific words and phrases are used in this context.

g. State the range of subject matter/topics covered in this corpus.

The Hong Kong Engineering Corpus (HKEC) covers a wide range of topics related to the
engineering sector. Here are some of the text types and their corresponding codes
included in the HKEC:

About Us (AU): 647,013 words

Abstracts (A): 94,671 words
Agreements (AG): 127,895 words
Circular Letters (CL): 143,313 words
Code of Practice ©: 997,228 words
Conference Proceedings (CP): 196,498 words
Consultation Papers (CSP): 111,494 words
Fact Sheets (FS): 26,059 words
Frequently Asked Questions (FAQ): 55,726 words

Dr Afida
Guides (G): 783,805 words
Handbooks (HB): 67,284 words
Letters to Editor (LE): 3,492 words
Manuals (M): 296,299 words
Media Releases (MR): 1,566,742 words
Notes (N): 156,255 words
Ordinances (O): 139,176 words
Plans (PL): 4,173 words
Position Documents (PD): 75,660 words
Publicity Material (PM): 599,407 words
Product Descriptions (PRD): 611,549 words
Project Summaries (PS): 115,829 words
Q & A (QA): 27,703 words
Reports ®: 979,170 words
Review Papers (RP): 106,506 words
Speeches (SP): 2,822 words
Standards (S): 136,024 words
Technical Papers (TP): 65,731 words
Tender Notices (TN): 4,242 words
Transaction Discussions (TRD): 7,149 words
Transaction Notes (TRN): 79,058 words
Transaction Proceedings (TRP): 1,055,248 words

These text types cover a broad spectrum of engineering topics, providing a

comprehensive resource for studying the language used in the engineering industry.

h. State the variety(s) of English that are included in this corpus.

The Hong Kong Engineering Corpus (HKEC) includes texts collected from the engineering
sector of Hong Kong. The variety of English included in this corpus is primarily Hong
Kong English, which is the English language as it is used in Hong Kong. The corpus is a
reflection of the English language as it is used in the professional engineering context in
Hong Kong.

10. Search for TCSE: Ted Corpus Search Engine - yohasebe.com.

a. What is TCSE?
TCSE stands for TED Corpus Search Engine.
b. What is the type of texts used as corpus in this website?
The type of texts used as corpus in this website are transcripts of TED Talks.
c. What is the objective of this website?
The objective of this website is to provide a search engine that specializes in exploring transcripts of
TED Talks for educational and scientific purposes.
d. Who is the creator is this corpus website?
The creator of this corpus website is Yoichiro Hasebe at Doshisha University, Kyoto, Japan.
e. What are the various languages included in the TED talk corpus?
The TED talk corpus include Arabic, Bulgarian, Burmese, Chinese(simplified and traditional), Croatian,
Czech, Dutch, French, German, Greek, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese,

Dr Afida
Korean, Kurdish, Northern Kurdish, Persian, Polish, Portuguese, Portuguese, Brazilian, Romanian,
Russian, Serbian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, and Vietnamese.

f. Type this phrase “let me put it this way” in the search box. How many hits in how
many talks did the result show?
There are 137,637 hits in 5 talks.
g. Why do you think the speaker used this phrase in his talk?
The speaker might have used this phrase to rephrase or clarify a point in a different way that is
easier to understand for the audience. It is a common phrase used in English to introduce a different
way of explaining something.

11. Search for https://clic.bham.ac.uk/.

a) What is the meaning of CLiC and what is this website about?

CLiC stands for Corpus Linguistics in Cheshire, and it is a website that provides a web app for the
analysis of literary texts. The website is part of the CLiC Dickens project, which demonstrates how
computer-assisted methods can be used to study literary texts through corpus stylistics.

b) When did this project begin?

The CLiC Dickens project started at the University of Nottingham in 2013.

c) Who is responsible for spearheading this endeavor?

Michaela Mahlberg, a professor of corpus stylistics at the University of Birmingham
d) What is the present total word count for this corpus?
There are 16,732,352 words.
e) What is the distinct objective for creating this compilation?
It aims to make the CLiC website as easy to use and as accessible as possible for all users. And the
site will meet the recommended government standard for web accessibility (WCAG 2.1 AA).

f) What contextual background is associated with the corpus?

The CLiC Dickens project is funded by the Arts and Humanities Research Council, grant reference
AH/P504634/1. Project team: Prof. Michaela Mahlberg, Prof. Peter Stockwell, Viola Wiegand
The CLiC web app is part-funded by GLARE. GLARE is a project funded by the European Commission
within Marie Sklodowska-Curie Actions (reference number EU 749521). Project team: Prof. Michaela
Mahlberg, Dr. Anna Cermakova.

g) What kinds of textual materials are incorporated within this corpus?

Including DNov, 19C, ChiLit, ArTs and AAW.

h) Can you delineate the breadth of subject matter/topics addressed within this corpus?
The breadth of subject matter/topics addressed within the CLiC corpus is limited to literary texts,
specifically novels written by Charles Dickens and other 19th-century authors. The corpus is designed
to support the analysis of fictional speech and literary body language in literary texts. Therefore, the
subject matter and topics addressed within the corpus are limited to the literary aspects of the texts,
such as character development, plot, themes, and literary devices. The corpus is not intended to
cover a wide range of topics or subject matter outside of the literary context.

Dr Afida
i) Specify the English language varieties encompassed within this corpus.
The CLiC corpora covers different kinds of narrative fiction. It focus on the distinction between
narrative and fictional speech, and the stylistic choices made by these authors, particularly their
approach to sentence patterns, diction and punctuation.

j) State the functions given to use this tool.

The CLiC functions can be divided into two groups:
The ‘Concordance’ and ‘Subsets’ tabs both display text (patterns) from the selected books in
context. This is where you can analyse the use of particular words and phrases.
The ‘Clusters’ and ‘Keywords’ tabs both show lists of frequent patterns (without context),
but they differ in their applications. The Clusters tab lists frequent words and word sequences
(‘clusters’) in a single corpus (or several corpora if you have selected more than one). In the
Keywords tab, you can compare the frequency of words and clusters in one corpus with another.

Dr Afida

Corpus Linguistics Overview
No ratings yet
Corpus Linguistics Overview
42 pages
BNC Nadeem Hassan
No ratings yet
BNC Nadeem Hassan
15 pages
8-CORPUS Analysis - Module 2-12-01-2024
No ratings yet
8-CORPUS Analysis - Module 2-12-01-2024
41 pages
BNC & Ice
No ratings yet
BNC & Ice
24 pages
RoutledgeHandbooks 9780367076399 Chapter4
No ratings yet
RoutledgeHandbooks 9780367076399 Chapter4
14 pages
1 Corpus Linguistics
No ratings yet
1 Corpus Linguistics
38 pages
Text Corpus: Meaning, Features, Classification
No ratings yet
Text Corpus: Meaning, Features, Classification
14 pages
The Spoken BNC2014 Designing and Building A Spoken
No ratings yet
The Spoken BNC2014 Designing and Building A Spoken
26 pages
Corpus Typology
No ratings yet
Corpus Typology
23 pages
The International Encyclopedia of Language and Social Interaction - 2015 - Vaughan
No ratings yet
The International Encyclopedia of Language and Social Interaction - 2015 - Vaughan
17 pages
Linguistic Corpora Overview
No ratings yet
Linguistic Corpora Overview
41 pages
Lan & Meng 2023
No ratings yet
Lan & Meng 2023
23 pages
Corpus 2
No ratings yet
Corpus 2
49 pages
Corpus Linguistics and Corpus Analysis
No ratings yet
Corpus Linguistics and Corpus Analysis
7 pages
Group Members:: Ayesha Azhar Bareera Akbar Irum Masood Maryam Ahmed Tahira Jabeen
No ratings yet
Group Members:: Ayesha Azhar Bareera Akbar Irum Masood Maryam Ahmed Tahira Jabeen
58 pages
Corpus Lingustics
No ratings yet
Corpus Lingustics
24 pages
Linguistics Researchers' Guide
100% (1)
Linguistics Researchers' Guide
13 pages
Seminar 1
No ratings yet
Seminar 1
7 pages
Unit 7 Extended Well-Known and Influential Corpora
No ratings yet
Unit 7 Extended Well-Known and Influential Corpora
56 pages
Types of CL
No ratings yet
Types of CL
5 pages
Types of Corpora
100% (6)
Types of Corpora
2 pages
Designing A Corpus
No ratings yet
Designing A Corpus
29 pages
Introduction To Corpus Linguistics PDF
No ratings yet
Introduction To Corpus Linguistics PDF
12 pages
WK 3 Key Issues For Corpora Selection
No ratings yet
WK 3 Key Issues For Corpora Selection
37 pages
Corpus Linguistics Practical Introduction PDF
No ratings yet
Corpus Linguistics Practical Introduction PDF
32 pages
Huang 2015
No ratings yet
Huang 2015
5 pages
Chap 2 Part 1
No ratings yet
Chap 2 Part 1
8 pages
Corpus Linguistics: An Introduction
No ratings yet
Corpus Linguistics: An Introduction
43 pages
Corpus Definitions. Last Year
No ratings yet
Corpus Definitions. Last Year
6 pages
Corpus Linguistics 1
No ratings yet
Corpus Linguistics 1
48 pages
Film Discourse: Corpus Analysis and Synchronic Perspective
No ratings yet
Film Discourse: Corpus Analysis and Synchronic Perspective
5 pages
Corpus Usage: Be Ata B. Megyesi
No ratings yet
Corpus Usage: Be Ata B. Megyesi
40 pages
Cheng 2012 PP 3-8 Intro
No ratings yet
Cheng 2012 PP 3-8 Intro
6 pages
Topics
No ratings yet
Topics
85 pages
Different Types of Corpora
No ratings yet
Different Types of Corpora
6 pages
Corpora
No ratings yet
Corpora
12 pages
Corpus Linguistics: History and Analysis
No ratings yet
Corpus Linguistics: History and Analysis
66 pages
CORPUS TYPES and CRITERIA
100% (2)
CORPUS TYPES and CRITERIA
14 pages
Httpslearn Eu Central 1 Prod Fleet01 Xythos - Content.blackboardcdn - Com5ac734ed505df11691663x Blackboard Expiration 16342992
No ratings yet
Httpslearn Eu Central 1 Prod Fleet01 Xythos - Content.blackboardcdn - Com5ac734ed505df11691663x Blackboard Expiration 16342992
38 pages
Corpus Linguistics
No ratings yet
Corpus Linguistics
40 pages
Digital Corpora As A Source of Authentic Materials in Teaching Grammar
No ratings yet
Digital Corpora As A Source of Authentic Materials in Teaching Grammar
12 pages
E-Content Submission To INFLIBNET
No ratings yet
E-Content Submission To INFLIBNET
14 pages
Corpus Linguistics Lect 1
No ratings yet
Corpus Linguistics Lect 1
5 pages
Cospus Approaches in Discourse Analysis
No ratings yet
Cospus Approaches in Discourse Analysis
14 pages
Spoken Corpora PDF
No ratings yet
Spoken Corpora PDF
25 pages
00 General Handout
No ratings yet
00 General Handout
24 pages
Corpus Design and Types of Corpora
No ratings yet
Corpus Design and Types of Corpora
68 pages
Introduction
No ratings yet
Introduction
8 pages
Corpus Linguistics & The BNC: Outline
No ratings yet
Corpus Linguistics & The BNC: Outline
2 pages
Corpora in Human Language Technologies
No ratings yet
Corpora in Human Language Technologies
42 pages
Leech and Smith 2005
No ratings yet
Leech and Smith 2005
14 pages
Corpus Design and Types of Corpora
No ratings yet
Corpus Design and Types of Corpora
68 pages
Corpus Design: G Kennedy, Introduction To Corpus Linguistics, CH 2 CF Meyer, English Corpus Linguistics, Ch. 2
No ratings yet
Corpus Design: G Kennedy, Introduction To Corpus Linguistics, CH 2 CF Meyer, English Corpus Linguistics, Ch. 2
38 pages
Corpus Linguistics 25 Years On Facchinetti Full Access
100% (2)
Corpus Linguistics 25 Years On Facchinetti Full Access
102 pages
Genres, Registers and Text Types
No ratings yet
Genres, Registers and Text Types
37 pages
Terminology - Lecture 1 Corpora, Corpus Design and Corpus Selection
No ratings yet
Terminology - Lecture 1 Corpora, Corpus Design and Corpus Selection
28 pages
Seminar 3
No ratings yet
Seminar 3
10 pages
Features and Differences of The Parallel Corpus of English and Uzbek Languages. Jamshid Norov
No ratings yet
Features and Differences of The Parallel Corpus of English and Uzbek Languages. Jamshid Norov
5 pages
Reference and Sense
No ratings yet
Reference and Sense
5 pages
Sight Word Mini Books Sample
No ratings yet
Sight Word Mini Books Sample
10 pages
UIEO Class 5 Paper 2021
No ratings yet
UIEO Class 5 Paper 2021
16 pages
Stage 8 - T2P2 - Fiction
No ratings yet
Stage 8 - T2P2 - Fiction
6 pages
List of Irregular Verbs With Pronunciation Traducere Romana
No ratings yet
List of Irregular Verbs With Pronunciation Traducere Romana
4 pages
Comparative and Superlatives
No ratings yet
Comparative and Superlatives
16 pages
C-1.Reading Skills
No ratings yet
C-1.Reading Skills
21 pages
Maths - Assignment (Limits)
No ratings yet
Maths - Assignment (Limits)
18 pages
Test Modal Verbs
No ratings yet
Test Modal Verbs
2 pages
Enrichment Worksheets Grade 12-Foundation: English Department
No ratings yet
Enrichment Worksheets Grade 12-Foundation: English Department
59 pages
CBSE Class 1 Maths Chapter 9 Worksheet
No ratings yet
CBSE Class 1 Maths Chapter 9 Worksheet
7 pages
Assessment Rubric For Presentations: Category/ Criteria Exemplary (5) Competent (3) Needs Work (1) Score
No ratings yet
Assessment Rubric For Presentations: Category/ Criteria Exemplary (5) Competent (3) Needs Work (1) Score
1 page
Black Mirror Worksheet
No ratings yet
Black Mirror Worksheet
3 pages
Whole Numbers and Place Value Guide
No ratings yet
Whole Numbers and Place Value Guide
15 pages
Grammar Unit 7: Past Simple: Be YES NO
No ratings yet
Grammar Unit 7: Past Simple: Be YES NO
4 pages
Unit Test 3
No ratings yet
Unit Test 3
3 pages
Grade 6 - Unit 2 - Quiz - Week 5 - Student
No ratings yet
Grade 6 - Unit 2 - Quiz - Week 5 - Student
2 pages
Spell Mate: Presenting Sponsor Associate Sponsor
No ratings yet
Spell Mate: Presenting Sponsor Associate Sponsor
193 pages
Purposive Communication Module
No ratings yet
Purposive Communication Module
9 pages
Movers - Revision Test 1-2
No ratings yet
Movers - Revision Test 1-2
8 pages
UNIT 7 - Reading
No ratings yet
UNIT 7 - Reading
6 pages
Young Learners Speaking Test Guide
No ratings yet
Young Learners Speaking Test Guide
10 pages
RH - PreK - U1 - L9 - Wonderful - Weather - Tiantian - 1227 - 191227193018
No ratings yet
RH - PreK - U1 - L9 - Wonderful - Weather - Tiantian - 1227 - 191227193018
25 pages
1st. Lan. Acq. 2025
No ratings yet
1st. Lan. Acq. 2025
10 pages
Advanced Trainer Writing Test 1
No ratings yet
Advanced Trainer Writing Test 1
4 pages
German Noun Gender Guide
No ratings yet
German Noun Gender Guide
3 pages
20250502-Std 5 Unit Test 1 Syllabus and Timetable 1
No ratings yet
20250502-Std 5 Unit Test 1 Syllabus and Timetable 1
2 pages
Grade 4 DLL English Q1 W6 2019
No ratings yet
Grade 4 DLL English Q1 W6 2019
12 pages
TÉMAZÁRÓK ÉS MEGOLDÓKULCSOK Mappa Tartalma
18% (45)
TÉMAZÁRÓK ÉS MEGOLDÓKULCSOK Mappa Tartalma
44 pages
Common and Proper Noun
100% (1)
Common and Proper Noun
5 pages

Exploring Corpora Task 1 - 2023

Uploaded by

Exploring Corpora Task 1 - 2023

Uploaded by

EXPLORING CORPORA

Aim: to create awareness of the available corpora in the world!

1. Go to CBL Links -> corpora section.

a. What does it mean by first generation and second generation corpora?

b. What are the types of corpora listed in this corpora section?

pre-electronic corpora, first-generation corpora, and second-generation corpora

Second-generation (Mega) corpora: British National Corpus (BNC), Corpus of Contemporary

British National Corpus (BNC)

Corpus of Contemporary American English (COCA)

International Corpus of English (ICE)

Santa Barbara Corpus of Spoken American English

Spoken Dutch Corpus

e. What are the different types of data compiled in specialized corpora?

specific dialects, genres, registers

Hansard Corpus - contains parallel texts in English and Welsh..

Canadian Hansard Corpus - contains parallel texts in English and French.

Academia Sinica Balanced Corpus(Chinese) ： 5-m-word corpus of Chinese at the Academia

Lancaster Corpus of Mandarin Chinese (LCMC) ： a 1-million-word corpus (including punctuation)

2. Search for the BNC corpus website.

The domain of a text indicates the kind of writing it contains.

 60% of written texts were to be books

h. What are the classification features for BNC’s written texts?

i. How is the result of a search term displayed in the BNC?

a. Who is the creator of this website?

c. What is the time range for the NOW corpus?

d. What kind of searches can be done using COCA?

4. Search for the Freiburg – LOB Corpus of British English corpus.

Created by Christian Mair from the Albert-Ludwigs-Universität Freiburg.

b. In which year did work on FLOB began?

e. What is the language variety of FLOB?

5. Search for Pakistan National Corpus of English (PNCE)

6. Search for The London-Lund Corpus of Spoken English.

a. Who created this corpus and in which year was it created?

a. How many learner corpora are listed in this website?

b. What is the meaning of learner corpora?

State ALL the target languages of these corpora.

a. What does the word ARCHER stand for?

b. What type of corpus does ARCHER belong to?

c. In which year did work on this corpus begin?

e. Who created this corpus?

f. In which institution is the corpus based?

g. What is the objective in creating ARCHER?

9. Go to http://rcpce.engl.polyu.edu.hk/index.html and search for the Hong Kong Engineering

Based on the information on the website, answer the following questions.

a. When did this project begin?

b. Who are the people behind it?

d. What is the specific purpose of compilation?

e. What is the context of the corpus?

f. What are the types of texts used in the corpus?

g. State the range of subject matter/topics covered in this corpus.

About Us (AU): 647,013 words

These text types cover a broad spectrum of engineering topics, providing a

h. State the variety(s) of English that are included in this corpus.

10. Search for TCSE: Ted Corpus Search Engine - yohasebe.com.

11. Search for https://clic.bham.ac.uk/.

b) When did this project begin?

c) Who is responsible for spearheading this endeavor?

f) What contextual background is associated with the corpus?

g) What kinds of textual materials are incorporated within this corpus?

j) State the functions given to use this tool.

You might also like