A collection of texts, spoken and/or written, which has been designed and compiled based on a set of clearly defined criteria. CORPUS [13c: from Latin corpus body. The plural is usually corpora]. A collection of texts, especially if complete and self-contained: the corpus of Anglo-Saxon verse. Plural also corpuses. In linguistics and lexicography, a body of texts, utterances, or other specimens considered more or less representative of a language, and usually stored as an electronic database. Currently, computer corpora may store many millions of running words, whose features can be analysed by means of tagging (the addition of identifying and classifying tags to words and other formations) and the use of concordancing programs. Corpus linguistics studies data in any such corpus.

A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language.

A collection of naturally occurring language text, chosen to characterize a state or variety of a language.

Monitor corpus - attempts to be a representative cross-section of the spoken and/or written language to be studied (e.g. The Bank of English (COBUILD) and the British National Corpus) an d by its very nature it has to be very large (The Bank of English is about 400 million words of written and spoken texts and continues to grow). Monitor corpora have to be continually updated with 'new' texts and 'old' texts must be discarded if they are to be truly representative.


Sample corpus - does not pretend to be representative of the whole spoken and/or written forms of the language to be investigated. Sample corpora are much more common and are the norm in most corpus-based studies (e.g. International Corpus of English and the Hong Kong Corpus of Spoken English at PolyU).

Definition of corpus linguistics

Corpus linguistics is simply the study of language through corpus-based research, but it differs from traditional linguistics in its insistence on the systematic study of authentic examples of language in use.

Text linguistics vs corpus linguistics

Illustration vs evidence

Introspection and informant testing vs observation of text

(i.e. corpus = evidence)

Language cannot be invented; it can only be captured.

Examples of English language corpora

The Bank of English - written and spoken English (used extensively by researchers and for the COBUILD series of English language books)

The BNC - written and spoken British English (used extensively by researchers and for the Oxford University Press, Chambers and Longman publishing houses)

CANCODE (Cambridge Nottingham Corpus of the Discourse of English)- spoken British English (used extensively by researchers and Cambridge University Press)

ICE (International Corpus of English- international varieties of spoken and written English (most of the corpus is not yet available)

Brown University Corpus & LOB (Lancaster-Oslo-Bergen) Corpus - parallel corpora of written texts (but now rather outdated)

London-Lund Corpus (Survey of English Usage)- spoken British English (used very extensively by researchers, but it is now quite old)

Santa Barbara Corpus - spoken American English (most of the corpus is not yet available)

Hong Kong Corpus of Spoken English (still being compiled, 1 million of the target 1,5 million words have been collected so far)

ICAME (International Computer Archive of Modern English) - a centre which aims to coordinate and facilitate the sharing of computer-based corpora.

Examples of corpus linguistic studies

Three main types of corpus based linguistic study:

1. Lexical: e.g. word use, idioms, irregular plurals

eye vs eyes


2. Syntactic: sentence level features, e.g. use of prepositions, verb forms, pronouns, agreement

who vs whom

look vs watch

3. Discourse: the structure of text e.g. cohesion above the sentence level



Test the hypothesis of co-selection

Study the verbal environment of a word or phrase and examine the nature and extent of multi-word choices in the clause.

Inspect co-text

co-selection (intricate pattern of word choice in the clause)

the verbal environment of a word or phrase: the nature and extent of multi-word choices

the hypothesis of co-selection: tends to undermine the notion of word-meaning

Form-meaning distinction

Sinclair (in Wichmann et al, 1997) claims that meaning has an important effect on structure. If a word has two meanings it is possible to predict that it also has at least two structures (Sinclair, 1997: 35-36) and this is only made possible by studying examples of language in use.

some words (e.g. combat) exist as noun and verb: noun = more concrete, narrower meaning; verb = more figurative, vaguer meaning

transitive and intransitive verbs (e.g. manage)

verbs referring to physical senses, e.g. see, feel, hear, smell (referring to the present time: typically preceded by the modal can or can't rather than being in the simple present tense; when used with non-physical meanings and in other meanings: simple present tense used)


Corpus-based language studies enable the researcher to identify and describe various realisations of the productivity of language - what Sinclair (1997: 37-38) terms 'permissble variety'. For example, a search for the first noun X in the structure a(n) X of Y allows the researcher to uncover the productive opportunities in language.

Schematic knowledge (see Aston, in Wichmann, 1997)

To enable the researcher to become aware of the many kinds of regularities in discourse and the extent to which you can operate with fixed or semi-fixed associations

Based on a distinction between syntagmatic associations at the levels of meanings (informational/rhetorical structure) and of forms (collocation/colligation/semantic/pragmatic/prosodic), and paradigmatic associations between situation and meaning (genre and register conventions), between meaning and form (conventional speech act and referring procedures), and between situation and form (routine formulae, technical terminology).


Multiple texts, multiple contexts:

Recurrent patternings among multiple texts (e.g. newspaper articles):

Similar topics: recurrent semantic fields with similar lexicogrammatical content

Similar types: regularities in information and rhetorical structures (Aston, in Wichmann, 1997: 58-60)


Look for regularities (patterning): collocation, colligation, connotation, discourse structuring.

e.g any

will and would

irregular verbs etc.

Exploring texts through the concordancer

An example of this kind of study for discourse features is detailed below:

Study a list of words (occurring at least four times) taken from an article and make hypotheses about the text-topic.

Study a list of words (occurring three and two times) taken from the same article

Use the Mini-Concordancer to search for the pattern of a particular word/stem to find all the forms relating to the stem.

Guess the answer to some questions (comprehension).

Remove the concordance and the frequency lists and write a text modelling on the one used for the analyses.

Calculate the frequency of the words used in your text and compare the frequencies with those of the original text.

(Choose 15 words from your text that occur only once and create a cloze test.)

Some fundamental precepts in corpus-based research

Present real examples only

Know your intuition

Inspect context (co-text)

Identify form-meaning distinction

Highlight productivity

Some implications of corpus-based study

The operations of text and context retrieval rarely provide simply what the user was expecting (unexpected or unthought of usage): three spin-offs:

replace or complement the researcher's intuition

suggest curiosities which may be motivating for both researchers and learners

transform the study of language from an environment which has been 'evidence scarce' to one which is 'evidence abundant'.


Why use a corpus

The computer ?gives us the ability to comprehend, and to account for, the contents of such corpora in a way which was not dreamed of in the pre-computational era of corpus?

Linguistics: to study linguistic competence or performance as revealed in naturally occurring data. Most applications will require or lead to the creation of annotated text.

Language teaching/learning: language for specific purposes (e.g. use newspaper corpora, corpora of scientific texts); to prepare vocabulary lists based on high-frequency lexical items; to prepare CLOZE tests; to answer ad hoc learner questions ('What's the difference between few and a few?'); to discover facts about language ...

Different corpora

native speaker vs. learner

monolingual vs. multilingual

original vs. translations

synchronic vs. diachronic

plain vs. annotated

Constructing a corpus

There is no consensus in the community as to the procedures to be followed in corpus design (balanced, opportunistic, statistically sophisticated and defiantly naive approaches all struggle with each other for acceptance). The purposes of the corpora can be very different from each other; e.g. as a basis for a dictionary; to create a word frequency list; to study some linguistic phenomenon; to study the language of a particular author or time period; to train a Natural Language Processing system; as a teaching resource for non-native speakers; to study language acquisition, etc).

Finding and cleaning up text

There is a vast amount of English texts available over the Internet for scholarly research. . You can easily search the Web for the texts you want and clean it up to build a corpus of your own. Much of what is on the Web is, of course, written texts. However, you can also find transcripts of speeches.

Scanning can be another method of collecting electronic texts, but you need to have access to a scanner and learn how to use the OCR. It is still much faster and more reliable than typing.

Your electronic corpora should be ASCII text because most software for text analysis works on ASCII text. After cleaning up the texts, you need to save the files to "TEXT ONLY" in Word. In the *.txt files, you lose all the formatting in bold and italics.