Language, computers & statistics: not just for geeks corpus linguistics in the a level classroom

Glossary of corpus linguistics terminology used here



The regular and statistically meaningful co-occurrence not of words, but of grammatical patterns, such as word class or tense.

  • The lexical item “naked eye” colligates with the definite article “the” and prepositions (“with” or “to” usually).

  • The noun “consequence” colligates with a form of the verb “BE” + subordinate clause


The regular co-occurrence of two (or more) words that is more frequent than chance would predict (= statistically meaningful). Collocation can be calculated using slightly different statistical measures (see “Dice coefficient”, “loglikelihood” and “mutual information”), but all of these measures take the following into consideration, when working out the strength of a collocational pattern:

  1. Total number of tokens in the corpus

  2. Number of tokens (occurrences) of the node word

  3. Number of tokens (occurrences) of the collocate

  4. Number of tokens (occurrences) where node and collocate appear together within a specified range (see “span”)

The observed number of co-occurrences in the corpus is compared to the expected number of co-occurrences, i.e. the number expected by chance given 1, 2, and 3. If the observed number of co-occurrences of node and collocate is larger than what can be ascribed to chance, then the collocate is a statistically significant collocate of the node.

  • “grass” collocates with “green”

  • “letter” collocates with “read and write”

  • In GCSE and A level specifications, the noun “student” collocates with the noun “assessment”.


The arranging of the node (word under investigation) in all its appearances in a corpus, by having it in the middle of the sentence in which it occurs, with a specified number to the left and right (see also: node & KWIC)



Software tool that analyses corpora and allows researchers to produce a concordance(s) for word(s) they are interested in. Many concordancing tools also allow you to calculate frequency list

AntConc (free to download from Laurence Anthony’s website), WordSmith Tools (not free), LancsBox (University of Lancashire’s corpus tools, free). As well as online interfaces such as CQPWeb (University of Lancashire) & Sketch Engine.

corpus (plural: corpora)

A large collection of texts that has been converted into an electronic format (e.g. plain text for AntConc) so it can be analysed by a concordancing tool or other software. A corpus can contain written texts as well as (transcribed) spoken texts; it can be general (wide range of written & spoken data to represent “typical or general” forms of a language; it can be monolingual or bilingual; it can be based on specific discourses e.g. academic English or newspaper reports only.
A specific corpus / corpora is usually built to investigate a particular register or topic, while general corpora are used to look at more general aspects of language usage. Finally, general corpora are used also to calculate keyword lists and keyness (see “keyword” and “keyness”).

General corpora: British National Corpus (BNC), Corpus of Contemporary American English (COCA), Brown Corpus (one of the earliest general corpora of British English, free, small).

