Language, computers & statistics: not just for geeks corpus linguistics in the a level classroom

Download 33.13 Kb.
Size33.13 Kb.
1   2   3   4   5   6   7   8
Glossary of corpus linguistics terminology used here

Specialist corpora:Brexit corpus on Sketch Engine; JSI Timestamped corpora on Sketch Engine (news articles)


An online interface developed by Lancaster University (Prof. Andrew Hardie) which hosts several corpora, including the BNC XL 2007. CQP = Corpus Query Processor

Available at:

Critical Discourse Analysis (CDA)

Critical discourse analysis is a methodology that enables a vigorous assessment of what is meant when language is used to describe and explain. It is a form of discourse analysis (close, detailed linguistic analysis of text[s]) influenced by Critical Theory. It is frequently combined with corpus linguistics to allow for larger amounts of texts to be analysed.

CDA is particularly interested in ideology, power & representation (of minority groups usually) and some researchers are very politically engaged in their work (e.g. socialist, feminist, anti-racist etc.). Norman Fairclough’s well-known Language and Power is a key CDA text.

Dice coefficient and LogDice

A statistical measure used in corpus linguistics to calculate collocation strength. Dice coefficient takes both significance and effect size (see “significance” and “effect size” below) into consideration.
Sketch Engine uses an adapted version of Dice Coefficient called LogDice.

CQPWeb allows you to choose Dice Coefficient as one method to calculate collocation strength.


a graphic representation (usually some sort of bar chart form) of how a word or token is spread across the corpus. (see “distribution” below).

See exercise 2d.


the way a word / token is spread across the corpus; does it appear in only a few texts in a concentrated manner or is it spread more evenly across all / many texts that make up the corpus. (see “dispersion” above).

e.g. both ‘man’ and ‘woman’ are spread across almost all the texts that make up the BNC. This is in sharp contrast with ‘raping’ and ‘menstruating’ which do not appear in every text of the corpus and have a tendency to be clustered in certain texts (more so with the latter). See exercise 2d.

effect size

A statistical measure to express how big (in terms of measure) a phenomenon is.
In corpus linguistics, effect size gives a high collocational strength to collocations that consist of one or two words that are rare. This may be useful for some research, but in other research it may skew the findings.

See exercises 1d and 1e.

focus corpus

When calculating keyness or keywords, this is how the corpus you are investigating is referred to (so as to distinguish it from the ‘reference corpus’). (See “keywords” and “reference corpus” below).

See exercise 2b.


How often a word occurs in the corpus; usually expressed as a number per million.


frequency list

A list of the most frequent words in the corpus



Keyness is the quality a word or phrase has of being "key" in its context of a particular corpus or discourse. (see “keyword” below). This can be expressed as a score (which is calculated by taking the frequency per million of the word in the focus corpus plus 1 or 100 and dividing this by the frequency per million of the word in the reference corpus plus 1 or 100).

See exercise 2b.
Formula for Keyness score:

where fpm = frequency per million and N is a number usually 1 or 100 to even out the calculation’s result.


Word which occurs in a corpus more often than we would expect to occur by chance alone. Keywords are calculated by comparing the word frequencies in a text against their expected frequencies in a much larger corpus, which acts as a reference for general language use. 

See exercise 2b.

KWIC (key word in context)

The concordance display of a node (usually highlighted in a different colour and centred) showing the words to the left and right as specified. Allows researchers to get a sense of how the word is used, its meanings etc.



the so-called canonical form of the word (like the dictionary entry). If you ask a concordance to look for “word”, usually it will only look for that particular form, ignoring plurals or different forms of the word. If you search for a lemma, you will get all possible forms of the “same” word.

E.g. the lemma for a verb is its infinitive form and when searching a corpus for a verb lemma form, you will get all uses of the verb, present tense forms, past tense forms etc.


A statistical measure used to calculate collocation. This measures only effect size. (see “effect size” above)

See exercises 1d and 1e.


The word or lexical item that is under investigation.

E.g. the tokens “man” and “woman” in exercise 1.

reference corpus

This is an extra corpus that is required to calculate keywords for the corpus you are interested in. Usually, a reference corpus is a general corpus which aims to represent typical language use. If you are investigating a general corpus for keywords, usually your reference corpus will be another general corpus, preferably one that is bigger than your focus corpus (this makes the keyword calculation more reliable).

See exercise 2b.

seed word

a word chosen (as part of a series of words / phrases) to build a corpus. When building a corpus from scratch like this, it is important to think carefully about seedwords – as your choices will determine what ends up in your corpus. It is easy to introduce bias into a corpus by selecting only particular seed words!

For example, if you want to investigate attitudes towards LGBT+ people & issues in the news, you could build a corpus using “gay”, “lesbian”, “bisexual”, “trans”, “transgender”, and “LGBT” as seed words.


In statistics (and research that employs statistical measurements), researchers set a value for something occurring by chance. If the research finds that some phenomena have a higher score than this value (usually called p-value), it suggests that the finding is more than just accidental or occurred by chance alone.
In corpus linguistics, there are different ways of calculating significance (and there’s some controversy in this area) and some of the original approaches such as t-score are now no longer used, because they are not suitable for language since language (unlike say heart disease in a general population) is not quite random. It is rare for contemporary corpus linguists to only use measures of significance, however, when combined with effect size (see “effect size” above), significance can be helpful.


Sketch Engine

An online interface developed by Lexical Computing Ltd. It provides access to a large number of corpora in different languages, as well as a range of corpus functions. It allows you to load your own corpus for investigation – Sketch Engine will tag your corpus automatically so that you can investigate word sketches. It has a 30 day free trial, as well as a free version (with a limited range of corpora available).

Available from:


The area a researcher wants to focus on for possible collocates to occur.

In English, it is common to have a span of 4 to places to the left (4L) and 4 to the right (4R). You may find 5L and 5R is another common measurement used for spans in English. But you can tailor your span to suit your research: e.g. if you are interested in particular attributive adjectives being used with a noun, for example, you may wish to specify a span of 1L (one to the left of the node).


Word / lexeme in the corpus, counted every time it occurs, including repetitions of the same word. This is different from “word” in corpus linguistics, as “word” refers to the same lexical item and is counted only once to give a word count. The word count in a corpus is smaller than the token count.

The JSI timestamped English corpus 2014-19 has approximately over 44 billion tokens, but 38 plus billion words.

word sketch

A function on Sketch Engine (indeed, it is behind the name) that shows the collocations and colligations of the word(s) you are interested in.

See exercise 2c.

Zipf’s Law

A statistical finding in many social science phenomena, this also applies to word frequencies in language. Basically, when you calculcate a frequency list, Zipf’s law states that the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

See exercise 2a.

Download 33.13 Kb.

Share with your friends:
1   2   3   4   5   6   7   8

The database is protected by copyright © 2022
send message

    Main page