# Exploring some of the statistics used in calculating collocations

 Page 2/8 Date 15.10.2022 Size 33.13 Kb. #156561
Glossary of corpus linguistics terminology used here

Exploring some of the statistics used in calculating collocations:
 1d Collocations of ‘man’ with Dice Coefficient procedure: Click on the pulldown menu where it says “statistic” and change it to “Dice coefficient”, then click on the button “Go!” next to “submit changed parameters”. CQPWeb will now calculate the collocations for “man” using a different statistical measure. Make a note of the top 10 collocates for “man” according to this procedure. What do you notice? 1e Collocations of ‘woman’: Repeat the same steps as above for the noun ‘woman’. To go back to the main query menu, select the drop down menu next to the “Go!” button and select “new query”. Make a note of the frequency per million for ‘woman’ – how does this compare to ‘man’’s frequency? Now calculate the collocations for ‘woman’, using the standard calculation (the default for CQPWeb is loglikelihood). Make a note of the top 10 collocations for ‘woman’ using this calculation. Change the setting for statistic to “Dice coefficient” and re-calculate the collocates for ‘woman’ with this measure. Make a note of the top 10 collocations for ‘woman’. How does this compare to the previous measurement (with loglikelihood? Question: Looking back at both sets of collocations (the ones calculated with loglikelihood and with Dice coefficient, can this exercise tell us anything about gender in early 21st century Britain?

 1f Delving into some detail – collocations of ‘woman’: ‘menstruating’ and ‘raping’ Both ‘menstruating’ and ‘raping’ were among the top 10 collocates for ‘woman’. ‘menstruating’ was a collocate when using the standard loglikelihood calculation for collocation. start a new query for ‘menstruating’. 1. How many times does this word appear in the BNC? 2. If you read the concordance lines (KWIC) for ‘menstruating’ you can get a sense of how this word is used. What do you notice (e.g. syntactical patterns / grammatical usage)? 3. If you look on the left of the concordance, you can see a code (consisting of two or three capital letters) that identifies the type of text the text where this word occurs was taken from. One particular text, ACL, accounts for almost half of all the entries in the BNC. What sort of text is this (you can find out by hovering over the link or clicking on it)? 4. Calculate collocations for ‘menstruating’ by using the standard function (loglikelihood). What do you find are the collocations for this word? 5. Change the statistic to “Dice coefficient” and re-calculate the collocations for ‘menstruating’. What do you find? 6. Start a new query for “menstruating” and once you have the concordance back up, select “dispersion” from the menu under the “choose action” pull down menu. This will give you a graphic representation of where the node “menstruating” appears in the corpus. ‘raping’ on the other hand was only a collocate for ‘woman’ when calculated using Dice co-efficient. 7. Start a new query for ‘raping’. How many times does this word occur in the BNC? In how many texts? 8. Make a collocation for ‘raping’ using the standard loglikelihood calculation. What are the 8th to 12th most common collocates for ‘raping’? 9. Now return to the concordance lines /KWIC for ‘raping’ by selecting “Back to query result” in the “Choose action” drop down menu at the top right / bottom right. Skim read the concordance lines – what do you notice about the way the collocates “man”, “men”, “woman” and “women” behave with the node ‘raping’? 10. Lastly, make a dispersion plot for the node “raping”. How does this compare with the one you made earlier for “menstruating”? Why is this new plot not that different in the sense that there seems to be a lot of clustering of the node “raping” in specific areas of the corpus only?