Lancaster University, Edge Hill University
The corpus-based approach to linguistics and language education has gained prominence over the past four decades, particularly since the mid-1980s. This is because corpus analysis can be illuminating ‘in virtually all branches of linguistics or language learning’ (Leech 1997: 9; cf. also Biber, Conrad and Reppen 1998: 11). One of the strengths of corpus data lies in its empirical nature, which pools together the intuitions of a great number of speakers and makes linguistic analysis more objective (McEnery and Wilson 2001: 103). Unsurprisingly, corpora have been used extensively in nearly all branches of linguistics including, for example, lexicographic and lexical studies, grammatical studies, language variation studies, contrastive and translation studies, diachronic studies, semantics, pragmatics, stylistics, sociolinguistics, discourse analysis, forensic linguistics, and language pedagogy. Corpora have passed into general usage in linguistics in spite of the fact that they still occasionally attract hostile criticism (e.g. Widdowson 1990, 2000).1
The early 1990s saw an increasing interest in applying the findings of corpus-based research to language pedagogy. The upsurge of interest is evidenced by the eight well-received biennial international conferences on Teaching and Language Corpora (TaLC) held in Lancaster, Oxford, Graz, Bertinoro, Granada, Paris, and Lisbon. This is also apparent when one looks at the published literature. In addition to a large number of journal articles, at least twenty-five authored or edited volumes have recently been produced on the topic of teaching and language corpora: Wichmann et al (1997), Partington (1998), Bernardini (2000), Burnard and McEnery (2000), Kettemann and Marko (2002, 2006), Aston (2001), Ghadessy, Henry, and Roseberry (2001), Hunston (2002), Granger et al (2002), Connor and Upton (2002), Tan (2002), Sinclair (2003, 2004), Aston et al (2004), Mishan (2005), Nesselhauf (2005), Römer (2005), Braun, Kohn and Mukherjee (2006), Gavioli (2006), Scott and Tribble (2006), Hidalgo, Quereda and Santana (2007), O’Keeffe, McCarthy and Carter (2007), Aijmer (2009), and Campoy, Gea-valor and Belles-Fortuno (2010). These works cover a wide range of issues related to using corpora in language pedagogy, e.g. corpus-based language descriptions, corpus analysis in classroom, and learner corpus research (cf. Keck 2004).
In the opening chapter of Teaching and Language Corpora (Wichmann et al 1997), Leech (1997) observed that a convergence between teaching and language corpora was apparent. That convergence has three focuses, as noted by Leech: the indirect use of corpora in teaching (reference publishing, materials development, and language testing), the direct use of corpora in teaching (teaching about, teaching to exploit, and exploiting to teach), and further teaching-oriented corpus development (LSP corpora, L1 developmental corpora and L2 learner corpora).
In the remainder of this chapter, we will explore the potential uses of corpora in language pedagogy in terms of Leech’s three focuses of convergence. The chapter concludes by discussing the debate over the relevance of authenticity and frequency of corpora in language education as well as the future of corpus-based language pedagogy.
Indirect use of corpora
The use of corpora in language teaching and learning has been more indirect than direct. This is perhaps because the direct use of corpora in language pedagogy is restricted by a number of factors including, for example, the level and experience of learners, time constraints, curricular requirements, knowledge and skills required of teachers for corpus analysis and pedagogical mediation, and the access to resources such as computers, and appropriate software tools and corpora, or a combination of these (see the concluding section for further discussion). This section explores how corpora have impacted on language pedagogy indirectly.
Corpora can be said to have revolutionized reference publishing (at least for English), be it a dictionary or a reference grammar, in such a way that dictionaries published since the 1990s are typically have used corpus data in one way or another so that ‘even people who have never heard of a corpus are using the product of corpus-based investigation’ (Hunston 2002: 96).
Corpora are useful in several ways for lexicographers. The greatest advantage of using corpora in lexicography lies in their machine-readable nature, which allows dictionary makers to extract all authentic, typical examples of the usage of a lexical item from a large body of text in a few seconds. The second advantage of the corpus-based approach, which is not readily available when using citation slips, is the frequency information and quantification of collocation which a corpus can readily provide (see the section ‘Syllabus design and materials development’ for further discussion of collocation). Some dictionaries, e.g. Cobuild 1995 and Longman 1995, include such frequency information. Frequency data plays an even more important role in the so-called frequency dictionaries, which define core vocabulary to help learners of different modern languages, e.g. Davies (2005) for Spanish, Jones and Tschirner (2005) for German, Davies and de Oliveira Preto-Bay (2007) for Portuguese, Lonsdale and Bras (2009) for French, and Xiao, Rayson and McEnery (2009) for Chinese. Information of this sort is particularly useful for materials writers and language learners alike.
A further benefit of using corpora is related to corpus markup and annotation. Many available corpora (e.g. the British National Corpus, BNC) are encoded with textual (e.g. register, genre and domain) and sociolinguistic (e.g. user gender and age) metadata which allows lexicographers to give a more accurate description of the usage of a lexical item. Corpus annotations such as part-of-speech tagging and word sense disambiguation also enable a more sensible grouping of words which are polysemous and homographs. Furthermore, a monitor corpus, which is constantly updated, allows lexicographers to track subtle change in the meaning and usage of a lexical item so as to keep their dictionaries up-to-date.
Last but not least, corpus evidence can complement or refute the intuitions of individual lexicographers, which are not always reliable (cf. Sinclair 1991: 112; Atkins and Levin 1995; Murison-Bowie 1996: 184) so that dictionary entries are more accurate. The above observations are line with Hunston (2002: 96), who summarizes the changes brought about by corpora to dictionaries and other reference books in terms of five ‘emphases’: an emphasis on frequency, an emphasis on collocation and phraseology, an emphasis on variation, an emphasis on lexis in grammar, and an emphasis on authenticity.
It has been noted that non-corpus-based grammars can contain biases while corpora can help to improve grammatical descriptions (McEnery and Xiao 2005). The Longman Grammar of Spoken and Written English (Biber et al 1999) can be considered as a new milestone in reference publishing following Quirk et al’s (1985) Comprehensive Grammar of the English Language. Based entirely on the 40-million-word Longman Spoken and Written English Corpus, the book gives ‘a thorough description of English grammar, which is illustrated throughout with real corpus examples, and which gives equal attention to the ways speakers and writers actually use these linguistic resources’ (Biber et al 1999: 45). The new corpus-based grammar is unique in many different ways, for example, by taking account of register variations and exploring the differences between written and spoken grammars.
While lexical information forms, to some extent, an integral part of the grammatical description in Biber et al (1999), it is the Collins COBUILD series (Sinclair 1990, 1992; Francis et al 1997; 1998), that focus on lexis in grammatical descriptions (the so-called ‘pattern grammar’, Hunston and Francis 2000). In fact, Sinclair et al (1990) flatly reject the distinction between lexis and grammar. While pattern grammars focusing on the connection between pattern and meaning challenge the traditional distinction between lexis and grammar, they are undoubtedly useful in language learning as they provide ‘a resource for vocabulary building in which the word is treated as part of a phrase rather than in isolation’ (Hunston 2002: 106).
For language pedagogy the most important developments in lexicography relate to the learner dictionary. Yet corpus-based learner dictionaries have a quite short history. It was only in 1987 that the Collins Cobuild English Language Dictionary (Sinclair 1987) was published as the first ‘fully corpus-based’ dictionary. Yet the impact of this corpus-based dictionary was such that most other publishers in the ELT market followed Collins’ lead. By 1995, the new editions of major learner dictionaries such as the Longman Dictionary of Contemporary English (3rd edition), the Oxford Advanced Learner’s Dictionary (5th edition, Hornby and Crowther 1999), and a newcomer, the Cambridge International Dictionary of English (Procter 1999) all claimed to be based on corpus evidence in one way or another.
One of the important features of corpus-based learner dictionaries is their inclusion of quantitative data extracted from a corpus. Another important feature, which is also related to frequency information, is that such dictionaries typically select the vocabulary used from a controlled set when defining the entry for a word. Producing definitions in an L2 that language learners can understand is a problem; language learners may not have a very well developed L2 vocabulary. This makes it necessary and desirable for dictionary makers to limit the vocabulary they use when defining words in a dictionary. Nowadays, most learner dictionary makers prepare a list of defining words, usually ranging from 2,000 to 2,500 words, based on the frequency information extracted from corpora as well as on the lexicographers’ experience of defining words.
As noted earlier, an important use of corpus data for lexicography is in the area of example selection so that nowadays most dictionaries of English use corpora as the source of their examples. In the case of learner dictionaries, however, there was a tradition of using examples invented by lexicographers, rather than authentic materials, in dictionary production, because they believed that foreign language learners have difficulty understanding authentic materials and therefore have to be presented with simple, rewritten examples in which the use of a given word is highlighted to show its syntactic and semantic properties. It was corpus-based learner dictionary work which challenged this received wisdom. The Cobuild (Collins Birmingham University International Language Database) project broke with tradition and used authentic data extracted from corpora to produce illustrative examples for a learner dictionary. The use of authentic examples in learner dictionaries is an area where corpus-based learner dictionaries have innovated.
Syllabus design and materials development
While corpora have been used extensively to provide more accurate descriptions of language use, a number of scholars have also used corpus data directly to look critically at existing TEFL (Teaching English as a Foreign Language) syllabuses and teaching materials. Mindt (1996), for example, finds that the use of grammatical structures in textbooks for teaching English differs considerably from the use of these structures in L1 English. He observes that one common failure of English textbooks is that they teach ‘a kind of school English which does not seem to exist outside the foreign language classroom’ (Mindt 1996: 232). As such, learners often find it difficult to communicate successfully with native speakers. A simple yet important role of corpora in language education is to provide more realistic examples of language usage that reflect the complexities and nuances of natural language.
In addition, however, corpora may provide data, especially frequency data, which may further alter what is taught. For example, on the basis of a comparison of the frequencies of modal verbs, future time expressions and conditional clauses in native English corpora and their grading in textbooks used widely in Germany, Mindt (ibid) concludes that one problem with non-corpus-based syllabuses is that the order in which those items are taught in syllabuses ‘very often does not correspond to what one might reasonably expect from corpus data of spoken and written English’, arguing that teaching syllabuses should be based on empirical evidence rather than tradition and intuition, with frequency of usage as a guide to priority for teaching (Mindt 1996: 245-246). While frequency is certainly not the only determinant of what to teach and in what order (see the concluding section for further discussion), it can indeed help to make learning more effective. For example, McCarthy, McCarten and Sandiford’s (2005-2006) innovative Touchstone book series, which is based on the Cambridge International Corpus, aims to present the vocabulary, grammar, and functions students encounter most often in real life.
Hunston (2002: 189) echoes Mindt, suggesting that ‘the experience of using corpora should lead to rather different views of syllabus design.’ The type of syllabus she discusses extensively is a ‘lexical syllabus’, originally proposed by Sinclair and Renouf (1988) and outlined fully by Willis (1990) and embodied in Willis, Willis and Davids’ (1988-1989) three-part Collins COBUILD English Course. According to Sinclair and Renouf (1988: 148), a lexical syllabus would focus on ‘(a) the commonest word forms in a language; (b) the central patterns of usage; (c) the combinations which they usually form.’
While the term may occasionally be misinterpreted to indicate a syllabus consisting solely of vocabulary items, a lexical syllabus actually covers ‘all aspects of language, differing from a conventional syllabus only in that the central concept of organization is lexis’ (Hunston 2002: 189). Sinclair (2000: 191) would say that the grammar covered in a lexical syllabus is ‘lexical grammar’, not ‘lexico-grammar’, which attempts to ‘build a grammar and lexis on an equal basis.’ Indeed, as Murison-Bowie (1996: 185) observes, ‘in using corpora in a teaching context, it is frequently difficult to distinguish what is a lexical investigation and what is a syntactic one. One leads to the other, and this can be used to advantage in a teaching/learning context.’ Sinclair and his colleagues’ proposal for a lexical syllabus is echoed by Lewis (1993, 1997a, 1997b, 2000), who provides strong support for the lexical approach to language teaching.
A focus of the lexical approach to language pedagogy is teaching collocations (i.e. habitual co-occurrences of lexical items) and the related concept of prefabricated units. There is a consensus that collocational knowledge is important for developing L1/L2 language skills (e.g. Bahns 1993; Zhang 1993; Cowie 1994; Herbst 1996; Kita and Ogata 1997; Partington 1998; Hoey 2000, 2004; Shei and Pain 2000; Sripicharn 2000; Altenberg and Granger 2001; McEnery and Wilson 2001; McAlpine and Myles 2003; Nesselhauf 2003). Hoey (2004), for example, posits that ‘learning a lexical item entails learning what it occurs with and what grammar it tends to have.’ Cowie (1994: 3168) argues that ‘native-like proficiency of a language depends crucially on knowledge of a stock of prefabricated units.’ Aston (1995) also notes that the use of prefabs can speed language processing in both comprehension and production, thus creating native-like fluency.
A powerful reason for the employment of collocations, as Partington (1998: 20) suggests, ‘lies in the way it facilitates communication processing on the part of hearer’, because ‘language consisting of a relatively high number of fixed phrases is generally more predictable than that which is not’ while ‘in real time language decoding, hearers need all the help they can get.’ As such, competence in a language undoubtedly involves collocational knowledge (cf. Herbst 1996: 389).
Collocational knowledge indicates which lexical items co-occur frequently with others and how they combine within a sentence. Such knowledge is evidently more important than individual words themselves (cf. Kita and Ogata 1997: 230) and is needed for effective sentence generation (cf. Smadja and McKeown 1990). Zhang (1993), for example, finds that more proficient L2 writers use significantly more collocations, more accurately and in more variety than less proficient learners. Collocational error is a common type of error for learners (cf. McAlpine and Myles 2003: 75). Gui and Yang (2002: 48) observe, on the basis of the one-million-word Chinese Learner English Corpus, that collocation error is one of the major error types for Chinese learners of English. Altenberg and Granger (2001) and Nesselhauf (2003) find that even advanced learners of English have considerable difficulties with collocation. One possible explanation is that learners are deficient in ‘automation of collocations’ (Kjellmer 1991). ‘As a result, learners need detailed information about common collocational patterns and idioms; fixed and semi-fixed lexical expressions and different degrees of variability; relative frequency and currency of particular patterns; and formality level’ (McAlpine and Myles 2003: 75).
Corpora are useful in this respect, not only because collocations can only reliably be measured quantitatively, but also because the KWIC (key word in context) view of corpus data exposes learners to a great deal of authentic data in a structured way. Our view is line with Kennedy (2003), who discusses the relationship between corpus data and the nature of language learning, focusing on the teaching of collocations. The author argues that second or foreign language learning is a process of learning ‘explicit knowledge’ with awareness, which requires a great deal of exposure to language data.
In addition to the lexical focus, corpus-based teaching materials try to demonstrate how the target language is actually used in different contexts, as exemplified in Biber et al’s (2002) Longman Student Grammar of Spoken and Written English, which pays special attention to how English is used differently in various spoken and written registers.
Another emerging area of language pedagogy which has started to use the corpus-based approach is language testing. Alderson (1996) envisaged the following possible uses of corpora in this area: test construction, compilation and selection, test presentation, response capture, test scoring, and calculation and delivery of results. He concludes that ‘[t]he potential advantages of basing our tests on real language data, of making data-based judgments about candidates’ abilities, knowledge and performance are clear enough. A crucial question is whether the possible advantages are born out in practice’ (Alderson 1996: 258-259). The concern raised in Alderson’s conclusion appears to have been addressed satisfactorily now so that nowadays computer-based tests are recognized as being comparable to paper-based tests (e.g. computer-based versus paper-based TOEFL tests).
A number of corpus-based studies of language testing have been reported. For example, Coniam (1997) demonstrated how to use word frequency data extracted from corpora to generate cloze tests automatically. Kaszubski and Wojnowska (2003) presented a corpus-driven computer program, TestBuilder, for building sentence-based ELT exercises. The program can process raw corpora of plain texts or corpora annotated with part-of-speech information, using another linked computer program that assigns the part-of-speech category to each word in the corpus automatically in real time. The annotated data is used in turn as input for test material selection. Indeed, corpora have recently been used by major providers of test services for a number of purposes:
as an archive of examination scripts;
to develop test materials;
to optimize test procedures;
to improve the quality of test marking;
to validate tests; and
to standardize tests.
For example, the University of Cambridge Local Examinations Syndicate (UCLES) is active in both corpus development (e.g. Cambridge Learner Corpus, Cambridge Corpus of Spoken English, Business English Text Corpus, and Corpus of Young Learners English Speaking Tests) and the analysis of native English corpora and learner corpora. At UCLES, native English corpora such as the British National Corpus (BNC) are used ‘to investigate collocations, authentic stems and appropriate distractors which enable item writers to base their examination tasks on real texts’ (Ball 2001: 7);2 the corpus-based approach is used to explore ‘the distinguishing features in the writing performance of EFL/ESL learners or users taking the Cambridge English examinations’ and how to incorporate these into ‘a single scale of bands, that is, a common scale, describing different levels of L2 writing proficiency’ (Hawkey 2001: 9); corpora are also used for the purpose of speaking assessment (Ball and Wilson 2002; Taylor 2003) and to develop domain-specific (e.g. business English) wordlists for use in test materials (Ball 2002; Horner and Strutt 2004).
For learners to benefit from the use of corpora, language teachers must first of all be equipped with a sound knowledge of the corpus-based approach. It is unsurprising then to discover that corpora have been used in training language teachers (e.g. Allan 1999, 2002; Conrad 1999; Seidlhofer 2000, 2002; O’Keeffe and Farr 2003). Allan (1999), for example, demonstrates how to use corpus data to raise the language awareness of English teachers in Hong Kong secondary schools. Conrad (1999) presents a corpus-based study of linking adverbials (e.g. therefore and in other words), on the basis of which she suggests that it is important for a language teacher to do more than using classroom concordancing and lexical or lexico-grammatical analyses if language teaching is to take full advantage of the corpus-based approach. Conrad’s concern with teacher education is echoed by O’Keeffe and Farr (2003), who argue that corpus linguistics should be included in initial language teacher education so as to enhance teachers’ research skills and language awareness.
Direct use of corpora
While indirect uses such as syllabus design and materials development are closely associated with what to teach, corpora have also provided valuable insights into how to teach. Of Leech’s (1997) three focuses, direct uses of corpora include ‘teaching about’, ‘teaching to exploit’, and ‘exploiting to teach’, with the latter two relating to how to use. Given a number of restricting factors as noted in the previous section, direct uses have so far been confined largely to learning at more advanced levels, for example, in tertiary education, whereas in general English language teaching (let alone to mention other foreign languages), especially in secondary education (see Braun 2007 for a rare example of an empirical study of using corpora in secondary education), the direct use of corpora is ‘still conspicuously absent’ (Kaltenböck and Mehlmauer-Larcher 2005).
‘Teaching about’ means teaching corpus linguistics as an academic subject like other sub-disciplines of linguistics such as syntax and pragmatics. Corpus linguistics has now found its way into the curricula for linguistics and language related degree programmes at both postgraduate and undergraduate levels in many universities around the world. ‘Teaching to exploit’ means providing students with ‘hands-on’ know-how, as emphasized in McEnery, Xiao and Tono (2006), so that they can exploit corpora for their own purposes. Once the student has acquired the necessary knowledge and techniques of corpus-based language study, the learning activity may become student centred. ‘Exploiting to teach’ means using a corpus-based approach to teaching language and linguistics courses (e.g. sociolinguistics and discourse analysis), which would otherwise be taught using non-corpus-based methods.
If the focuses of ‘teaching about’ and ‘exploiting to teach’ are viewed as being associated typically with students of linguistics and language programmes, ‘teaching to exploit’ relates to students of all subjects which involve language study and learning, who are expected to benefit from the so-called data-driven learning (DDL) or ‘discovery learning’.
The issue of how to use corpora in the language classroom has been discussed extensively in the literature. With the corpus-based approach to language pedagogy, the traditional ‘three P’s’ (Presentation – Practice – Production) approach to teaching may not be entirely suitable. Instead, the more exploratory approach of ‘three I’s’ (Illustration – Interaction – Induction) may be more appropriate, where ‘illustration’ means looking at real data, ‘interaction’ means discussing and sharing opinions and observations, and ‘induction’ means making one’s own rule for a particular feature, which ‘will be refined and honed as more and more data is encountered’ (see Carter and McCarthy 1995: 155). This progressive induction approach is what Murison-Bowie (1996: 191) would call the interlanguage approach: namely, partial and incomplete generalizations are drawn from limited data as a stage on the way towards a fully satisfactory rule. While the ‘three I’s’ approach was originally proposed by Carter and McCarthy (1995) to teach spoken grammar, it may also apply to language education as a whole, in our view.
It is clear that the exploratory teaching approach focusing on ‘three I’s’ is in line with Johns’ (1991) concept of ‘data-driven learning (DDL)’. Johns was perhaps among the first to realize the potential of corpora for language learners (e.g. Higgins and Johns 1984). In his opinion, ‘research is too serious to be left to the researchers’ (Johns 1991: 2). As such, he argues that the language learner should be encouraged to become ‘a research worker whose learning needs to be driven by access to linguistic data’ (ibid). John’s web-based Kibbitzer (www.eisu2.bham.ac.uk/johnstf/timeap3.htm) gives some very good examples of data-driven learning.
Data-driven learning can be either teacher-directed or learner-led (i.e. discovery learning) to suit the needs of learners at different levels, but it is basically learner-centred. This autonomous learning process ‘gives the student the realistic expectation of breaking new ground as a “researcher”, doing something which is a unique and individual contribution’ (Leech 1997: 10). It is important to note, however, that the key to successful data-driven learning, even if it is student-centred, is the appropriate level of teacher guidance or pedagogical mediation depending on the learners’ age, experience, and proficiency level, because ‘a corpus is not a simple object, and it is just as easy to derive nonsensical conclusions from the evidence as insightful ones’ (Sinclair 2004: 2). In this sense, it is even more important for language teachers to be equipped with the necessary training in corpus analysis.
Johns (1991) identifies three stages of inductive reasoning with corpora in the DDL approach: observation (of concordanced evidence), classification (of salient features) and generalization (of rules). The three stages roughly correspond to Carter and McCarthy’s (1995) ‘three I’s’. The DDL approach is fundamentally different from the ‘three P’s’ approach in that the former involves bottom-up induction whereas the latter involves top-down deduction. The direct use of corpora and concordancing in the language classroom has been discussed extensively in the literature (e.g. Tribble 1991, 1997a, 1997b, 2000, 2003; Tribble and Jones 1990, 1997; Flowerdew 1993; Karpati 1995; Kettemann 1995, 1996; Wichmann 1995; Woolls 1998; Aston 2001; Osborne 2001, Braun 2007), covering a wide range of issues including, for example, underlying theories, methods and techniques, and problems and solutions.