What corpora can offer in language teaching and learning

Teaching oriented corpora

Teaching oriented corpora

Teaching-oriented corpora are particularly useful in teaching languages for specific purposes (LSP corpora) and in research on L1 (developmental corpora) and L2 (learner corpora) language acquisition. Such corpora can be used directly or indirectly in language pedagogy as discussed in the previous sections.

Languages for specific purposes and professional communication

In addition to teaching English as a second or foreign language in general, a great deal of attention has been paid to domain-specific language use and professional communication (e.g. English for specific purposes and English for academic purpose). For example, Thurstun and Candlin (1997, 1998) explore the use of concordancing in teaching writing and vocabulary in academic English. Hyland (1999) compares the features of the specific genres of metadiscourse in introductory course books and research articles on the basis of a corpus consisting of extracts from 21 university textbooks for different disciplines and a similar corpus of research articles.

Likewise, Upton and Connor (2001) undertake a ‘move analysis’ in the business English using a business learner corpus. The authors approach the cultural aspect of professional communication by comparing the ‘politeness strategies’ used by learners from different cultural backgrounds. Thompson and Tribble (2001) examine citation practices in academic text. Koester (2002) argues, on the basis of an analysis of the performance of speech acts in workshop conversations, for a discourse approach to teaching communicative functions in spoken English. Yang and Allison (2003) study the organizational structure in research articles in applied linguistics. Carter and McCarthy (2004) explore, on the basis of the Cambridge and Nottingham Corpus of Discourse in English (CANCODE), a range of social contexts in which creative uses of language are manifested. Hinkel (2004) compares the use of tense, aspect and the passive in L1 and L2 academic texts.

Xiao (2003) reviews a number of case studies using specialized multilingual corpora to teach domain specific translation. Parallel concordancing is not only useful in translation teaching; it can also aid the so-called ‘reciprocal learning’ (Johns 1997), where two language learners from different L1 backgrounds are paired to help each other learn their language. Studies such as these demonstrate that LSP corpora are particularly useful in teaching language for specific purposes and professional communication.

Learner corpora and interlanguage analysis

The creation and use of learner corpora in language pedagogy and interlanguage research has been welcomed as one of the most exciting recent developments in corpus-based language studies. If native speaker corpora of the target language provide a top-down approach to using corpora in language pedagogy, learner corpora provide a bottom-up approach to language teaching (Osborne 2002).

A learner corpus, as opposed to a ‘developmental corpus’ composed of data produced by children acquiring their mother tongue (L1), comprises written or spoken data produced by language learners who are acquiring a second or foreign language. Data of this type has particularly been useful in language pedagogy and second language acquisition (SLA) research, as demonstrated by the fruitful learner corpus studies published over the past decade (see Pravec 2002; Keck 2004; and Myles 2005 for recent reviews). SLA research is primarily concerned with ‘the mental representations and developmental processes which shape and constrain second language (L2) productions’ (Myles 2005: 374).

Language acquisition occurs in the mind of the learner, which cannot be observed directly and must be studied from a psychological perspective. Nevertheless, if learner performance data is shaped and constrained by such a mental process, it at least provides indirect, observable, and empirical evidence for the language acquisition process. Note that using product as evidence for process may not be less reliable; sometimes this is the only practical way of finding about process. Stubbs (2001) draws a parallel between corpora in corpus linguistics and rocks in geology, ‘which both assume a relation between process and product. By and large, the processes are invisible, and must be inferred from the products.’ Like geologists who study rocks because they are interested in geological processes to which they do not have direct access, SLA researchers can analyze learner performance data to infer the inaccessible mental process of second language acquisition.

Learner corpora can also be used as an empirical basis that tests hypotheses generated using the psycholinguistic approach, and to enable the findings previously made on the basis of limited data of a small number of informants to be generalized. Additionally, learner corpora have widened the scope of SLA research so that, for example, interlanguage research nowadays treats learner performance data in its own right rather than as decontextualised errors in traditional error analysis (cf. Granger 1998: 6).

At the pre-conference workshop on learner corpora affiliated to the second International Symposium of Corpus Linguistics held at the University of Lancaster, the workshop organizers Tono and Meunier observed that learner corpora are no longer in their infancy but are going through their nominal teenage years – they are full of promise but not yet fully developed.

In language pedagogy, the implications of learner corpora have been explored for curriculum design, materials development and teaching methodology (cf. Keck 2004: 99). The interface between L1 and L2 materials has been explored. Meunier (2002), for example, argues that frequency information obtained from native speaker corpora alone is not sufficient to inform curriculum and materials design. Rather, ‘it is important to strike a balance between frequency, difficulty and pedagogical relevance. That is exactly where learner corpus research comes into play to help weigh the importance of each of these’ (Meunier 2002: 123). Meunier also advocates the use of learner data in the classroom, suggesting that exercises such as comparing learner and native speaker data and analyzing errors in learner language will help students to notice gaps between their interlanguage and the language they are learning.

Interlanguage studies based on learner corpora which have been undertaken so far focus on what Granger (2002) calls ‘Contrastive Interlanguage Analysis (CIA)’, which compares learner data and native speaker data, or language produced learners from different L1 backgrounds. The first type of comparison typically aims to evaluate the level of under- or overuse of particular linguistic features in learner language while the second type aims to uncover L1 interference or transfer. Corpus data produced by learners from different L1 backgrounds can also be compared against one another with the aim of uncovering common features of second language acquisition process by discarding L1-specific peculiarities. In addition to CIA, learner corpora have also been used to investigate the order of acquisition of particular morphemes. Readers can refer to Granger et al (2002) for recent work in the use of learner corpora, and read Granger (2003) for a more general discussion of the applications of learner corpora such as the International Corpus of Learner English (ICLE).

In addition to SLA research, learner corpora can also be used directly in classroom teaching. For example, Seidlhofer (2002) and Mukherjee and Rohrbach (2006) demonstrate how a ‘local learner corpus’ containing students’ own writings can be used directly for learning by coping with students’ questions about their own or classmates’ writings, or analyzing and correcting errors in such familiar writings.

Before we close the discussion of using corpora in language pedagogy, it is appropriate to address some objections to the use of corpora in language learning and teaching. While frequency and authenticity are often considered two of the most important advantages of using corpora, they are also the locus of criticism from language pedagogy researchers. For example, Cook (1998: 61) argues that corpus data impoverishes language learning by giving undue prominence to what is simply frequent at the expense of rarer but more effective or salient expressions. Widdowson (1990, 2000) argues that corpus data is authentic only in a very limited sense in that it is de-contextualized (i.e. traces of texts rather than discourse) and must be re-contextualized in language teaching. On the other hand, it can be argued that:

on the contrary, using corpus data not only increases the chances of learners being confronted with relatively infrequent instances of language use, but also of their being able to see in what way such uses are atypical, in what contexts they do appear, and how they fit in with the pattern of more prototypical uses. (Osborne 2001: 486)

This view is echoed by Goethals (2003: 424), who argues that ‘frequency ranking will be a parameter for sequencing and grading learning materials’ because ‘frequency is a measure of probability of usefulness’ and ‘high-frequency words constitute a core vocabulary that is useful above the incidental choice of text of one teacher or textbook author.’ Hunston (2002:194-195) observes that ‘items which are important though infrequent seem to be those that echo texts which have a high cultural value,’ though in many cases ‘cultural salience is not clearly at odds with frequency.’

While frequency information is readily available from corpora, no corpus linguist has ever argued that the most frequent is most important. On the contrary, Kennedy (1998: 290) argues that frequency ‘should be only one of the criteria used to influence instruction’ and that ‘the facts about language and language use which emerge from corpus analyses should never be allowed to become a burden for pedagogy’. As such, raw frequency data is often adjusted for use in a syllabus, as reported in Renouf (1987: 168).

It would be inappropriate, therefore, for language teachers, syllabus designers, and materials writers to ignore ‘compelling frequency evidence already available’, as pointed out by Leech (1997: 16), who argues that:

Whatever the imperfections of the simple equation ‘most frequent’ = ‘most important to learn’, it is difficult to deny that frequency information becoming available from corpora has an important empirical input to language learning materials.

Kaltenböck and Mehlmauer-Larcher (2005: 78) downplay the role of frequency in language learning, arguing that ‘what is frequent in language will be picked up by learners automatically, precisely because it is frequent, and therefore does not have to be consciously learned.’ This is not true, however. Determiners such as a and the are certainly very frequent in English, yet they are difficult for Chinese learners of English because their mother tongue does not have such grammatical morphemes and does not maintain a count-mass noun distinction.

Clearly, frequency is not ‘automatically pedagogically useful’ (Kaltenböck and Mehlmauer-Larcher 2005: 78); decisions relating to teaching must also take account of overall teaching objectives, learners’ concrete situations, cognitive salience, learnability, generative value, and of course teachers’ intuitions (cf. Kaltenböck and Mehlmauer-Larcher 2005: 78). However, frequency can at least help syllabus designers, materials writers and teachers alike to make better-informed and more carefully motivated decisions (cf. Gavioli and Aston 2001: 239).

If we leave objections to frequency data to one side, Widdowson (1990, 2000) also questions the use of authentic texts in language teaching. In his opinion, authenticity of language in the classroom is ‘an illusion’ (1990: 44) because even though corpus data may be authentic in one sense, its authenticity of purpose is destroyed by its use with an unintended audience of language learners (see Murison-Bowie 1996: 189). Widdowson (2003: 93) makes a distinction between ‘genuineness’ and ‘authenticity’, which are claimed to be the features of text as a product and discourse as a process respectively: corpora are genuine in that they comprise attested language use, but they are not authentic for language teaching because their contexts (as opposed to co-texts) have been deprived.

We will not be engaged in the debate here, but would like to draw readers’ attention to Stubbs’ (2001) metaphor of product versus process as cited in the section ‘Learner corpora and interlanguage analysis’. The implication of Widdowson’s argument is that only language produced for imaginary situations in the classroom is ‘authentic’. Even if we do follow Widdowson’s genuineness-authenticity distinction, it is not clear why such imaginary situations are authentic because authenticity, as opposed to genuineness, would mean real communicative context. Situations conjured up for classroom teaching obviously do not take place in really communicative contexts; then how can they be authentic, if we choose to keep this distinction? When students learn and practise a shopping ‘discourse’, they are actually by no means doing shopping! Furthermore, as argued by Fox (1987), invented examples often do not reflect nuances of usage. That is perhaps why, as Mindt (1996: 232) observes, students who have been taught ‘school English’ cannot readily cope with English used by native speakers in real life. As such, Wichmann (1997: xvi) argues that in language teaching, ‘the preference for “authentic” texts requires both learners and teachers to cope with language which the textbooks do not predict.’

The discussions in the previous sections suggest that corpora appear to have played a more important role in helping to decide what to teach (indirect uses) than how to teach (direct uses). While indirect uses of corpora seem to be well established, direct uses of corpora in teaching are largely confined to advanced levels like higher education. Corpus-based learning activities are nearly absent general TEFL classes at lower levels like secondary education. Of the various causes for this absence mentioned earlier, perhaps the most important are the access to appropriate corpus resources and the necessary training of teachers, which we view as priorities for future tasks of corpus linguists if corpora are to be popularized to more general language teaching context.

While there are a wide range of existing corpora that are publicly available (see Xiao 2008 for a recent survey), the majority of those resources have been developed ‘as tools for linguistic research and not with pedagogical goals in mind’ (Braun 2007). As Cook (1998: 57) suggests, ‘the leap from linguistics to pedagogy is […] far from straightforward.’ To bridge the gap between corpora and language pedagogy, the first step would involve creating corpora that are pedagogically motivated, in both design and content, to meet pedagogical needs and curricular requirements so that corpus-based learning activities become an integral part, rather than an additional option, of the overall language curriculum. Such pedagogically motivated corpora ‘should not only be more coherent than traditional corpora; they should, as far as possible, also be complementary to school curricula, to facilitate both the contextualisation process and the practical problems of integration’ (Braun 2007: 310). The design of such corpus-based learning activities must also take account of learners’ age, experience and level as well as their integration into the overall curriculum.

Given the situation of learners (e.g. their age, level of language competence, level of expert knowledge, and attitude towards learning autonomy) in general language education in relation to advanced learners in tertiary education, even such pedagogically motivated corpus-based learning activities must be mediated by teachers. This in turn raises the issue of the current state of teachers’ knowledge and skills of corpus analysis and pedagogical mediation, which is another practical problem that has prevented direct use of corpora in language pedagogy. As Kaltenböck and Mehlmauer-Larcher (2005: 81) argue, ‘mediation by the teacher is a necessary prerequisite for successful application of computer corpora in language teaching and should therefore be given sufficient attention in teacher education courses’ (cf. also O’Keeffe and Farr 2003). However, as the integration of corpus studies in language teacher training is only a quite recent phenomenon (cf. Chambers 2007), ‘it will therefore at least take more time, and perhaps a new generation of teachers, for corpora to find their way into the language classroom’ (Braun 2007: 308).

In conclusion, if these two tasks are accomplished, it is our view that corpora will not only revolutionize the teaching of subjects such as grammar in the 21st century as Conrad (2000) has predicated, they will also fundamentally change the ways we approach language education, including both what is taught and how it is taught. As Gavioli and Aston (2001) argue, corpora should not only be viewed as resources which help teachers to decide what to teach, they should also be viewed as resources from which learners may learn directly.

1. In this chapter, we will not be concerned with the debate over the use of corpus data in linguistic analysis and language education. Readers interested in the pros and cons of using corpus data should refer to Sinclair (1991), Widdowson (1991, 2000), de Beaugrande (2001) and Stubbs (2001). While Widdowson, Sinclair and de Beaugrande characterize two extreme attitudes towards corpora, there are many milder (positive or negative) reactions to corpus data between the two extremes. Readers can refer to Nelson (2000) for a good review.

2. ‘Stem’ is a technical term in language testing which refers to ‘the top part of a multiple-choice item, usually a statement or question’ (Fulcher and Davidson 2007: 53). As a collection of attested language data, a corpus is a good resource for test writers as it can provide abundant authentic stems.

