A test, questionnaire, observation sheet, coding system for think aloud data etc.etc. may be reliable without being valid. An instrument is 'reliable' simply if it consistently works to give the same results every time, if the same cases are remeasured. But the measurement could still be consistently 'wrong' (biassed). Two markers may agree on the scores they give a student for fluency in an oral test, but they may have both misunderstood the instructions on how fluency is to be scored. Checking on reliability is basically detecting the amount of random error in one's measurement of variables, checking on validity is trying to detect constant error.
An instrument or other data gathering and measuring procedure is valid only if it measures what it is supposed to measure, but one can only ask about that if it is in the first place reasonably reliable. There is no point checking up on what variable an instrument measures or records unless it is consistently measuring something. So validity checks should in theory follow reliability checks, only for those instruments that do have satisfactory reliability. In practice, however, as we see below, one type of validity, Content validity, is often established before reliability is checked.
It should be noted that, in practice, there can be as much argument in a research project about what the appropriate variables are that should be quantified, and how to define them properly, as about the precise validity issues of whether this or that instrument measures the variables correctly.
The notion of validity can also be applied more widely to entire research projects. It then includes consideration of the all the things that make for good and bad sampling of cases (external validity), and control of unwanted variables that may interfere with results, design, and so forth (internal validity) as well as validity of the measuring instruments themselves. However here we focus just on the validity of the measuring and recording instruments - fundamental both to research and to pedagogical uses of such means of quantifying data, and a topic often more written about in the literature on testing than in that on research methods.
Traditionally four 'types of validity' of instruments are often recognised: really four ways of checking if an instrument does measure what it is supposed to.
Some add further types, especially these two which I shall briefly describe but leave:
'Face validity'. This is really only concerned with how the instrument appears to subjects and users - i.e. whether to the ordinary person it looks as if it measures what it is supposed to - not the essential matter of what it really measures. Nevertheless in practical research it may be necessary to ensure face validity as well as 'actual' validity. Especially in pedagogical contexts teachers and learners will expect a grammar test to ‘look like a grammar test’.. to them.
'Consequential validity'. This refers to actual and potential outcomes of use of an instrument, rather than the instrument itself. It is a controversial concept in that where and when attention to the social and political ramifications of test use should be addressed is arguable. I.e. an instrument might be said to have good consequential validity if it prompted beneficial backwash/washback: i.e. if teachers started teaching students specifically preparing them to take the test or whatever, and this actually had a beneficial effect on teaching. However, consequences of use of an instrument may often be seen as not really due to the instrument itself but to how people use it. Still it may need to be considered where some obvious consequences occur. This will tend to be for tests used pedagogically rather than for research.
Returning now to the traditional core ideas about validity... In validity discussion it is necessary to be clear about where one is talking about the variable a particular instrument measures, and where one is talking about the targeted variable - the 'thing' or conceptual variable one is trying to measure, since invalidity arises precisely where these two variables are not the same. A term often used to distinguish the latter (the targeted variable) is 'construct' (or sometimes 'trait'). So, for example, many applied linguists have a mental conception of a variable they call 'communicative competence': that can be termed a construct which they define in various ways (e.g. Canale and Swain in AL1, Bachman, etc.) and then set out to measure with this or that instrument (each of which is usually made up of several sub-instruments). Argument ensues not only about the definition of communicative competence in the first place (i.e. theoretical discussion about how to define the construct) but also about whether this or that instrument measures it validly or not.
The four key aspects of validity one can check on are: content or descriptive validity, construct validity, concurrent validity and predictive validity (the last two also being referred to also collectively as criterion validity... but nothing to do with 'criterion referencing' which we met earlier). Construct validity is misleadingly named, since, as just said in the last paragraph, the notion of a 'construct' is present in all validity work, not just that where one is assessing 'construct validity'. Note also that in practice not all instruments are able be checked for all four types of validity: it depends on the nature of the instrument and other factors - criterion validity is particularly restricted to instances where a criterion instrument exists (see below).
An analogy may help introduce these types of validity. Suppose you are into home cookery, and want to buy an electronic weighing scale to measure your ingredients properly. You go to some shop where there are several on display. You see one you like: how do you evaluate it before buying it?
1) You might just examine it and see if it seems well made. Is it made of poor quality, easy to break plastic? Does it have a big enough bowl for the amount of flour you might want to weigh? You might turn it on and see if the display showing the weight seems likely to be misread or not, and so on.
2) You might try it by putting something on it, like a pound coin from your pocket, and seeing what weight it records. You take the coin off, put it on again, and see whether it comes up with exactly the same weight again, as you would expect.
3) You might try it by putting your coin also on the best (most expensive) scales in the shop and seeing if the scales you have your eye on comes up with the same weight.
4) You might try a few experiments, like seeing if two pound coins are recorded on your scales as weighing twice the amount of one coin, as you might expect, or putting a 500gm bag of sugar on it to see what the scales show.
One of those is a check on reliability not validity: which one?
The other three are in effect checks on construct, concurrent and content validity: which is which do you think?
You may say that when buying a weighing scales you might actually be more concerned with other things than the above, e.g.
5) You might judge it by whether it looks cool and the colour would match the other accessories in your kitchen
6) You might consider if it seems too complicated and technical for your everyday cookery needs and choose the scales most like one you have used before for ease
7) You might be on a budget and be concerned mainly with price
8) You might choose one 'Because my friend Lesley has one like it and says it always works well'.
Some of these too have counterparts in measurement for research, and might be regarded as matters of validity in a looser sense. It is often a good idea, for example, to use tests, elicitation procedures, etc. that you and the subjects are familiar with (6) and do not introduce complications that may distract them. If the procedure or whatever has to be unfamiliar then of course you give them ample practice first on items that you don't score. Ease of use of an instrument should not be allowed to lead to invalidity, however, as it may if one simply does not bother to measure things because they are too 'difficult' to do. For instance, an all-round measure of language proficiency should measure oral ability and extensive reading ability, but of course these are difficult and time consuming to include compared with grammar and intensive reading ability.
Price (7) also has a counterpart in research, where the cost is usually in time. It may not be practically possible to measure your subjects' English proficiency directly with a recognised test like FCE that you administer yourself, which is most likely to be valid, so you have to rely on something less - such as a quick cloze test, or the school record of English exam scores last May....
History of prior use of an instrument (8) is interesting, as it is often relied on by researchers. E.g. 'If Gardner and his students used this questionnaire in published research in 1979, 1985 and 1991, it's good enough for me'. This may be OK, but often if you look closely you may find that the validity (and reliability) was not actually checked all that thoroughly in the first place, and what might be a duff instrument has become accepted simply through the number of times it has been used! Also, even if it was properly validated in the first place, if you are going to use it in a different country/culture from where it has been used before, or at a different proficiency level, or with non-natives instead of native speakers, and so on, it probably needs revalidating.
For a fuller account of validity in general see my book ch21-23. That focusses more on the sources of invalidity, and hence what precautions one can take when designing and using one's instruments in the first place. The account here focusses more on aspects where statistics can be used to help check on an instrument later (e.g. in a pilot study).
Content validation involves little stats at all.
Criterion validation procedures mostly use the same "agreement" statistical measures as the basic ones seen in reliability study - especially correlation.
Construct validation may also use these, but can involve in the end all kinds of research methods and any of the full panoply of statistical methods of analysis thereof. Hence it is not possible here to attempt a systematic overview of all statistical measures used in validity study. Instead I offer a few typical examples, to give something of the flavour of the statistics that may commonly be met.
1. Content or descriptive validity, and item analysis again
This is a bit like (1) in the Introduction. Many instruments involve stimulus texts, words, items etc. which are supposed to represent some variable and are somehow used to elicit or categorise responses from subjects. For example, a set of agree/disagree items such as 'When I speak in the English class I feel self-conscious' is used collectively to measure the variable 'English class anxiety'; or a set of twenty pairs of opposites and twenty pairs of synonyms is used to represent values of an IV 'type of meaning relation' in a psycholinguistic experiment to see if similarity of meaning is identified faster than oppositeness of meaning; or a single questionnaire question is used to measure the age of subjects.
The main practical way to check on the validity of the content of an instrument is to get some 'experts' to judge whether the content of items, questions etc. reflects the intended variable or not. In the case of linguistic elements like opposites and synonyms, of course, there are books of these that one could refer to, as there is a fairly finite and definable population of these in any language. Or one asks a linguist. Anyway, linguistic units, structures etc. form definable populations which people can have in mind as they judge particular set of items for an instrument. But the instrument has to have a declared definition of the target construct before anyone can judge these things: if an expert is judging a grammar test, he/she need to know clearly how the construct 'grammar' is delimited for the purpose of this research (e.g. does it include or not wordformation, linkage between sentences...?).
Would a test of achievement or a test of proficiency be easier to assess the content validity of? Why?
Who are the obvious experts that you have access to to ask to check the validity of your instruments?
In the case of class anxiety it is not so straightforward. There is not so obvious a population of 'ways in which class anxiety shows itself' that one can have in mind as one judges an instrument. Rather you have to rely on people's mental conception of what anxiety in the English class is, and their judgment of what specific statements would reflect possessing or not possessing it. An appropriate expert here might be a teacher, though one might feel that learners similar to those one intends to measure later with the instrument might also be able to suggest whether your items really are related to class anxiety as they experience it, in their cultural context etc.
People similar to the intended subjects can sometimes also be brought in when making an instrument in the first place, incidentally: if you were devising a set of items to reflect English class anxiety from scratch, rather than using an existing set, you could elicit statements from some learners about feelings they have during English classes and select from those a set to try out as an instrument, or get learners to comment on your items in a pilot study.
The only problem with using people who resemble future subjects, rather than 'experts', as judges of validity is that if some complex theoretical concepts are involved in the definition of the targeted variable, you cannot expect people who resemble subjects to judge directly if the items relate to it, though they may still offer comments that help you judge if the items are in fact relevant. Even what might seem like the everyday concept of 'anxiety' has behind it a (mainly psychological) literature which distinguishes for example 'trait anxiety' (which is a permanent feature of a person's personality) from 'situational anxiety', and so on which one cannot expect subjects to take into consideration. Other variables may be even more arcane from the point of view of the ordinary subjects, and even some applied linguists! (e.g. field dependence, integrative orientation, metacognitive strategic competence, etc.).
Where relevant, the content of an instrument should pay due attention to/be compatible with any relevant theory - whether linguistic, psycholinguistic or whatever. If your instrument is a test designed to measure whether people have acquired the INFL constituent in English or not, then you and the judges need a grasp of the relevant parts of Chomskian grammar to see if indeed the items relate to all and only aspects of grammar that belong in INFL. If you are using a set of categories to classify strategies you have identified in think aloud data, then in establishing a suitable set of categories to use you need to be aware of the relevant discussion in the literature of how to define strategies, what classifications have been used before, etc. This is where the expert judgment comes in.
Strictly there is a difference between what one should be judging as suitable content for an instrument with absolute value as against one with relative interpretation. Take a test of knowledge of English phrasal verbs. To check its content validity first of course one has to be clear how 'phrasal verb' is being defined - in the narrow linguist's sense, distinct from prepositional verbs like look at - or in the broader teacher's sense, which includes both? Then, if the test is supposed to record the absolute (criterion referenced) construct of how many phrasal verbs some subjects know, one needs to check that the test includes a random sample from the entire set of phrasal verbs, as defined. If, however, the test is intended only to measure a relative construct - who knows more phrasal verbs than who (relative/norm-referenced measure) - it is sufficient to check that all the items really are about phrasal verbs, with no other types included, and that a reasonable range of them has been included: the sort of considerations discussed under reliability and item analysis for norm-referenced tests would then be involved as well, and they would ensure that the verbs included are in fact of moderate difficulty for the targeted subjects (excluding ones that are too easy or too difficult). Often for research purposes the relative measure is most suitable: if one needs to test subjects' reading ability for a study, it is usually in order to distinguish 'good' from 'bad' readers purely in the sense of 'relatively better' and 'relatively worse' ones, so there would be no point in choosing a text for the test which was too easy or too hard for them.
PhD students don't always get anyone other than themselves to check their instruments for content. At least the supervisor should be involved as a second expert, especially as you don't even have to pilot the instrument to do this! And if piloting an instrument, always get as many open comments as possible from the subjects used about their experiences using it, and what they thought it was measuring, as these may give valuable hints as to whether some misunderstanding has occurred which might affect validity. In particular you may discover that they are using test-taking or task-performing strategies which mean that your instrument is measuring their successful use of these strategies as much as their possession of whatever ability the test is testing (see McDonough, Strategy and Skill in Learning a Foreign Language ch6). Or it may emerge that they are using knowledge that you had not intended to be measured by this instrument: e.g. imagination in written task which you intend merely to measure overall text organisation, or real world knowledge in test items you had intended to test just vocabulary knowledge.
Can you think of some examples of instruments where some variable other than the intended linguistic one would often get measured
There have been some more systematic attempts by researchers to investigate the validity of various types of test by think aloud research. This compares the strategies test-takers use when taking a test with the strategies they use when doing the 'same thing' otherwise. E.g. do learners doing a conventional reading comprehension test where they read a passage and answer multiple choice questions read the passage using the same strategies that they use when reading a passage in a non-test situation? You can guess the answer (see McDonough ch6 again). If the notion of the 'content' of an instrument is allowed to include the skills used to tackle the items as well as what is in the items, as it well might, this is all loosely part of content validation.
There is little statistics involved in any of this. To demonstrate content validity of an instrument you are using you can of course record the % of items in a test or inventory that all judges agreed represented the targeted variable. Or on any single item or an instrument as a whole one can record the % of judges who agreed it was valid. However, more usually one uses the information at once to improve the instrument before using it for real (and reports on this in one's write-up). Typically any item or instrument as a whole that at least one judge objected to would be revised. In other words, this often involves a form of 'item analysis'.
How does this differ from the sort of item analysis that often goes hand in hand with the assessment of internal reliability of an instrument?
2. Concurrent validity
Checking up on concurrent validity means checking one's instrument by comparing it with another one that you know is (more) valid (the 'criterion' measure or instrument). An obvious limitation is that you cannot use this validation method on any instrument where no known 'better' instrument exists to be compared with! In much language research, especially perhaps psycholinguistics, one is measuring things that have not been measured before. And if a better instrument does exist, in research work the suggestion will always be there: why not use the known 'better' instrument in the first place, instead of this other one? Maybe the reason will be cost - compare (3) and (7) in the Introduction.
Furthermore, there may be disagreement as to which is the instrument one can safely assume is more valid. It is often assumed that 'direct' measures (more naturalistic methods of measurement) are more valid than 'indirect' ones (e.g. artificial tests). More elaborate extensions of this concept of validity form the multimethod part of Multimethod Multitrait studies.
To demonstrate concurrent validity one has to try out one's instrument, and the criterion one, on the same people (typical of one's target subjects) and show that there is high agreement in the results. This involves piloting, similar to that for reliability checking. The distinction between relative (norm referenced) and absolute (criterion referenced) measures applies, though it is the measures of agreement appropriate to the former that one mostly sees used.
2.1 Concurrent validity: variables on interval scales
Fischer (1984) wanted to quantify the communicative quality of the writing of learners of French. He developed an instrument, to be used primarily by language teachers, based on combining scores for three separate variables which he considered prime indicators of the target construct. Measurer ratings of 'pertinence' and 'clarity' (each on a scale 0-6) are straightforwardly added, and a count of mean number of errors per clause, as a measure of 'structural accuracy', is added as a negative score - i.e. in effect deducted. I.e. he works with a formula for communicative quality as follows:
One way to evaluate this would be to consider its content validity:
Does your expert judgment tell you that these three things are the key ones in any theory-based definition of written communicative competence?
However, Fischer was concerned with demonstrating concurrent validity. In order to do this he had it applied to18 typical texts written by learners as a communicative task (job applications). These he also got 53 native speakers of French to rate globally on a scale 0-24 running from 'no message gets over' to 'coherent and contains content holding reader's attention', these glosses of scale points being presented in French, of course. The average (strictly, median) native speaker rating was chosen as the criterion, assumed valid, measure for each text. Since teachers and researchers maybe can't always use native speaker rating direct to measure communicative quality of writing, he wanted to show that his complex instrument, based on measuring three things and combining them, was as good as a NS rating.
Fischer then examined statistically concurrent validity (and some aspects of construct validity - see later). Although you could make out a case for both the composite instrument itself, and the criterion instrument it is being compared with, having some 'absolute' (criterion-referenced) interpretation, in fact Fischer used the Spearman rho correlation - a 'relative' measure of agreement which treats the scores as rank ordered.
Correlations were calculated not just between the combined communicative measure and the native speaker ratings, the key pair for concurrent validation, but also, as is common practice in such instances, between all possible pairings of these and the separate measures of the three component variables combined in Fischer's instrument. Thus a whole set of correlation coefficients emerges, which in validity studies, as elsewhere, are usually displayed in a two-dimensional table called a 'correlation matrix', like the one illustrated below (adapted from Fischer). Of course if the instrument being used had not consisted of several subinstruments, there would have just been one correlation to examine.
The columns are understood as labelled the same as the rows of the same number, so the first figure in the top line (.241) is the correlation between the pertinence ratings of the texts and the clarity ratings. This is in fact a correlation internally between components of the instrument being validated, not directly throwing light on the concurrent validity of the instrument as a whole. In such tables the diagonal is usually left empty because the correlation of a measure with itself is +1, though sometimes this space is used to display the standard deviation of scores on each variable. And normally only one triangle is displayed, not the full square of correlations, because the other triangle would merely repeat the same information over again (SPSS however gives you everything twice).
From this we see that the highest correlation, close to +1, is in fact between the overall communicative instrument and the native speaker criterion. This is strong evidence for concurrent validity of the former (.7 or more is often regarded as 'good' in concurrent validity discussions). But remember that, having used Spearman rho, this only extends to showing that Fischer's instrument puts cases in almost exactly the same "order of merit" as native speaker ratings. It says nothing about whether any absolute levels of communicative competence ascribed are the same. To examine the latter, the score scales would have to be equated (e.g. both expressed as %) and calculations e.g. of the average absolute difference would show up if the instrument is also concurrently valid in this absolute sense.
Among the components of the communicative instrument, which rating correlates least well with the criterion measure?
If one tried to reduce Fischer's measure to a % score, what problem would one face?
2.2 Concurrent validity: variables on category scales
For an example of concurrent validation involving category scales, compare Black and Dockrell's data (1984 p115, but note it is not interpreted by those authors exactly as here). This can be seen as in effect a small scale check of the concurrent validity of teacher judgment of 23 learners' mastery of the dative plural in German, using a 12 item criterion-referenced multiple choice test as criterion. Beware the two different uses of the word 'criterion' here. A score of 75% or better was chosen to indicate mastery on the test, and the teacher judgment was a direct categorisation of cases as masters and non-masters of this aspect of German, based on his general experience of the learners in class and past grades (Note 'master' here means one who has acquired the relevant aspect of language, not a teacher). Again the natural way of examining and quantifying the agreement is by a method already seen for reliability - a contingency table display and the measure of agreement suitable to data with absolute value - % agreement (or better Cohen's kappa coefficient):
What is the percentage agreement here?
What would Cohen's kappa be for this data?
We can see that the agreement is quite modest, not strong evidence for the concurrent validity of the teacher judgment in absolute terms, on the assumption that the test is valid. We can gain useful further information by examining the figures on the table for cases disagreed on. In fact the disagreement lies exclusively in the teacher overestimating the absolute competence of the learners and judging thirteen of them to know the German dative plural who did not. When a categorisation of the 'pass-fail' or 'yes-no' type like this is analysed in this way against a criterion categorisation, it may be of some importance whether 'false positives' or the reverse predominantly occur if pedagogical decisions are made on the basis of a partially valid instrument. In the present instance clearly the teacher judgment errs in favour of the former rather than the latter.
Do you think it is reasonable to regard 75% correct answers on a test as evidence of 'mastery' or having acquired something?
Do you think it is possible to argue that the teacher judgment should be considered the criterion measure (i.e. the one with assumed validity), and that we should assess the validity of the test against that judgment rather than the other way round?
Suppose we sit on the fence and say we aren't sure which of the two measures can really be assumed to be valid. Does the above analysis of the agreement between them then tell us anything about the validity of either measure?
How does the above analysis differ from one of reliability?
The above approach to measuring concurrent (and predictive) validity for categorisations seems most simple and appropriate for pairs of variables recorded in the same categories, and with some absolute value, though other, correlation-type, measures are also seen used - e.g. phi (Tindal et al. 1985), tetrachoric r (Arrasmith et al. 1984).
3. Predictive validity
This one was not illustrated in the Introduction, as it does not really apply to weighing machines, and indeed not to many instruments used in language research. In fact one can only check predictive validity for an instrument that claims to measure a variable that by its nature has some forward-looking element to it by definition, or can be assumed to predict later performance on some other variable. Tests of language learning aptitude, reading readiness and the like are the most obvious examples. They can then be checked by seeing if people who do well on them also do well at a later time on whatever the instrument claimed to predict (i.e. actually learning a language, reading etc.). The latter, outcome, measure serves as a criterion to assess the former, predicting, measure just as the known more valid measure serves as criterion to assess the questionable one in concurrent validation.
3.1 Predictive validity: variables on interval scales
Predictive validity is often quantified by a correlation measure of some sort, as the commonly used instruments that it applies to are tests yielding interval scores. Thus Curtin et al. (1983) report testing out Pimsleur's foreign language learning aptitude test at University of Illinois High School. Correlations between the aptitude test scores and grades obtained later at the end of the first year of learning a foreign language were not very impressive:
Though nowhere near +1, at least the correlations are all positive - there were no languages for which higher scores on the aptitude tests foretold lower learning achievement. Those who did relatively well on the aptitude test were most likely to be later relatively successful at learning Latin - by the means of instruction used in the above institution, of course. As usual these correlations tell us nothing about any absolute levels of achievement in the languages concerned. Further information might be gleaned from an examination of the four scatterplots for the aptitude and specific language achievement scores. For instance it might pay to see which cases were "spoiling" the lower correlations and whether they had anything else in common.
3.2 Predictive validity: variables on category scales
A criterion-referenced 'mastery' test which purports to predict, from attainment of a preset accept or reject level on the test, success or failure on an ensuing course, would best be analysed like the Black and Dockrell example discussed above, together with a 'false positive analysis' etc. In practice the problem may be that you will get followup information only on those who 'passed' the predictive measure, as only those were allowed to go on to learn, so you will not be able to complete the contingency table. An example where all the figures are available is afforded by Moller (1975), who examined the predictive validity of the Davies English test, administered before embarkation to foreign students coming to the UK, as an indicator of adequacy of their English for postgraduate courses in a variety of subjects. (The Davies test was popular in the days before the development of IELTS). The followup criterion here was UK supervisor rating. Statistics as in 2.2 apply.
As can be seen, the percentage agreement is 86%, so reasonable predictive validity is achieved. But the division of the mismatches is such that it seems fortunate that those that the test rejected were nevertheless still allowed to go!
4. Construct validity
Construct validation again requires the instrument in question to be used, maybe in a pilot trial. Essentially it involves examining relationships between what are clearly measures of different variables, one of which one is interested in validating, to see if the known/assumed relationships (whether positive or negative or neither) are found. Thus it contrasts with concurrent validation where one is concerned with comparing what are regarded as different means of measuring 'the same' variable (and of course only positive relationships are expected). Having said that, there are subtle differences in the ways some writers use the term 'construct validity' which I will not consider here.
Construct validation can take on many forms, but we will treat it as of two types - that where one looks at the internal structure of a complex instrument such as Fischer's above (which only applies to instruments made by measuring several variables and combining the scores, of course) and that where one simply looks at the relationships between an instrument to be validated and other measures of other variables whose validity is not in question. The last is applicable everywhere and can involve almost any statistics. It essentially is exactly the same as pursuing substantive classical research, except that the assumptions are altered. In normal research you assume the measuring instruments are valid and use them to measure relevant variables so you can find out about the relationship between one variable and another (which is not known for sure in advance, though you may have a hypothesis about it of course). In construct validity work you assume all the measuring instruments except one are valid measures of their intended variables and know from previous research or common sense what the relationship between one variable and another is. You use this to assess the validity of the instrument in question. If the assumed relationship between variables is found, that suggests the instrument has construct validity. This was illustrated by (4) in the Introduction. Normally you use weighing scales with the assumption that the scales are correct, but where you are uncertain about what things weigh. To check scales you can reverse this, by weighing things with a known weight or known difference in weight to check if the scales are correct.
4.1 Construct validity within a complex instrument
We can illuminate the examination of construct validity within a complex instrument a bit more by examining the Fischer example study again (see 2.1). Many instruments used in language research are not complex in this way however. Columns 2-4 of the matrix in the above section show the internal analysis in correlation terms of Fischer's instrument. Examining these is effectively one way of checking 'construct validity' since the three component measures were chosen, with some theoretical backing, as supposedly quantifying different complementary variables which make up a meaningful single construct. We therefore expect them not to agree much with each other but to agree with the overall measure derived from them. Column 4 shows the correlations of each component variable with Fischer's overall measure. Unlike all the other correlations, which are "free" to arise or not depending on empirical fact, there is bound to be some degree of correlation here because of the a priori connection between any overall measure and its component variables. To confirm validity we expect pretty high correlations, which we get for two out of the three components in fact: we can see again that the pertinence rating correlates less well with the overall measure than the other two component variables do, and the clarity rating the best. Thus the clarity rating would be the best choice as a single indicator for the instrument as a whole, if you so wished. Generally you would hope for reasonably high correlations between component variables and a combined measure made from them, otherwise it might seem that the theoretical framework that led you to think of them as components of "one" variable was wrong.
Columns 2 and 3 enable you to examine the correlations just between the three grass-roots variables themselves. In a composite instrument such as this you would not expect high correlations here otherwise it might be argued that there is no real point in combining three separate measures - one would do. In fact two of the three are quite low, one being very slightly negative. That means those who do well on the pertinence rating score if anything relatively low on the structural accuracy measure. Indeed correlation-type coefficients can extend all the way down to -1, if relatively high values on one variable are scored by cases who score low ones on the other and vice versa. This on the whole supports the logic on which the measure was established, based on the theory that the three variables are more or less independent contributors to the one composite construct. If more contributory variables had been measured, the statistical technique of 'factor analysis' could have been used to sort out which were the essential ones, quantifying definitely distinct components of the composite measure.
4.2 Construct validity of an instrument in relation to other, different, variables
This type of construct validation is really the most fundamental and most widely available type of all, though often overlooked by researchers who limit themselves to pursuing the 'easier' content validation. It can in the end involve any of the standard research designs. A simple example of its use is where, as part of a study of the acquisition of INFL in English, say, a researcher includes learners of several proficiency levels, and some adult native speakers of English. All are tested with a test designed to show mastery of this aspect of English grammar. The hypotheses naturally are that higher proficiency levels will show greater mastery and NS the highest of all. However, from the point of view of substantive research these hypotheses are of the type sometimes called 'straw men'. It seems somehow 'obvious' that native speakers will do well and that learners will do less well, and that learners of higher proficiency will do better than those with lower proficiency. Such hypotheses really cover things we know already and it hardly seems worth doing a study just to show them to be true. Good hypotheses to follow up on for a research project are not usually ones that everyone already has proved several times over, but something more on the edge of knowledge. So one would hope this researcher has some other hypotheses or research questions as well - ones which assume a bit more and focus on something less well known (like perhaps ones about which specific manifestations of INFL are acquired first, or whether they all emerge together). But if a research project does include some 'straw man' hypotheses, this does have a value as a form of construct validation. Precisely because the relationships are already known, the new instrument to measure mastery of INFL can be checked for validity by seeing if it yields the assumed relationships. This employs the 'argumentation in reverse' that is typical of construct validation. In other words, when the data is gathered, do we find an increase in mastery at higher proficiency levels, with near perfect scores for native speakers? If we do, that suggests our INFL test is valid. If not, we would wonder why and check the test (though if this was all done in the main study, not a pilot, it is too late to change it!).
What statistics would probably be involved in that example?
To further characterise what goes on in construct validation I shall sketch some other approaches that could have been used in the Fischer example above, though they were not. Typically they would have concentrated on external relationships of Fischer's overall instrument with measures of what are assumed to be clearly separate variables, neither components of Fischer's nor criterion measures of much the "same" construct. The criterion measure - the native speaker rating - could also be pursued in the same way. For either instrument this could be done either again via correlations, or via research involving comparisons of groups or conditions similarly.
4.2.1 Construct validity: correlational design
The correlational approach would again generate a matrix of correlation coefficients, this time between Fischer's instrument and several other different variables. Here you would look to see if the correlations obtained accorded with what theory would predict the relationship to be. For instance written communicative competence would not be predicted to correlate particularly with explicit metalinguistic knowledge of linguistic terminology, or with intelligence, so you would expect near zero coefficients. This is sometimes called 'discriminant validation'. On the other hand you would assume a positive relationship with oral communicative competence. Checking on this sort of relationship is sometimes called 'convergent validation'.
In a different realm, Fasold (1984 p120f) reports the convergent approach being used to validate census reports on native language in Quebec. You would a priori feel able to assume that the proportions of adults reporting themselves as mother tongue French or English speakers to match closely the proportions of children independently recorded as enrolling in Catholic and Protestant schools, and indeed this turns out to be so. So the trustworthiness of the census is supported. The divergent approach is an essential component in the MTMM method of validation (my book 21.2.3).
More elaborate studies along these lines use factor analysis which can be seen as being able to analyse a whole set of correlations together rather than pairwise and identify which groups of variables are clubbing together to place cases in a similar order with similar spacing of scores - i.e. mutually correlate highly. In particular, 'confirmatory factor analysis' is used where you have a prediction to test about what the correlations should be, as you will have in a proper construct validation exercise.
4.2.2 Construct validity: non-experimental independent groups design
The group comparative approach was illustrated by the INFL example in 4.2. Fischer's instrument could similarly be checked by trying it with native speakers as well as learners. This would be permitted since it was not specifically norm-referenced to learners only. Something similar was done for the TOEFL (Angoff and Sharon 1971). On instruments such as these, if it emerged that a group of learners did better than a group of native speakers, clearly this would be counter to assumptions and you would question the validity of the measure. Here the EV - the categorisation of cases as 'native speaker' or 'learner' - is the "other" variable that the measure in question (used as DV) is checked for its relationship with.
Additionally for instruments made up of many items, Rasch analysis (see reliability) can be used to show if items are got right more by successively higher levels/ages as they should.
4.2.3 Construct validity: experimental designs (independent groups and repeated measures)
Instead of using existing characteristics of cases, such as being a native speaker or not, often it is suitable to 'make' an EV in a fully experimental investigation of construct validity. So you could use Fischer's instrument on two or more groups of cases after they are exposed to different conditions or treatments which you feel confident on a priori grounds will affect their written communicative competence. E.g. you teach one group communicative writing but not the other. If Fischer's instrument duly records the assumed difference, then that supports its validity. The term 'treatment validity' is sometimes used for this specific approach.
The same 'treatment' approach is widely used in validating criterion-referenced achievement tests in pedagogical contexts (Black and Dockrell 1984 p92). You administer a questionably valid test of, say, apologising appropriately in English, to a group who have been taught the content on which the test is supposedly based, and a group who have not. The natural assumption here is of course that the former group should score higher than the latter, if the test is valid. A variant of this approach would rather involve administering the test twice to the same group, before and after instruction, to see if the assumed improvement registers or not. Not dissimilar also, is the following procedure, often done quite informally. In a multi-item test or inventory you intersperse 'control' items on which you have a priori assumptions of what the responses should be. For example, if cases are rating the 'idiomaticity' of some phrases you may include some that are clearly non-idiomatic as a check on whether the cases are using the same definition as you.
4.2.4 Conclusion on construct validation statistics
At this point it is worth drawing attention to some differences between the use of correlation coefficients to quantify validity and their use in connection with reliability. First, we have just seen that near zero and negative values may well arise in validity work: these would be most unexpected in a reliability study (or indeed concurrent or predictive validity study). Second, correlation matrices are most often seen in a validity study. In reliability work where you have, say, a set of three or more repetitions of a test, you would probably try to quantify the overall agreement between them rather than examine the agreement between each possible pair of occasions when the test was redone. Finally, in the study of validity you are studying the relationships between different variables and so often comparing instruments yielding scores on different scales of the same general type. For instance, Fischer's instrument has a maximum score of 12, the criterion rating one of 24. That again does not arise in reliability checking as by definition the same measure on the same scale is repeated in some way. This causes no problem for correlation-type coefficients, incidentally, since they only quantify agreement in a relative sense and are largely unaffected by differences in score level on the two variables compared.
Further, something I have generally glossed above but which must be considered when interpreting the correlation coefficients is the number of cases involved. To oversimplify, what constitutes a correlation that is big enough in a positive or negative direction to be taken as marking any convincing relationship (versus no marked relationship at all) depends on the number of cases. The threshold values can be found on significance tables in basic statistics texts. The threshold value for Spearman rho for 18 cases is in fact .399, so we see that two of the correlations obtained for Fischer's instrument are not high enough to indicate any real relationship at all (using the customary .05 level of significance).
In a similar way, the key statistical information that emerges from the group comparison and experimental methods is an indication of probability, signified by 'p', and perhaps other measures of amount of relationship like eta squared (rather than a correlation coefficient). These can be arrived at by any number of statistical means appropriate to the particular design used. As customary, if p is less than 0.05, researchers would generally assume that a difference in DV scores worth paying attention to has been demonstrated between the groups or conditions. Technically the difference is said to be 'significant'. See further a text such as Langley (1979) or Rowntree (1981) for a simple account of the reasoning behind significance tests and p values and their interpretation. In the present application, then, a p near zero would indicate a relationship between the EV variable and the one whose quantification is in question. You would have to further look at the scores to see if the relationship was in the assumed direction - e.g. that native speakers did do significantly better rather than worse than learners on Fischer's measure, and so support its validity.
All these approaches involving validation against measures of clearly different variables share the characteristics of being like normal research designs, but done "back to front". Indeed any of the many designs of investigations that are used in research can be used in this back-to-front way as validity checks. To reiterate: what I mean by "back to front" is as follows. Normally in language research you assume that the variables that are of interest can be quantified validly and seek through correlation study or experiment etc. to find out something about relationships between them. In the present situation you rather have to assume you know the relationship between the variables, and that measures of variables other than the one whose validity is in question are valid, in order to find out about the validity of a problem instrument.
A maths analogy is valuable. We know the formula that the circumference of a circle is equal to pi times the diameter (C=πd). So regularly when we know the diameter of a circle we use the formula to calculate the circumference (pi of course always = 3.1416). However, we can also use the formula back to front - i.e. if we happen to know the circumference we can divide by pi to calculate the diameter. What we cannot do is find anything out if we are sure neither of the circumference nor the diameter, or indeed if we are not sure that the relationship expressed in the formula is quite right either! Similarly in the present instance we can conclude nothing about validity of the quantification of one of the variables in the above pieces of research unless we are sure of all the other elements - which can be a big assumption.
For instance in the apology example above (4.2.3), we have to assume the teaching really is effective: otherwise a contrary result could be interpreted as evidence of faulty teaching/learning rather than a faulty measure of competence in apologising. In the case of Fischer's instrument we would have to be sure of our theory that predicted no relationship between written communicative competence and knowledge of terminology, and so forth. The reality of course is that usually you are not sure of several aspects of a piece of research. Hence acts of faith some into play in construct validation just as they tend to, in the matter of the assumption of validity of the criterion instrument in concurrent validation.
5. Conclusion on validity in general
Checking validity is not to be overlooked just because reliability is usually easier to check, and a more mechanical task to check.
The best thing to do of course, both with reliability and validity, is to be fully aware in advance of the kinds of factors that cause it and make sure they don't get into the instruments, or the way they are administered and scored, in the first place! Quality assurance beats quality control (See e.g. my book).
Often in research one is using/piloting for the first time instruments that one has created especially to quantify some very specific variables of interest, in order to test a new hypothesis. Hence you may be heavily reliant on content validation as there are no other measures of the same thing of known validity to use for concurrent validation, and you don't feel able to assume any relationships with other variables that can be used as a basis for construct validation. One protection is to devise several different ways of measuring 'the same thing', none of which have very well established validity, and average the results: this may yield better validity than a single measure, and is implicit in the notion of 'triangulation' associated especially with non-classical research (i.e. the idea that you use several different ways of investigating the same thing, both qualitative and quantitative perhaps, in order to illuminate it most effectively).
Can you think of more than one way of gathering data and quantifying the frequency with which subjects use different reading strategies?
Can you think of more than one way to measure if some learners have mastered the third singular -s on verbs in English or not?
Portfolio assessment in teaching contexts as a form of triangulation?
If you do use in this way several methods of measuring the 'same thing' (or at least of what you intend to be the same construct) of course it looks a bit like concurrent validation, but is actually quite different. You might for example measure reading ability both with a cloze test (filling gaps in a text), and a multiple choice test of comprehension following reading a passage, neither of which you are sure of the validity of. You can even compare the results (e.g. with correlation) to see how far they agree, in addition to combining the scores to get 'the best measure'. But this is not concurrent validation unless you are able to make the assumption that one of the instruments is definitely valid. If the two measures agree, it is encouraging but does not actually confirm their validity if you know the validity of neither in any other way. There could be a constant error shared by both, just as in reliability analysis. And if there is a big difference in the results, you don't know if it is because one is valid and the other not, or because both are invalid in different ways, or what. Gathering qualitative data in the form of comments from the subjects afterwards may help decide. But there is always the risk that you create a set of scores with less validity by combining the results of a highly valid instrument with those of an invalid one.
Probably the best way to decide on what multiple ways of measuring the same thing to use is to choose ones which appear to potentially involve rather different sources of invalidity, in the hope they balance out. For instance, to get a balanced overall measure of vocabulary knowledge you would combine tests with both written and oral and picture stimuli, so that it is not invalidated by being partly a reading test, which could be the case if you used only written stimuli. You need an awareness of sources of invalidity (see my book esp. ch 21-23 and accounts of validity generally).
Investigation of validity (and indeed of reliability) can be a research topic in itself rather than, as we have pursued it here, just something you do on the side of your 'real' research, to check that the instruments are accurate. A particular kind of validity checking done as a research project in itself is the 'multitrait multimethod' study (MTMM). This sets out to systematically measure more than one construct ('trait') each by more than one type of instrument ('method') in a parallel way. One then gets to see if the different instruments produce more drastically different results than the different supposed constructs (Bachman; for a clear example of its use see Gardner, Lalonde et al. ‘The role of attitude and motivation’ Language Learning 35).
When combining the results of different instruments, you need a bit of statistical knowhow. We consider here interval scores with relative not absolute value. Just adding the scores from three different reading tests will not be satisfactory if they are on different scales, as the weight or contribution of each instrument will not be equal. Even if they are all on the same interval scale, the three sets of figures may have different standard deviations, which has an effect. If they are all at least interval, one method is to turn all the scores into standard scores (also known as z scores), which puts them all on scales of the same length with the same mean and SD. In SPSS this can be done by going to Analyze...Descriptive statistics...Descriptives, and marking the box for SPSS to Save standardized values as variables. SPSS creates columns with the standard score equivalents of the columns you had before. In fact standard score (or z score) equivalents of a set of scores = (the original score - the mean) / the SD. This reduces any interval scale to a scale with scores of mean 0 and SD 1, ranging approximately between -3 and +3. You can then safely add or average the three sets of scores to produce a composite scale. For non-interval scales or mixtures of scale types other techniques are needed (On combining scores see further my book chs 15-18).
A more sophisticated way of combining interval scores would be to use factor analysis to pick out the most marked information shared between the scores for the three instruments and use scores for each subject on that 'first common factor' as the overall measure (see Factor Analysis). Obviously there is a good chance that the information shared by three partly invalid instruments designed to measure the same thing will be more valid than the information that differentiates them.