Scoring Sentences Developmentally: An Analog
of Developmental Sentence Scoring
A thesis submitted to the faculty of
Brigham Young University
in partial fulfillment of the requirements for the degree of
Master of Science
Ron W. Channell, Chair
Department of Communication Disorders
Brigham Young University
Copyright © 2012 Amy Seal
All Rights Reserved
Scoring Sentences Developmentally: An Analog
of Developmental Sentence Scoring
Department of Audiology and Speech-Language Pathology, BYU
Master of Science
A variety of tools have been developed to assist in the quantification and analysis of naturalistic language samples. In recent years, computer technology has been employed in language sample analysis. This study compares a new automated index, Scoring Sentences Developmentally (SSD), to two existing measures. Eighty samples from three corpora were manually analyzed using DSS and MLU and the processed by the automated software. Results show all three indices to be highly correlated, with correlations ranging from .62 to .98. The high correlations among scores support further investigation of the psychometric characteristics of the SSD software to determine its clinical validity and reliability. Results of this study suggest that SSD has the potential to compliment other analysis procedures in assessing the language development of young children.
Keywords: sentence scoring, Developmental Sentence Scoring, MLU, SSD,
A project of this magnitude is certainly not undertaken or completed by a single individual. Although, my name is listed as the sole author, this work is result of the efforts of many. I would like to thank my parents, Gregory and Suzanne Seal, for their love and support. Their constant encouragement kept me going when stress, fatigue, or procrastination threatened to halt my progress. I want to thank my fellow students for their friendship. I couldn’t have done it with out them. They know who they are!
The late nights, marathon study sessions, and equally lengthy study breaks made this journey so much more enjoyable! And finally, to Dr. Ron Channell, I want to express my sincere thanks for the endless hours he spent answering my questions, talking me through details, and editing my fledgling drafts. His sense of humor and positive outlook kept me motivated and smiling (most of the time) through the months of research and writing. Without his guidance, skill, and patience, this work never would have moved beyond page one.
TABLE OF CONTENTS
LIST OF TABLES
A variety of useful tools and indices for language sample analysis have been developed to assist in the quantification of natural, spontaneous language. The ability to quantify language provides a basis for collecting normative data and making developmental comparisons (Bennett-Kastor, 1988; Miller, 1991). Quantified descriptions of language can be useful in providing baseline information prior to developing appropriate intervention goals (Klee & Paul, 1981; Klee, 1985). Normative data are also valuable for measuring progress during intervention and comparing treatment outcomes (Hughes, Fey, & Long, 1992; Lee, 1974). Existing quantification measures range from frequency count procedures such as Mean Length of Utterance (MLU; Brown, 1973), to scored indices of grammatical complexity such as Developmental Sentence Scoring (DSS; Lee, 1974) and the Index of Productive Syntax (IPSyn; Scarborough, 1990).
For more than 30 years, MLU has been used as a measure of grammatical development. The correlation between MLU and the acquisition of grammatical morphemes has been verified (de Villiers & de Villiers, 1973; Klee & Fitzgerald, 1985; Rondal, Ghiotto, Bredart, & Bachelet, 1986). However, the validity of MLU beyond the age of two or three (Bennet-Kastor, 1988; Klee, Schaffer, May, Membrino, & Mougey, 1989; Rondal et al., 1986) and its sensitivity to syntactic development (Klee et al., 1989) have been called into question. Despite these criticisms MLU maintains widespread clinical use (Kemp & Klee, 1997; Muma, Pierce, & Muma, 1983).
DSS is the most commonly recognized formal procedure for grammatical language sample analysis. Although the DSS procedure is more than 20 years old, it continues to be recognized as a valid, reliable tool for obtaining information about grammatical development (Hughes et al., 1992). Reportedly DSS is the tool most frequently employed by clinicians practicing language sample analysis (Hux, Morris-Friehe, & Sanger, 1993; Kemp & Klee, 1997). While DSS enjoys clinical popularity, the procedure is not without its limitations. The reliability of DSS scores using only the recommended 50-utterance sample has proven to be problematic (Johnson & Tomblin, 1975). In addition, DSS does not account for incomplete utterances and emerging forms in the scoring procedure.
Automated versions of DSS have been developed to facilitate more efficient grammatical analysis. As with most language sample analysis tools, DSS is time-consuming and requires clinician skill and training (Hux et al., 1993; Kemp & Klee, 1997). In order to decrease these time and resource demands, programs such as Computerized Language Analysis (CLAN; MacWhinney; 1991) and Computerized Profiling (CP; Long & Fey, 1993) were developed to perform automated DSS analysis. However, the accuracy of these programs is variable at best. Both CLAN and CP display low accuracy rates in certain grammatical categories (Boyce, 1995) and are unable to detect subtle nuances of DSS scoring such as correctness of use (e.g. pronoun gender agreement). In addition, there are elements of DSS that do not lend themselves to automation at all, including attempt marks and sentence points. The absence of these DSS features raises the question as to whether the analyses performed by existing programs can truly be termed DSS. In order to obtain a complete and accurate DSS analysis, the clinician must make corrections and additions to the generated data. Since DSS output from CLAN and CP requires manual correction, both programs can be classified as only “semi-automated” (Baker-Van Den Goorbergh, 1994).
Current views maintain that fully automated programs (i.e. programs which do not require clinician assistance beyond the initial input of the transcript) are not yet practical (Baker-Van den Goorbergh, 1994; Long & Fey, 1995). However, this position is based on the practice of designing computer software to execute existing manual analysis procedures. The ability of computers to precisely replicate tools created for manual use is presently limited. Fully automated programs permit the user to input an uncoded transcript, and the software codes each utterance and computes the results (Long, 1991). Such software is well within the scope of current technology. To achieve acceptable levels of accuracy and efficiency, however, fully automated programs must represent independent indices designed specifically for automated analysis.
Clearly there is a need for an automated index that carries out the same function as DSS. The index should serve as more than a simple imitation of manual methods. Rather, such a program should accomplish the same goals as DSS but constitute a new, distinct instrument. Modifications to the prescribed procedures of manual DSS can be made to accommodate the constraints of automation, while maintaining the integral components of grammatical analysis. As with all independent measures, automated indices must be psychometrically evaluated to establish compliance with standards of acceptable clinical testing (American Psychological Association, 1985; Worthen, White, Fan, & Sudweeks, 1999). In addition, separate normative data must be collected for the index, independent of data compiled in the original DSS literature.
An analog of DSS grew out of initial attempts to refine existing versions of automated DSS. Recognizing that some elements of DSS couldn’t be automated (e.g. sentence points, attempt marks) and other elements were functionally unnecessary (e.g. using only complete utterances), Channell (2000) developed a new measure based on the principles of DSS but with modifications to the original procedure. The result is an independent index called Scoring Sentences Developmentally (SSD).
The present study looks at the SSD and examines how well it correlates with manual DSS and MLU. The analog was assessed to determine its ability to obtain a detailed, quantified, and scored evaluation of grammatical structures comparable to results obtained with manual DSS and MLU procedures. Such a comparison provides information regarding the effectiveness and value of the analog. The correlational analysis of this study represents only the first step in developing and evaluating a fully automated index of grammatical complexity. Future research is necessary to investigate the psychometric validity and reliability of the index and to establish an independent compilation of normative data.
Review of Literature
Standards for Evaluating Assessment Instruments
The use of norm-referenced and standardized tests is widespread in educational, psychological, and clinical settings. Criteria have been established to evaluate psychometric measures used in assessment procedures (American Psychological Association, 1985). Validity and reliability have been identified as the primary standards that must be met in all clinical tests before operational use. Validity refers to the appropriateness and usefulness of inferences drawn from a test. Construct validity focuses on the ability of the test to measure the characteristic of interest. Content validity demonstrates the degree to which individual items or components of the test represent the domain of content. Criterion-related validity refers to the relationship between tests scores and some predetermined external criterion. Reliability is defined as the extent to which the test is free from errors of measurement. Four types of reliability are generally considered, including test-retest, parallel form, internal consistency, and interrater reliability (Worthen et al., 1999).
Psychometric standards of testing have been applied to tests assessing language disorders. McCauley and Swisher (1984a) asserted the importance of using appropriate norm-referenced tests to separate disordered from non-disordered language. Thirty norm-referenced language and articulation tests designed for use with preschool children were evaluated on the basis of 10 psychometric criteria. Results of the examination indicated that fewer than 20% of the reviewed tests met 5 of the 10 criteria and 50% of the tests met two or fewer criteria. Criteria requiring empirical evidence of validity and reliability were met least often, indicating that these tests failed to demonstrate many of the psychometric characteristics required of well-designed norm-referenced tests.
A companion article by McCauley and Swisher (1984b) acknowledged the flaws and misuses of norm-referenced tests while still asserting the value and necessity of such tests when used properly. Using a hypothetical client, the authors addressed four common errors associated with norm-referenced testing and provided guidelines to avoid potential problems. Although McCauley and Swisher maintained their support of norm-referenced testing, they conceded that the tendency for norm-referenced tests to provide incomplete or misleading information requires greater reliance on the use of language sample analysis and development of criterion-referenced tests.
Muma (1998) contended that McCauley and Swisher (1984a) misrepresented his views regarding the usefulness of psychometric testing in the problem – no problem issue. Muma reaffirmed the role of norm-referenced tests in identifying language disorders but criticized the heavy reliance on psychometric normative testing for overall language assessment. Citing construct validity as the crucial standard for any test, Muma stated that many tests widely used in clinical practice lack this type of validity. Further, Muma questioned the practice of using norm-referenced testing in which “contrived activities are imposed on an individual in a priori procedures” (p. 179) rather than allowing for descriptions of spontaneous intentional language within a natural context. Muma advocated the use of descriptive procedures, such as language sampling, to overcome this issue. Psychometric standards have traditionally not been applied to language sampling procedures since few procedures are norm-referenced and sample collection techniques are not standardized. Muma notes, however, that descriptive assessment is “well grounded on philosophical view and theoretical perspectives thereby having construct validity” (pp. 177-178), thus yielding strong psychometric support to language sample analysis.
Language Sample Analysis
Language production in its many manifestations is the most seriously impaired process among children with language disorders (Miller, 1991). The clinical value of language sampling in the assessment of child language has long been established (Bloom & Lahey, 1978; Gallagher, 1983; Hux et al., 1993; Klee, 1985; Lee, 1974). The primary purposes of language sample analysis are to characterize the nature of a child’s linguistic system, both individually and in relation to same-age peers, and to develop and evaluate appropriate goals for intervention (Klee & Paul, 1981). A variety of analysis procedures and instruments have been developed. Menyuk (1964) broadly classified these approaches as descriptions of sentence length, examinations of sentence structure complexity, and proportions of usage of different sentence structures at various age levels. Miller (1981) differentiated procedures on the basis of whether they quantify structural and semantic development to evaluate developmental status of a child or identify structural or semantic problems within a child’s system.
Prevalence of Language Sampling
Muma, Pierce, and Muma (1983) surveyed the philosophical orientation and the assessment and intervention procedures advocated by speech-language pathology training programs. Open-response surveys were completed by 76 training programs recognized by American Speech and Hearing Association. Of the 76 respondents, 71 reported using language sampling and analysis techniques. Thirty-seven respondents specifically mentioned the use of DSS. Results indicated that language sampling procedures were most frequently used with young children. Muma et al. concluded that practices reported speech-language pathology training programs reflect a recognition of the importance of language-based assessment and intervention.
Hux et al. (1993) examined the language sampling practices of school-based speech-language pathologists across nine states. The study included responses to 51 questions addressing the background, attitudes and sampling and analysis procedures used by 239 speech-language pathologists. Although time constraints, lack of skills, and diminished resources are common difficulties associated with language sampling, results of the survey revealed that respondents routinely use language sampling practices in assessment and treatment of school-aged children. The majority of respondents (60%) obtained samples of 51 to 100 utterances in length. Fifty-one percent of respondents reported collecting samples during one setting only. Respondents also showed a clear preference for non-standardized procedures of analysis. Respondents indicating a preference for standardized procedures identified DSS as the only method used with regularity. The majority of respondents judged language sampling as a reliable and useful means of distinguishing between students with normal and disordered language. Hux et al. reported that although 82% of respondents indicated language sampling was not mandated by local or state agencies, speech-language pathologists regularly implemented such practices as part of assessment. Hux et al. cited the infrequency of language sampling for adolescent, culturally diverse, or mildly impaired populations, and the tendency of clinicians to rely on self-designed methods rather than standardized procedures with proven validity and reliability as areas of concern.
Kemp and Klee (1997) followed up with a similar survey to assess the generalizabilty of the Hux et al. (1993) findings and to judge the extent to which changes in the workplace had impacted clinical use of language sampling. Kemp and Klee surveyed 253 speech-language pathologists employed in preschool positions across 45 states regarding language sampling practices. Eight-five percent of respondents reported using language sample analysis in the assessment of language impairment in preschool children. Of clinicians using language sample analysis, 92% reported using it for diagnosis, 44% for screening, 77% for intervention, and 64% for post intervention. Clinicians not using language sampling reported lack of time (86%), lack of computer resources (40%), lack of training and expertise (16% each), and financial constraints (15%) as reasons for not using analysis procedures. Almost half of the respondents preferred collecting samples based on the number of utterances rather than length of time. Nearly half of the respondents also indicated a preference for non-standardized procedures of analysis. Of the standardized procedures noted, DSS (35%) and Lahey’s (1988) Content/Form/Use (29%) were most often cited. Only 8% reported using a computer program for language sample analysis. Kemp and Klee observed that most clinicians endorsed language sample analysis as important in the assessment process but found that the time, effort, and skills required often make the practice difficult. Kemp and Klee concluded that clinical practice must find ways to accommodate the demands placed on clinicians by developing assistive technology to aid in the transcription and analysis of language samples.
Simple Count Analyses
Type/Token Ratio. Simple frequency counts have been used to quantify semantic aspects of language such as lexical diversity (Miller, 1981; Richards, 1986). Templin (1957) studied 480 children and devised the Type/Token Ratio (TTR) as a means of weighing the number of different words produced in a 50-utterance sample against the total number of words produced. Templin found a ratio of 1:2 (.50) to be consistent across age, sex, and socio-economic status. Miller (1981) viewed TTR as a valuable clinical tool for baseline assessment due to its consistency. Traditionally, a low TTR has been used as a warning for possible restrictions on the range of vocabulary used by a child in his or her syntactic repertoire (Fletcher, 1985). Richards (1987) argued, however, that TTR reveals more about the number of tokens in the sample rather than the actual range of vocabulary usage. He suggested that without adequate sample sizes and established norms, the clinical use of TTR is unreliable. In addition, Bennett-Kastor (1988) noted that TTR is sensitive to context constraints and should not be used the sole measure.
Mean Length of Utterance. The use of MLU as a measure syntactic complexity in child language is a long-standing practice. Brown (1973) popularized the use of MLU based on morpheme count as a simple index of grammatical development. He asserted that as a child’s grammar increases in complexity through the acquisition of additional morphemes and structures, there is a corresponding increase in utterance length. Brown identified 14 categories of grammatical morphemes and established a set of guidelines for counting the number of morphemes in each utterance. Brown described five stages of development defined by intervals on the continuum of MLU scores, contending that specific aspects of syntactic development correlate with the number of morphemes used. Brown found that MLU was strongly correlated to chronological age and proposed that was predictive of the acquisition of morphemes assigned to each stage of development.
Subsequent studies substantiated the high positive correlation between chronological age and MLU (de Villiers & de Villiers 1973; Miller & Chapman, 1981; Miller 1991). The correlation between MLU and the acquisition of grammatical morphemes has also been verified (de Villiers & de Villiers, 1973; Klee & Fitzgerald, 1985; Rondal et al., 1986). However, several limitations and problems with MLU have also been identified. Chabon, Kent-Udolf, and Egolf (1982) found that MLU scores were unreliable for children beyond Brown’s Stage V of development. Other findings challenge the validity of MLU beyond Stage II, at values of approximately 2.0 to 3.0 (Bennet-Kastor, 1988; Klee et al., 1989; Rondal et al., 1986).
Perhaps even more significant is the question of whether or not MLU is a valid measure of syntactic complexity at all. Klee and Fitzgerald (1985) examined the MLU scores and grammatical complexity of language samples obtained from18 children. Although the acquisition of grammatical morphemes did correlate with increases in MLU, changes in syntactic structure and diversity were not reflected. Klee and Fitzgerald concluded that MLU is not a good indicator of grammatical development in terms of syntactic construction. Perhaps MLU is not a sensitive measure of any linguistic construct other than utterance length itself (Klee et al., 1989). Miller (1991) also acknowledged that older children could increase the complexity of the system without increasing utterance length.
Language Assessment, Remediation, and Screening Procedure (LARSP)
Crystal, Fletcher, and Garman (1989) developed a qualitative procedure for grammatical analysis called LARSP. The descriptive framework of LARSP is based on seven stages of grammatical acquisition through which children pass. A 30-minute language sample is collected and analyzed on the word, phrase, clause, and sentence level. The frequency count of various structures at each level is tallied on a profile chart. A pattern of syntax is established by comparing several samples in order to establish an expected pattern (Crystal, 1982). Klee and Paul (1981) noted that LARSP yields an age score by giving some indication of acceptable variation around a general developmental stage. However, the measure has not been standardized and provides only raw data without conventions for summarization and interpretation.
Index of Productive Syntax (IPSyn)
The Index of Productive Syntax (IPSyn) was developed by Scarborough (1990) as an easily obtained summary scale of grammatical complexity to be used for the study of individual differences in language acquisition. A primary goal of the index is to provide numerical scores suitable for statistical analysis and standardization. IPSyn measures the emergence of syntactic and morphological structures in productive language. Scarborough developed IPSyn using 75 samples obtained longitudinally from 15 children. The first 100 successive, intelligible utterances in each sample were coded for 56 grammatical forms to develop the IPSyn score sheet. Data from the score sheet was used to derive a final IPSyn score. A comparison of mean IPSyn and MLU values at each age revealed that IPSyn is a reliable age-sensitive summary of grammatical complexity. Scarborough cautioned, however, that the index does not provide detailed diagnostic information about a child’s mastery of specific structures and rules. Scarborough concluded that IPSyn is most suitable as a tool for comparing or matching subjects in research groups.
IPSyn has been applied in a variety of uses by researchers. In a comparative study involving autistic Down syndrome, and normal children, Tager-Flusberg and Calkins (1990) used IPSyn to investigate whether imitation is more advanced than spontaneous language. IPSyn was used to evaluate the grammatical content of the imitative and spontaneous corpora. An additional study of autistic and Down syndrome children used IPSyn as one of the comparative measures of language acquisition and development (Tager-Flusberg et al., 1990). Scarborough, Rescorla, Tager-Flusberg, Fowler, and Sudhalter (1991) examined the relationship between utterance length and grammatical complexity in normal and language-disordered children. IPSyn was used as the measure of syntactic and morphological proficiency and correlated to MLU scores for each group. Scarborough et al. found excellent agreement between IPSyn and MLU scores for children from 2 to 4 years old.