Marco baroni

Download 269.95 Kb.
Size269.95 Kb.
1   2   3   4

bound verbal form -main, is associated with any semantic content.

However, several experimental studies have shown that speakers do treat some of

these semantically opaque forms as morphologically complex. For example, the following

studies have presented evidence from a variety of experimental tasks that speakers are

aware of the morphological structure of words that are (partially or completely)


semantically opaque: Emmorey (1989), Bentin and Feldman (1990), Feldman and Stotko

(unpublished -- quoted in Stolz and Feldman 1995), Roelofs and Baayen (1997), Baroni

(2001) and Baayen, Schreuder and Burani (2001).

Now, if it turned out that there is a convergence between the parses assigned by the

distribution-driven learner and the speakers’ intuitions about semantically opaque forms,

then this would constitute a stronger form of evidence in favor of the hypothesis that

speakers used distributional cues to assign morphological structure to words.

For example, if it turned out that both the automated learner and the speakers treated

recitation as morphologically complex (re+citation), but remain as monomorphemic, then it

would be reasonable to conclude that speakers are sensitive to distributional cues similar to

the ones implemented in the automated learner, since they could not have assigned a

morphological structure to recitation on the basis of its meaning (and, also, it is unlikely

that they could have used syntactic or phonological cues to distinguish recitation from


Indeed, the results of the second survey I conducted show that, even when only

semantically opaque words are considered, there is a significant correlation between the

parses assigned by the learner and speakers’ intuitions. Thus, this study provides strong

support for the claim that humans use distributional cues in morpheme discovery.

Notice that this type of evidence in favor of distribution-driven learning is not

available in other domains. For example, even if it has been shown that distributional cues

can be very effective for segmenting utterances into words (Brent and Cartwright 1996),

there is no clear equivalent to semantically opaque morphemes in the domain of syntactic


In particular, idiomatic phrases such as kick the bucket are not the equivalent of

semantically opaque morphologically complex forms such as permit. First, idiomatic

phrases also have a literal, semantically transparent meaning, and it is unlikely that speakers

are not aware of this meaning. Second, words occurring in idioms also occur in nonidiomatic

sentences. This is not the case of a bound stem like -mit, which occurs only in

opaque forms.

5.1 Constructing the semantically opaque word list

Following a standard practice in morphological processing studies (see, for example,

Marslen-Wilson, Tyler, Waksler and Older 1994), I first conducted a survey in which three

judges were asked to rate a set of forms from the DDPL output for semantic transparency,

and then I selected forms that received a low average semantic transparency rating to

construct the survey corpus.

The DDPL output contains a total of 3,651 forms beginning with strings

corresponding to one of the prefixes postulated by the model. Of these, 382 are actually

treated by the model as prefixed. Clearly, it was not feasible to ask the semantic

transparency judges to assign a rating to all 3,651 forms. Thus, the corpus presented to the

judges was constructed in the following way.

First, I made a preliminary division of the 382 words treated as prefixed by DDPL

into two categories: words that I judged to be obviously prefixed (productively formed,

semantically transparent), and words that may or may not be prefixed (this preliminary list

included a wide range of types, from obviously non-prefixed words such as adage to only

slightly lexicalized forms such as inhumane). The first list was composed of 101 words,

the second list of 181 words. I randomly selected 10 words from the first list, and I kept all

the 181 words from the second list.

From the list of the remaining 3,269 words treated as simple by DDPL, I then

randomly selected 10 more words that were obviously prefixed and completely transparent,

and 200 words that may or may not be prefixed.


The corpus presented to the three judges was composed of the 20 completely

transparent words and 381 “ambiguous” words selected in this way. The 20 completely

transparent words served both as a control and, more importantly, they were added in order

to try to minimize the risk that judges would assign high ratings to some semantically

opaque forms merely in order to make use of the whole range of the rating scale.

The judges were two graduate students and one postdoctoral fellow in the UCLA

Linguistics Department, and were selected because of their strong background in

morphology and morphological processing. I selected expert judges because I wanted to

make sure that they would understand the task, and in particular that they would understand

the distinction between rating forms on the basis of semantic transparency vs.

morphological complexity.

Judges were asked to rate the words in the corpus on a scale from 1 to 5, assigning

1 to completely opaque words and 5 to completely transparent words.

A series of correlation analyses showed that the judges’ ratings were highly

correlated (both Pearson and Spearman correlation coefficients in all pairwise comparisons

were higher than .7). Thus, I computed the average cross-judge rating for each word in the


As expected, the 20 transparent words received very high ratings (the mean rating

for this set of words was 4.89). Of the remaining forms, 97 out of the 181 words treated as

prefixed by DDPL received an average rating lower than 2.5; 183 out of the 200 words

treated as simple by DDPL received an average rating lower than 2.5. Notice the

asymmetry between the two sets: just a little more than half of the complex-for-DDPL

words that were pre-selected as potentially opaque are indeed semantically opaque, whereas

90% of the simple-for-DDPL words that were pre-selected as potentially opaque are indeed

semantically opaque. This suggests that, although DDPL did not have access to semantic

information, the model did show a preference for treating semantically opaque words as

simple. This is good from the point of view of a general assessment of the DDPL

performance, but it made it harder to design the survey presented here.

The corpus for the second survey was thus composed of the 97 complex-for-DDPL

forms that had a semantic rating lower than 2.5, and 97 randomly selected words from the

183 simple-for-DDPL words with a semantic rating lower than 2.5. I decided not to add a

control set of semantically transparent forms, as I wanted to maximize the participants’

sensitivity to differences in morphological status among opaque words. If some

semantically transparent words had been inserted, speakers would have probably reserved

the high values of the rating scale for such forms, “squeezing” the ratings of semantically

opaque words within a narrow range at the bottom of the scale.

The average semantic rating across the complex-for-DDPL forms in this list was

1.54; the average rating across the simple-for-DDPL forms in this list was 1.21. One of the

judges was also asked to rate the 194 forms in the corpus by assigning ratings on a 5 point

scale on the sole basis of the degree of semantic transparency of the potential prefix of each

form. The average prefix transparency rating across forms treated as complex by DDPL

was 1.86; the average prefix transparency rating across forms treated as simple by DDPL

was 1.46. Thus, while there is a noticeable and slightly worrisome difference in the degree

of prefix transparency between the two sets, it seems safe to state that not only the forms in

both sets are semantically opaque when considered as wholes, but also that the potential

prefixes occurring in them tend to be opaque.

The complex-for-DDPL set contained 17 distinct prefixes, the simple-for-DDPL set

contained 16 distinct prefixes. The two sets shared 14 prefixes. The average length of the

words in the complex-for-DDPL set was 9 characters, the average length of their potential

stems was 6.4 characters. The average length of the words in the simple-for-DDPL set was

8.6 characters, the average length of their potential stems was 6.3 characters. The average

frequency of the words in the complex-for-DDPL set was 3.6, the average frequency of

their potential stems was 263.4. The average frequency of words in the simple set was

21.1, the average frequency of their potential stems was 102.9. One word in the complex


set had a potential stem that did not occur as an independent string in the corpus; 77 words

in the simple set had potential stems that did not occur as independent strings.

5.2 Methodology and data collection

The same methodology and data collection procedure described in section 4.2.1 above was

followed in the second survey.

A group of eight English native speakers, all graduate or undergraduate students or

post-doctoral fellows in linguistics, took part in the survey. None of them had participated

in the previous survey.

5.3 Results and discussion

Pairwise Pearson and Spearman correlation coefficients were computed for the ratings of

all pairs of participants. The patterns of three participants were poorly correlated with those

of the other participants and with each other (for each of these three participants, the

correlation coefficient between her/his ratings and those of a majority of other speakers was

lower than .4). Thus, their data were discarded.

As the ratings of the remaining participants were highly correlated (all pairwise

Pearson and Spearman coefficients were higher than .5), the per-word average rating value

across them was computed, and the resulting variable was compared to the parses assigned

by DDPL to the same words in a one-way ANOVA in which the average ratings were

grouped on the basis of the DDPL parses (simple vs. complex). The results of the ANOVA

indicates that, in this case as well, the difference between ratings assigned by native

speakers to words treated as simple vs. complex by DDPL is highly significant (F(1,192) =

49.2, p < .0000).

If the participants in the survey had mostly relied on semantic cues when assigning

ratings to the words in the list, they should have assigned uniformly low ratings to all

words. However, this was not the case: as shown by the correlation between the average

ratings and DDPL parses, in general speakers assigned higher ratings to words that DDPL

treated as complex, lower ratings to words that DDPL treated as simple. The average mean

rating across words that were complex for DDPL was 3.78; the average mean rating across

words that were simple for DDPL was 2.81.

The most plausible explanation for this asymmetry is that the way in which

speakers represent potentially complex words is affected by distributional factors such as

the ones implemented in DDPL.8 In turn, a plausible hypothesis about why such

distributional factors have an effect on speakers’ morphological intuitions is that speakers

relied on distributional cues during morpheme discovery.

On the other hand, adult speakers are obviously also sensitive to semantic cues,

when rating words for morphological complexity. As all the words in the survey corpus

were semantically opaque, it is not surprising that the results of this second survey are less

clear-cut than those of the previous survey (as shown by the fact that this time there is less

of a difference between the average mean ratings assigned to DDPL simple and complex


I suspect that semantics influenced the results both directly and indirectly. First, the

morphological representations of adult speakers are almost certainly affected by the

semantic structure of words. Thus, while speakers seem to distinguish words that are

complex on purely distributional grounds from simple words, it is likely that such words

occupy a middle ground, in terms of morphological complexity, between simple words and

semantically transparent words (see Seidenberg and Gonnerman 2000 for similar

considerations). Indeed, if no correlation between DDPL and the speakers had emerged,

we could not have been sure that the negative result was due to the fact that speakers do not


rely on distributional cues such as the ones employed by DDPL during morpheme

discovery. The negative result could have instead been due to the fact that, once speakers

acquire sufficient evidence about the semantic properties associated with morphemes, they

revise their morphological representation of forms, and they change (from complex to

simple) the representation of those forms that were originally treated as complex on

distributional grounds, but whose complex representation is not supported by semantic


Moreover, as a consequence of the fact that the distinction between semantically

opaque but complex forms and simple forms is probably not as clear-cut as the distinction

between complex and transparent words and simple words, the participants in the second

survey had to provide ratings based on more subtle judgments, requiring more

sophisticated metalinguistic introspection skills. Thus, as this was a harder task, it is likely

that the participants in the second survey had more difficulty with it than the participants in

the first survey, and that the less marked difference between sets is in part due to “noise” in

the ratings.

However, beyond these considerations, what is truly important from our point of

view is that there is a high correlation between DDPL parses and the speakers’ ratings of

semantically opaque words. Thus, the survey results provide support for the hypothesis

that humans are sensitive to distributional cues to morphological constituency such as the

ones used by DDPL.


The results of the simulation reported above provide support for the general hypothesis that

distributional information of the kind encoded in the DDPL model can in principle be

helpful in morpheme discovery. Moreover, the convergence between the DDPL parses and

speakers’ ratings of a set of semantically opaque words provides some preliminary support

for the hypothesis that humans rely on distributional cues such as the ones employed by the

automated learner when assigning morphological parses to words. A plausible explanation

of this finding is that speakers are sensitive to such cues because they employed them in

order to assign morphological parses during morpheme discovery.

Moreover, these results are also potentially relevant to the theory of morphological

processing, in that they could provide the basis for a (partial) explanation of the fact that, as

various psycholinguistic studies have shown, speakers treat some semantically opaque

words as morphologically complex: They do so because, during morpheme discovery, they

used distributional schemes to search for the morphemes of their language, and these

schemes lead them to analyze some words as morphologically complex even in the lack of

semantic cues supporting the complex analysis.

Clearly, while I believe that the results presented here are encouraging, many

questions are still open, and much more research has to be done before we can reach safe

conclusions about the nature and role of distributional evidence in morpheme discovery.

The DDPL model could be improved and extended in various ways. Obviously, the

model should be extended to suffixation and other types of affixation. Furthermore,

algorithms in which the distributional information used by DDPL is integrated with other

types of information (such as syntactic category information) could be developed. Also,

alternative lexicon generation algorithms, exploring a larger (or, better, more

morphologically sensible) area of the hypothesis space, should be investigated.

The reviewers pointed out recent work by Jennifer Hay (see, e.g., Hay 2000)

suggesting that what matters in morphological processing is not the absolute frequency of

derived forms, but the relative frequency of derived forms and their bases. In short, if a

potentially complex form is more frequent than its potential base, the form is more likely to

be parsed as a whole, whereas, if the base is more frequent than the complex form, then the

complex form is more likely to be decomposed. In this setting, the absolute frequency


heuristic used by DDPL can be seen as an approximation of a more realistic relativefrequency-

based heuristic. In future research, it will be extremely interesting to test a

revised version of the model that takes relative frequency effects into account.9

Finally, a reviewer also suggested that it would interesting to develop a version of

DDPL that returns values on a continuous complexity scale, rather than binary complex vs.

simple parses. This would allow a more direct comparison with human ratings, and it

would correspond, perhaps, to a more realistic model of human morphological processing

(see, e.g., Seidenberg and Gonnerman 2000 and Baroni 2001 for arguments in favor of

gradient morphological representations).

From the point of view of testing the model, we should first of all test DDPL in

simulations with other English corpora, both in orthographic and phonetic transcriptions.

Furthermore, DDPL should be tested using input corpora from other languages.

In terms of collecting empirical evidence, we should first of all collect data from

more speakers, possibly re-designing the survey task in order to make it feasible for

speakers with no linguistics background. Furthermore, it would be interesting to collect

data using other methods (for example, using a morphological priming paradigm), to make

sure that the results we obtained are task-independent. Obviously, it would also be

important to collect developmental data from children, to have a more concrete idea of when

and how human learners perform morpheme discovery.

Last but not least, a more sophisticated analysis of the empirical results obtained

should try to assess whether all the cues exploited by DDPL are relevant in predicting the

response patterns of the speakers, and/or what is their relative importance as predictors.

While all these lines of research should be pursued in the near future, and I am sure

that readers will raise other important issues that were not dealt with here, I believe that the

current results are already shedding some (weak) light on the role of distributional cues in

the domain of morpheme discovery.



1 I would like to thank Donca Steriade, Adam Albright, Lynne Bernstein, Harald Baayen, Amy Schafer, Kie

Zuraw, the reviewers for the Yearbook of Morphology and, especially, Ed Stabler, Carson Schütze and

Bruce Hayes for help and advice. Of course, none of them is responsible for any of the claims I make. A

more detailed discussion of several of the issues discussed here can be found in Baroni (2000b), which is

downloadable from

2 In this study, I illustrate my points through examples presented in orthographic transcription. The same

points could have been also illustrated by the same examples (or similar ones) presented in phonetic

transcription. A preliminary experiment with a corpus of phonetically transcribed words suggests that,

because of the different distributional properties of specific morphemes in spoken and written language, the

morpheme-discovery algorithm presented here performs in a similar but slightly different way when

presented with orthographic vs. phonetically transcribed input. See Baroni (2000b:4.6) for discussion.

3 See also Brent (1993) and Brent, Murthy and Lundberg (1995). The model of Brent and colleagues

represents, as far as I know, the first attempt to apply the Minimum Description Length principle to the

problem of morpheme discovery

4 In Baroni (2000b), I motivate and defend the assumptions about morpheme discovery that went into the

design of the algorithm described here, i.e., that it makes sense to model morpheme discovery as a separate

task from utterance segmentation, that it makes sense to model prefix discovery as a separate subtask within

morpheme discovery, and that it makes sense to consider an approach to the task in which only binary

(prefix+stem) parses of words are evaluated.

5 Baroni (2000b) discusses further how the lexicon + encoding criterion reflects morphological heuristics

(including, among other things, a discussion of how the heuristics interact and of how such interaction

insures that the “frequent substrings are likely to be morphemes” heuristic is interpreted in terms of type

rather than token frequency).

6 All else being equal, in the data compression scheme proposed here longer substrings are more likely to

constitute independent lexical entries than shorter substrings. For example, at the same frequency of

occurrence in the input corpus, a substring like dis- is more likely to be treated as an independent entry than

a substring like a-. Again, I would like to claim that this also makes sense from the point of view of

morpheme discovery.

7 To keep things simple, I presented here an example in which stems occur elsewhere in the corpus as

independent words -- i.e., they are free stems. However, the same pattern takes place even if the relevant

stems never occur in independent words, but are the product of the parse of other prefixed forms -- i.e., they

are bound stems.

8 See Baroni (2000b:4.5.3) for a post-hoc analysis that seems to rule out the possibility that the

asymmetry can be explained by phonological cues.

9 Interestingly, the earlier model presented in Baroni 2000a did take relative frequency effects into account,

at least to a certain extent, by assigning shorter indices to more frequent lexical entries, thus making the

likelihood that a form will be parsed as complex dependent not only on the frequency of the form itself, but

also on the frequency of its potential base.


Baayen, Harald. 1994. "Productivity in language production". Language and Cognitive

Processes 9, 447-469.

Baayen, Harald; and Lieber, Rochelle 1991. "Productivity and English derivation: A

corpus-based study". Linguistics 29, 801-843.

Baayen, Harald; Schreuder, Robert; and Burani, Cristina. 1997. "Parsing and semantic

opacity". in Morphology and the mental lexicon, E. Assink and D. Sandra (eds),

Dordrecht: Kluwer. In press.

Baroni, Marco. 2000a. "Using distributional information to discover morphemes: A

distribution-driven prefix learner". Paper presented at the LSA Meeting, Chicago.

Baroni, Marco. 2000b. Distributional cues in morpheme discovery: A computational

model and empirical evidence. UCLA dissertation.

Baroni, Marco. 2001. "The representation of prefixed forms in the Italian lexicon:

Evidence from the distribution of intervocalic [s] and [z] in northern Italian".

Yearbook of Morphology 1999, 121-152.

Bentin, Shlomo; and Feldman, Laurie. 1990. "The contribution of morphological and

semantic relatedness to repetition priming at short and long lags: Evidence from

Hebrew". Quarterly Journal of Experimental Psychology 42A, 693-711.

Brent, Michael. 1993. "Minimal generative explanations: A middle ground between

neurons and triggers". Proceedings of the 15th Annual Conference of the

Cognitive Science Society, 28-36.

Brent, Michael; and Cartwright, Timothy. 1996. "Distributional regularity and

phonotactic constraints are useful for segmentation". Cognition 61, 93-125.

Brent, Michael; Murthy, Sreerama; and Lundberg, Andrew. 1995. "Discovering

morphemic suffixes: A case study in minimum description length induction".

Presented at the Fifth International Workshop on AI and Statistics.

Emmorey, Karen. 1989. "Auditory morphological priming in the lexicon". Language and

Cognitive Processes 4, 73-92.

Feldman, Laurie (ed). 1995. Morphological aspects of language processing. Hillsdale:


Goldsmith, John. 2001. "Unsupervised learning of the morphology of a natural

language". Computational Linguistics 27, 153-198.

Harris, Zellig. 1955. "From phoneme to morpheme". Language 31, 190-222.

Hay, Jennifer. 2000. Causes and consequences of word structure. Northwestern

University dissertation, 2000.

Kucera, Henry; and Francis, Nelson. 1967. Computational analysis of present-day

American English. Providence: Brown University Press.

Marchand, Hans. 1969. The categories and types of present-day English word-formation:

A synchronic-diachronic approach. Munich: Beck.

Marslen-Wilson, William; Tyler, Lorraine; Waksler, Rachelle; and Older, Lianne. 1994.

"Morphology and meaning in the English mental lexicon". Psychological Review

101, 3-33.

Maxwell, Michael (ed). 2002. Morphological and phonological learning: Proceedings of

the sixth ACL-SIGPHON meeting. Philadelphia: ACL.

Quirk, Randolph; Greenbaum, Sidney; Leech, Geoffry; and Svartvik, Jan. 1985. A

comprehensive grammar of the English language. London: Longman.

Redington, Martin; and Chater, Nick. 1998. "Connectionist and statistical approaches to

language acquisition: A distributional perspective". Language and Cognitive

Processes 13, 129-191.

Rissanen, Jorma. 1978. "Modeling by shortest data description". Automatica 14, 456-471.

Roelofs, Ardi; and Baayen, Harald. 2001. "Morphology by itself in planning the

production of spoken words". Psychonomic bulletin and review. In press.

Schreuder, Robert; and Baayen, Harald. 1994. "Prefix stripping re-revisited". Journal of

Memory and Language 33, 357-375.

Schreuder, Robert; and Baayen, Harald. 1995. "Modeling morphological processing". In

Feldman (1995), 131-154.

Seidenberg, Mark; and Gonnerman, Laura. 2000. "Explaining derivational morphology as

the convergence of codes". Trends in Cognitive Sciences 4, 353-361.

Seitz, Philip; Bernstein, Lynne; Auer, Edward; and MacEachern, Margaret. 1998. The

PHLEX Database. Los Angeles: House Ear Institute.

Smith, Philip. 1988. "How to conduct experiments with morphologically complex

words". Linguistics 26, 699-714.

Snover, Matthew; and Brent, Michael. 2001. "A Bayesian model for morpheme and

paradigm identification". Proceedings of ACL 39, 482-490.

Stolz, Jennifer; and Feldman, Laurie. 1995. "The role of orthographic and semantic

transparency of the base morpheme in morphological processing". in Feldman

(1995), 109-129.

Wurm, Lee. 1997. "Auditory processing of prefixed English words is both continuous

and decompositional". Journal of Memory and Language 37, 438-461.
Download 269.95 Kb.

Share with your friends:
1   2   3   4

The database is protected by copyright © 2023
send message

    Main page