Marco baroni

Download 269.95 Kb.
Size269.95 Kb.
  1   2   3   4

Distribution-driven morpheme discovery:

A computational/experimental study


SSLMIT (University of Bologna)

Corso della Repubblica 136

47100 Forlì (FC), Italy

October 2002


In an early stage of morphological acquisition, children must discover which

strings correspond to affixes of their language, and which of the words

containing those strings are actually affixed. For example, a child acquiring

English must be able to discover that the word-initial string re- is a prefix, but

also that the word remake is prefixed, whereas the word retail, probably, is

not, even though it begins with re-. In this study, I present a computational

model of how the task of morpheme (in particular, prefix) discovery could be

performed on the basis of distributional cues (cues based on the co-occurrence

patterns, frequency and length of words and their substrings in the input). The

results of a simulation using this model with an input corpus of English words

show that distributional evidence could in principle be very helpful to learners

having to perform the task of morpheme discovery. Moreover, I show that the

morphological parses assigned by the distribution-driven model to a set of

potentially prefixed but semantically opaque words are correlated with

morphological complexity ratings assigned to the same words by native

English speakers. I argue that this convergence between the model and the

speakers, in a domain in which speakers cannot rely on semantic cues,

constitutes evidence that humans do rely on distributional cues similar to the

ones exploited by my model, when assigning morphological structure to



During the process of language acquisition, learners must discover which strings constitute

the affixes of their language and which words of the language can be decomposed into

affixes and other components. These are prerequisites to morphological acquisition.

Ultimately, a learner acquiring a language must discover the syntactic and semantic

properties associated with each affix of the language, in order to be able to produce and

understand new words. For example, a learner acquiring English must discover that re- is a

prefix that attaches to verbs to create other verbs with an iterative meaning.

However, in order to learn the morphological properties of an affix, learners must

first of all notice the existence of that affix. Moreover, in order to discover the linguistic


properties associated with the affix, the learner must inspect the semantic, syntactic and

morphological characteristics of a set of words containing that affix.

For example, in order to discover the properties of the prefix re-, English learners

must first of all, of course, notice that the string re- is a prefix. Moreover, the learners must

collect and analyze a number of words containing the prefix re- (redo, rename, remake...),2

in order to extract the correct generalizations about this prefix.

However, not all the words containing a string identical to an affix actually contain

that affix. In order to discover the correct generalizations about the properties of the affix,

the learners must have a preliminary idea of which of the words containing a string identical

to the affix are actually affixed. If an English learner tried to decide what is the meaning and

function of re- on the basis of, say, redo, retail and really, the learner would probably come

up with the wrong generalizations about the prefix or, more likely, she would not notice

any generalization at all and she would conclude that re- is not a prefix.

Of course, if the string corresponding to an affix mostly occurs in words which do

indeed contain the affix, the learner is probably going to extract the correct generalizations

even if there are a few pseudo-affixed words (i.e., words containing the string

corresponding to the affix without actually being synchronically analyzable as affixed).

However, this is not always the case. For example, Schreuder and Baayen (1994) have

shown that, for several common English and Dutch prefixes, the number of pseudoprefixed

words is higher than the number of truly prefixed words (at least in terms of token


Thus, it would not be safe, for a learner, to assume a priori that any word

containing a string identical to an affix does indeed contain the affix from a morphological

point of view. Consequently, the learner must decide which of the words containing a

potential affix are truly morphologically complex, and which are pseudo-affixed, i.e., the

learner must assign preliminary morphological parses to the words she hears.

While I presented the task of discovering that a certain string is an affix (or more

generally a morpheme) and the task of assigning parses to words as separate aspects of

morpheme discovery, the two tasks are obviously closely related. A learner is not likely to

hear affixes in isolation. Thus, the task of discovering the affixes will typically involve

assigning morphological parses to words. A string is an affix of the language if at least one

of the words containing the string in the language is parsed as morphologically complex,

and the string constitutes one of the morphological components in the parse.

Morpheme discovery is a difficult task. Not only does the learner have to consider

many possible segmentations of each potentially complex word she hears, but she does not

a priori know which meanings and/or syntactic functions are expressed by morphemes in

her language, and consequently she cannot a priori know whether a word is

morphologically complex or not. Furthermore, the learner does not know which types of

morphemes (prefixes, suffixes, circumfixes, infixes, autosegments, templates...) are

present in the language. Thus, even if the learner had some reason to expect a certain word

to be morphologically complex, she still would have to determine whether the word should

be divided into a prefix and a stem, or into a stem and a suffix, or into consonantal and

vocalic templates, or into other morpheme combinations.

It is probable that learners follow a number of different morpheme discovery

strategies, looking for phonological, syntactic and semantic cues. Moreover, distributional

evidence, i.e., evidence based on the frequency and co-occurrence patterns of words and

their substrings, provides potentially useful cues that learners can exploit. While each of

these approaches can help the learner in the morpheme discovery task, none of them is

likely to be sufficient by itself.

The primary goal of this study is to contribute to a better understanding of how

language learners perform morpheme discovery. In particular, the study provides evidence

in favor of the hypothesis that distributional cues play a significant role in this process.

Some recent studies have provided new support for the idea that distributional

information plays an important role in language learning (see Redington and Chater 1998


for a review of both the classic objections to distributional approaches and recent

distribution-driven learning models). Thus, another goal of the present study is to provide

further support for the general claim that language learners make crucial use of

distributional cues.

As I will shortly discuss in the conclusion, another goal of this study is to provide a

(partial) explanation for an interesting datum emerging from experimental studies of

morphological processing and representation, i.e., that speakers can represent words as

morphologically complex even if they lack semantic compositionality (i.e., the meaning of

the whole word is not the product of the meanings of the component morphemes). One can

argue that this phenomenon (complex representation/treatment of semantically opaque

words) is at least in part a by-product of distribution-driven morpheme discovery, and the

empirical evidence presented here provides some support for this hypothesis.

In order to assess the potential role of distributional cues in morpheme discovery, I

designed an automated learner, which performs a simplified version of this task on the sole

basis of the distributional evidence it can extract from a corpus of untagged words. The

most obvious simplification in the task performed by this computational model is that it

only looks for prefixes and stems, and not also for other kinds of morphemes.

The strategy followed by the automated learner in its search for prefixes and stems

is based on a simple fact about the distributional nature of morphemes: morphemes are

independent linguistic units, and as such, they occur in a number of different words where

they combine with other morphemes. The nonrandom distribution of morphemes makes

them detectible, in many cases, by statistical methods.

Given an input corpus of English words, the automated learner, equipped with a

small number of simple distributional heuristics, is able to discover a large set of actual

English prefixes, finding very few “false positives” (strings which are not English prefixes

but are treated by the learner as such). Moreover, the morphological parses (prefix + stem

vs. monomorphemic) assigned by the learner to the words in the input corpus are correlated

with intuitions of native English speakers about the morphological structure of the same


Thus, the computational simulation presented here demonstrates first of all that a

limited number of simple distributional heuristics can help a morpheme discoverer a great

deal, i.e., that there is in principle a large amount of evidence about morphological

constituency that children could extract from simple distributional cues.

Moreover, I show that the morphological parses assigned by the distribution-driven

model to a set of potentially prefixed but semantically opaque words are correlated with

morphological complexity ratings assigned to the same words by native English speakers. I

argue that this convergence between the model and the speakers, in a domain in which

speakers could not have relied on semantic cues, constitutes evidence that humans do

indeed rely on distributional cues similar to the ones exploited by my model, when

assigning morphological structure to words (see section 5 below for a full discussion of

this argument).

Notice that in the current project I am modeling morpheme discovery as a purely

distribution-driven task because I am interested in trying to determine how much and what

kind of information a learner could in principle extract from distributional evidence alone. I

am not trying to argue that this is the only kind of evidence used by human learners. It is

plausible that learners would use distributional cues at the earliest stages of morpheme

discovery, since distributional information can be straightforwardly extracted from the data,

and it can be exploited prior to any linguistic analysis. More sophisticated linguistic

information can later be used to refine the coarse guesses on morphological structure made

on the basis of distributional cues.

The remainder of this study is organized as follows: In section 2, I shortly review

some related work. In section 3, I present and discuss the computational model I am

proposing. In section 4, I present the results of a simulation in which this model was

tested, and I compare the morphological parses assigned by the model to morphological


complexity ratings assigned by humans. In section 5, I compare the performance of the

model to that of native speakers in parsing semantically opaque (but potentially complex)

words. Finally, in the conclusion I briefly discuss future directions that this project could



After the pioneering work of Harris in the fifties (see, e.g., Harris 1955) and until very

recently, modeling morpheme discovery has been a relatively unpopular research domain.

However, in the last few years there has been a resurgence of interest in the topic, and

several supervised and unsupervised algorithms performing morpheme discovery or related

tasks have been recently proposed: See, for example, most of the papers collected in

Maxwell 2002, and the references quoted there.

A recent approach that is closely related to the one I propose below is the one of

Goldsmith 2001.3 Goldsmith’s model, like mine, is based on the idea that morphological

segmentation can be re-phrased as a data compression problem. However, Goldsmith’s

model differs from the one presented here in several respects.

The most obvious (and probably least interesting) difference is that Goldsmith’s

model looks for suffixation patterns, whereas my model focuses on prefixation.

More importantly, Goldsmith’s model is based on an information-theoretically

rigorous model of data compression constructed using the Minimum Description Length

(MDL) principle of Rissanen 1978. The model proposed below, on the other hand, is only

loosely inspired by the MDL idea, and the data compression scheme it assumes is not valid

from an information-theoretic point of view (see Baroni 2000b:3.3.7 on why I decided to

abandon the more rigorous MDL-based approach I adopted in Baroni 2000a).

Furthermore, Goldsmith’s model generates a single analysis using alternative

heuristic strategies, and then uses the MDL criterion to refine such analysis. On the other

hand, the model presented here uses heuristics to generate a set of alternative analyses, and

then applies the maximal data compression criterion to choose the best of these alternatives.

From a strictly linguistic point of view, Goldsmith’s model has two desirable

properties that are missing in the current model, i.e. it can fully decompose words

containing multiple affixes, and it groups stems and affixes into primitive forms of

paradigms called signatures.

Last but not least, Goldsmith’s main interest seems to lie in the possibility of

extracting morphological analysis tools from unlabeled corpora using an automated

procedure, whereas I developed the model I describe below because I am interested in

testing some hypotheses about the role of distributional learning during human

morphological acquisition. Thus, the focus here is less on the technical aspects of the

model, and more on how the outputs it produces compare to human intuitions about

morphological structure.

Another model that is closely related to mine is the utterance segmentation method

proposed in Brent and Cartwright (1996). Indeed, my algorithm can be seen as an

adaptation of Brent and Cartwright’s lexicon selection and generation methods to the

morpheme discovery problem. Thus, my algorithm takes words, rather than unsegmented

utterances as input, and it returns maximally binary segmentations.

Moreover, my model is biased so that it favors parses in which one element has

affix-like distributional properties, and the other element has stem-like properties.

Finally, Brent and Cartwright’s model, like the one proposed by Goldsmith, is

based on an information-theoretically sound data compression scheme, whereas the model I

propose below is only justified by the fact that it captures intuitively plausible morphemesegmentation




In order to assess the effectiveness of distributional heuristics in morpheme discovery, I

designed and implemented a learning model which performs a particular aspect of this task

--prefix discovery-- on the sole basis of distributional evidence.

The algorithm presented here takes a corpus of untagged orthographically or

phonetically transcribed words as its input and outputs a lexicon composed of a list of

prefixes and stems. Moreover, the algorithm assigns morphological parses (prefix + stem

or monomorphemic parses) to all the word types in the input corpus.4 The algorithm relies

entirely on the distributional information that can be extracted from the input. From here

on, I will refer to the algorithm presented here with the acronym DDPL, which stands for

Distribution-Driven Prefix Learner.

DDPL is based on a “generation and selection” strategy: a large number of lexica

compatible with the input data are generated, a certain measure is computed for each

lexicon, and the lexicon with the lowest value of this measure is selected. The formula used

to compute this measure constitutes the conceptual core of the algorithm, and it is based on

the idea that the best morphological analysis of the input is also the one allowing maximal

data compression of the input, given certain assumptions about how the data compression

process should work.

As we will see, the data compression scheme from which the DDPL lexicon

selection formula is derived favors lexical analyses respecting the following three

principles, which in turn can be interpreted as plausible morpheme discovering strategies:

• Substrings occurring in a high number of different words are likely to be


• Substrings which tend to co-occur with other potential morphemes are more likely

to be morphemes;

• All else being equal, low frequency words are more likely to be morphologically

complex than high frequency words.

The first two principles should be fairly intuitive: Morphemes -- especially affixes -- tend to

occur in a number of different words. Thus, they will tend to have a high type frequency.

Moreover, real morphemes are not simply substrings that occur in a high number of

random words, but rather substrings that can co-occur with other morphemes to form

complex words. This explains the second principle.

The third principle is perhaps less intuitive. Consider, however, the following. At

one extreme, if a morphologically complex word is very frequent, the word is likely to

have its own lexical entry, distinct from the entries of its component parts (at the very least,

for reasons of ease of lexical access). However, once a word has an independent lexical

entry, the word can acquire its own semantic features and thus it is likely to lose, over the

course of time, its connection with its component parts. In other words, high frequency

words are less likely to be morphologically complex because, even if they were complex

from an etymological point of view, they will tend to acquire a lexicalized meaning due to

heavy usage.

At the other extreme, productively formed complex words must be hapax legomena

(words with a token frequency of 1), or in any event have a very low token frequency.

Indeed, Baayen has shown in several studies (see for example Baayen 1994, Baayen and

Lieber 1991) that the number of hapax legomena containing a certain morpheme is a good

indicator of the productivity of the morpheme. If a morpheme is productive, then the

morpheme is often used to create new forms, and new forms, being new, are likely to have

a very low frequency.

Thus, all else being equal, it would make sense for a learner to be more willing to

guess that a word is complex if the word has a low token frequency than if the word has a

high token frequency.


While in the following sections I will concentrate on how morpheme discovery can

be rephrased as a data compression problem, the reader should keep in mind that I am by

no means assuming that morpheme discovery is indeed a data compression problem

(indeed, as I discuss in Baroni (2000b), the data compression method proposed here, when

seen as an actual data compression algorithm, is far from optimal, and it is not truly

implementable). Rather, the interest of the data compression approach lies in the fact that it

allows us to derive an explicit formula that, as I will show, favors the same type of lexical

analysis that would be favored by distribution-based morpheme discovery strategies such

as the ones I just discussed.

3.1 Data compression and morphological analysis: the shortest lexicon + encoding criterion

The criterion used by DDPL to select the best lexicon is based on the idea that the lexicon

generated by the most plausible morphological analysis is also the best lexicon for purposes

of data compression, given certain restrictions on how the compression procedure must

work. The rationale behind this intuition is the following: Since morphemes are

syntagmatically independent units, which occur in different words and combine with each

other, a lexicon containing morphemes is going to be “shorter” (in the literal sense that it

can be represented using a small number of characters) than a lexicon containing random

substrings, or a lexicon in which no word is decomposed. The advantage of reducing the

problem of morpheme discovery to a matter of (constrained) data compression is the

following: There are no straightforward ways to decide which one, among a set of possible

lexica, is the best one from the point of view of morphology, but it is relatively simple to

estimate which lexicon allows maximal data compression.

Given that the connection between data compression and morphological analysis is

not very intuitive, I will illustrate it with a set of simple examples.

Let us suppose that we are given a list of words, and our goal is to find a compact

format to store information from which the very same list can be reconstructed. In

particular, we take a “lexicon and encoding” approach to this task. We construct a compact

lexicon from which all the input words (plus, possibly, others) can be reconstructed. We

associate an index to each lexical unit, and then we rewrite the corpus as a sequence of

these indices. I will refer to the rewriting of the corpus as a sequence of indices with the

term encoding.

As long as some lexical units occur very frequently in the input corpus (and/or

many units occur relatively frequently), and the lexical indices are, on average, shorter than

the units they represent, the lexicon + encoding strategy will allow us to represent the

corpus in a shorter format than the original one. In order to make sure that the second

requirement is satisfied (i.e., lexical indices are on average shorter than the input words

they represent), I assume here that all indices are exactly one character long (I will revise

Download 269.95 Kb.

Share with your friends:
  1   2   3   4

The database is protected by copyright © 2022
send message

    Main page