Distribution-driven morpheme discovery:
A computational/experimental study
SSLMIT (University of Bologna)
Corso della Repubblica 136
47100 Forlì (FC), Italy
In an early stage of morphological acquisition, children must discover which
strings correspond to affixes of their language, and which of the words
containing those strings are actually affixed. For example, a child acquiring
English must be able to discover that the word-initial string re- is a prefix, but
also that the word remake is prefixed, whereas the word retail, probably, is
not, even though it begins with re-. In this study, I present a computational
model of how the task of morpheme (in particular, prefix) discovery could be
performed on the basis of distributional cues (cues based on the co-occurrence
patterns, frequency and length of words and their substrings in the input). The
results of a simulation using this model with an input corpus of English words
show that distributional evidence could in principle be very helpful to learners
having to perform the task of morpheme discovery. Moreover, I show that the
morphological parses assigned by the distribution-driven model to a set of
potentially prefixed but semantically opaque words are correlated with
morphological complexity ratings assigned to the same words by native
English speakers. I argue that this convergence between the model and the
speakers, in a domain in which speakers cannot rely on semantic cues,
constitutes evidence that humans do rely on distributional cues similar to the
ones exploited by my model, when assigning morphological structure to
the affixes of their language and which words of the language can be decomposed into
affixes and other components. These are prerequisites to morphological acquisition.
Ultimately, a learner acquiring a language must discover the syntactic and semantic
properties associated with each affix of the language, in order to be able to produce and
understand new words. For example, a learner acquiring English must discover that re- is a
prefix that attaches to verbs to create other verbs with an iterative meaning.
However, in order to learn the morphological properties of an affix, learners must
first of all notice the existence of that affix. Moreover, in order to discover the linguistic
properties associated with the affix, the learner must inspect the semantic, syntactic and
For example, in order to discover the properties of the prefix re-, English learners
must first of all, of course, notice that the string re- is a prefix. Moreover, the learners must
collect and analyze a number of words containing the prefix re- (redo, rename, remake...),2
in order to extract the correct generalizations about this prefix.
However, not all the words containing a string identical to an affix actually contain
that affix. In order to discover the correct generalizations about the properties of the affix,
the learners must have a preliminary idea of which of the words containing a string identical
to the affix are actually affixed. If an English learner tried to decide what is the meaning and
function of re- on the basis of, say, redo, retail and really, the learner would probably come
up with the wrong generalizations about the prefix or, more likely, she would not notice
any generalization at all and she would conclude that re- is not a prefix.
Of course, if the string corresponding to an affix mostly occurs in words which do
indeed contain the affix, the learner is probably going to extract the correct generalizations
even if there are a few pseudo-affixed words (i.e., words containing the string
corresponding to the affix without actually being synchronically analyzable as affixed).
However, this is not always the case. For example, Schreuder and Baayen (1994) have
shown that, for several common English and Dutch prefixes, the number of pseudoprefixed
words is higher than the number of truly prefixed words (at least in terms of token
Thus, it would not be safe, for a learner, to assume a priori that any word
containing a string identical to an affix does indeed contain the affix from a morphological
point of view. Consequently, the learner must decide which of the words containing a
potential affix are truly morphologically complex, and which are pseudo-affixed, i.e., the
learner must assign preliminary morphological parses to the words she hears.
While I presented the task of discovering that a certain string is an affix (or more
generally a morpheme) and the task of assigning parses to words as separate aspects of
morpheme discovery, the two tasks are obviously closely related. A learner is not likely to
hear affixes in isolation. Thus, the task of discovering the affixes will typically involve
assigning morphological parses to words. A string is an affix of the language if at least one
of the words containing the string in the language is parsed as morphologically complex,
and the string constitutes one of the morphological components in the parse.
Morpheme discovery is a difficult task. Not only does the learner have to consider
many possible segmentations of each potentially complex word she hears, but she does not
her language, and consequently she cannot a priori know whether a word is
morphologically complex or not. Furthermore, the learner does not know which types of
morphemes (prefixes, suffixes, circumfixes, infixes, autosegments, templates...) are
present in the language. Thus, even if the learner had some reason to expect a certain word
to be morphologically complex, she still would have to determine whether the word should
be divided into a prefix and a stem, or into a stem and a suffix, or into consonantal and
vocalic templates, or into other morpheme combinations.
It is probable that learners follow a number of different morpheme discovery
strategies, looking for phonological, syntactic and semantic cues. Moreover, distributional
evidence, i.e., evidence based on the frequency and co-occurrence patterns of words and
their substrings, provides potentially useful cues that learners can exploit. While each of
these approaches can help the learner in the morpheme discovery task, none of them is
likely to be sufficient by itself.
The primary goal of this study is to contribute to a better understanding of how
language learners perform morpheme discovery. In particular, the study provides evidence
in favor of the hypothesis that distributional cues play a significant role in this process.
Some recent studies have provided new support for the idea that distributional
information plays an important role in language learning (see Redington and Chater 1998
for a review of both the classic objections to distributional approaches and recent
further support for the general claim that language learners make crucial use of
As I will shortly discuss in the conclusion, another goal of this study is to provide a
(partial) explanation for an interesting datum emerging from experimental studies of
morphological processing and representation, i.e., that speakers can represent words as
morphologically complex even if they lack semantic compositionality (i.e., the meaning of
the whole word is not the product of the meanings of the component morphemes). One can
argue that this phenomenon (complex representation/treatment of semantically opaque
words) is at least in part a by-product of distribution-driven morpheme discovery, and the
empirical evidence presented here provides some support for this hypothesis.
In order to assess the potential role of distributional cues in morpheme discovery, I
designed an automated learner, which performs a simplified version of this task on the sole
basis of the distributional evidence it can extract from a corpus of untagged words. The
most obvious simplification in the task performed by this computational model is that it
only looks for prefixes and stems, and not also for other kinds of morphemes.
The strategy followed by the automated learner in its search for prefixes and stems
is based on a simple fact about the distributional nature of morphemes: morphemes are
independent linguistic units, and as such, they occur in a number of different words where
they combine with other morphemes. The nonrandom distribution of morphemes makes
them detectible, in many cases, by statistical methods.
Given an input corpus of English words, the automated learner, equipped with a
small number of simple distributional heuristics, is able to discover a large set of actual
English prefixes, finding very few “false positives” (strings which are not English prefixes
but are treated by the learner as such). Moreover, the morphological parses (prefix + stem
vs. monomorphemic) assigned by the learner to the words in the input corpus are correlated
with intuitions of native English speakers about the morphological structure of the same
limited number of simple distributional heuristics can help a morpheme discoverer a great
deal, i.e., that there is in principle a large amount of evidence about morphological
constituency that children could extract from simple distributional cues.
Moreover, I show that the morphological parses assigned by the distribution-driven
model to a set of potentially prefixed but semantically opaque words are correlated with
morphological complexity ratings assigned to the same words by native English speakers. I
argue that this convergence between the model and the speakers, in a domain in which
speakers could not have relied on semantic cues, constitutes evidence that humans do
indeed rely on distributional cues similar to the ones exploited by my model, when
assigning morphological structure to words (see section 5 below for a full discussion of
Notice that in the current project I am modeling morpheme discovery as a purely
distribution-driven task because I am interested in trying to determine how much and what
kind of information a learner could in principle extract from distributional evidence alone. I
am not trying to argue that this is the only kind of evidence used by human learners. It is
plausible that learners would use distributional cues at the earliest stages of morpheme
discovery, since distributional information can be straightforwardly extracted from the data,
and it can be exploited prior to any linguistic analysis. More sophisticated linguistic
information can later be used to refine the coarse guesses on morphological structure made
on the basis of distributional cues.
The remainder of this study is organized as follows: In section 2, I shortly review
some related work. In section 3, I present and discuss the computational model I am
proposing. In section 4, I present the results of a simulation in which this model was
tested, and I compare the morphological parses assigned by the model to morphological
complexity ratings assigned by humans. In section 5, I compare the performance of the
words. Finally, in the conclusion I briefly discuss future directions that this project could
2. RELATED WORK
recently, modeling morpheme discovery has been a relatively unpopular research domain.
However, in the last few years there has been a resurgence of interest in the topic, and
several supervised and unsupervised algorithms performing morpheme discovery or related
tasks have been recently proposed: See, for example, most of the papers collected in
Maxwell 2002, and the references quoted there.
A recent approach that is closely related to the one I propose below is the one of
Goldsmith 2001.3 Goldsmith’s model, like mine, is based on the idea that morphological
segmentation can be re-phrased as a data compression problem. However, Goldsmith’s
model differs from the one presented here in several respects.
The most obvious (and probably least interesting) difference is that Goldsmith’s
model looks for suffixation patterns, whereas my model focuses on prefixation.
More importantly, Goldsmith’s model is based on an information-theoretically
rigorous model of data compression constructed using the Minimum Description Length
(MDL) principle of Rissanen 1978. The model proposed below, on the other hand, is only
loosely inspired by the MDL idea, and the data compression scheme it assumes is not valid
from an information-theoretic point of view (see Baroni 2000b:3.3.7 on why I decided to
abandon the more rigorous MDL-based approach I adopted in Baroni 2000a).
Furthermore, Goldsmith’s model generates a single analysis using alternative
heuristic strategies, and then uses the MDL criterion to refine such analysis. On the other
hand, the model presented here uses heuristics to generate a set of alternative analyses, and
then applies the maximal data compression criterion to choose the best of these alternatives.
From a strictly linguistic point of view, Goldsmith’s model has two desirable
properties that are missing in the current model, i.e. it can fully decompose words
containing multiple affixes, and it groups stems and affixes into primitive forms of
paradigms called signatures.
Last but not least, Goldsmith’s main interest seems to lie in the possibility of
extracting morphological analysis tools from unlabeled corpora using an automated
procedure, whereas I developed the model I describe below because I am interested in
testing some hypotheses about the role of distributional learning during human
morphological acquisition. Thus, the focus here is less on the technical aspects of the
model, and more on how the outputs it produces compare to human intuitions about
Another model that is closely related to mine is the utterance segmentation method
proposed in Brent and Cartwright (1996). Indeed, my algorithm can be seen as an
adaptation of Brent and Cartwright’s lexicon selection and generation methods to the
morpheme discovery problem. Thus, my algorithm takes words, rather than unsegmented
utterances as input, and it returns maximally binary segmentations.
Moreover, my model is biased so that it favors parses in which one element has
affix-like distributional properties, and the other element has stem-like properties.
Finally, Brent and Cartwright’s model, like the one proposed by Goldsmith, is
based on an information-theoretically sound data compression scheme, whereas the model I
propose below is only justified by the fact that it captures intuitively plausible morphemesegmentation
3. DDPL: AN AUTOMATED DISTRIBUTION-DRIVEN PREFIX LEARNER
designed and implemented a learning model which performs a particular aspect of this task
--prefix discovery-- on the sole basis of distributional evidence.
The algorithm presented here takes a corpus of untagged orthographically or
phonetically transcribed words as its input and outputs a lexicon composed of a list of
prefixes and stems. Moreover, the algorithm assigns morphological parses (prefix + stem
or monomorphemic parses) to all the word types in the input corpus.4 The algorithm relies
entirely on the distributional information that can be extracted from the input. From here
on, I will refer to the algorithm presented here with the acronym DDPL, which stands for
Distribution-Driven Prefix Learner.
DDPL is based on a “generation and selection” strategy: a large number of lexica
compatible with the input data are generated, a certain measure is computed for each
lexicon, and the lexicon with the lowest value of this measure is selected. The formula used
to compute this measure constitutes the conceptual core of the algorithm, and it is based on
the idea that the best morphological analysis of the input is also the one allowing maximal
data compression of the input, given certain assumptions about how the data compression
process should work.
As we will see, the data compression scheme from which the DDPL lexicon
selection formula is derived favors lexical analyses respecting the following three
principles, which in turn can be interpreted as plausible morpheme discovering strategies:
• Substrings occurring in a high number of different words are likely to be
• Substrings which tend to co-occur with other potential morphemes are more likely
to be morphemes;
• All else being equal, low frequency words are more likely to be morphologically
complex than high frequency words.
The first two principles should be fairly intuitive: Morphemes -- especially affixes -- tend to
occur in a number of different words. Thus, they will tend to have a high type frequency.
Moreover, real morphemes are not simply substrings that occur in a high number of
random words, but rather substrings that can co-occur with other morphemes to form
complex words. This explains the second principle.
The third principle is perhaps less intuitive. Consider, however, the following. At
one extreme, if a morphologically complex word is very frequent, the word is likely to
have its own lexical entry, distinct from the entries of its component parts (at the very least,
for reasons of ease of lexical access). However, once a word has an independent lexical
entry, the word can acquire its own semantic features and thus it is likely to lose, over the
course of time, its connection with its component parts. In other words, high frequency
words are less likely to be morphologically complex because, even if they were complex
from an etymological point of view, they will tend to acquire a lexicalized meaning due to
At the other extreme, productively formed complex words must be hapax legomena
(words with a token frequency of 1), or in any event have a very low token frequency.
Indeed, Baayen has shown in several studies (see for example Baayen 1994, Baayen and
Lieber 1991) that the number of hapax legomena containing a certain morpheme is a good
indicator of the productivity of the morpheme. If a morpheme is productive, then the
morpheme is often used to create new forms, and new forms, being new, are likely to have
a very low frequency.
Thus, all else being equal, it would make sense for a learner to be more willing to
guess that a word is complex if the word has a low token frequency than if the word has a
high token frequency.
While in the following sections I will concentrate on how morpheme discovery can
no means assuming that morpheme discovery is indeed a data compression problem
(indeed, as I discuss in Baroni (2000b), the data compression method proposed here, when
seen as an actual data compression algorithm, is far from optimal, and it is not truly
implementable). Rather, the interest of the data compression approach lies in the fact that it
allows us to derive an explicit formula that, as I will show, favors the same type of lexical
analysis that would be favored by distribution-based morpheme discovery strategies such
as the ones I just discussed.
The criterion used by DDPL to select the best lexicon is based on the idea that the lexicon
generated by the most plausible morphological analysis is also the best lexicon for purposes
of data compression, given certain restrictions on how the compression procedure must
work. The rationale behind this intuition is the following: Since morphemes are
syntagmatically independent units, which occur in different words and combine with each
other, a lexicon containing morphemes is going to be “shorter” (in the literal sense that it
can be represented using a small number of characters) than a lexicon containing random
substrings, or a lexicon in which no word is decomposed. The advantage of reducing the
problem of morpheme discovery to a matter of (constrained) data compression is the
following: There are no straightforward ways to decide which one, among a set of possible
lexica, is the best one from the point of view of morphology, but it is relatively simple to
estimate which lexicon allows maximal data compression.
Given that the connection between data compression and morphological analysis is
not very intuitive, I will illustrate it with a set of simple examples.
Let us suppose that we are given a list of words, and our goal is to find a compact
format to store information from which the very same list can be reconstructed. In
particular, we take a “lexicon and encoding” approach to this task. We construct a compact
associate an index to each lexical unit, and then we rewrite the corpus as a sequence of
these indices. I will refer to the rewriting of the corpus as a sequence of indices with the
As long as some lexical units occur very frequently in the input corpus (and/or
many units occur relatively frequently), and the lexical indices are, on average, shorter than
the units they represent, the lexicon + encoding strategy will allow us to represent the
corpus in a shorter format than the original one. In order to make sure that the second
requirement is satisfied (i.e., lexical indices are on average shorter than the input words
they represent), I assume here that all indices are exactly one character long (I will revise