|REDUCTION OF LEXICAL AMBIGUITY
IGM, University of Marne-la-Vallée - CNRS
The resolution of lexical ambiguity is a prerequisite for many automatic procedures on written texts, even simpler ones. However, it is not an easily automatable task. We will examine on concrete examples the issues faced during the elaboration of lexical disambiguators. In order to estimate the potential of approaches, we will consider how disambiguating written texts before processing them brings about improvements to relevant applications. In this study we will take into account both linguistic and computational problems and show how they are connected1.
1. Lexical tagging of texts
Written text cannot undergo linguistic processing without the system having access to linguistic information about words. In order to make such information quickly and conveniently available, computer programs usually attach it to the words of the text themselves in the form of lexical tags. The lexical tag for a word, therefore, gathers all the information available about it and useful for the task to be performed, ranging from the very form occurring in the text to grammatical, morphological, syntactic and semantic data, according to the nature of the task. A basic step consists in segmenting the text, identifying minimal units and annotating them with tags. This task is called lexical analysis, lexical tagging or annotation.
The technical means of attaching lexical information to words can be classified into two types, depending on whether the information comes from an electronic dictionary or is deduced from information present in the text.
Dictionary-based tagging is simple: the program looks up the words in a dictionary that associates tags to all the words in the language. This approach was widely put to the test in the 1990's and yields the most reliable results, in so far as the dictionary conforms to actual usage of the language and is comprehensive enough. For inflected languages, like most European languages, inflected dictionaries are used. The number of entries in inflected-form dictionaries is larger than in conventional dictionaries, in which verbs are present only in the infinitive. Highly inflected languages, e.g. Polish, have several millions of inflected words. Even so, there exist dictionaries reasonably close to exhaustivity, that can be compressed into files of an order of magnitude of 1 Mb, making it possible to tag thousands of words per second. The Intex system contains efficient tools for dictionary compression and text tagging (M. Silberztein 1994). In this article, we will use usual Intex conventions for lexical tags: thus, in French, <actif,A:fp> represents the adjective actif in the feminine plural, i.e. actives.
Approaches to tagging without dictionary were implemented in numerous systems during the 1980's and 1990's. Such systems exploit information present in the text, such as final parts of words and contexts. For example, many French words in -ives are adjectives in the feminine plural. This rule correctly assigns the tag <A:fp> to the word actives in the sentence:
Les entreprises les plus actives ont gagné des parts de marché
Many words preceded by certains des are nouns in the masculine plural. This rule, when applied to the sentence:
Certains des films primés ne seront à l'affiche qu'en automne
correctly assigns the tag <N:mp> to films. Such rules are automatically learned through a training based on frequency counts in a sample of tagged or untagged texts. Several types of machine learning are known. The principle itself is an approximation, since words with the same final part may have entirely different properties, for example the adjective moindre and the verb joindre. However, it is the only solution for tagging unknown words automatically.
2. Lexical ambiguity
Residual difficulties with lexical tagging result from the massive density of lexical ambiguity in all natural languages. There is an instance of lexical ambiguity between two distinct linguistic elements when they are spelled exactly the same way. Some instances of lexical ambiguity come from imperfections in writing systems of languages. For example, in the two sentences:
Les opinions divergent sur ce point
On place sur le trajet du faisceau lumineux un dispositif divergent
the two forms divergent, a verb and a noun, which are pronounced differently, are identical only because of oddities of French spelling. This type of ambiguity is more or less abundant, according to writing systems. Theoretically, it would not exist in the case of a language with a sufficiently informative writing system.
However, lexical ambiguity covers a much more extensive reality than mere curiosities and sporadic contingencies. We will examine several examples of ambiguous words. In the following sentences, the two occurrences of forme are a noun and a verb, although they do not present any phonetic difference:
On ne forme pas un spécialiste en quelques semaines
Le contrat prend une forme de nature à satisfaire les deux parties
A human reader does not perceive the slightest ambiguity, but the necessity of assigning different tags to a verb and a noun, in a step which comes before recognition of syntactic structures of sentences, for example, is obvious. Two terms of an instance of lexical ambiguity may have an etymological relation, like forme, or not, like peignait:
(1) Luc peignait les cheveux d'Anne
Luc peignait ses volets en vert
They can even be two elements of the same paradigm:
Je sais que vous plaisantiez
J'aime que vous plaisantiez
In this example, the two occurrences of plaisantiez differ in inflectional features: respectively indicative imperfect and subjunctive present. Lexical ambiguity often involves two words with a close etymological relation, the same part of speech, but two senses:
(2) Luc a marqué une croix sur cette page
(3) Ce livre m'a profondément marqué
Syntactic properties depend on senses: in the case of verbs, the number of essential complements, the corresponding prepositions, the syntactic transformations applicable, the selectional restrictions on subjects and essential complements... may differ. Now, such syntactic information, which is required for recognizing sentence structures, must be included in lexical tags for several classes of applications. We will come back to this topic in section 3.
Another common type of lexical ambiguity involves compound words (M. Salkoff 1999):
(4) La table ronde sur la politique de santé est annulée
(5) La table ronde est trop petite, il faudra prendre l'autre
In (4), table ronde is a compound noun and, as such, a minimal unit for linguistic processing, therefore the tag for table ronde will have to contain relevant syntactic information, e.g. the form and type of complements of this unit. In (5), table and ronde are distinct units that make up a free noun phrase. The tradition in natural language processing is persistently reluctant to admit the idea of representing a compound by a single tag (e.g. J. Cloeren 1999, p. 45). The computational linguistics community quite as persistently underestimates the quantity of compound words in texts, though recent studies (e.g. J. Senellart 1999) estimate at more than a half the proportion of texts constituted by compound words. In addition, the technical content of a text, including most technical terms, is to be found more in this part than in the other. Lexical tagging should deal with units of greater complexity than simple words. Tagging only simple words amounts to restricting all further computer processing to a superficial part of language.
Natural languages also provide examples of other types of ambiguity besides lexical ambiguity (e.g. M. Gross, in this volume), but since we are not interested in them in this article, we will most of the time just refer to ambiguity for short.
The existence of lexical ambiguity has technical consequences. The function of an electronic dictionary is to ensure that relevant tags are available for all words; in the case of multiple senses, therefore, a mere dictionary lookup yields lexical information about all senses, and not only the contextually appropriate information. At this stage, ambiguity is explicitly represented through tag lists; it must be resolved in order for the lexical tagging process to be complete, i.e. for each word to be annotated only with the right tag or tags. Indeed, several tags may perfectly be contextually appropriate. It is the case of sentence (1), in which Luc can have either made a painting of Anne's hair or combed it. It is an example where ambiguity can be represented at the level of lexical tagging but cannot be resolved (genuine ambiguity).
The case of tagging without dictionary is different: it is possible to produce several tags in order to represent several analyses, but it is impossible to ensure that the correct tagging of each word is present among the tags produced. In addition, recognizing compound words without a dictionary is an almost insoluble problem, since most of them obey all usual rules of morphological agreement and superficial syntax, like table ronde, prendre en compte, au cours de, and therefore are impossible to detect without complete lists. To our view, this limitation completely disqualifies the exclusive use of statistical taggers for useful tagging. From now on we will restrict our scope to dictionary-based tagging.
Electronic dictionaries may be more or less comprehensive, in the sense that they describe a larger or smaller number of words, or of word senses. When the number of senses described increases, the degree of ambiguity also grows, which can appear as an obstacle to computer processing in the case of rare senses, like oublie 'sort of cake', an old-fashioned word and a homograph of forms of oublier 'forget'. For some applications, this is an argument for limiting the extension of the lexical description, but the fundamental work of describing linguistic data cannot be restricted by considerations specific to a particular type of application: the description of rare words and rare senses is required for other applications. Restricting lexical description in order to prevent ambiguity from becoming visible would be equivalent to stopping the development of air traffic in order to avoid building airports.
Lexical information found in electronic dictionaries and attached to words may be more or less detailed, according to its nature and extension. For instance, tags reduced to an indication of grammatical category: verb, noun, adjective... are little informative. The following sentence is an example of such a minimalist tagging, with classical abbreviations:
pense arriver à l'heure
The informative content of tags is related with the size of the set of tags: parts of speech make up a set of about 15 elements. If we take into account more information, this amount can only grow. This quantitative parameter is called the granularity of the description, since in proportion as it grows, each tag describes less words in a more informative way, and the descriptive model gets more fine-grained.
The granularity of a description grows when more information is included. The most popular tagsets for English include essentially, in addition to part of speech, inflectional features: tense, number, person... The British National Corpus basic tagset totals 57 lexical tags (G. Leech et al. 1994), the Penn Treebank tagset, 36 (M. Marcus et al. 1993), the Lancaster-Oslo-Bergen Corpus tagset, 120 (S. Johansson et al. 1986; Garside et al. 1987), and the Brown Corpus tagset, 71 (B. Greene & G. Rubin 1971, N. Francis & H. Kucera 1982). In Romance languages, which are more inflected than English, the variations of the same features generate 80 to 120 tags2. Let us insert inflectional features into sentence (6):
pense arriver à l'heure
We can still include into the tags the canonical form (or lemma or base form) of each inflected word, e.g. the infinitive in the case of verbs:
(8) Il pense
arriver à l'heure<à l'heure,ADV>
When canonical forms are included in all tags, the number of tags increases dramatically: the class of forms to which a given tag can be assigned usually reduces to a single word. Common tagsets do not include canonical forms of words beyond a few grammatical words, and are often called wordclass tagsets. It is natural to avoid this term in a framework where wordclasses are very small. Since we consider that canonical forms are an important piece information, we will prefer the term of lexical tagsets.
To limit the number of tags without any loss of information, canonical forms can be abbreviated, taking advantage of the redundancy between inflected form and canonical form: with this convention, pensait
is replaced with pensait<3er,V:I3s>. The amount of distinct tags reaches an order of magnitude of 1000 in French. The morphological dictionaries of the RELEX network of laboratories, which are integrated in Intex, use tagsets of at least this level of granularity.
In examples (6)-(8), we included into tags grammatical, morphological and inflectional information only. Even so, the granularity of lexical description varied greatly. The representation of ambiguity depends on the system of tags. The word pense was not represented as ambiguous in (6):
The same word becomes ambiguous as soon as we take into account inflectional features, since it can be the first or third person in the indicative present, or the same in the subjunctive, or the second person in the imperative:
Thus, ambiguity grows automatically with the granularity of the tagset used to represent it. However, statistics on texts show that this growth is moderate as long as information is restricted to the grammatical, morphological and inflectional level. We computed that the average lexical ambiguity of a sample of French texts is 1.63 tags per word with a tagset of the type of (6), and 1.99 with one of the type of (8). The growth of ambiguity is limited (22%) as compared with the increase in informative content brought about by the substitution of tagsets.
From the point of view of application, the adequacy of tagsets for the particular application must first be assessed. The simplest procedure on written text is the detection of lexical errors, i.e. of words that do not belong in the vocabulary, like pourparler (for pourparlers): this task does not require any tags. Some more ambitious applications require tagging the text, but can be considered acceptable even if they do not produce exhaustive results. For example, users of spelling checkers for non-lexical errors are usually aware of the difficulty of the task, view such systems as an aid to re-reading text and do not trust them to exhaustively point errors. The same holds for text searching and text indexing. The texts selected, in the first case, and the index items chosen, in the second case, can include undesired elements and miss desired elements, without making such systems unusable, since the user's purpose is only approximately defined and output is processed manually. For this category of applications, the information in tags of the type of (8) probably allows for obtaining more interesting results, as compared to the present state of the art.
More ambitious applications require even more lexical information: speech synthesis from unrestricted written text, translation, and other applications involve thorough syntactic parsing, i.e. recognition of the structures of sentences and phrases: clauses, predicates, complements. This recognition cannot be automated without specific lexical information. Thus, surveying the applications which tagging is useful for leads to examine the relations between tagging and syntactic parsing. It is a common prejudice to consider a priori that these two procedures belong to distinct, independent areas. Let us instead examine which kind of information required by syntactic parsing must be included in lexical tags.
In the case of verbs and other predicative forms, like marquer, faire un calcul, être d'accord, être débiteur, prendre en compte, the number of essential complements and the corresponding prepositions are indispensable for identifying the predicative element(s) of the sentence and their arguments: subjects and complements, if any. In addition, the sentence can be the result of the application of one or several syntactic operations: passive, inversions, reductions, omissions, pronominalizations... Now, specialists of syntax know that not all transformations are applicable to all predicative forms, and this information is essential for recognizing structures. Selectional restrictions, i.e. the information of the set of nouns that can occur in a sentence as subjects or complements, depend on each predicative form: it is fundamental to know them, even incompletely, to discriminate hypotheses during computer parsing, e.g. in (2) or (3).
Formalization of such properties is an extensive work, since lexicon is concerned. The Lexicon-Grammar theory (M. Gross 1994a, 1994b) is the natural framework for implementing such a study.
Let us examine a simple example of a linguistic piece of information required for syntactic parsing: the pre-nominal and post-nominal position of adjectives in French. This information is lexical, in the sense that it depends on each adjective and does not comply with any general rules. The adjective sympathique may be pre- or post-nominal, the adjective politique is post-nominal:
une sympathique proposition
une proposition sympathique
* une politique décision
une décision politique
Both words adulte and analphabète are ambiguous between an adjective and a noun:
Nous proposons un cours pour adultes analphabètes
In this sentence, the combination of the words adultes analphabètes with two parts of speech each generates four analyses; the fact that the adjective adulte is exclusively post-nominal correctly rules out the analysis adultes<A:mp> analphabètes<N:mp>. The effect of this type of rule on ambiguity resolution in French (M. Garrigues 1997) and in Portuguese (P. Carvalho, this volume) is quantitatively important.
Tags with the kind of linguistic information exemplified above could legitimately be called 'syntactic wordclass tags' or 'syntactic lexical tags'. Such information is indeed of a syntactic level, and is far more precise than the 'morphosyntactic' or 'wordclass' information included in common tagsets (e.g. N. Francis & H. Kucera 1982, R. Garside et al. 1987, M. Marcus et al. 1993, G. Leech et al. 1994). Such tagsets do include more information than bare part of speech; e.g. sometimes they mark a distinction between common nouns and proper nouns, or between auxiliaries and non-auxiliary verbs; but they do not include the classification of verbs into intransitive, direct transitive and indirect transitive still less the argument structure of verbs. Even intermediate tagsets, designed to ensure comparability between tagsets, do not anticipate the possibility of considering this classification or this structure (G. Leech & A. Wilson 1999). Moreover, as we already mentioned, common tagsets do not systematically include information about canonical forms.
When information exploitable in syntactic parsing is included, the informative content and granularity of tags grow dramatically, since making a formal description of these properties implies separating the senses of verbs, like in (2) and (3), adjectives, and other predicative elements. At this stage, the granularity of the tagset goes far beyond present standards. The properties that we mentioned are difficult to handle in frequency-based systems. For instance, the pre- or post-nominal position of French adjectives has no correlation with orthographic marks like suffixes, it can depend on the senses of a given form, and often varies freely for a given sense, which makes it difficult to obtain this information through automatic generalization from examples in a sample of texts.
A consequence of an increase in granularity of the description is an increase in ambiguity: each word with several senses associated with several syntactic behaviours can a priori be represented by respective tags, and this kind of ambiguity combines with inflectional ambiguity that we exemplified first. There are no estimates of the average number of tags per word yet, but an order of magnitude of 10 is plausible. However, increase in ambiguity is not a defect of the model, but a reflection and a consequence of the complexity of the problem. A fine-grained tagset is a heavy tool from a technical point of view, but for the same reason why bulldozers are heavy: teaspoons are not appropriate tools for making embankments.
As far as degree of lexical ambiguity is concerned in general, the only technical means of measuring this quantity is to count the amount of tags per word, and this value depends on the granularity of the tagset. Therefore, comparing error rates or reduction rates i.e. the proportion of ambiguity resolved by respective programs across systems, or other numerical results of systems with different tagsets, is in general meaningless.
4. Delimitation of the problem and objectives
We examined several examples of lexical ambiguity and we observed that the phenomenon affects any text; and almost any sentence, even the simplest. In addition, depending on the granularity of descriptive models, the degree of ambiguity and the number of tags may increase outstandingly, and must increase due to the complexity of the most interesting applications that justify the study of the problem.
Now, can all lexical ambiguity in a text be resolved, and at which cost?
The answer is clear. For some sentences, resolving all lexical ambiguity would imply recognizing the entire syntactic structure. In the following sentences, pêcher is respectively a verb and a noun:
(9) Il avait l'impression de pêcher dans un aquarium
(10) La photo montre un exemple de pêcher dans un terrain siliceux
Immediate context is analogous in the two sentences, on the left: <DET:s> <N:s> <de,PREP> as well as on the right: <dans,PREP> <un,DET:s> <N:s>. We need to know the properties of the nouns impression and exemple in order to resolve the ambiguity. The complement of impression is an infinitival clause:
(11) Luc a l'impression de rêver
The nominal form of this complement:
Luc a l'impression d'un rêve
may occur in the singular with an empty determiner:
Luc a l'impression de rêve dont il a déjà parlé
but if the complement is a predicative noun, not a concrete noun:
(12) * Luc a l'impression d'arbre dont il a déjà parlé
The noun exemple does not accept an infinitival complement:
(13) * Ce texte est un exemple de décrire nos résultats
The nominal complement of exemple behaves according to different restrictions on the use of determiners:
? Ce texte est un exemple d'une description scientifique
Ce texte est un exemple de description scientifique