| (23) Certains supporters sont même violents
and the pronoun entry for:
(24) Certains sont même violents
The other solution consists in representing both cases with the same tag, and describing the syntactic relation between (23) and (24) as a deletion of the noun (M. Gross 1977, pp.28-30). Some representation of this relation of deletion, by the way, is indispensable to a thorough parsing, for other reasons, namely the restitution of the complete subject of (24), no matter which tag is assigned to the word. The second solution is applicable to constructions like Certains de ces supporters sont même violents too. If it is adopted, the dictionary represents certains as ambiguous between a determiner <certains,DET:mp> and one or several adjectives <certain,A:mp>. Both solutions are applicable to other determiners: aucun, deux, trois... The two solutions are nearly equivalent: the difference between them does not appear as a real divergence in linguistic analysis, but as a formal detail.
However, linguistic concepts cannot be handled by a computer system without formalization, and once underlying linguistic analyses are formalized, a mere formal detail suffices to make them distinct, even if they are equivalent variants. Since the representation of the words at stake in the dictionary depends on the solution chosen, both the ambiguity reduction grammars and the parser must comply with it. Consider now the following grammatical constraint:
(25) A verb cannot occur immediately after a determiner
Constraint (25) may be correct if certains is analysed as a pronoun in (24), and indeed it is in keeping with the first solution, but it is incorrect if certains is analysed as determiner. In the first solution, certains is represented as more ambiguous than in the second. If the second solution is preferred, the ambiguity between <certains,DET:mp> and <certains,PRO:mp> disappears, as well as the necessity of (25); therefore, the second solution simplifies both the dictionary and the ambiguity reduction grammar.
It is interesting to observe the symptoms that can occur in the case of a discrepancy between the dictionary and the ambiguity reduction grammar.
If the dictionary conforms to the first solution and the grammar to the second, the ambiguity between <DET> and <PRO> arises and is not resolved by the grammar. This is a clear case of artificial ambiguity, since it is an artifact caused by an inconsistency between two components of the linguistic data.
If, on the contrary, the dictionary conforms to the second solution and the grammar to the first, rule (25) applies and discards the reading in which certains is a determiner, keeping only the wrong readings in which certains is an adjective. Thus, the right reading is rejected.
The origin of the problem lies in the presence of two formal representations of the same grammatical construct. Another prototypical example of the same situation is the possibility of representing the syntax of a construction either as free or as frozen. The syntax of a construction is said to be free if it is to be treated as a combination of linguistic units, like good wine, and frozen if it makes up a compound word, as white wine. A form may be ambiguous between a free construction and a compound word, like table ronde in (4)-(5). The ambiguity is of a lexical nature, since the tags to be assigned to the form are completely different depending on the interpretation7. This kind of ambiguity can in some cases be resolved locally, which is probably the case of en fait in:
En fait, il pleut
where ambiguity is resolved in favour of the compound. Here we are interested in the problems that arise when one can hesitate between two representations, a free one and a frozen one, for the same construction. The syntax of majuscule (J. Senellart, 1999) is an example of this situation8. All occurrences of majuscule correspond to the following forms:
- sequences of the type N majuscule, where N can only be one of the nouns lettre, alphabet, caractère, initiale, or the name of a letter of the alphabet, a, b, etc.:
Un seul (caractère + * nom de personne) majuscule figure dans le texte
These forms cannot be related with sentences of the type N est majuscule:
* Ce caractère est majuscule
The noun lettre can always occur in a paraphrase of these forms: lettre initiale majuscule, alphabet de lettres minuscules, etc.
- the noun majuscule is always equivalent to lettre majuscule;
- the form N être en majuscules is paraphrasable by N être écrit en lettres majuscules.
The syntax of these expressions is so limited that it is best represented by a local grammar (M. Gross 1997). The set of possible values of the noun N in the forms N majuscule is small: all these forms make up a small family of compound nouns, in which the noun majuscule is a reduced form of the compound lettre majuscule. Once tags are devised to represent all these compound words, an additional adjectival tag is no longer required for majuscule.
A second solution, closer to grammatical and lexicographical traditions, would be to consider that the noun majuscule is a simple noun, and the forms N majuscule free combinations of nouns with an independent adjective <majuscule,A>. The limited syntax of the combination should therefore be represented in the data of the parser, namely the selectional restrictions on N, the impossibility of using the "adjective" in a predicative or attributive form, and the possibility of deleting the noun lettre.
The difference between the two solutions lies in how the data about the semi-frozen syntax of a family of expressions are dealt with in dictionaries and grammars. These data have to be formally described in any case, but they can appear either in the dictionary (local grammars are a part of the dictionary), or in the grammar of the syntax of the language. In the first solution, the form majuscule is not represented as ambiguous, since the tag depends on the noun on the left. In the second solution, the form is systematically considered ambiguous between noun and adjective, and this ambiguity has to be resolved which complicates the process without any easily identifiable benefit.
The consequences of a discrepancy between the models underlying the dictionary and the ambiguity reduction grammars depend on the type of inconsistency involved. If the dictionary represents the forms as frozen, and the ambiguity reduction grammar considers them free, resolving the ambiguity between noun and adjective in favour of the noun has no undesirable effect, since the word is not tagged as an adjective during lexical analysis. If, on the contrary, the dictionary represents the forms as free and the grammar considers them frozen, the ambiguity between noun and adjective introduced by lexical analysis is not resolved: the disagreement between the two sets of linguistic data, therefore, gives rise to artificial ambiguity.
We can mention a third example: the gender of adjectives in Romance languages. Some adjectives vary in gender, like lent/lente, others do not, like fixe. The distinction between masculine and feminine, therefore, appears as relevant grammatical information in the case of lent, and as a source of artificial lexical ambiguity in the case of fixe, since it leads to handling two tags <fixe,A:ms> and <fixe,A:fs>. The same holds for various determiners, pronouns and nouns.
The inflectional system of Romance languages, in this respect, can be represented in two superficially different ways. The traditional solution considers that gender is a relevant category for all adjectives, and represents fixe as inflectionally ambiguous9. In the second solution, gender is relevant for some adjectives and irrelevant for others10. The choice between the two solutions depends on the three data sets involved: the dictionary, that represents fixe with one adjectival tag or two; the ambiguity reduction grammar, which has to resolve the inflectional ambiguity of fixe or not; and the grammar that formalizes the syntax of the language11. If fixe is represented as ambiguous, the formal representation of all adjectives is more homogeneous, which simplifies the description of noun-adjective agreement. Choosing a solution implies considering the situation as a whole. Needless to say, the coexistence of both models in the same process is bound to cause a major disaster, though they differ only in a mere formal detail.
The three examples: relation between determiners and pronouns, frozen syntax of majuscule, gender of adjectives, provide evidence of interdependencies between elements of linguistic description. The construction of electronic dictionaries, of ambiguity reduction grammars, and even the construction of a tagset, require that the same linguistic analyses underly the whole process, including parsing. Problems related to possible inconsistencies or incompatibilities between elements of linguistic description do not invalidate the general architecture that we adopted, namely an organization of data and computer processing into distinct, compatible components. Such practical difficulties are well-known to researchers with an actual descriptive activity in the domain. They only reflect the complexity of real-size linguistic description. The visible outcome of a discrepancy between elements of formal linguistic description can be a decrease in recall or in precision, or the appearance of artificial ambiguity.
The influence of the linguistic analyses underlying formal description is obviously fundamental. Let us point out two stakes depending on the choice of these analyses:
- the simplicity of the global process; choosing globally simple descriptive solutions is an important objective, due to the intrinsic complexity of languages when lexicon is taken into account;
- the degree of lexical ambiguity represented, as measured by the number of tags per word; this number depends on the tags assignable to words in accordance with the linguistic analyses chosen.
In particular, as far as grammatical words are concerned, these two aspects: the simplicity of formal description and the degree of ambiguity of grammatical words, are connected. In the example of the relation between determiner and pronoun by noun deletion (23)-(24), the simpler solution is characterized by a smaller number of tags, therefore by a lesser ambiguity, since the problem is treated as syntactic. This example is typical of a frequent situation with grammatical words. When coordination between sets of linguistic data is maximal, the apparent degree of ambiguity decreases since linguistic elements are represented only once.
However, this effect does not suffice to completely remove lexical ambiguity from all grammatical words, nor to make tag assignment useless for grammatical words. Some grammatical words show ambiguity without any synchronic connection with syntactic relations of the type of (23)-(24). For example, s' may be a conjunction or pronoun:
(26) Le public entre s'il n'y a pas de panneaux d'information
Le vernis s'est endommagé sous l'effet des intempéries
Assigning at least two distinct tags <si,CONJ> and <se,PRO:3> to the form s' has a practical utility: establishing a communication between distinct grammatical constraints. For instance, in (26), a grammatical constraint can safely rule out <se,PRO:3> on the grounds of the presence of il, and another constraint can use the other tag, <si,CONJ>, as a mark of clause boundary. Stating the two grammatical constraints separately is convenient, since there is no connection between them, and only the tag for s' makes it possible for the second to use the result of the first. Tags for grammatical words, thus, are a convenient instrument for efficient ambiguity reduction. In addition, several grammatical words are ambiguous with lexical words: the pronoun and determiner certains is homograph with plural forms of an adjective, the conjunction or with a noun, the preposition entre with forms of the verb entrer...
Constructing ambiguity reduction grammars is an important activity because it contributes to the laborious, progressive elaboration of formal grammars of languages and testing solutions on texts. In the long term, this could even become the most interesting aspect of ambiguity reduction grammars.
8. Necessity of formalisms for ambiguity reduction
Ambiguity reduction involves:
- describing grammatical constraints during the construction of the system,
- automatically selecting or filtering the readings in conformity with the constraints, when the system is applied to texts.
The description of constraints and the selection of readings cannot be implemented without a description formalism that defines conventions for formalizing and interpreting grammatical constraints.
Linguistic description in general can be more or less formal. Lack of formalism can make description inaccurate, incomplete, incoherent or inapplicable. A high level of formalism, however, does not ensure by itself the adequacy of content: on the contrary, if it means a proliferation of formal details, it becomes an obstacle to description. The issue of formalization or encoding, therefore, conditions the success of the enterprise.
The issue of automating the selection of readings, though it is more computer-related, is not independent of the interpretation of the formal description, since the criteria of selection are the grammatical constraints. One of the functions of a formalism for ambiguity reduction is to provide a basis for implementing a selection module checking the constraints described.
We will discuss and exemplify the reasons why a formalism, or formalisms, are required for reducing ambiguity.
In the absence of a specific formalism, it is impossible to define what would constitute a correct grammatical constraint, since the interpretation of constraints will remain vague. Consider the following grammatical constraint:
(27) Immediately after an <N:ms>, an <A> in the singular is in the masculine
(27) may be correct or incorrect... because natural languages are so ambiguous. If it is interpreted in the sense that it removes all analyses in which a tag <N:ms> occurs immediately before a tag <A:fs>, it can have some relevancy; it correctly resolves, for example, gender ambiguity in the following sentence:
On voit apparaître un fichier supplémentaire
If the interpretation is that (27) applies after any word that can be an <N:ms>, the rule is wrong, since it erroneously discards the analysis of sincère in the feminine in the following sentence:
Voilà une déclaration que l'opinion en général juge sincère
because of the ambiguity of juge between verb and noun. The problem here stems from the fact that the grammatical constraint is not specified accurately. Since the exact nature of the grammatical constraint is unclear, it is impossible to tell whether it is valid.
The main function of an ambiguity reduction formalism, then, is very simple: ensuring that the interpretation of constraints described is clear and precise, so that it is possible to define what a correct constraint is (i.e. a constraint which is in accordance with the linguistic analyses underlying the project), and that the operation of filtering according to correct constraints can be automated.
It is logical and legitimate to try to design the simplest possible formalism. For instance, a minimal formalism, almost an absence of formalism, would consist in stating grammatical constraints in the form of mere sets of grammatical sequences, exactly as if they were local grammars describing families of forms. Adjective-noun agreement could be written as:
(28) <N:ms> <A:ms> + <N:mp> <A:mp> + <N:fs> <A:fs> + <N:fp> <A:fp>
and compound tenses of verbs with auxiliary avoir would appear as:
etc. Unfortunately, the idea is too simple. Descriptions of forms do not explicitly define grammatical constraints because they do not explicitly rule out any analyses. In this respect, (28) is even less explicit than (27). In order to interpret a description of forms like (28) as a grammatical constraint unambiguously, we would need conventions for its interpretation. Several conventions would be possible; they would lead to distinct interpretations, i.e., to distinct grammatical constraints. Different interpretation conventions would have advantages and drawbacks for description writers; in order to be used, they would have to be implemented by different computer programs. They would therefore be distinct formalisms, unavoidably more complex than the initial idea.
The preceding example stresses the fact that explicitly specifying the effects of the application of a grammar to a text is the responsibility of a formalism for ambiguity reduction. As a matter of fact, a formalism cannot be used satisfactorily without such a specification, which predicts which type of analyses are actually rejected when one applies grammatical constraints stated with the formalism. A specification serves as a common reference point for grammar writers, which describe constraints, and computer program writers that implement and maintain the system. The usefulness of a specification becomes obvious in the case of an error in one of these two parts of the system, either in linguistic data, or in the program. When an error is discovered, the specification is used in order to tell which part it stems from: for instance, it may happen that the procedure of filtering performed by the program disagrees with the specification. As long as the error is not corrected, the specification is maintained as a common reference point, and the other part is not disturbed. In the absence of a specification, both parts of the system are disturbed: the system does not behave in conformity with some implicit reference point, but its future behaviour is unknown until the error is corrected. Software engineers systematically resort to specifications for the construction of complex computer systems: they are a way of organizing communication between teams and between complex parts of a system. Obviously, a specification does not resolve all problems: it can change, which disturbs the whole system, but this happens less often than programming mistakes or errors in linguistic description. Thus, an ambiguity reduction formalism must ensure that grammatical constraints stated with it are accurately and unambiguously defined.
The main quality of a formalism is probably its simplicity, but we saw that the problem involved is not so simple that one could manage without any formalism at all. The comfort granted by disambiguation formalisms to writers of grammatical constraints is the main approach to defining criteria of quality:
- simplicity and readability,
- predictability of the effect of applying a grammar to a text,
- possibility of organizing a grammar into small components relatively independent of one another,
- possibility of automating the specified operations efficiently.
We will mention three other features that, according to our experimentations in constructing and using ambiguity reduction grammars (É. Laporte & A. Monceaux 1999), are relevant to the quality of formalisms.
- Punctuation tags and variable tags, i.e. tags that represent categories of words, like parts of speech, e.g. <N>, are indispensable for encoding the description of the context of many constraints. All examples of constraints stated above, for instance, include tags of this type, except constraint (17).
- A formalism may have more or less expressive power, in the sense that it provides more or less technical means of stating grammatical constraints of such or such kind. The notion of expressive power is exemplified in section 9.
- In the case of overlap between the application zones of one or several grammatical constraints in a text12, the effect of applying the constraints depends on the specification associated with the formalism. Two modes of operation are imaginable and have already been tested in actual systems: the application of a constraint to a given zone may depend, or not, on the possible presence of an overlap with another application zone. The first mode is that of systems that, in the case of overlap between two application zones, only check the constraint in one of the two zones, e.g. the left one. With this convention, two grammatical constraints can yield a correct result when they are applied separately, but be incompatible when two application zones happen to overlap. It is even theoretically possible that, when you add a constraint to an existing ambiguity reduction grammar, the rate of ambiguity reduction decreases, since it blocks some instances of application of constraints already present in the grammar. With this mode of operation, the result of applying a grammatical constraint does not depend only on the context explicitly described in it, but also on the contexts of other constraints that can have some overlap with it. The presence of overlaps extends the context that determines the result of application, and de facto changes the interpretation of the grammatical constraint in function of interactions between neighbouring application zones.
In the second mode of operation, the formalism itself ensures that the result of applying each constraint to each application zone is independent of the rest of the text, on the left as well as on the right. This does not preclude the apparent "co-operation" between rules exemplified by sentence (26): the grammatical constraint that uses <si,CONJ> as a mark of clause boundary does not apply to analyses with <se,PRO:3>, and, therefore, does not remove them; they are independently removed by the other constraint, due to the presence of the verb on the left.
9. A typology of ambiguity reduction formalisms
We have mentioned that an ambiguity reduction formalism simultaneously defines conventions for the description of grammatical constraints and for the result of the application of the constraints to analyses generated by dictionary lookup. These two aspects are closely connected.
Approaches to filtering, for example, can be positive or negative, depending whether you describe analyses to be kept or analyses to be discarded. This option has consequences not only on the description of constraints but also on the specification of the filtering. Both options are theoretically possible, and have been implemented in actual systems. From the point of view of application, they do not provide the same advantages.
In the light of the several experimentations carried out at IGM in the recent years, we will present a typology of ambiguity reduction formalisms and comment upon their respective potential.
We will classify ambiguity reduction formalisms into two types on the basis of the following criterion of distinction. We will consider the elementary automatic operation by which a given analysis of a given sentence is rejected or preserved as an application of a given rule described in the formalism. Such an elementary decision depends on the conditions encoded in the rule. Conceptually, the whole process of filtering can be viewed as a repetition of instances of this elementary decision mechanism. Depending on the formalism used, that operation can take into account conditions about: