Taxonomy of xml schema Languages using Formal Language Theory

Download 358.82 Kb.
Date conversion13.05.2016
Size358.82 Kb.
  1   2   3

Taxonomy of XML Schema Languages using Formal Language Theory

On the basis of regular tree languages, we present a formal framework for XML schema languages. This framework helps to describe, compare, and implement such schema languages. Our main results are as follows: (1) four classes of tree languages, namely "local", "single-type", "restrained competition" and "regular"; (2) document validation algorithms for these classes; and (3) classification and comparison of schema languages: DTD, XML-Schema, DSD, XDuce, RELAX Core, and TREX.

Makoto Murata, IBM Tokyo Research Lab./ International University of Japan

Dongwon Lee, UCLA / CSD

Murali Mani, UCLA / CSD

1. Introduction

XML is a meta language for creating markup languages. To represent a particular type of information, we create an XML-based language by designing an inventory of names for elements and attributes. These names are then utilized by application programs dedicated to this type of information.

A schema is a description of such an inventory: a schema specifies permissible names for elements and attributes, and further specifies permissible structures and values of elements and attributes. Advantages of creating a schema are as follows: (1) the schema precisely describes permissible XML documents, (2) computer programs can determine whether or not a given XML document is permitted by the schema, and (3) we can use the schema for creating application programs (by generating code skeletons, for example). Thus, schemas play a very important role in development of XML-based applications.

Several languages for writing schemas, which we call schema languages, have been proposed in the past. Some languages (e.g., DTD) are concerned with XML documents in general; that is, they handle elements and attributes. Other languages are concerned with particular type of information which may be represented by XML; primary constructs for such information are not elements or attributes, but rather constructs specific to that type of information. RDF Schema of W3C is an example of such a schema language. Since primary constructs for RDF information are resources, properties, and statements, RDF Schema is concerned with resources, properties, and statements rather than elements and attributes. In this paper, we limit our concern to schema languages for XML documents in general (i.e., elements and attributes); specifically, we consider DTD [BPS00], XML-Schema [TBMM00], DSD [KMS00], RELAX Core [Mur00b], and TREX [Cla01]. 1 Although schema languages dedicated to particular types of information (e.g., RDF, XTM, and ER) are useful for particular applications, they are outside the scope of this paper.

We believe that providing a formal framework is crucial in understanding various aspects of schema languages and facilitating efficient implementations of schema languages. We use regular tree grammar theory ([Tak75] and [CDG+97]) to formally capture schemas, schema languages, and document validation. Regular tree grammars have recently been used by many researchers for representing schemas or queries for XML and have become the mainstream in this area (see OASIS [OA01] and Vianu [VV01]); in particular, XML Query [CCF+01] of W3C is based on tree grammars.

Our contributions are as follows:

1. We define four subclasses of regular tree grammars and their corresponding languages to describe schemas precisely;

2. We show algorithms for validation of documents against schemas for these subclasses and consider the characteristics of these algorithms (e.g., the tree model .vs. the event model); and

3. Based on regular tree grammars and these validation algorithms, we present a detailed analysis and comparison of a few XML schema proposals and type systems; an XML schema proposal A is more expressive than another proposal B if the subclass captured by A properly includes that captured by B.
The remainder of this paper is organized as follows. In Section 2, we consider related works such as other survey papers on XML schema languages. In Section 3, we first introduce regular tree languages and grammars, and then introduce restricted classes. In Section 4, we introduce validation algorithms for the four classes, and consider their characteristics. In Section 5, on the basis of these observations, we evaluate different XML schema language proposals. Finally, concluding remarks and thoughts on future research directions are discussed in Section 6.

2. Related Work

More than ten schema languages for XML have appeared recently, and [Jel00] and [LC00] attempt to compare and classify such XML schema proposals from various perspectives. However, their approaches are by and large not mathematical so that the precise description and comparison among schema language proposals are not straightforward. On the other hand, this paper first establishes a formal framework based on regular tree grammars, and then compares schema language proposals.

Since Kilho Shin advocated use of tree automata for structured documents in 1992, many researchers have used regular tree grammars or tree automata for XML (see OASIS [OA01] and Vianu [VV01]). However, to the best of our knowledge, no papers have used regular tree grammars to classify and compare schema language proposals. Furthermore, we introduce subclasses of regular tree grammars, and present a collection of validation algorithms dedicated to these subclasses.

XML-Schema Formal Description (formerly called MSL [BFRW01]) is a mathematical model of XML-Schema. However, it is tailored for XML-Schema and is thus unable to capture other schema languages. Meanwhile, our framework is not tailored for a particular schema language. As a result, all schema languages can be captured, although fine details of each schema language are not.

3. Tree Grammars

In this section, as a mechanism for describing permissible trees, we study tree grammars. We begin with a class of tree grammars called "regular", and then introduce three restricted classes called "local", "single-type", and "restrained-competition".

Some readers might wonder why we do not use context-free (string) grammars. Context-free (string) grammars [HU79] represent sets of strings. Successful parsing of strings against such grammars provides derivation trees. This scenario is appropriate for programming languages and natural languages, where programs and natural language text are strings rather than trees. On the other hand, start tags and end tags in an XML document collectively represent a tree. Since traditional context-free (string) grammars are originally designed to describe permissible strings, they are inappropriate for describing permissible trees.

3.1 Regular Tree Grammars and Languages

We borrow the definitions of regular tree languages and tree automata in [CDG+97], but allow trees with "infinite arity"; that is, we allow a node to have any number of subordinate nodes, and allow the right-hand side of a production rule to have a regular expression over non-terminals.

Definition 1. (Regular Tree Grammar) A regular tree grammar (RTG) is a 4-tuple G = (N, T, S, P), where:

N is a finite set of non-terminals,

T is a finite set of terminals,

S is a set of start symbols, where S is a subset of N.

P is a finite set of production rules of the form X ® a r, where X Î N, a Î T, and r is a regular expression over N; X is the left-hand side, a r is the right-hand side, and r is the content model of this production rule.

Example 1. The following grammar G1 = (N, T, S, P) is a regular tree grammar. The left-hand side, right-hand side, and content model of the first production rule is Doc, Doc (Para1, Para2*), and (Para1, Para2*), respectively.

N = {Doc, Para1, Para2, Pcdata}

T = {doc, para, pcdata}

S = {Doc}

P = {Doc ® doc (Para1, Para2*), Para1 ® para (Pcdata),

Para2 ®para (Pcdata), Pcdata ® pcdata ε}
We represent every text value by the node pcdata for convenience.

Without loss of generality, we can assume that no two production rules have the same non-terminal in the left-hand side and the same terminal in the right-hand side at the same time. If a regular tree grammar contains such production rules, we only have to merge them into a single production rule. We also assume that every non-terminal is either a start symbol or occurs in the content model of some production rule (in other words, no non-terminals are useless).

We have to define how a regular tree grammar generates a set of trees over terminals. We first define interpretations.

Definition 2. (Interpretation) An interpretation I of a tree t against a regular tree grammar G is a mapping from each node e in t to a non-terminal, denoted I(e), such that:

I(eroot) is a start symbol where eroot is the root of t, and

for each node e and its subordinates e0 , e1 , ..., ei ,there exists a production rule X ® a r such that

I(e) is X,

the terminal (label) of e is a, and

I(e0) I(e1) ... I(ei) matches r.

Now, we are ready to define generation of trees from regular tree grammars, and regular tree languages.

Definition 3. (Generation) A tree t is generated by a regular tree grammar G if there is an interpretation of t against G.

Example 2. An instance tree generated by G1 , and its interpretation against G1 are shown in Figure 1.
Figure 1.

An instance tree generated by G1, and its interpretation against G1. We use unique labels to represent the nodes in the instance tree.

Definition 4. (Regular Tree Language) A regular tree language is the set of trees generated by a regular tree grammar.

3.2 Local Tree Grammars and Languages

We first define competition of non-terminals, which makes validation difficult. Then, we introduce a restricted class called ``local'' by prohibiting competition of non-terminals [Tak75]. This class roughly corresponds to DTD.

Definition 5. (Competing Non-Terminals) Two different non-terminals A and B are said competing with each other if

one production rule has A in the left-hand side,

another production rule has B in the left-hand side, and

these two production rules share the same terminal in the right-hand side.

Example 3. Consider a regular tree grammar G3 = (N, T, S, P), where:
N = {Book, Author1, Son, Article, Author2, Daughter}

T = {book, author, son, daughter}

S = {Book, Article}

P = {Book ® book (Author1), Author1 ® author (Son), Son ® son ε,

Article ® article (Author2), Author2 ® author (Daughter),

Daughter ®daughter ε}

Author1 and Author2 compete with each other, since the production rule for Author1 and that for Author2 share the terminal author in the right-hand side. There are no other competing non-terminal pairs in this grammar.

Definition 6. (Local Tree Grammar and Language) A local tree grammar (LTG) is a regular tree grammar without competing non-terminals. A tree language is a local tree language if it is generated by a local tree grammar.

Example 4. G3 in Example 3 is not local, since Author1 and Author2 compete with each other.

Example 5. The following grammar G5 = (N, T, S, P) is a local tree grammar:

N = {Book, Author1, Son, Pcdata}

T = {book, author, son, pcdata}

S = {Book}

P = {Book ® book (Author1), Author1 ® author (Son),

Son ® son Pcdata, Pcdata ® pcdata ε}
Observe that local tree grammars and extended context-free (string) grammars (ECFG) look similar. However, the former describes sets of trees, while the latter describes sets of strings. The parse tree set of an ECFG is a local tree language.

3.3 Single-Type Tree Grammars and Languages

Next, we introduce a less restricted class called ``single-type'' by prohibiting competition of non-terminals within a single content model. This class roughly corresponds to XML-Schema.

Definition 7. (Single-Type Tree Grammar and Language) A a single-type tree grammar is a regular tree grammar such that

for each production rule, non-terminals in its content model do not compete with each other, and

start symbols do not compete with each other.

A tree language is a single-type tree language if it is generated by a single-type tree grammar.

Example 6. The grammar G1 shown in Example 1 is not single-type. Observe that non-terminals Para1 and Para2 compete, and they occur in the content model of the production rule for Doc.

Example 7. Consider G3 in Example 3. This grammar is a single-type tree grammar since no production rules have more than one non-terminal in their content models and thus there cannot be any competing non-terminals in the content models.

3.4 Restrained-Competition Tree Grammars and Languages

Now, we introduce an even less restricted class called ``restrained-competition''. The key idea is to use content models for restraining competition of non-terminals.

Definition 8. (Restraining) A content model r restrains competition of two competing non-terminals A and B if, for any sequences U, V, W of non-terminals, r do not generate both U A V and U B W.

Definition 9. (Restrained-Competition Tree Grammar and Language) A restrained-competition tree grammar is a regular tree grammar such that

for each production rule, its content model restrains competition of non-terminals occurring in the content model, and

start symbols do not compete with each other.

A tree language is a restrained-competition tree language if it is generated by a restrained-competition tree grammar.

Example 8. The grammar G1 is a restrained-competition tree grammar. Non-terminals Para1 and Para2 compete with each other, and they both occur in the content model of the production rule for Doc. However, this content model (Para1, Para2*) restrains the competition between Para1 and Para2, since Para1 may occur only as the first non-terminal and Para2 may occur only as the non-first non-terminal.

Example 9. The following grammar G9is not a restrained-competition tree grammar. Observe that the content model (Para1*, Para2*) does not restrain the competition of non-terminals Para1 and Para2. For example, suppose that U = V = W = ε. Then, both U Para1 V and U Para2 W match the content model.

N = {Doc, Para1, Para2, Pcdata}

T = {doc, para, pcdata}

S = {Doc}

P = {Doc ® doc (Para1*, Para2*), Para1 ® para (Pcdata),

Para2 ® para (Pcdata), Pcdata ® pcdata ε}

3.5 Summary of Examples

Each of the example grammars demonstrates one of the classes of tree grammars. The following table shows to which class each example belongs.
























  1   2   3

The database is protected by copyright © 2016
send message

    Main page