NOTE: This is really not a HTML version of the paper; rather, this
is merely a HTML-lized XML version of the paper so that it can be
viewed on browser that do not support XML format yet.
<?xml version="1.0" encoding="UTF-8"?>
<!-- edited with XML Spy v3.5 NT (http://www.xmlspy.com) by Dongwon Lee
(UCLA) -->
<!DOCTYPE paper SYSTEM "extremepaper.xml.dtd" [
<!ENTITY instance-tree SYSTEM "instance-tree.eps" NDATA eps>
<!ENTITY set SYSTEM "set.eps" NDATA eps>
<!ENTITY interpretation SYSTEM "interpretation.eps" NDATA eps>
]>
<paper secnumbers="1">
<front>
<title>Taxonomy of XML Schema Languages using Formal
Language Theory</title>
<author contact="1">
<fname>Makoto</fname>
<surname>Murata</surname>
<address>
<affil>IBM Tokyo Research Lab./ International University of Japan</affil>
<email>mmurata@trl.ibm.co.jp</email>
</address>
<bio>
<para>MURATA Makoto (FAMILY Given) is a researcher working on document processing.</para>
</bio>
</author>
<author contact="0">
<fname>Dongwon</fname>
<surname>Lee</surname>
<address>
<affil>UCLA / CSD</affil>
<email>dongwon@cs.ucla.edu</email>
</address>
<bio>
<para>Dongwon Lee is a Ph.D candidate majoring in database systems with a focus on XML.</para>
</bio>
</author>
<author contact="0">
<fname>Murali</fname>
<surname>Mani</surname>
<address>
<affil>UCLA / CSD</affil>
<email>mani@cs.ucla.edu</email>
</address>
<bio>
<para>Murali Mani is a Ph.D candidate majoring in database systems with a focus on XML. He is partially supported by the NSF grants - 0086116, 0085773, 9817773.</para>
</bio>
</author>
<keywords>
<keyword>Schema</keyword>
<keyword>XML</keyword>
<keyword>Tree Automaton</keyword>
<keyword>Regular Tree Grammar</keyword>
<keyword>Validation</keyword>
<keyword>Formal Language</keyword>
<keyword>Expressive Power</keyword>
<keyword>Taxonomy</keyword>
</keywords>
<abstract>
<para>On the basis of regular tree languages, we present a formal framework for XML schema languages. This framework helps to describe, compare, and implement such schema languages. Our main results are as follows: (1) four classes of tree languages, namely "local", "single-type", "restrained competition" and "regular"; (2) document validation algorithms for these classes; and (3) classification and comparison of schema languages: DTD, XML-Schema, DSD, XDuce, RELAX Core, and TREX.</para>
</abstract>
</front>
<body>
<!-- section 1-->
<section id="sec-introduction">
<title>Introduction</title> <para>XML is a
meta language for creating markup languages. To represent a
particular type of information, we create an XML-based language by
designing an inventory of names for elements and attributes. These
names are then utilized by application programs dedicated to this type
of information.</para>
<para>A <highlight style="ital">schema</highlight> is a description of such an inventory: a schema specifies permissible names for elements and attributes, and further specifies permissible structures and values of elements and attributes. Advantages of creating a schema are as follows: (1) the schema precisely describes permissible XML documents, (2) computer programs can determine whether or not a given XML document is permitted by the schema, and (3) we can use the schema for creating application programs (by generating code skeletons, for example). Thus, schemas play a very important role in development of XML-based applications.</para>
<para>Several languages for writing schemas, which we call <highlight
style="ital">schema languages</highlight>, have been proposed in the
past. Some languages (e.g., DTD) are concerned with XML documents in
general; that is, they handle elements and attributes. Other
languages are concerned with particular type of information which
<highlight style="ital">may</highlight> be represented by XML; primary
constructs for such information are not elements or attributes, but
rather constructs specific to that type of information. RDF Schema of
W3C is an example of such a schema language. Since primary constructs
for RDF information are resources, properties, and statements, RDF
Schema is concerned with resources, properties, and statements rather
than elements and attributes. In this paper, we limit our concern to
schema languages for XML documents in general (i.e., elements and
attributes); specifically, we consider DTD <bibref
refloc="WWW-BPS-00"/>, XML-Schema <bibref refloc="WWW-XSD-1-00"/>, DSD
<bibref refloc="WWW-DSD-00"/>, RELAX Core <bibref
refloc="Web-RELAX-00"/>, and TREX <bibref refloc="WWW-TREX-01"/>.
<ftnote id="fn1"><para>
Throughout the paper, we differentiate two terms -- XML schema(s) and
XML-Schema. The former refers to a general term for schema in XML model,
while the latter refers to one particular kind of XML schema language
proposed by W3C <bibref refloc="WWW-XSD-1-00"/>.</para></ftnote>
Although schema languages dedicated to particular types of
information (e.g., RDF, XTM, and ER) are useful for particular
applications, they are outside the scope of this paper.</para>
<para>We believe that providing a formal framework is crucial in
understanding various aspects of schema languages and facilitating
efficient implementations of schema languages. We use regular tree
grammar theory (<bibref refloc="IC-Takahashi-75"/> and <bibref
refloc="BOOK-TATA-97"/>) to formally capture schemas, schema
languages, and document validation. Regular tree grammars have
recently been used by many researchers for representing schemas or
queries for XML and have become the mainstream in this area (see OASIS
<bibref refloc="OASIS"/> and Vianu <bibref refloc="PODS-VV-01"/>); in
particular, XML Query <bibref refloc="WWW-XML-Query-01"/> of W3C is
based on tree grammars.</para>
<para>Our contributions are as follows:
<seqlist>
<li>
<para>We define four subclasses of regular tree grammars
and their corresponding languages to describe schemas
precisely;</para>
</li>
<li>
<para>We show algorithms for validation of documents against
schemas for these subclasses and consider the characteristics of these
algorithms (e.g., the tree model .vs. the event model); and</para>
</li>
<li>
<para>Based on regular tree grammars and these validation
algorithms, we present a detailed analysis and comparison of a few
XML schema proposals and type systems; an XML schema proposal
<highlight style="ital">A</highlight> is more expressive than another
proposal <highlight style="ital">B</highlight> if the subclass
captured by <highlight style="ital">A</highlight> properly includes
that captured by <highlight style="ital">B</highlight>.</para>
</li>
</seqlist>
</para>
<para>The remainder of this paper is organized as follows. In Section
<xref type="number" refloc="sec-related-work"/>, we consider related
works such as other survey papers on XML schema languages. In Section
<xref type="number" refloc="sec-tree-regular"/>, we first introduce
regular tree languages and grammars, and then introduce restricted
classes. In Section <xref type="number" refloc="sec-op"/>, we
introduce validation algorithms for the four classes, and consider
their characteristics. In Section <xref type="number"
refloc="sec-evaluation"/>, on the basis of these observations, we
evaluate different XML schema language proposals. Finally, concluding
remarks and thoughts on future research directions are discussed in
Section <xref type="number" refloc="sec-conclusion"/>. </para>
</section>
<!-- section 2-->
<section id="sec-related-work">
<title>Related Work</title>
<para>More than ten schema languages
for XML have appeared recently, and <bibref refloc="WWW-Jell-00"/> and
<bibref refloc="SIGMOD-REC-LC-00"/> attempt to compare and classify
such XML schema proposals from various perspectives. However, their
approaches are by and large not mathematical so that the precise
description and comparison among schema language proposals are not
straightforward. On the other hand, this paper first establishes a
formal framework based on regular tree grammars, and then compares
schema language proposals.</para>
<para>Since Kilho Shin advocated use of tree automata for structured
documents in 1992, many researchers have used regular tree grammars or
tree automata for XML (see OASIS <bibref refloc="OASIS"/> and Vianu
<bibref refloc="PODS-VV-01"/>). However, to the best of our
knowledge, no papers have used regular tree grammars to classify and
compare schema language proposals. Furthermore, we introduce
subclasses of regular tree grammars, and present a collection of
validation algorithms dedicated to these subclasses.</para>
<para>XML-Schema Formal Description (formerly called MSL <bibref
refloc="WWW-BFRW-01"/>) is a mathematical model of XML-Schema.
However, it is tailored for XML-Schema and is thus unable to capture
other schema languages. Meanwhile, our framework is not tailored for
a particular schema language. As a result, all schema languages can
be captured, although fine details of each schema language are
not.</para>
</section>
<!-- section 3-->
<section id="sec-tree-regular">
<title>Tree Grammars</title>
<para>In this section, as a mechanism for describing permissible trees, we study tree grammars. We begin with a class of tree grammars called "regular", and then introduce three restricted classes called "local", "single-type", and "restrained-competition".</para>
<para>Some readers might wonder why we do not use context-free (string) grammars. Context-free (string) grammars <bibref refloc="BOOK-HU-79"/> represent sets of <highlight style="ital">strings</highlight>. Successful parsing of strings against such grammars provides derivation trees. This scenario is appropriate for programming languages and natural languages, where programs and natural language text are strings rather than trees. On the other hand, start tags and end tags in an XML document collectively represent a <highlight style="ital">tree</highlight>. Since traditional context-free (string) grammars are originally designed to describe permissible strings, they are inappropriate for describing permissible trees.</para>
<subsec1>
<title>Regular Tree Grammars and Languages</title>
<para>We borrow the definitions of regular tree languages and tree automata in <bibref refloc="BOOK-TATA-97"/>, but allow trees with "infinite arity"; that is, we allow a node to have any number of subordinate nodes, and allow the right-hand side of a production rule to have a regular expression over non-terminals.</para>
<para id="def-rtg1">
<highlight style="bold">Definition 1. </highlight>
<highlight style="bital">(Regular Tree Grammar)</highlight>
A regular tree grammar (RTG) is a 4-tuple <highlight style="ital">G =
(N, T, S, P)</highlight>, where:
<randlist>
<li>
<para><highlight style="ital">N</highlight> is a finite set of <highlight style="ital">non-terminals</highlight>,</para>
</li>
<li>
<para><highlight style="ital">T</highlight> is a finite set of <highlight style="ital">terminals</highlight>,</para>
</li>
<li>
<para><highlight style="ital">S</highlight> is a set of <highlight
style="ital">start symbols</highlight>, where <highlight style="ital">S</highlight> is a subset of <highlight style="ital">N</highlight>.</para>
</li>
<li>
<para><highlight style="ital">P</highlight> is a finite set of production rules of the form <sgml>X → a r</sgml>, where <sgml>X ∈ N</sgml>, <sgml>a ∈ T</sgml>, and <sgml>r</sgml> is a regular expression over <highlight style="ital">N</highlight>; <highlight
style="ital">X</highlight> is the <highlight style="ital">left-hand
side</highlight>, <sgml>a r</sgml> is the <highlight
style="ital">right-hand side</highlight>, and <sgml>r</sgml> is the
<highlight style="ital">content model</highlight> of this production
rule.</para>
</li>
</randlist>
</para>
<para id="ex-tree-regular">
<highlight style="bold">Example 1.</highlight> The following grammar <highlight style="ital">G<sub>1</sub> = (N, T, S, P)</highlight> is a regular tree grammar. The left-hand side, right-hand side, and content model of the first production rule is <sgml>Doc</sgml>, <sgml>Doc (Para1, Para2*)</sgml>, and <sgml>(Para1, Para2*)</sgml>, respectively.
<sgml.block>
N = {Doc, Para1, Para2, Pcdata}
T = {doc, para, pcdata}
S = {Doc}
P = {Doc → doc (Para1, Para2*), Para1 → para (Pcdata),
Para2 →para (Pcdata), Pcdata → pcdata ε}
</sgml.block></para>
<para>We represent every text value by the node <sgml>pcdata</sgml> for convenience.</para>
<para>Without loss of generality, we can assume that no two production rules have the same non-terminal in the left-hand side and the same terminal in the right-hand side at the same time. If a regular tree grammar contains such production rules, we only have to merge them into a single production rule. We also assume that every non-terminal is either a start symbol or occurs in the content model of some production rule (in other words, no non-terminals are useless).</para>
<para>We have to define how a regular tree grammar generates a set of trees over terminals. We first define interpretations.</para>
<para id="def-int">
<highlight style="bold">Definition 2. </highlight>
<highlight style="bital">(Interpretation)</highlight> An
interpretation <highlight style="ital">I</highlight> of a tree
<highlight style="ital">t</highlight> against a regular tree grammar
<highlight style="ital">G</highlight> is a mapping from each node
<highlight style="ital">e</highlight> in <highlight
style="ital">t</highlight> to a non-terminal, denoted <highlight
style="ital">I(e)</highlight>, such that:<randlist>
<li>
<para><highlight style="ital">I(e<sub>root</sub>)</highlight> is a
start symbol where <highlight style="ital">e<sub>root</sub>
</highlight> is the root of <highlight style="ital">t</highlight>, and</para>
</li>
<li>
<para>for each node <highlight style="ital">e</highlight> and
its subordinates <highlight style="ital">e<sub>0</sub>
</highlight>, <highlight style="ital">e<sub>1</sub>
</highlight>, ..., <highlight style="ital">e<sub>i</sub>
</highlight>,there exists a production rule <sgml>X
→ a r</sgml> such that <randlist>
<li>
<para><highlight style="ital">I(e)</highlight> is <sgml>X</sgml>,
</para>
</li>
<li>
<para>the terminal (label) of <highlight
style="ital">e</highlight> is <highlight style="ital">a</highlight>,
and </para>
</li>
<li>
<para><highlight style="ital">I(e<sub>0</sub>)</highlight>
<highlight style="ital">I(e<sub>1</sub>)</highlight> ...
<highlight style="ital">I(e<sub>i</sub>)</highlight> matches <highlight
style="ital">r</highlight>.
</para>
</li>
</randlist>
</para>
</li>
</randlist>
</para>
<para>Now, we are ready to define generation of trees from regular tree
grammars, and regular tree languages.</para>
<para id="def-generation">
<highlight style="bold">Definition 3.</highlight>
<highlight style="bital">(Generation)</highlight> A tree <highlight style="ital">t</highlight>
is generated by a regular tree grammar <highlight style="ital">G</highlight> if there is
an interpretation of <highlight style="ital">t</highlight> against <highlight style="ital">G</highlight>.
</para>
<para id="ex-generation">
<highlight style="bold">Example 2. </highlight>
An instance tree generated by <highlight style="ital">G<sub><highlight style="ital">1</highlight></sub>
</highlight>, and its interpretation against <highlight style="ital">G<sub><highlight style="ital">1</highlight></sub></highlight> are shown in Figure <xref type="number" refloc="fig-interpretation"/>.
</para>
<figure id="fig-interpretation">
<caption>
<para>An instance tree generated by <highlight style="ital">G<sub><highlight style="ital">1</highlight></sub></highlight>, and its interpretation against <highlight style="ital">G<sub><highlight style="ital">1</highlight></sub></highlight>. We use unique labels to represent the nodes in the instance tree.
</para>
</caption>
<graphic figname="interpretation" scale="100"/>
</figure>
<para id="def-regular-tree-lang">
<highlight style="bold">Definition 4.</highlight>
<highlight style="bital">(Regular Tree Language)</highlight> A <highlight style="ital">regular tree language</highlight> is the set of trees generated by a regular tree grammar.
</para>
</subsec1>
<subsec1>
<title>Local Tree Grammars and Languages</title>
<para>We first define competition of non-terminals, which makes validation
difficult. Then, we introduce a restricted class called ``local''
by prohibiting competition of non-terminals <bibref
refloc="IC-Takahashi-75"/>. This class roughly corresponds to DTD.</para>
<para id="def-competing-nt">
<highlight style="bold">Definition 5. </highlight>
<highlight style="bital">(Competing Non-Terminals) </highlight>
Two different non-terminals <highlight style="ital">A</highlight> and <highlight style="ital">B</highlight> are said <highlight
style="ital">competing</highlight> with each other if
<randlist>
<!--
<li>
<para><highlight style="ital">A</highlight> and <highlight
style="ital">B</highlight> are different, </para>
</li>
-->
<li>
<para>one production rule has <highlight
style="ital">A</highlight> in the
left-hand side, </para>
</li>
<li>
<para>another production rule has <highlight
style="ital">B</highlight> in the left-hand
side, and</para>
</li>
<li>
<para>these two production rules share the same terminal
in the right-hand side.</para>
</li>
</randlist>
<!--
</para>
<para id="ex-competing">
-->
<highlight style="bold">Example 3. </highlight>
Consider a regular tree grammar <highlight style="ital">G<sub>3</sub> =
(N, T, S, P)</highlight>, where:
<sgml.block>
N = {Book, Author1, Son, Article, Author2, Daughter}
T = {book, author, son, daughter}
S = {Book, Article}
P = {Book → book (Author1), Author1 → author (Son), Son → son ε,
Article → article (Author2), Author2 → author (Daughter),
Daughter →daughter ε}</sgml.block>
<highlight style="ital">Author1</highlight> and <highlight
style="ital">Author2</highlight> compete with each other, since the
production rule for <highlight style="ital">Author1</highlight> and
that for <highlight style="ital">Author2</highlight> share the
terminal <highlight style="ital">author</highlight> in the
right-hand side. There are no other
competing non-terminal pairs in this grammar.</para>
<para id="def-local-tree-grammar">
<highlight style="bold">Definition 6.</highlight>
<highlight style="bital">(Local Tree Grammar and Language) </highlight>
A <highlight style="ital">local tree grammar (LTG)</highlight> is a regular tree grammar without competing non-terminals. A tree language is a <highlight style="ital">local tree language</highlight> if it is generated by a local tree grammar.
</para>
<para id="ex-tree-non-local">
<highlight style="bold">Example 4. </highlight>
<highlight style="ital">G<sub>3</sub></highlight>
in Example 3 is not local, since <highlight
style="ital">Author1</highlight> and <highlight
style="ital">Author2</highlight> compete with each other.
</para>
<para id="ex-tree-local">
<highlight style="bold">Example 5. </highlight>
The following grammar <highlight style="ital">G<sub>
<highlight style="ital">5</highlight>
</sub> = (N, T, S, P)</highlight> is a local tree grammar:
<sgml.block>
N = {Book, Author1, Son, Pcdata}
T = {book, author, son, pcdata}
S = {Book}
P = {Book → book (Author1), Author1 → author (Son),
Son → son Pcdata, Pcdata → pcdata ε}</sgml.block>
</para>
<para>Observe that local tree grammars and extended context-free (string) grammars (ECFG) look similar. However, the former describes sets of trees, while the latter describes sets of strings. The parse tree set of an ECFG is a local tree language.</para>
</subsec1>
<subsec1>
<title>Single-Type Tree Grammars and Languages</title>
<para>Next, we introduce a less restricted class called ``single-type'' by
prohibiting competition of non-terminals within a single content
model. This class roughly corresponds to XML-Schema.</para>
<para id="def-single-type-grammar">
<highlight style="bold">Definition 7. </highlight>
<highlight style="bital">(Single-Type Tree Grammar and Language)</highlight>
A a <highlight style="ital">single-type tree grammar</highlight> is a
regular tree grammar such that
<randlist>
<li>
<para>for
each production rule, non-terminals in its content model do not
compete with each other, and </para>
</li>
<li>
<para>start symbols do not compete with each other.</para>
</li>
</randlist>
A tree language is a single-type tree language if it is generated by a single-type tree grammar.</para>
<para>
<highlight style="bold">Example 6. </highlight>
The grammar <highlight style="ital">G<sub><highlight style="ital">1</highlight></sub></highlight> shown in Example 1 is not single-type. Observe that non-terminals <highlight style="ital">Para1</highlight> and <highlight style="ital">Para2</highlight> compete, and they occur in the content model of the production rule for <highlight style="ital">Doc</highlight>.</para>
<para id="ex-single-type">
<highlight style="bold">Example 7. </highlight>
Consider <highlight style="ital">G<sub>
<highlight style="ital">3</highlight></sub></highlight>
in Example 3. This
grammar is a single-type tree grammar since no production rules have more than one non-terminal in their content models and thus there cannot be any competing non-terminals in the content models.</para>
</subsec1>
<subsec1>
<title>Restrained-Competition Tree Grammars and Languages</title>
<para>Now, we introduce an even less restricted class called
``restrained-competition''.
The key idea is to use content models for restraining
competition of non-terminals.</para>
<para id="def-Restraining">
<highlight style="bold">Definition 8. </highlight>
<highlight style="bital">(Restraining) </highlight>
A content model <highlight style="ital">r</highlight> restrains competition of two competing
non-terminals <highlight style="ital">A</highlight> and <highlight
style="ital">B</highlight> if, for any sequences <highlight
style="ital">U,
V, W</highlight> of
non-terminals, <highlight style="ital">r</highlight> do not generate
both <highlight style="ital">U A V</highlight> and
<highlight style="ital">U B W</highlight>.
</para>
<para id="def-rc-grammar">
<highlight style="bold">Definition 9. </highlight>
<highlight style="bital">(Restrained-Competition Tree Grammar and Language)
</highlight>A <highlight
style="ital">restrained-competition tree grammar</highlight>
is a regular tree grammar such that
<randlist>
<li>
<para>for each production rule, its content model restrains competition of
non-terminals occurring in the content model, and </para>
</li>
<li>
<para>start symbols do not compete with each other.</para>
</li>
</randlist>
A tree language is a <highlight style="ital">restrained-competition tree language</highlight> if it is generated by a restrained-competition tree grammar.</para>
<!--
</para>
<para id="ex-effortless">
-->
<para>
<highlight style="bold">Example 8. </highlight>
The grammar <highlight style="ital">G<sub>
<highlight style="ital">1
</highlight>
</sub></highlight>
is a restrained-competition tree grammar. Non-terminals <highlight
style="ital">Para1</highlight> and <highlight
style="ital">Para2</highlight> compete with each other, and they both
occur in the content model of the production rule for <highlight
style="ital">Doc</highlight>. However, this content model
<sgml>(Para1, Para2*)</sgml> restrains the competition between
<highlight style="ital">Para1</highlight> and <highlight
style="ital">Para2</highlight>, since <highlight
style="ital">Para1</highlight> may
occur only as the first
non-terminal and <highlight style="ital">Para2</highlight> may occur
only as the non-first non-terminal.</para>
<para id="ex-non-effortless">
<highlight style="bold">Example 9. </highlight>
The following grammar G<sub>
<highlight style="ital">9</highlight></sub>is
not a restrained-competition tree grammar.
Observe that the
content model <sgml>(Para1*, Para2*)</sgml> does not restrain the
competition of non-terminals
<highlight style="ital">Para1</highlight> and <highlight
style="ital">Para2</highlight>. For example, suppose that
<highlight style="ital">U</highlight> =
<highlight style="ital">V</highlight> =
<highlight style="ital">W</highlight> =
<sgml>ε</sgml>. Then, both <highlight style="ital">U Para1
V</highlight> and <highlight style="ital">U Para2 W</highlight> match
the content model.
<sgml.block>
N = {Doc, Para1, Para2, Pcdata}
T = {doc, para, pcdata}
S = {Doc}
P = {Doc → doc (Para1*, Para2*), Para1 → para (Pcdata),
Para2 → para (Pcdata), Pcdata → pcdata ε}
</sgml.block>
<!--
</para>
<para id="def-efforles">
-->
</para>
</subsec1>
<subsec1>
<title>Summary of Examples</title>
<para>Each of the example grammars demonstrates one of the classes of tree grammars. The following table shows to which class each example belongs.</para>
<table>
<tgroup cols="5">
<thead>
<row>
<entry/>
<entry>Regular</entry>
<entry>Restrained-Competition</entry>
<entry>Single-Type</entry>
<entry>Local</entry>
</row>
</thead>
<tbody>
<row>
<entry>
<highlight style="ital">G<sub>
<highlight style="ital">5
</highlight>
</sub>
</highlight>
</entry>
<entry>Yes</entry>
<entry>Yes</entry>
<entry>Yes</entry>
<entry>Yes</entry>
</row>
<row>
<entry>
<highlight style="ital">G<sub>
<highlight style="ital">3
</highlight>
</sub>
</highlight>
</entry>
<entry>Yes</entry>
<entry>Yes</entry>
<entry>Yes</entry>
<entry>No</entry>
</row>
<row>
<entry>
<highlight style="ital">G<sub>
<highlight style="ital">1
</highlight>
</sub>
</highlight>
</entry>
<entry>Yes</entry>
<entry>Yes</entry>
<entry>No</entry>
<entry>No</entry>
</row>
<row>
<entry>
<highlight style="ital">G<sub>
<highlight style="ital">9
</highlight>
</sub>
</highlight>
</entry>
<entry>Yes</entry>
<entry>No</entry>
<entry>No</entry>
<entry>No</entry>
</row>
</tbody>
</tgroup>
</table>
</subsec1>
<subsec1>
<title>Expressiveness and Closure</title>
<para>Having introduced four classes
of grammars and languages, we consider expressiveness and closure.
The interested reader is referred to <bibref refloc="IBM-TR-LMM-00"/>
for the proofs.</para>
<para>First, we consider expressiveness using practical examples. For
each terminal, a local tree grammar provides a single content
model for all elements of this terminal. For example,
<sgml><title></sgml> elements may be subordinate to
<sgml><section></sgml>, <sgml><chapter></sgml>, or
<sgml><author></sgml> elements, but permissible subordinate children are
always the same. A single-type tree grammar does not have this tight
restriction, but cannot allow the first paragraph and the other
paragraphs in a section to have different content models. A
restrained-competition tree grammar lifts this restriction.
Although some regular tree languages cannot be captured by
restrained-competition tree grammars, we are not aware of any
practical examples.</para>
<para>Next, we present theoretical observations about expressiveness.
<randlist>
<li>
<para>Regular
tree grammars are strictly more expressive than restrained-competition
tree grammars. That is, some regular tree grammars cannot be
rewritten as restrained-competition tree grammars.</para>
</li>
<li>
<para>Restrained-competition
tree grammars are strictly more expressive than single-type tree grammars.
That is, some restrained-competition tree grammars cannot be rewritten
as single-type tree grammars.</para>
</li>
<li>
<para>Single-type
tree grammars are strictly more expressive than local tree grammars.
That is, some single-type tree grammars cannot be rewritten as local
tree grammars.</para>
</li>
</randlist>
<!--
</para>
<para>
-->
Next, we present observations on closure. A class of languages is
said to be closed under union/intersection/difference when, for any
two languages in that class, the union/intersection/difference of the
two languages also belongs to the same class.
<randlist>
<li>
<para>The class of regular tree languages is closed under union, intersection, and difference.
</para>
</li>
<li>
<para>The class of local tree languages is closed under intersection, but is not closed under union or difference.
</para>
</li>
<li>
<para>The class of single-type tree languages is closed under intersection, but is not closed under union or difference.
</para>
</li>
<li>
<para>The class of restrained-competition languages is closed under intersection, but is not closed under union or difference.</para>
</li>
</randlist>
</para>
</subsec1>
</section>
<!-- section 4-->
<section id="sec-op">
<title>Document Validation</title>
<para>In this section, we consider algorithms
for document validation. Given a grammar and a tree, these algorithms
determine whether that tree is generated by the grammar, and also
construct interpretations of the tree. All algorithms are based on
variations of tree automata. Some algorithms require the tree model
while others do not.</para>
<subsec1>
<title>Tree Model vs. Event Model</title>
<para>Programs for handling XML documents, including those for validation, are typically based on either the tree model or event model. In the tree model, the XML parser creates a tree in memory. Application programs have access to the tree in memory via some API such as DOM <bibref refloc="WWW-DOM-00"/>, and can traverse the tree any number of times. In the event model, the XML parser does not create a tree in memory, but merely raise events for start tags or end tags. Application programs are notified of such events via some API such as SAX <bibref refloc="WWW-SAX-00"/>. In other words, application programs visit and leave elements in the document in the depth-first manner. Once an application program visits an element in the document, it cannot visit the element again. While both models work fine for small documents, the tree model has performance problems for significantly large documents.
</para>
</subsec1>
<subsec1>
<title>DTD validation and their variations</title>
<para>We begin with validation for
local tree grammars. Since DTDs correspond to local tree grammars,
such validation has been used for years. Remember that a local tree
grammar does not have competing non-terminals (see Definition 6).
Thus, for each element, we can uniquely determine a non-terminal from
the terminal of this element. </para>
<para>Validation for local tree grammars can be easily built in the event
model. Suppose that the XML processor has encountered a start tag for
an element <highlight style="ital">e</highlight> in a given document.
We can easily determine a non-terminal for <highlight
style="ital">e</highlight> from this start tag. Next, suppose that
the XML processors has encountered the end tag for <highlight
style="ital">e</highlight>. Let
<highlight style="ital">e<sub>1</sub></highlight>,
<highlight style="ital">e<sub>2</sub></highlight>, ...
<highlight style="ital">e<sub>i</sub></highlight>
be the child elements of
<highlight style="ital">e</highlight>. Since the start tags for these
elements have been already encountered by the XML parser, we can
assume that we already know the non-terminals
<highlight style="ital">n<sub>1</sub></highlight>,
<highlight style="ital">n<sub>2</sub></highlight>, ...
<highlight style="ital">n<sub>i</sub></highlight>
assigned to
<highlight style="ital">e<sub>1</sub></highlight>,
<highlight style="ital">e<sub>2</sub></highlight>, ...
<highlight style="ital">e<sub>i</sub></highlight>.
We only have to examine if
<highlight style="ital">n<sub>1</sub></highlight>,
<highlight style="ital">n<sub>2</sub></highlight>, ...
<highlight style="ital">n<sub>i</sub></highlight>
matches the content model of the (unique) production rule for
<highlight style="ital">e</highlight>.
</para>
<para>This idea is effected by the following algorithm.</para>
<!-- algo: Validation for local tree grammars -->
<para id="algo-val">
<highlight style="bold">Algorithm 1: Validation for local tree
grammars</highlight>
<seqlist>
<li><para><highlight style="bold">begin element: </highlight>
When a start tag is encountered, we first
determine a terminal, say <highlight style="ital">a</highlight>.
We then search for a production
rule <sgml>X → a r</sgml> in the given grammar. It is guaranteed
that one and at most one such production rule is found.</para></li>
<li><para><highlight style="bold">end element: </highlight>
When an end tag is encountered, the sequence of
non-terminals assigned to the children of this element is compared
against <highlight style="ital">r</highlight>. If this sequence
does not match <highlight style="ital">r</highlight>, we report that
this document is invalid
and halt.</para></li>
</seqlist>
</para>
<!--
<para>
<deflist>
<def.item>
<def.term>begin element</def.term>
<def>
<para>When a start tag is encountered, we first
determine a terminal, say <highlight style="ital">a</highlight>.
We then search for a production
rule <sgml>X → a r</sgml> in the given grammar. It is guaranteed
that
one and at most one such production rule is found.</para>
</def>
</def.item>
<def.item>
<def.term>end element</def.term>
<def>
<para>When an end tag is encountered, the sequence of
non-terminals assigned to the children of this element is compared
against <highlight style="ital">r</highlight>. If this sequence
does not match <highlight style="ital">r</highlight>, we report that
this document is invalid
and halt.</para>
</def>
</def.item>
</deflist>
-->
<para>Observe that this algorithm not only determines whether a
document is generated by the given grammar but also constructs an
interpretation of the document. Since any document has at most one
interpretation under a local tree grammar, the constructed
interpretation is the only interpretation of the input document.
</para>
<para>Now, we extend our algorithm for
single-type tree grammars and restrained-competition tree grammars.
This extension is quite simple and straightforward. </para>
<para>Remember that a single-type tree
grammar does not allow competing non-terminals within a single content
model (see Definition 7). Thus, for each element, we can uniquely
determine a non-terminal from the terminal of this element as
well as the non-terminal for the parent element. Furthermore, since a
single-type tree grammar does not allow competing start symbols, we
can uniquely determine a non-terminal from the terminal of the
root element. </para>
<para>We only have to change the
behavior for start tag events.</para>
<!-- algo: Validation for single-type tree grammars -->
<para id="algo-single">
<highlight style="bold">Algorithm 2: Validation for single-type tree
grammars.</highlight>
<seqlist>
<li><para><highlight style="bold">begin element: </highlight> When a start
tag is encountered, we first determine a terminal, say
<highlight style="ital">a</highlight>.
<seqlist number="lalpha">
<li><para><highlight style="bold">root: </highlight>
If this start tag represents the root element, we search for a
production rule <sgml>X → a r</sgml> in the given grammar such
that <highlight style="ital">X</highlight> is a start symbol. It is
guaranteed that at most one such production rule is found. If not
found, we report that this document is invalid and halt.</para></li>
<li><para><highlight style="bold">non-root: </highlight> If
this start tag represents a non-root element, we have already
encountered the start tag for the parent element and have determined
the non-terminal and content model, say <highlight
style="ital">n'</highlight> and <highlight
style="ital">r'</highlight>, for the parent element. We search for a
production rule <sgml>X → a r</sgml> such that <highlight
style="ital">X</highlight> occurs in <highlight
style="ital">r'</highlight>. It is guaranteed that at most one such
production rule is found. If not found, we report that this document
is invalid and halt.</para></li>
</seqlist>
</para></li>
<li><para><highlight style="bold">end element: </highlight>
The same as in Algorithm 1.</para></li>
</seqlist>
</para>
<!--
<para>
<deflist>
<def.item>
<def.term>begin element</def.term>
<def>
<para>When a start tag is encountered, we first determine a
terminal, say <highlight style="ital">a</highlight>.
<deflist>
<def.item>
<def.term>root</def.term>
<def>
<para>If this start tag represents the root element, we
search for a
production rule <sgml>X → a r</sgml> in the given grammar such
that <highlight style="ital">X</highlight>
is a start symbol. It is guaranteed that at most one such production
rule is found. If not found, we report that this document is invalid
and halt.
</para>
</def>
</def.item>
<def.item>
<def.term>non-root</def.term>
<def>
<para>If this start tag represents a non-root element, we
have already
encountered the start tag for the parent element and have determined
the non-terminal and content model, say <highlight
style="ital">n'</highlight> and <highlight
style="ital">r'</highlight>, for the parent
element. We search for a production rule <sgml>X → a r</sgml>
such
that <highlight style="ital">X</highlight> occurs in <highlight
style="ital">r'</highlight>. It is guaranteed that at most one such
production rule is found. If not found, we report that this document
is invalid and halt.</para>
</def>
</def.item>
</deflist>
</para>
</def>
</def.item>
<def.item>
<def.term>end element</def.term>
<def>
<para>The same as in the previous algorithm.</para>
</def>
</def.item>
</deflist>
</para>
-->
<para>Next, we consider restrained-competition tree grammars. Remember that each content model in a restrained-competition tree grammar does not allow competing non-terminals to follow the same sequence of non-terminals (see Definition 9). Thus, for each element, we can uniquely determine a non-terminal from the terminal of this element as well as the non-terminals for the parent element and the elder sibling elements.
</para>
<para>We only have to change the behavior for start tag events for
non-root elements.
</para>
<!-- algo: Validation for restrained-competition tree grammars -->
<para id="algo-restrain">
<highlight style="bold">Algorithm 3: Validation for
restrained-competition tree grammars.</highlight>
<seqlist>
<li><para><highlight style="bold">begin element: </highlight> When a
start tag is encountered, we first determine a terminal, say
<highlight style="ital">a</highlight>.
<seqlist number="lalpha">
<li><para><highlight style="bold">root: </highlight>
The same as in Algorithm 2.</para></li>
<li><para><highlight style="bold">non-root: </highlight> If
this start tag represents a non-root element, we have already
encountered the start tag for the parent element and elder sibling
elements. We thus have already determined the non-terminal and
content model, say <highlight style="ital">n'</highlight> and
<highlight style="ital">r'</highlight>, for the parent element, and
have determined the non-terminals, say <highlight
style="ital">n<sub>1</sub></highlight>, <highlight
style="ital">n<sub>2</sub></highlight>, ..., <highlight
style="ital">n<sub>i</sub></highlight> for the elder sibling elements.
We search for a production rule <sgml>X → a r</sgml> such that
<highlight style="ital">r'</highlight> allows <highlight
style="ital">X</highlight> to follow <highlight
style="ital">n<sub>1</sub> n<sub>2</sub> ...
n<sub>i</sub></highlight>. It is guaranteed that at most one such
production rule is found. If not found, we report that this document
is invalid and halt.</para>
</li>
</seqlist>
</para></li>
<li><para><highlight style="bold">end element: </highlight>
The same as in Algorithm 1.</para></li>
</seqlist>
</para>
<!--
<para>
<deflist>
<def.item>
<def.term>begin element</def.term>
<def>
<para>When a start tag is encountered, we first determine a
terminal, say <highlight style="ital">a</highlight>.
<deflist>
<def.item>
<def.term>root</def.term>
<def>
<para>The same as in the previous algorithm.</para>
</def>
</def.item>
<def.item>
<def.term>non-root</def.term>
<def>
<para>If this start tag represents a non-root element, we
have already
encountered the start tag for the parent element and elder sibling
elements. We thus have already determined the non-terminal and
content model, say <highlight style="ital">n'</highlight> and
<highlight style="ital">r'</highlight>, for the parent element, and
have
determined the non-terminals, say
<highlight style="ital">n<sub>1</sub>
</highlight>,
<highlight style="ital">n<sub>2</sub>
</highlight>, ...
<highlight style="ital">n<sub>i</sub>
</highlight> for the
elder sibling elements.</para>
<para>We search for a production rule
<sgml>X → a r</sgml> such that <highlight
style="ital">r'</highlight>
allows <highlight style="ital">X</highlight> to follow
<highlight style="ital">n<sub>1</sub> n<sub>2</sub> ...
n<sub>i</sub>
</highlight>. It is guaranteed that at
most one such production rule is found. If not found, we report that
this document is invalid and halt.</para>
</def>
</def.item>
</deflist>
</para>
</def>
</def.item>
<def.item>
<def.term>end element</def.term>
<def>
<para>The same as in the previous algorithm.</para>
</def>
</def.item>
</deflist>
</para>
-->
<para>Observe that given a document, the extended algorithms construct
an interpretation of the document. Any document has at most one
interpretation under a single-type or restrained competition tree
grammar.</para>
</subsec1>
<subsec1>
<title>Variations of Tree Automata</title>
<para>Before we study algorithms applicable to arbitrary regular tree
grammars, we consider tree automata. Validation of trees against tree
regular grammars can be considered as execution of tree automata.
Automata for regular tree grammars have been studied in the past and
present <bibref refloc="BOOK-TATA-97"/>. There are top-down tree
automata and bottom-up tree automata: the former begins with the root
node and assigns states to elements after handling superior elements,
while the latter begins with leaf nodes and assigns states to elements
after handling subordinate elements. Moreover, there are
deterministic tree automata and non-deterministic tree automata: the
former assigns a state to each element, while the latter assigns any
number of states to each element. As a result, there are four types
of tree automata:
<randlist>
<li><para>deterministic top-down,</para></li>
<li><para>non-deterministic top-down,</para></li>
<li><para>deterministic bottom-up, and</para></li>
<li><para>non-deterministic bottom-up.</para></li>
</randlist>
</para>
<para>It is known that
non-deterministic top-down, deterministic bottom-up and
non-deterministic bottom-up tree automata are equally expressive. In
other words, any regular tree language can be accepted by some
non-deterministic top-down automata. The same thing applies to
deterministic bottom-up and non-deterministic bottom-up tree automata.
On the other hand, deterministic top-down tree automata are not
equally expressive. In other words, some regular tree language cannot
be accepted by any deterministic top-down tree automaton.</para>
<para>Algorithms 1, 2, and 3 are similar to deterministic top-down
automata. However, deterministic top-down tree automata assign a
state to an element without examining that element; they only examine
the parent element and the state assigned to it. Because of this
restriction, deterministic top-down tree automata are almost useless
for XML. On the other hand, Algorithms 1, 2, and 3 examine an element
before assigning a non-terminal (state) to it. Hence, we call them
semi-deterministic top-down. </para>
<para>One approach for validation is to use deterministic bottom-up
tree automata. Given a regular tree grammar, we create a
deterministic bottom-up tree automaton. This is done by introducing a
state for each subset of the set of non-terminals of the grammar and
then constructing a transition function for these states. Execution
of thus constructed deterministic bottom-up tree automata is
straightforward, which is the biggest advantage of this approach.
However, in this paper, we do not consider this approach further for
two reasons. First, creation of deterministic bottom-up tree automata
is not easy and is beyond the scope of this paper. Second,
straightforward implementation will lead to poor error message.</para>
<para>Other than deterministic tree automata, we have non-deterministic ones and our algorithms in the next subsection are based on them.</para>
</subsec1>
<subsec1>
<title>Non-deterministic algorithms for regular tree grammars</title>
<para>In this section, we show two algorithms for regular tree
grammars. Both are based on non-deterministic tree automata.</para>
<para>Our first algorithm simulates non-deterministic top-down tree automata. It cannot be implemented in the event model but requires the tree model. Unlike Algorithms 1, 2, and 3, this algorithm does not create string automata from content models.</para>
<para>This algorithm is effected by a simple recursive procedure
<highlight style="ital">validate</highlight>. Given a content model
and a sequence of elements, <highlight
style="ital">validate</highlight> compares this sequence against the
content model and reports success or failure.</para>
<para id="algo-validate">
<highlight style="bold">Algorithm 4.1: Validate</highlight>
<seqlist>
<!-- <li><para>procedure <highlight style="ital">validate</highlight></para></li> -->
<li>
<para><highlight style="bold">input: </highlight>
a content model <sgml>r</sgml> and a sequence
<highlight style="ital">e<sub>1</sub></highlight> <highlight
style="ital">e<sub>2</sub></highlight> ... <highlight
style="ital">e<sub>i</sub></highlight> of elements.</para>
<para><highlight style="bold">output: </highlight>success or failure</para>
</li>
<li><para>Switch statement
<seqlist number="lalpha">
<li><para>Case 1: <sgml>r = ε</sgml> (the null string)</para>
<para>If <highlight style="ital">e<sub>1</sub></highlight> <highlight
style="ital">e<sub>2</sub></highlight> ... <highlight
style="ital">e<sub>i</sub></highlight> is an empty sequence (i.e.,
<highlight style="ital">i</highlight> = 0), this procedure succeeds.
Otherwise, it fails.</para></li>
<li><para>Case 2: <sgml>r = x</sgml> (a non-terminal)</para>
<para>If <highlight style="ital">e<sub>1</sub></highlight> <highlight
style="ital">e<sub>2</sub></highlight> ... <highlight
style="ital">e<sub>i</sub></highlight> ≠ <highlight
style="ital">e<sub>1</sub></highlight> (i.e., i ≠ 1), this
procedure fails. If <highlight style="ital">e<sub>1</sub></highlight>
<highlight style="ital">e<sub>2</sub></highlight> ... <highlight
style="ital">e<sub>i</sub></highlight> = <highlight
style="ital">e<sub>1</sub></highlight> (i.e., <highlight
style="ital">i</highlight> = 1), we identify those production rules
<sgml>X → a r'</sgml> such that <sgml>a</sgml> is the terminal
of <highlight style="ital">e<sub>1</sub></highlight>. For each
of these production rules, we recursively invoke <highlight
style="ital">validate</highlight> for <sgml>r'</sgml> and <highlight
style="ital">e<sub>11</sub></highlight> <highlight
style="ital">e<sub>12</sub></highlight> ... <highlight
style="ital">e<sub>1j</sub></highlight>, where <highlight
style="ital">e<sub>11</sub></highlight> <highlight
style="ital">e<sub>12</sub></highlight> ... <highlight
style="ital">e<sub>1j</sub></highlight> is the children of <highlight
style="ital">e<sub>1</sub></highlight>. If at least one of these
invocations succeeds, this procedure succeeds. Otherwise, it
fails.</para></li>
<li><para>Case 3: <sgml>r = r1 | r2</sgml></para>
<para>We invoke <highlight style="ital">validate</highlight> for
<sgml>r1</sgml> and <highlight
style="ital">e<sub>1</sub></highlight> <highlight
style="ital">e<sub>2</sub></highlight> .... <highlight
style="ital">e<sub>i</sub></highlight> and then invoke it for
<sgml>r2</sgml> and <highlight
style="ital">e<sub>1</sub></highlight> <highlight
style="ital">e<sub>2</sub></highlight> .... e<sub>i</sub>. If at
least one of the two invocations succeeds, this procedure succeeds.
Otherwise, it fails.</para></li>
<li><para>Case 4: <sgml>r = r1 r2</sgml></para>
<para>For each <highlight style="ital">k (1 ≤ k ≤
i)</highlight>, we invoke <highlight style="ital">validate</highlight>
for <sgml>r1</sgml> and <highlight
style="ital">e<sub>1</sub></highlight> <highlight
style="ital">e<sub>2</sub></highlight> ... <highlight
style="ital">e<sub>k</sub></highlight> and also invoke <highlight
style="ital">validate</highlight> for <sgml>r2</sgml> and
<highlight style="ital">e<sub>k+1</sub></highlight> <highlight
style="ital">e<sub>k+2</sub></highlight> ... <highlight
style="ital">e<sub>i</sub></highlight>. If both invocations succeed
for some <highlight style="ital">k</highlight>, this procedure
succeeds. Otherwise, it fails.</para></li>
<li><para>Case 5: <sgml>r = r1</sgml><super>*</super></para>
<para>If <highlight style="ital">e<sub>1</sub></highlight> <highlight
style="ital">e<sub>2</sub></highlight> ... <highlight
style="ital">e<sub>i</sub></highlight> = ε (i.e., <highlight
style="ital">i</highlight> = 0), this procedure succeeds. Otherwise,
for each <highlight style="ital">k ( 1 ≤ k ≤ i)</highlight>,
we invoke <highlight style="ital">validate</highlight> for
<sgml>r1</sgml> and <highlight
style="ital">e<sub>1</sub></highlight> <highlight
style="ital">e<sub>2</sub></highlight> ... <highlight
style="ital">e<sub>k</sub></highlight> and also invoke <highlight
style="ital">validate</highlight> for <sgml>r1</sgml><super>*</super> and
<highlight style="ital">e<sub>k+1</sub> e<sub>k+2</sub>
... e<sub>i</sub></highlight>. If both invocations succeed for some
<highlight style="ital">k</highlight>, this procedure succeeds.
Otherwise, it fails.</para></li>
</seqlist></para>
</li>
</seqlist>
</para>
<para id="algo-non-det-bottom-up">
<highlight style="bold">Algorithm 4.2: Non-deterministic top-down validation for
regular tree grammars.</highlight>
<seqlist>
<li>
<para>We begin with the root element. For each start
symbol <sgml>x</sgml>, we invoke Algorithm 4.1 <highlight
style="bold">validate</highlight> for <sgml>x</sgml> and the root
element. If at least one of these invocations succeeds, this document
is valid. Otherwise, it is invalid.
</para>
</li>
</seqlist>
</para>
<para>This algorithm does not require understanding of formal language theory. On the other hand, this approach has significant disadvantages. First, this algorithm may cause exhaustive search. Second, this algorithm leads to poor error message. Given an invalid document, this algorithm tries all possibilities in turn. If all of them fail, this algorithm cannot tell which of the possibilities is closest to success and thus cannot report what is a required change.
</para>
<para>Observe that this algorithm provides an interpretation of the
document, but does <highlight style="ital">not</highlight> ensure that
this is the only interpretation. In fact, more than one
interpretation may exist for a given tree. For example, suppose that
an XML document <sgml><doc><para/></doc></sgml> is validated
against the grammar in Example 9. Two non-terminals, namely
<sgml>Para1</sgml> and <sgml>Para2</sgml>, can be assigned to the element
<sgml><para/></sgml>. Algorithm 4 merely returns one of the
possible interpretations.</para>
<para>Our second algorithm simulates non-deterministic bottom-up
tree automata. It is also a significant extension of Algorithm 1; the
biggest difference is that a set of non-terminals (rather than a single
non-terminal) is assigned to each element. Although this extension is
more advanced than Algorithms 2 and 3, it can still be built on top of the
event model.
</para>
<!-- algo: Algorithm for regular tree grammars -->
<para id="algo-rtg">
<highlight style="bold">Algorithm 5: Validation for regular tree
grammars</highlight>
<seqlist>
<li><para><highlight style="bold">begin element: </highlight> When a
start tag is encountered, we identify those production rules <sgml>X
→ a r</sgml> such that <highlight style="ital">a</highlight> is
the terminal of this tag. Note that more than one production
rule may be found.</para></li>
<li><para><highlight style="bold">end element: </highlight>
When an end tag is encountered, we have already
encountered the end tags for the children
<highlight style="ital">e<sub>1</sub></highlight>,
<highlight style="ital">e<sub>2</sub></highlight>, ...
<highlight style="ital">e<sub>i</sub></highlight> of
this element. We create non-terminal sequences by choosing one of the
non-terminal assigned to each <highlight
style="ital">e<sub>j</sub></highlight>(<highlight style="ital">j ≤ i</highlight>),
<!-- $e_j \; (j \leq i)$ -->
and concatenating the chosen non-terminals. If at least one of the
created non-terminal sequences match <highlight
style="ital">r</highlight> (the content model of <sgml>X → a
r</sgml>), then <highlight style="ital">X</highlight> is one of the
non-terminals for this element. If none of the created non-terminals
match any content model for this element, we report that this document
is invalid and halt.</para>
<para>When this element is the root element, validation
succeeds if and only if at least one of the non-terminals assigned to the root element is a
start symbol.</para></li>
</seqlist>
</para>
<para>Again, this algorithm does <highlight
style="ital">not</highlight> provide a unique interpretation of the
document. When more than one interpretation is possible, this
algorithm returns all of them.</para>
</subsec1>
<subsec1>
<title>Summary of Algorithms and their properties</title>
<para>The following table shows how the various algorithms described in this section compare with respect to the tree automaton they simulate, and the model (event model or tree model) that the algorithm requires.</para>
<table>
<tgroup cols="3">
<thead>
<row>
<entry>Algorithms</entry>
<entry>Class of Tree Automaton</entry>
<entry>Model the algorithm requires</entry>
</row>
</thead>
<tbody>
<row>
<entry>Algorithms 1, 2, 3</entry>
<entry>semi- deterministic top-down</entry>
<entry>Event model</entry>
</row>
<row>
<entry>Algorithm 4</entry>
<entry>non deterministic top-down</entry>
<entry>Tree model</entry>
</row>
<row>
<entry>Not Applicable</entry>
<entry>deterministic bottom-up</entry>
<entry>Not applicable</entry>
</row>
<row>
<entry>Algorithm 5</entry>
<entry>non deterministic bottom-up</entry>
<entry>Event model</entry>
</row>
</tbody>
</tgroup>
</table>
</subsec1>
</section>
<!-- section: 5-->
<section id="sec-evaluation">
<title>Evaluating Different XML Schema Language Proposals</title>
<para>In this section, we compare six
representative XML schema language proposals: DTD, DSD, XML-Schema,
XDuce, RELAX Core, and TREX. Our focus is on the mathematical
properties of these schema languages in our framework, though we also
mention other features such as ease of use whenever possible.</para>
<para>We capture all these schema proposals by regular tree grammars.
For this purpose, we slightly modify our definition of production
rules. We allow production rules without terminals; that is,
they are of the form <sgml>x → r</sgml>,
where <sgml>x ∈ N</sgml> and <sgml>r</sgml> is a regular
expression over <highlight style="ital">N</highlight>. However, we
impose a restriction that all non-terminals described by such
production rules can be safely expanded to regular expressions over
the other non-terminals. For example, <sgml>x → ((y, x, y) |
y)</sgml> is disallowed. Note that this production rule causes
non-regular string languages.</para>
<para>The relationship between the expressive power of the various
grammar classes in the previous section helps to compare the different
XML Schema proposals (see Figure <xref type="number"
refloc="fig-set"/>).
</para>
<subsec1>
<title>DTD</title>
<para>DTD as defined in <bibref refloc="WWW-BPS-00"/>
is a local tree grammar. This is enforced by not distinguishing
between terminals and non-terminals. Element type
declarations of DTDs are production rules, and "element types" of XML
1.0 are terminals as well as non-terminals. Content
models are required to be deterministic (see Appendix E of <bibref
refloc="WWW-BPS-00"/>). Attribute-list declarations of DTDs associate
attributes to terminals. </para>
<para>As an example, consider a DTD as below:
<sgml.block>
<!ELEMENT doc (para*)>
<!ELEMENT para (#PCDATA)>
</sgml.block>
It can be captured by a local tree grammar shown below:
<sgml.block>
N = {Doc, Para Pcdata}
T = {doc, para, pcdata}
S = {Doc}
P = {Doc → doc (Para*), Para → para (Pcdata),
Pcdata → pcdata ε}
</sgml.block></para>
</subsec1>
<subsec1>
<title>DSD</title>
<para>The main features of DSD <bibref refloc="WWW-DSD-00"/> are the use of <highlight
style="ital">non-terminals</highlight>, <highlight style="ital">content
expressions</highlight> to specify unordered content, <highlight
style="ital">context patterns</highlight> which can be used to specify
structures based on other structures and also values, and specifying
relationships
using <highlight style="ital">points-to</highlight> for IDREF
attributes. Using our framework, we can capture all the features except
structure specification based on values and <highlight
style="ital">points-to</highlight> for IDREF attributes.</para>
<para>Let us consider the structural specification by DSD: DSD does
not impose any constraint on the production rules, therefore we can
express any regular tree grammar in DSD. For example, <highlight
style="ital">E = (Author<sub>1</sub>*,
Publisher<sub>1</sub>*,Author<sub>2</sub>*)</highlight> is a perfectly
valid content model in DSD, where <highlight
style="ital">Author<sub>1</sub></highlight> and <highlight
style="ital">Author<sub>2</sub></highlight> are competing
non-terminals. But the parsing in DSD uses a greedy technique (not
enough backtracking for the <highlight style="ital">*</highlight>
operator). Therefore, DSD does not accept all regular tree languages.
For example, consider a sequence of two <sgml>author</sgml> elements
such that the first matches <highlight
style="ital">Author<sub>1</sub></highlight> only and the second
matches <highlight style="ital">Author<sub>2</sub></highlight> only.
The greedy evaluation of DSD tries <highlight
style="ital">Author<sub>1</sub></highlight> for both elements, but
does not try <highlight style="ital">Author<sub>2</sub></highlight>
for the second.</para>
<para>Let us consider element definitions in DSD.
An example element definition in DSD has the form:
<sgml.block> <ElementDef ID="book_title" Name="title">
SomeContentSpecification
</ElementDef></sgml.block>
This can be converted into the grammar notation as:
<sgml>P = {book_title → title Expression}.</sgml> Here, <highlight
style="ital">Expression</highlight> is equivalent to
<sgml>SomeContentSpecification</sgml>. </para>
<para>Context patterns of DSD represent conditions on paths from the
root, and may be used in element definitions. Our framework does not
directly capture context patterns. However, a recent paper <bibref
refloc="PODS-Murata-01"/> by the first author shows that a pair of a
regular tree grammar and a regular path expression for locating nodes
can be rewritten as a single regular tree grammar. Such rewriting
allows context patterns to be captured in our framework.</para>
</subsec1>
<subsec1>
<title>XML-Schema</title>
<para>XML-Schema <bibref refloc="WWW-XSD-1-00"/> represents a
single-type tree grammar. Although XML-Schema has many complicated
mechanisms, it is not very expressive from the viewpoint of formal
language theory. Furthermore, Kawaguchi <bibref refloc="WWW-Kawaguchi"/>
recently argued that most mechanisms of XML-Schema are confusing and
should be avoided.</para>
<para>The main features of XML-Schema are <highlight
style="ital">complex type definitions</highlight>, <highlight
style="ital">anonymous type definitions</highlight>, <highlight
style="ital">group definitions</highlight>, <highlight
style="ital">subtyping by extension and restriction</highlight>,
<highlight style="ital">substitution groups</highlight>, <highlight
style="ital">abstract type definitions</highlight> and <highlight
style="ital">integrity constraints such as key, unique and keyref
constraints</highlight>. Most of these features can be described in
our framework as illustrated below.
<seqlist>
<li><para>A complex type definition defines a production rule without terminals.
For instance,
<sgml.block>
<xsd:complexType name="Book">
<xsd:sequence>
<xsd:element name="title" type="xsd:string" minOccurs="1 maxOccurs="1"/>
<xsd:element name="author" type="xsd:string" minOccurs="1" maxOccurs="unbounded"/>
<xsd:element name="publisher" type="xsd:string" minOccurs="0" maxOccurs="1"/>
</xsd:sequence>
</xsd:complexType></sgml.block>
This can be converted into production rules <sgml>Book → (Title,
Author+, Publisher?)</sgml>,
<sgml>Title → title (Pcdata)</sgml>,
<sgml>Author → author (Pcdata)</sgml>, and
<sgml>Publisher → publisher (Pcdata)</sgml>.
</para>
</li>
<li>
<para>A group definition defines a non-terminal and a production rule
without a terminal. An example (<bibref refloc="WWW-XSD-0-00"/>
Section 2.7) is shown below:
<sgml.block>
<xsd:group name="shipAndBill">
<xsd:sequence>
<xsd:element name="shipTo" type="USAddress" />
<xsd:element name="billTo" type="USAddress" />
</xsd:sequence>
</xsd:group></sgml.block>
This group definition is equivalent to production
grammar rules
<sgml>ShipTo → shipTo (USAddress)</sgml>, <sgml>BillTo → billTo (USAddress)</sgml>, and <sgml>shipAndBill → (ShipTo, BillTo)}</sgml>
</para>
</li>
<li>
<para>From object oriented programming, XML-Schema borrows the
concepts of
sub-typing. This is achieved through extension or restriction. An
example of derived types by extension (slightly modified from <bibref
refloc="WWW-XSD-0-00"/> Section 4.2)
is given below.
<sgml.block>
<xsd:complexType name="Address">
<xsd:sequence>
<xsd:element name="name" type="xsd:string"/>
<xsd:element name="street" type="xsd:string"/>
<xsd:element name="city" type="xsd:string"/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="UKAddress">
<xsd:complexContent>
<xsd:extension base="Address">
<xsd:sequence>
<xsd:element name="postcode" type="xsd:string"/>
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType></sgml.block>
The above type definitions for Address and UKAddress
is equivalent to production rules
<sgml>Address → (Name, Street, City)</sgml>,
<sgml>UKAddress → (Name, Street, City, Postcode)</sgml>,
and production rules for <sgml>Name, Street, City, Postcode</sgml>.
</para>
<para>Now suppose there was an element declaration such as:
<sgml><xsd:element name="shipTo" type="Address"/></sgml>
This is equivalent to production rules
<sgml>ShipTo → shipTo (Address)</sgml> and <sgml>ShipTo → shipTo
(UKAddress)</sgml>.</para>
<para>Note that the production rule
<sgml>ShipTo → shipTo UKAddress</sgml> is automatically inserted,
true to the object-oriented programming paradigm. But this can be
considered as a ``side-effect'' in formal language theory. XML-Schema
provides an attribute called <highlight
style="ital">block</highlight> to prevent such side-effects. For
example, if Address had defined <highlight
style="ital">block=#all</highlight>, then we would not automatically
insert the rule.</para>
<para>XML-Schema also provides an attribute called <highlight
style="ital">final</highlight> which prevents
derived types by extension or restriction or both. For example, if
<highlight style="ital">Address</highlight> had defined <highlight
style="ital">final=#all</highlight>,
then we cannot derive a type called <highlight
style="ital">UKAddress</highlight> from it.</para>
</li>
<li>
<para>XML-Schema
provides a mechanism called <sgml>xsi:type</sgml> which allows it to
not satisfy the constraints of a restrained-competition tree
grammar. For example, it is legal to have the following rules in
XML-Schema.
<sgml.block>
P = {X → (Title), Y → (Title, Author1+), Z → (Title, Author2+, Publisher),
Book → book X, Book → book Y, Book → book Z}</sgml.block>
<!--<sgml.block>
P2 = {X → (Title), Y → (Title, Author1+), Z → (Title, Author2+, Publisher)}
P1 = {Title → title TITLE, Author1 → author Son*, Author2 → author Daughter*,
Book → book X, Book → book Y, Book → book Z}</sgml.block>-->
The above grammar is not a restrained competition tree grammar. But
for checking document validity, the properties of a single-type tree
grammar are maintained by requiring that the document mention
explicitly the type. For example a valid node in the instance tree
would be: <sgml><book xsi:type="Y"></sgml>.</para>
<para>It should be noted that this does not make XML-Schema equivalent to a
regular tree grammar, because XML-Schema cannot define a type
such as <sgml>BOOK → (Title, Author1*, Author2*)</sgml> with
<sgml>Author1→ author A1</sgml>, <sgml>Author2 → author
A2</sgml>.</para>
</li>
<li>
<para>A substitution group definition (previously known as
equivalence
classes) can be converted into an equivalent grammar definition as
follows. For example, consider the substitution group definition
(<bibref refloc="WWW-XSD-0-00"/> Section 4.6) (We modify it slightly
for easy explanation):
<sgml.block>
<element name="shipComment" type="Y" substitutionGroup="Ipo:comment"/>
<element name="customerComment" type="Z" substitutionGroup="Ipo:comment"/></sgml.block>
This is converted into grammar rules as:
<!--
<highlight style="ital">P<sub>1</sub></highlight>:
<sgml.block>{ShipComment → shipComment Y, customerComment → cutomerComment Z
Ipo:comment → shipComment Y, Ipo:comment → shipComment Z}
</sgml.block> -->
<sgml.block>
P = {ShipComment → shipComment Y, customerComment → cutomerComment Z
Ipo:comment → shipComment Y, Ipo:comment → shipComment Z}
</sgml.block>
where <highlight style="ital">Ipo:comment</highlight>, <highlight
style="ital">ShipComment</highlight> and <highlight
style="ital">CustomerComment</highlight> are non-terminals. Using
such substitution groups require that there be a production rule of
the form: <sgml>Ipo:comment → ipo:comment X</sgml>, and that
<highlight style="ital">Y, Z</highlight> be derived from <highlight
style="ital">X</highlight>.</para>
</li>
</seqlist>
</para>
</subsec1>
<subsec1>
<title>XDuce</title>
<para>XDuce <bibref refloc="ICFP-HJP-00"/> provides type definitions equivalent to regular tree
grammars. The key features of XDuce are their <highlight style="ital">type constructors</highlight> where we can define regular expression types as derived from other types, and subtyping using <highlight style="ital">subtagging</highlight>.</para>
<para>A type definition in XDuce that produces a tree is converted in
our framework into a production rule with a terminal. Consider the
example from <bibref refloc="ICFP-HJP-00"/>: <sgml>type Addrbook =
addrbook [Person*]</sgml> is written as <sgml>Addrbook → addrbook
(Person*)</sgml>. Any type definition that does not produce a tree is
written as a production rule without a terminal. For example,
<sgml>type X = T, X | ()</sgml> represents a production rule <sgml>X
→ (T, X + ε)</sgml>. Note that XDuce writes the above type
rules in a right-linear form, which makes every content model
definition equivalent to a regular string language.</para>
<para>XDuce provides a mechanism called <highlight
style="ital">subtagging</highlight>, which is a reflexive and transitive relation on terminals in <highlight style="ital">T</highlight>.
<!--This is
slightly difficult to convert into grammar, because it is based on
terminals in <highlight style="ital">T</highlight>, and not on
non-terminals in <highlight style="ital">N1</highlight> or <highlight style="ital">N2</highlight>. Below, we explain how we can convert the subtagging declarations into grammar rules.-->
For example, consider the subtag declaration
<sgml.block> subtag i <: fontstyle</sgml.block>
To convert this subtagging relation into grammar, consider all production rules that have <sgml>fontstyle</sgml> on the right hand side. Let these rules be <sgml>{A → fontstyle A1, B
→ fontstyle B1, ..., N → fontstyle N1}</sgml>. Now the subtag declaration adds the following additional rules to <sgml>P1: {A → i A1, B → i B1, ..., N → i N1}</sgml>
</para>
</subsec1>
<subsec1>
<title>RELAX Core</title>
<para>Any regular tree grammar can be expressed in RELAX Core <bibref refloc="Web-RELAX-00"/>. The main features of RELAX Core are <sgml>elementRule</sgml>s and <sgml>hedgeRule</sgml>s, and clear differentiation between them.</para>
<para>To convert RELAX Core to our framework, let us first consider <sgml>elementRule</sgml>, <sgml>hedgeRule</sgml> and <sgml>tag</sgml> elements, but where <sgml>tag</sgml> elements do not specify a <sgml>role</sgml> attribute. In this case, the value of the <sgml>name</sgml> attribute of the <sgml>tag</sgml> element is considered to be the value of the <sgml>role</sgml> attribute.
<!--In other words we assume that the role attribute of an <sgml>elementRule</sgml> provides a tag name.-->
<seqlist>
<li>
<para>An <sgml>elementRule</sgml> with a corresponding <sgml>tag</sgml> element defines a production rule with a terminal. For example consider the <sgml>elementRule</sgml>, slightly modified
from the one in <bibref refloc="Web-RELAX-00"/>.
<sgml.block>
<elementRule role="section" label="Section">
<ref label="paraWithFNotes" occurs="*"/>
</elementRule>
<tag name="section"/>
</sgml.block>
This can be converted into a production rule <sgml>
Section → section (paraWithFNotes*)</sgml>
</para>
</li>
<li>
<para><sgml>hedgeRule</sgml> defines a production rule without a terminal. For example,
consider the <sgml>hedgeRule</sgml> mentioned in <bibref refloc="Web-RELAX-00"/>
<sgml.block>
<hedgeRule label="blockElem">
<ref label="para"/>
</hedgeRule></sgml.block>
The above <sgml>hedgeRule</sgml> can be converted into
<sgml>blockElem → (para)</sgml></para>
<para>RELAX Core allows a <sgml>hedgeRule</sgml> to reference to
another <sgml>hedgeRule</sgml>. But it requires that there be no
recursion in the <sgml>hedgeRule</sgml>s; this ensures that the
grammar remains regular.</para>
</li>
<li>
<para>RELAX Core allows multiple <sgml>hedgeRule</sgml>s to share the same label. For example, we can specify in RELAX Core two <sgml>hedgeRule</sgml>s as
<sgml.block>
<hedgeRule label="blockElem">
<ref label="para"/>
</hedgeRule>
<hedgeRule label="blockElem">
<ref label="itemizedList"/>
</hedgeRule></sgml.block>
This can be converted into production rules
<sgml>blockElem → (para)</sgml> and
<sgml>blockElem → (itemizedList)</sgml>
</para>
</li>
<li>
<para>RELAX
Core allows multiple <sgml>elementRule</sgml>s to share the same label
as follows. For example, we can have
<sgml.block>
<elementRule role="a" label="A">
<ref label="X"/>
</elementRule>
<elementRule role="a" label="A">
<ref label="Y"/>
</elementRule>
<elementRule role="b" label="A">
<ref label="Z"/>
</elementRule>
<tag name="a"/>
</sgml.block>
These three <sgml>elementRule</sgml>s can be converted into three
production rules <sgml>A → a X, A → a Y, A → b Z</sgml>.
</para>
</li>
<li>
<para>Now let us consider when <sgml>tag</sgml> elements specify a <sgml>role</sgml> attribute.
<!--<sgml>Tag</sgml> is used to specify element tag names.--> In
effect this adds an additional level of indirection in rule specification. To
convert into the grammar representation, we need to collapse the role
attributes. This will be clear from the following example
<bibref refloc="Web-RELAX-00"/>.
<sgml.block>
<tag name="val" role="val-integer">
<attribute name="type" required="true">
<enumeration value="integer"/>
</attribute>
</tag>
<elementRule role="val-integer" label="Val">
<ref label="X"/>
</elementRule>
<tag name="val" role="val-string">
<attribute name="type" required="true">
<enumeration value="string"/>
</attribute>
</tag>
<elementRule role="val-string" label="Val">
<ref label="Y"/>
</elementRule></sgml.block>
They can be converted into production rules
<sgml>Val → val1 X</sgml> and <sgml>Val → val2 Y</sgml>,
where <sgml>val1</sgml> represents <sgml><val type="string"></sgml>
and <sgml>val2</sgml> represents <sgml><val type="integer"></sgml>.
</para>
</li>
</seqlist>
</para>
</subsec1>
<subsec1>
<title>TREX</title>
<para>TREX <bibref
refloc="WWW-TREX-01"/> is quite similar to RELAX Core. Major
differences of TREX from RELAX Core are (1) content models containing
both attributes and elements, (2) interleaving (shuffling) of regular
expressions, (3) wild cards for names, and (4) handling of namespaces.
Other differences are syntactical. Among these differences, (1) is
highly important and is based on new validation techniques. For the
lack of space, we do not further consider this issue in this
paper.</para>
<para>TREX supports the notion of non-terminals using <sgml>define</sgml>. However, unlike RELAX Core, TREX does not distinguish between non-terminals that produce trees and non-terminals that produce list of trees. For example, consider the following TREX pattern:
<sgml.block>
<grammar>
<start>
<ref name="AddressBook"/>
</start>
<define name="AddressBook">
<element name="addressBook">
<zeroOrMore>
<ref name="Card"/>
</zeroOrMore>
</element>
</define>
<define name="inline">
<zeroOrMore>
<choice>
<element name="bold">
<ref name="inline"/>
</element>
<element name="italic">
<ref name="inline"/>
</element>
</choice>
</zeroOrMore>
</define>
</grammar>
</sgml.block>
<!--<sgml.block>
<grammar>
<start>
<ref name="AddressBook"/>
</start>
<define name="AddressBook">
<element name="addressBook">
<zeroOrMore>
<ref name="Card"/>
</zeroOrMore>
</element>
</define>
<define name="Card">
<element name="card">
<ref name="Name"/>
<ref name="Email"/>
</element>
</define>
</grammar>
</sgml.block>-->
Here <sgml>AddressBook</sgml> is a non-terminal that produces a tree, and <sgml>inline</sgml> is a non-terminal that produces a list of trees. The above TREX pattern will be represented in our framework as follows:
<sgml.block>
P = {AddressBook → addressBook (Card*), Bold → bold (Inline),
Italic → italic (Inline), Inline → (Bold | Italic)*}
</sgml.block>
</para>
<para>
</para>
<figure id="fig-set">
<caption>
<para>The expressive power of the different regular tree
grammars:
(a) regular tree grammars (RELAX Core, XDuce, TREX),
(b) restrained-competition tree grammars
(c) single-type tree grammars (XML-Schema)
(d) local tree grammars (DTD)
</para>
</caption>
<graphic figname="set" scale="100"/>
</figure>
</subsec1>
<subsec1>
<title>Implementations</title>
<para>In this subsection, we describe
status of implementations of XML schema languages.
<seqlist>
<li><para><highlight style="bold">DTD: </highlight>
Membership checking against DTDs has been widely
implemented. Most implementations are based on the event model, and
use Algorithm 1.</para>
</li>
<li><para><highlight style="bold">XDuce: </highlight>
Type assignment implemented by the designers of XDuce
is based on the tree model. This implementation is similar to Algorithm 4.</para>
</li>
<li><para><highlight style="bold">DSD: </highlight>
Type assignment of DSD is similar to that of
XDuce. It is based on the tree model, and is a variation of Algorithm 4.
However, backtracking is
not thorough, and thus does not perform exhaustive search. </para>
</li>
<li><para><highlight style="bold">XML-Schema: </highlight>
Schema-validity assessment, as defined in 7.2 of XML
Schema Part 1,
is similar to Algorithm 2. To the best of our knowledge, all implementations
follow this reference model.
<seqlist number="lalpha">
<li><para><highlight style="bold">XSV: </highlight>
XSV (XML-Schema Validator) is based on the tree model.</para>
</li>
<li><para><highlight style="bold">XML-Schema Processor: </highlight>
This implementation can be combined with
a SAX parser as well as a DOM parser. When it is
combined with an event model parser it does not support
<sgml>key</sgml>, <sgml>unique</sgml>, and <sgml>keyref</sgml> as of now (2001/3/31).</para>
</li>
<li><para><highlight style="bold">Xerces Java Parser: </highlight>
This implementation is based on the event model, and does not support
<sgml>key</sgml>, <sgml>unique</sgml>, and <sgml>keyref</sgml> as of now (2001/3/31).</para>
</li>
<li><para><highlight style="bold">XMLSpy & XMLInstance: </highlight>
They are XML document editors, and
thus use the tree model. Editing of documents can be controlled by
schemas in XML-Schema.</para>
</li>
</seqlist>
</para></li>
<li><para><highlight style="bold">RELAX Core: </highlight>
Four implementations of RELAX Core are available
at <highlight style="ital">http://www.xml.gr.jp/relax/</highlight>.
They use different algorithms for type
assignment and membership checking.
<seqlist number="lalpha">
<li><para><highlight style="bold">VBRELAX: </highlight>
This program is based on the tree model. It
simulates top-down non-deterministic automata by backtracking. Such
backtracking may require exhaustive search.</para>
</li>
<li><para><highlight style="bold">RELAX Verifier for C++: </highlight>
This program is based on the event
model. It performs membership checking, but does not perform type
assignment. This program is based on Algorithm 5..</para>
</li>
<li><para><highlight style="bold">RELAX Verifier for Java:
</highlight> This program is based on the event model. Its algorithm
combines Algorithm 3 (top-down) and Algorithm 5 (bottom-up). When an
element is visited, possible non-terminals for this element are
computed; unlike Algorithm 3, more than one possible non-terminal may
be found. Then, as in Algorithm 5, this program determines which of the
possible non-terminals are allowed by the subordinate elements.
Advantages of combining Algorithms 3 and 5 are appropriate error
messages and error recovery.</para>
</li>
<li><para><highlight style="bold">RELAXER: </highlight>
Relaxer is a Java class generator. Given a RELAX
module, Relaxer generates Java classes that represent XML documents
permitted by the module. These Java classes receive XML documents as
DOM trees and perform type assignment. This type assignment uses
top-down non-deterministic automata by (limited) backtracking.</para>
</li>
</seqlist>
</para></li>
<li>
<para><highlight style="bold">TREX:</highlight> One implementation of TREX is available. The key feature of this algorithm is the efficiency and error message.
<seqlist>
<li><para><highlight style="bold">TREX Implementation by James
Clark:</highlight>This implementation can be considered as a
combination of Algorithms 3 and 5, but is more advanced; it uses
derivatives of regular expressions to construct tree automata lazily.
Furthermore, this algorithm efficiently support
<sgml><interleave</sgml> and attribute-element content
models.</para></li>
</seqlist>
</para></li>
</seqlist>
<!--
<deflist>
<def.item>
<def.term>DTD</def.term>
<def>
<para>Membership checking against DTDs has been widely
implemented. Most implementations are based on the event model, and
they simulate top-down deterministic automata with one
lookahead.</para>
</def>
</def.item>
<def.item>
<def.term>XDuce</def.term>
<def>
<para>Type assignment implemented by the designers of XDuce
is based on the tree model. This implementation simulates top-down
non-deterministic automata by backtracking. Such backtracking may
require exhaustive search.</para>
</def>
</def.item>
<def.item>
<def.term>DSD</def.term>
<def>
<para>Type assignment of DSD is similar to that of
XDuce. It is based on the tree model, and simulates top-down
non-deterministic automata by backtracking. However, backtracking is
not thorough, and thus does not perform exhaustive search. </para>
</def>
</def.item>
<def.item>
<def.term>XML-Schema</def.term>
<def>
<para>Schema-validity assessment, as defined in 7.2 of XML
Schema Part 1,
performs type assignment by using deterministic top-down tree automata
with one lookahead. To the best of our knowledge, all implementations
follow this model. Although XML-Schema allows restrained-competition
grammar satisfying the single-type
constraint only, other mechanisms of XML-Schema (<sgml>key</sgml>,
<sgml>unique</sgml>, and <sgml>keyref</sgml>) require use of the tree
model.<deflist>
<def.item>
<def.term>XSV</def.term>
<def>
<para>XSV (XML-Schema Validator) is based on the tree
model.</para>
</def>
</def.item>
<def.item>
<def.term>XML-Schema Processor</def.term>
<def>
<para>This implementation can be combined with
a SAX parser as well as a DOM parser. When it is
combined with an event model parser it does not support
<sgml>key</sgml>, <sgml>unique</sgml>, and <sgml>keyref</sgml>.</para>
</def>
</def.item>
<def.item>
<def.term>Xerces Java Parser</def.term>
<def>
<para>This implementation is based on the event model, and does not support
<sgml>key</sgml>, <sgml>unique</sgml>, and <sgml>keyref</sgml>.</para>
</def>
</def.item>
<def.item>
<def.term>XMLSpy and XMLInstance</def.term>
<def>
<para>They are XML document editors, and
thus use the tree model. Editing of documents can be controlled by
schemas in XML-Schema.</para>
</def>
</def.item>
</deflist>
</para>
</def>
</def.item>
<def.item>
<def.term>RELAX Core</def.term>
<def>
<para>Four implementations of RELAX Core are available
at <highlight style="ital">http://www.xml.gr.jp/relax/</highlight>.
They use different algorithms for type
assignment and membership checking.<deflist>
<def.item>
<def.term>VBRELAX</def.term>
<def>
<para>This program is based on the tree model. It
simulates top-down non-deterministic automata by backtracking. Such
backtracking may require exhaustive search.</para>
</def>
</def.item>
<def.item>
<def.term>RELAX Verifier for C++</def.term>
<def>
<para>This program is based on the event
model. It performs membership checking, but does not perform type
assignment. This program is based on Algorithm 5.</para>
</def>
</def.item>
<def.item>
<def.term>RELAX Verifier for Java</def.term>
<def>
<para>This program is based on the event
model. It is also based on Algorithm 5, but
further introduces top-down non-determinism. Intuitively speaking,
when an element is visited, possible non-terminals for this element
are temporarily assigned to the element. Such possible non-terminals
are chosen by examining the parent element and the non-terminals
temporarily assigned to it. When an element is left, this program
examines which of the temporarily assigned non-terminals are allowed
by the subordinate elements. Advantages of combining top-down
non-determinism and bottom-up non-determinism are appropriate error
messages and error recovery. RELAX Verifier performs type assignment as
well as membership checking.
Two linear-time algorithms for type assignment have been implemented.
One requires two-pass processing of XML documents, and the other is
dedicated to restrained-competition grammars without the deterministic
constraint.</para>
</def>
</def.item>
<def.item>
<def.term>RELAXER</def.term>
<def>
<para>Relaxer is a Java class generator. Given a RELAX
module, Relaxer generates Java classes that represent XML documents
permitted by the module. These Java classes receive XML documents as
DOM trees and perform type assignment. This type assignment uses
top-down non-deterministic automata by (limited) backtracking.</para>
</def>
</def.item>
</deflist>
</para>
</def>
</def.item>
<def.item>
<def.term>TREX</def.term>
<def>
<para>
<deflist>
<def.item>
<def.term>TREX Implementation></def.term>
<def>
<para>The TREX implementation is based on Algorithm 5. However it has a very efficient algorithm for handling non-determinism. This is based on derivatives of regular expressions, and lazy construction of automaton.
</para>
</def>
</def.item>
</deflist>
</para>
</def>
</def.item>
</deflist>
-->
</para>
</subsec1>
</section>
<!-- section: 6 -->
<section id="sec-conclusion">
<title>Conclusion</title>
<!--
<para>A mathematical framework using formal language theory to compare
various XML schema languages is presented. In our framework, a normal
form representation for regular tree grammar is first proposed and
various subclasses of regular tree language are subsequently
defined. Further, a detailed analysis of the closure properties and
expressive power of the different subclasses is presented. Finally,
results on the complexity of membership checking and type assignment
for various XML schema languages are presented.</para>
<para>There are several problems of great
theoretical and practical interest that are still open. One such
problem is: what is the ``tightest'', say, restrained-competition
grammar for a given regular tree grammar? Another interesting
research topic is <highlight style="ital">k</highlight> lookaheads in
the vertical and horizontal directions. Also the algorithmic aspects
of type checking still pose lot of challenges: for example, what is
the ``precise'' class of grammars for which we can do type assignment
in the event model? Also error recovery and error messages are
important to implementations.
</para>-->
<!--
<para>A mathematical framework using formal language theory to compare
various XML schema languages is presented. This framework enables us
to define various subclasses of regular tree languages, and study the
closure properties and expressive power of these languages in a great
detail. Also, algorithms for document validation and type assignment
for the different grammar classes are described. Finally, various
schema language proposals are compared using our framework, and the
implementations available are also discussed.</para>
<para>Our framework brings forward a very important question: Do we
need to migrate from deterministic content models of XML 1.0 in favor
of schema languages such as RELAX Core and TREX that allow
non-deterministic content models? Our work in this paper as well as
other works, with regard to document processing and XML query, make us
believe that we should allow non-deterministic content models in XML
schema languages.</para>
<para>We have currently pursuing multiple research issues beyond
presented here. We are examining ambiguity in regular tree grammars
and languages, and studying how to determine whether a given regular
tree grammar is ambiguous or not. We are also examining integrity
constraints necessary for a schema language as studied widely in the
area of database systems, and examining efficient implementations of
these constraints for XML applications.</para>
-->
<para>To compare XML schema language proposals, we have studied
four classes of tree languages, namely "local", "single-type",
"restrained-competition", and "regular". The class "regular" is the
most expressive and is closed under boolean operations, while the
other classes are weaker and are not closed under union and difference
operations. We have also presented five validation algorithms for
these classes. Then, we have shown which class is captured by DTD,
DSD, XML-Schema, XDuce, RELAX Core, and TREX, respectively.</para>
<para>Surprisingly enough, from the viewpoint of formal language
theory, XML-Schema is not as powerful as RELAX Core or TREX.
XML-Schema capture the class "single-type" only, and is not closed
under union or difference. On the other hand, RELAX Core and TREX
capture the class "regular", which is closed under boolean
operations. However, the class "regular" may provide more than one
interpretation of a document. One could argue that such multiple
interpretations are problematic, and that the class "regular" should
be avoided for this reason.</para>
<para>Finally, OASIS recently started RELAX NG <bibref
refloc="OASIS-RELAXNG-01"/>, which unifies RELAX Core and TREX. RELAX
NG inherits advantages of both languages, and is likely to become a
simple, powerful, and solid language.</para>
</section>
</body>
<!-- REAR === READ === REAR -->
<rear>
<bibliog>
<bibitem id="WWW-BFRW-01">
<bib>BFRW01</bib>
<pub>
A. Brown, M. Fuchs, J. Robie, and P. Wadler.
``MSL. A model for W3C XML Schema'',
In 10th Int'l World Wide Web Conf., Hong Kong, May 2001.
</pub>
</bibitem>
<!-- this reference is replaced to latest version (dongwon)
<bibitem id="WWW-BPS-98">
<bib>BPS98</bib>
<pub>
T. Bray, J. Paoli, and C. M. Sperberg-McQueen (Eds).
``Extensible Markup Language (XML) 1.0'', Feb. 1998.
<highlight
style="ital">http://www.w3.org/TR/1998/REC-xml-19980210</highlight>.</pub>
</bibitem>
-->
<bibitem id="WWW-BPS-00">
<bib>BPS00</bib>
<pub>
T. Bray, J. Paoli, and C. M. Sperberg-McQueen (Eds).
``Extensible Markup Language (XML) 1.0 (2nd Edition)'', W3C
Recommendation, Oct. 2000.
<highlight
style="ital">http://www.w3.org/TR/2000/REC-xml-20001006</highlight>.</pub>
</bibitem>
<bibitem id="WWW-TREX-01">
<bib>Cla01</bib>
<pub>
J. Clark.
``TREX - Tree Regular Expressions for XML", 2001.
<highlight style="ital">http://www.thaiopensource.com/trex</highlight>.</pub>
</bibitem>
<bibitem id="BOOK-TATA-97">
<bib>CDG+97</bib>
<pub>
<!-- BOOK-TATA-97 -->
H. Comon, M. Dauchet, R. Gilleron, F. Jacquemard, D. Lugiez, S. Tison,
and M. Tommasi.
``Tree Automata Techniques and Applications'', 1997.
<highlight
style="ital">http://www.grappa.univ-lille3.fr/tata</highlight>.</pub>
</bibitem>
<!-- added and cited from conclusion section (dongwon) -->
<bibitem id="OASIS-RELAXNG-01">
<bib>CM01</bib>
<pub>
J. Clark, and M. Murata (Eds).
``RELAX NG Tutorial", OASIS Working Draft, Jun. 2001.
<highlight style="ital">http://www.oasis-open.org/committees/relax-ng/tutorial.html</highlight>.</pub>
</bibitem>
<!-- replaced to the latest 2001/5 version (dongwon)
<bibitem id="WWW-XSD-0-00">
<bib>Fal00</bib>
<pub>
D. C. Fallside (Eds).
``XML Schema Part 0: Primer'', Oct. 2000.
<highlight
style="ital">http://www.w3.org/TR/xmlschema-0</highlight>.</pub>
</bibitem>
-->
<bibitem id="WWW-XSD-0-00">
<bib>Fal01</bib>
<pub>
D. C. Fallside (Eds).
``XML Schema Part 0: Primer'', W3C Recommendation, May 2001.
<highlight
style="ital">http://www.w3.org/TR/xmlschema-0</highlight>.</pub>
</bibitem>
<bibitem id="BOOK-HU-79">
<bib>HU79</bib>
<pub>
<!-- BOOK:HU:79 -->
J. E. Hopcroft and J. D. Ullman.
``Introduction to Automata Theory, Language, and
Computation''.
Addison-Wesley, 1979.
</pub>
</bibitem>
<bibitem id="ICFP-HJP-00">
<bib>HVP00</bib>
<pub>
<!-- ICFP:HJP:00 -->
H. Hosoya, J. Vouillon, and B. C. Pierce.
``Regular Expression Types for XML''.
In Int'l Conf. on Functional Programming (ICFP), Montreal,
Canada, Sep. 2000.
</pub>
</bibitem>
<bibitem id="WWW-Jell-00">
<bib>Jel00</bib>
<pub>
<!-- WWW:Jell:00 -->
R. Jelliffe.
``The XML Schema Specification in Context'', Feb. 2000.
<highlight
style="ital">http://www.ascc.net/~ricko/XMLSchemaInContext.html</highlight>.</pub>
</bibitem>
<bibitem id="WWW-Kawaguchi">
<bib>Ka01</bib>
<pub>
K. Kawaguchi.
``W3C XML Schema Made Simple'', XML.com 2001
<highlight
style="ital">http://www.xml.com/pub/a/2001/06/06/schemasimple.html</highlight>.</pub>
</bibitem>
<bibitem id="WWW-DSD-00">
<bib>KMS00</bib>
<pub>
<!--WWW:DSD:00-->
N. Klarlund, A. Moller, and M. I. Schwatzbach. ``DSD: A Schema Language for XML''.
In ACM SIGSOFT Workshop on Formal Methods in Software Practice, Portland, OR, Aug. 2000.
</pub>
</bibitem>
<bibitem id="SIGMOD-REC-LC-00">
<bib>LC00</bib>
<pub>
<!-- SIGMOD-REC:LC:00 -->
D. Lee and W. W. Chu.
``Comparative Analysis of Six XML Schema Languages''.
ACM SIGMOD Record, 29(3):76--87, Sep. 2000.
</pub>
</bibitem>
<bibitem id="IBM-TR-LMM-00">
<bib>LMM00</bib>
<pub>
<!-- IBM-TR:LMM:00 -->
D. Lee, M. Mani, and M. Murata.
``Reasoning about XML Schema Languages using Formal Language
Theory''.
Technical Report, IBM Almaden Research Center, RJ# 10197, Log#
95071, Nov. 2000.
<highlight
style="ital">http://www.cs.ucla.edu/~dongwon/paper</highlight>.</pub>
</bibitem>
<bibitem id="WWW-SAX-00">
<bib>Meg00</bib>
<pub>
<!-- WWW:SAX:00 -->
D. Megginson.
``SAX 2.0: The Simple API for XML'', May. 2000.
<highlight
style="ital">http://www.megginson.com/SAX/index.html</highlight>.</pub>
</bibitem>
<!-- Web:HEDGE:00 -->
<!-- <bibitem id="Web-HEDGE-00">
<bib>Mur00a</bib>
<pub>
M. Murata.
``Hedge Automata: a Formal Model for XML Schemata'', 2000.
<highlight
style="ital">http://www.xml.gr.jp/relax/hedge_nice.html</highlight>.</pub>
</bibitem>
-->
<bibitem id="Web-RELAX-00">
<bib>Mur00b</bib>
<pub>
<!-- Web:RELAX:00 -->
M. Murata.
``RELAX (REgular LAnguage description for XML)'', Aug. 2000.
<highlight style="ital">http://www.xml.gr.jp/relax</highlight>.</pub>
</bibitem>
<!-- PODS:Murata:01 -->
<bibitem id="PODS-Murata-01">
<bib>Mur01</bib>
<pub>
M. Murata.
``Extended Path Expressions for XML",
In ACM PODS, Santa Barbara, CA, May 2001.
<!-- this looks error. removing (dongwon)
<highlight style="ital">http://www.xml.gr.jp/relax</highlight>. -->
</pub>
</bibitem>
<bibitem id="IC-Takahashi-75">
<bib>Tak75</bib>
<pub>
<!-- IC:Takahashi:75 -->
M. Takahashi.
``Generalizations of Regular Sets and Their Applicatin to a Study of
Context-Free Languages''.
Information and Control, 27(1):1--36, Jan. 1975.
</pub>
</bibitem>
<bibitem id="WWW-XSD-1-00">
<bib>TBMM00</bib>
<pub>
<!-- WWW:XSD-1:00 -->
H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn (Eds).
``XML Schema Part 1: Structures'', W3C Recommendation, May 2001.
<highlight
style="ital">http://www.w3.org/TR/xmlschema-1/</highlight>.</pub>
</bibitem>
<bibitem id="WWW-XML-Query-01">
<bib>CCF+01</bib>
<pub>
Don Chamberlin, James Clark, Daniela Florescu, Jonathan Robie, Jérôme Siméon, Mugur Stefanescu (Eds).
``XQuery 1.0: An XML Query Language'', W3C Working Draft, June 2001.
<highlight
style="ital">http://www.w3.org/TR/xquery/</highlight>.</pub>
</bibitem>
<bibitem id="OASIS">
<bib>OA01</bib>
<pub>
OASIS.
``SGML/XML and Forest/Hedge Automata Theory'', Web page, May 2001.
<highlight
style="ital">http://xml.coverpages.org/hedgeAutomata.html</highlight>.</pub>
</bibitem>
<bibitem id="PODS-VV-01">
<bib>VV01</bib>
<pub>
V. Vianu.
``A Web Odyssey: From Codd to XML''. In ACM PODS, Santa Barbara, CA, May 2001.
</pub>
</bibitem>
<bibitem id="WWW-DOM-00">
<bib>WHA+00</bib>
<pub>
<!-- WWW:DOM:00 -->
L. Wood, A. Le Hors, V. Apparao, S. Byrne, M. Champion, S. Isaacs, I. Jacobs, G. Nicol, J. Robie, R. Sutor, and C. Wilson (Eds).
``Document Object Model (DOM) Level 1 Specification'', W3C Recommendation, Sep. 2000.
<highlight
style="ital">http://www.w3.org/TR/2000/WD-DOM-Level-1-20000929/</highlight>.
</pub>
</bibitem>
</bibliog>
</rear>
</paper>