ACM SIGMOD Record 29(3), September 2000.


Comparative Analysis of Six XML Schema Languages1

Dongwon Lee     Wesley W. Chu

Department of Computer Science
University of California, Los Angeles
Los Angeles, CA 90095, USA
Email: {dongwon,wwc}@cs.ucla.edu


Abstract: As XML [5] is emerging as the data format of the internet era, there is an substantial increase of the amount of data in XML format. To better describe such XML data structures and constraints, several XML schema languages have been proposed. In this paper, we present a comparative analysis of six noteworthy XML schema languages.

1   Introduction

As of June 2000, there are about a dozen of XML schema languages that have been proposed. Among those, in this paper, we choose six schema languages (XML DTD [5], XML Schema [4, 9, 22], XDR [10, 15, 17], SOX [8], Schematron [11, 18], DSD [12, 13]) as representatives2.

Our rationale in choosing the representatives is as follows: 1) they are backed by substantial organizations so that their chances of survival are high (e.g., XML DTD and XML Schema by W3C, XDR by Microsoft, DSD by AT&T), 2) there are publically known usages or applications (e.g., XML DTD in XML, XDR in BizTalk, SOX in xCBL), 3) the language has a unique approach distinct from XML DTD (e.g., SOX, Schematron, DSD).

First, we briefly review each schema language.

1.1   XML DTD

XML DTD (DTD in short), a subset of SGML DTD, is the de facto standard XML schema language of the past and present and is most likely to thrive until XML Schema finally arrives. It has limited capabilities compared to other schema languages. Its main building block consists of an element and an attribute. The real world is typically represented by the use of hierarchical element structures.

1.2   XML Schema

XML Schema is an ongoing effort of W3C to aid and eventually replace DTD in the XML world. XML Schema aims to be more expressive than DTD and more usable by a wider variety of applications. It has many novel mechanisms such as inheritance for attributes and elements, user-defined datatypes, etc.

1.3   XDR



First known as XML-Data, then later trimmed and improved to XDR (XML-Data Reduced), this language is a joint effort of Microsoft and others and is being used in Microsoft's BizTalk framework. XDR is heavily influenced by another proposal co-developed by IBM and Microsoft, DCD (Document Content Description), and thus shares many similar features.

1.4   SOX

SOX (Schema for Object-Oriented XML) is an alternative schema language for defining the syntactic structure and partial semantics of XML document types. As the name implies, it extends DTD in an object-oriented way by allowing extensible data types and inheritance among element types. The current version, 2.0, is being developed by Commerce One.

1.5   Schematron

Schematron, created by Rick Jelliffe, is quite unique from others in that it focuses on validating schemas using patterns instead of defining schemas. Its schema definition is simple enough to be defined in a single page, yet provides very powerful constraint specification via XPath [7]. The latest version is 1.4.

1.6   DSD

DSD 1.0 was co-developed by AT&T Labs and BRICS with the goals of context-dependent description of elements and attributes, flexible default insertion mechanisms, expressive power close to XSLT [6], etc. Like Schematron, DSD has a strong edge on schema constraints.

1.7   Other Languages

In addition, DCD [3], DDML [2], Assertion Grammars [19], RELAX [16] have been proposed.

2   Features Classification

In the following, we denote a constant value with single quotes regardless of the language specification for simplicity. Furthermore, any attribute A or element E in the language is denoted by áAñ or áEñ. When a schema language supports a certain feature fully or partially, we denote it as Yes or Partial. Otherwise, we denote No. Furthermore, when there is no explicitly equivalent construct in the language, but the feature can be simulated using other constructs with reasonable complexity, we consider the feature supported by the language.

2.1   Schema

1. Syntax in XML: Using XML syntax for the schema language brings several benefits [14]: 1) users do not have to learn new proprietary syntax, 2) the schema language can be readily applicable to existing XML applications (e.g., editor, browser), 3) the schema file can be stored in a XML storage system along with XML documents, and 4) the schema language is extensible. All the schema languages except DTD are written in XML syntax:


DTD: No XML Schema: Yes XDR: Yes
SOX: Yes Schematron: Yes DSD: Yes




2. Namespace: All languages except DTD and DSD support namespace.


DTD: No XML Schema: Yes XDR: Yes
SOX: Yes Schematron: Yes DSD: No



Suppose one wants to define the book element by reusing the address element defined elsewhere (denoted as URI) and defining his own title element. This can be written in XML Schema as follows:
<schema xmlns:z='URI' ...>
  <element name='book'>
    <complexType>
      <element name='title' type='string'>
      <element name='address' type='z:address'>
    </complexType>
  </element>
</schema>
Similarly, in XDR:
<ElementType name='title' dt:type='string'/>
<ElementType name='book' xmlns:z='URI'>
  <element type='title'><element type='z:address'>
</ElementType>
SOX supports elements ánamespaceñ to declare namespace and two attributes áprefixñ and átypeñ to qualify names.
<namespace prefix='z' namespace='URI'/>
<elementtype name='book'>
  <model>
    <element type='title'>
    <element prefix='z' type='address'>
  </model>
</elementtype>
Similarly, Schematron provides an attribute ns for the element áschemañ.


3. Include & import: Sometimes, it is convenient to pull in externally defined schema fragments to the current schema. This is especially true when the schema gets larger; it becomes more desirable to have modular schema definitions for better maintainence and readability. Several schema languages support this feature. If the newly pulled-in fragments can have only the same target namespace as the current schema, then we refer to it as include. Otherwise, we refer to it as import. First, Include is supported as follows:


DTD: No XML Schema: Yes XDR: No
SOX: Yes Schematron: No DSD: Yes



In XML Schema, using áinclude schemaLocation='URI'ñ is conceptually equivalent to replacing the include clause with all the definitions in the URI. The namespace of the included fragments must be the same as that of the current schema. In SOX, a construct ájoinñ allows schema definitions belonging to the same namespace to be pulled in. Similarly, áincludeñ is supported in DSD as well.

Furthermore, Import is supported as follows:


DTD: No XML Schema: Yes XDR: No
SOX: Yes Schematron: No DSD: No



In XML Schema, a construct áimportñ exists. By importing multiple namespaces, XML Schema allows definitions and declarations contained in schemas under different namespaces. In SOX, special processing indicator á? import ?ñ is used to import schema that can override the default namespace declared in the current schema.

2.2   Datatype

Datatype can be categorized into two types: simple or complex. A simple type cannot have element content nor carry attributes while a complex type can. Although most schema languages support simple types separately, the support of complex type is a bit fuzzy due to the mixed definition of complex type and element type. Therefore, here, we only explicitly compare features of the simple type. The features of the complex type are interspersed through Sections 2.4 and 2.5.


1. Built-in type: This is either a primitive or derived simple type provided by the schema language specification. Most schema languages, except Schematron and DSD, support an array of built-in types including the plain string and XML-related types (e.g., ID, NMTOKEN). The number of such built-in types are:


DTD: 10 XML Schema: 37 XDR: 33
SOX: 17 Schematron: 0 DSD: 0



While DTD supports only XML-related primitive types, XML Schema supports an extensive set of 37 built-in types, covering most types being used in general programming languages. So does XDR or SOX. Since the focus of Schematron is validating XML structure, it does not provide any explicit built-in types. Similarly, DSD has no built-in types. However many types can be easily simulated through its support of regular expressions.


2. User-defined type: When schema designers consider certain types be defined as simple types in their schema, XML Schema, SOX, and DSD provide such a facility:


DTD: No XML Schema: Yes XDR: No
SOX: Yes Schematron: No DSD: Yes



In XML Schema, new simple types can be created by deriving from built-in or derived types via the inheritance. Details will be found in Section 2.5. In SOX, new datatypes can be defined using three facets áenumerationñ, áscalarñ and ávarcharñ. Although types can be simulated in Schematron, they are not treated as first-class objects as in other languages. DSD uses a construct áStringTypeDefñ along with a rich set of operators and regular expressions to support user-defined types. For instance, a 9 digit US zipcode definition can be written as follows in DSD:
<StringTypeDef ID='zipcode'>
  <Sequence>
    <Repeat value='5'>
      <CharSet Value='0123456789'>
    </Repeat>
    <Optional>
      <String Value='-'>
      <Repeat value='4'>
        <CharSet Value='0123456789'>
      </Repeat>
    </Optional>
  </Sequence>
</StringTypeDef>

3. Domain constraint: Not only the type itself, but also the legal values for the type are important. Some languages support a set of constructs to limit the valid domain values for datatypes as follows:


DTD: No XML Schema: Yes XDR: No
SOX: Partial Schematron: Yes DSD: Yes



Towards this feature, XML Schema supports a multitude of facets (e.g., range, precision, length, mask) and regular expressions. SOX provides a primitive set of facets including enumeration, min or max value, maxlength, etc. However, the pattern language is not supported. Although built-in or user-defined types are not allowed in Schematron, one can simulate such types using Schematron's support of XPath. For instance, the integer type for the element E can be simulated as follows [18]:
<rule context='E'>
  <assert test='floor(.) = number(.)'>
    E can have only integer value.</assert>
</rule>
As shown in the example of the user-defined type case, DSD supports a set of pattern-related operators to constrain the legal domain for user-defined types.


4. Explicit null: It is often preferable to differentiate among unknown, inapplicable or others by supporting the explicit ``null'' values.


DTD: No XML Schema: Yes XDR: No
SOX: No Schematron: No DSD: No



In XML Schema, there is an attribute ánullableñ to indicate that the element content is null. In a XML instance document, the element fullname carries an attribute null='true' to represent the nullness as shown below:
schema  : <element name='fullname' nullable='true'/>
instance: <fullname xsi:null='true'></fullname>

2.3   Attribute

1. Default value: All support this feature.


DTD: Yes XML Schema: Yes XDR: Yes
SOX: Yes Schematron: No DSD: Yes



In an attribute declaration of DTD, if the declaration is neither #REQUIRED nor #IMPLIED, then the attribute value contains the declared default value.
<!ATTLIST list  type   (bullets|ordered) 'ordered'>
<!ATTLIST form  method CDATA #FIXED 'POST'>
Here, the attribute type of the element list has a default value of ``ordered'' while the attribute method of the element form has a fixed value of ``POST''. Other languages support default values similarly. The following three snippets in the order of XML Schema, XDR and SOX illustrate an attribute nm with a default value ``John Doe'':
<attribute name='nm' use='default' value='John Doe'/>

<AttributeType name='nm' dt:type='string'/>
<attribute type='fullname' default='John Doe'/>

<attrdef name='nm' datatype='string'>
  <default>John Doe</default>
</attrdef>
DSD provides a more sophisticated way of defining default for attributes by associating them with a boolean expression. For instance, in DSD, one can specify a default value of ``John Doe'' for male employees as follows:
<Default>
  <Context><Element Name='employee'>
    <Attribute Name='gender' Value='M'/>
  </Element></Context>
  <DefaultAttribute Name='nm' Value='John Doe'/>
</Default>

2. Choice among attributes: This feature comes in handy when schema designers want to associate multiple attributes with an element and constrain validity to one attribute at any given time.


DTD: No XML Schema: No XDR: No
SOX: No Schematron: Yes DSD: Yes



Schematron and DSD can express the requirement that exactly one of the two attributes fn and gn must be present as an attribute of person element as follows:
<rule context='person'>
  <assert test='@fn or @gn'>Or semantics</assert>
  <assert test='count(attribute::*) = 1'>
    Only one attribute</assert>
</rule>

<ElementDef ID='person'>
  <AttributeDecl Name='fn' IDType='ID'/>
  <AttributeDecl Name='gn' IDType='ID'/>
  <OneOf>
    <Attribute Name='fn'/><Attribute Name='gn'/>
  </OneOf>
</ElementDef>

3. Optional vs. required: In all languages, whether or not an attribute definition is required in a XML document instance can be expressed.


DTD: Yes XML Schema: Yes XDR: Yes
SOX: Yes Schematron: Yes DSD: Yes



To denote an attribute must be present, DTD uses a keyword #REQUIRED while XML Schema uses áuse='required'ñ in the attribute declaration. Similarly, in XDR, an attribute árequired='yes'ñ is used while in SOX, an element árequired/ñ is used for mandatory attribute definition. Schematron can enforce this feature using a pattern áassert test='@attribute-name'ñ. Like XDR, DSD supports an attribute áOptional='no'ñ.


4. Domain constraint: Some languages can specify admissible values for attributes.


DTD: Partial XML Schema: Yes XDR: Partial
SOX: Partial Schematron: Yes DSD: Yes



DTD and XDR provide only the enumeration capability by which users can list all legal values for the attribute being defined. For instance, the following snippets show examples of DTD and XDR for such an enumerated attribute type, RGB:
<!ATTLIST spec RGB (red|green|blue)>

<AttributeType name='RGB' dt:type='enumeration'
  dt:values='red green blue'/>
In XML Schema, domain values for simple types can first be constrained using various facets and then new attributes can be defined using the simple types. SOX provides áenumerationñ to constrain the attribute domain. In Schematron, the support for an arbitrary domain constraint rule for attribute values is possible as shown in the case of the domain constraint for datatypes in Section 2.2. In DSD, one can apply numerous operators such as áUnionñ or áRepeatñ to the construct áStringTypeñ to constrain domain values.


5. Conditional definition: Often, an attribute a1 of an element E is relevant only when an attribute a2 has a certain value.


DTD: No XML Schema: No XDR: No
SOX: No Schematron: Yes DSD: Yes



For instance, the following Schematron schema states that if the element E has the attribute one, then it must have the second attribute two as well:
<rule context='E'>
  <report test='(@one) or not(@one and @two)'>
    E cannot have attribute 'one' alone.</report>
</rule>
DSD supports this feature easily using its rich boolean operators. For instance, the following snippet states that the salary attribute is defined only when the student is a ``TA'':
<ElementDef ID='student'>
  <If><Attribute Name='TA' Value='yes'>
    <Then><Optional>
      <AttributeDecl Name='salary'/>
    </Optional></Then>
  </If>
</ElementDef>

2.4   Element

1. Default value: Elements can have either simple or complex default values.


DTD: No XML Schema: Partial XDR: No
SOX: No Schematron: No DSD: Yes



In XML Schema, one can provide a string value as the default value when the element has a simple type.
<element name='fullname' type='string'
         default='John Doe'/>
DSD allows both simple and complex defaults for elements using áDefaultContentñ. For instance, one can specify that a default address is ``Los Angeles'' and ``CA'':
<Default>
  <Context><Element Name='address'/></Context>
  <DefaultContent>
    <city>Los Angeles</city><state>CA</state>
  </DefaultContent>
</Default>

2. Content model: The element content model can be 1) empty, 2) text (including datatype), 3) element, or 4) mixed (text + element).


DTD: Yes XML Schema: Yes XDR: Yes
SOX: Partial Schematron: Yes DSD: Yes



DTD supports all four content models as follows:
empty   : <!ELEMENT o EMPTY>
text    : <!ELEMENT p (#PCDATA) >
element : <!ELEMENT q (x?|y*|z+) >
mixed   : <!ELEMENT r (#PCDATA|x)* >
Similarly, XML Schema and XDR support the four content models using a construct ácontentñ which supports values such as ``empty'', ``textOnly'', ``elementOnly'' (``eltOnly'' for XDR), ``mixed''. Furthermore, XML Schema allows specification of a datatype for an element. SOX supports three content models using constructs áempty/ñ, ástring/ñ and áelement/ñ, respectively, but does not explicitly support the mixed content model. In Schematron, the following XPath expression can be used as a value for áassertñ construct to specify the four content models:
empty   : not(*)
text    : string-length(text()) > 0
element : count(element::*) = count(*)
mixed   : by default
DSD also supports all four models using constructs áEmpty/ñ, áStringType/ñ, áElement/ñ and áAnyElement/ñ, respectively.


3. Ordered sequence: The order among sub-elements is consequential.


DTD: Yes XML Schema: Yes XDR: Yes
SOX: Yes Schematron: Yes DSD: Yes



In DTD, sub-elements listed with an operator ``,'' obey the order among them. Likewise, in XML Schema, the order needs to be preserved unless otherwise specified. Otherwise, one can explicitly specify that the order is sequential using a grouping construct ásequenceñ. In XDR, the áorder='seq'ñ attribute specifies that sub-elements are required to appear in a sequential order. SOX supports ásequenceñ content models as well. For instance, in SOX, the following states that the person element must have the sub-element fn followed by the sub-element ln:
<elementtype name='person'>
  <model><sequence>
    <element name='fn'/><element name='ln'/>
  </sequence></model>
</elementtype>
The same schema can be written in Schematron as follows:
<rule context='person'>
  <assert test='(*[position()=1] = fn)
     and (*[position()=2] = ln)'>
    fn must be followed by ln.</assert>
</rule>
The ordered sequence in DSD is expressed in a similar fashion by the construct áSequenceñ in an element content definition.
<ElementDef ID='person'>
  <Sequence>
    <Element Name='fn'/><Element Name='ln'/>
  </Sequence>
</ElementDef>

4. Unordered sequence: The order among sub-elements is inconsequential.


DTD: No XML Schema: Yes XDR: Yes
SOX: No Schematron: Yes DSD: Yes



Unlike SGML which offers an operator & to create an unordered sequence, DTD does not offer an explicit operator for unordered sequence. Instead, one needs to encode all the possible combinations of the sub-elements. For instance, to express an unordered sequence of sub-element (a & b & c) of SGML in DTD, one has to write ((a,b,c)|(a,c,b)|(b,a,c)|(b,c,a)|(c,a,b)|(c,b,a)) or somewhat incorrectly (a|b|c)* [20]. Using a grouping construct áallñ in XML Schema, one can specify the unordered sequence. In XDR, the áorder='many'ñ attribute specifies that sub-elements can appear in any order. In Schematron, if one does not specify any patterns, then it takes the unordered sequence by default.

In DSD, a single content expression describes a set of allowed sequences of string data and elements. Several content expressions describe all merging of sequences, one from each expression. Thus, by cleverly using this feature, one can capture "floating elements", i.e., mixes of ordered and unordered contents. This feature in DSD is more expressive than the simple ordered or unordered content model.


5. Choice among elements: Only one sub-element among candidates is allowed.


DTD: Yes XML Schema: Yes XDR: Yes
SOX: Yes Schematron: Yes DSD: Yes



DTD uses an operator ``|'' to denote choice among elements. Using a grouping construct áchoiceñ in XML Schema, one can specify that only one of the sub-elements in the group must appear. In XDR, the áorder='one'ñ attribute specifies that only one sub-element can be used. SOX supports the áchoiceñ content model for an element. Schematron can express its choice among elements using rules similar to the case of choice among attributes in Section 2.3. In DSD, the construct áOneOfñ is supported as follows:
<ElementDef ID='person'>
  <OneOf>
    <Element Name='fn'/><Element Name='gn'/>
  </OneOf>
</ElementDef>

6. Min & Max occurrence: In this scheme, the language can support if minimum occurrence is k and maximum occurrence is l.


DTD: Partial XML Schema: Yes XDR: Yes
SOX: Yes Schematron: Yes DSD: Partial



In DTD, the occurrences of elements can be only primitively controlled by the three Kleene operators: 1) ``?'' for 0 or 1, 2) ``*'' for 0 or many and 3) ``+'' for 1 or many. In XML Schema, an element declaration carries minOccurs='k' and maxOccurs='l'. In XDR, áminOccursñ and ámaxOccursñ attributes specify how many times an element can appear within another element. In SOX, an element definition carries áoccursñ attribute that indicates the number of repetitions of the instanced element. It can take 1) the three Kleene operators (i.e., ?, *, +), 2) a value of the form ``k,l'', or 3) ``k,*''. In Schematron, this can be written as áassert test='count(E)>=k'ñ and áassert test='count(E)<=l'ñ. In DSD, the occurrences of elements can be specified as áOptionalñ, áZeroOrMoreñ, áOneOrMoreñ, and áUnionñ, but cannot be specified with respect to the exact minimum and maximum numbers.


7. Open model: An open content model enables additional elements or attributes to be present within an element without having to declare each and every element. This provides an extensibility mechanism.


DTD: No XML Schema: No XDR: Yes
SOX: No Schematron: Yes DSD: No



In XDR, the model is open by default. One has to specify a closed model explicitly with ámodel='closed'ñ. Suppose, for instance, one has the following person element definition (the city and state elements are defined elsewhere):
<ElementType name='address' model='closed'>
  <element type='city'/><element type='state'/>
</ElementType>
<ElementType name='person'>
  <element type='address'/>
</ElementType>
This definition states that the address element can have only two sub-elements city and state while the person element can have a sub-element address and possibly others since it is an ``open'' content model. Thus, the following XML document instance is valid although the unknown element name is added to the person element.
<person>
  <address>
    <city>Los Angeles</city><state>CA</state>
  </address>
  <name>John Doe</name>
</person>
In Schematron, the content model is open by default. The closed model also can be expressed using a count() function in XPath. For instance, the following schema states that the person element is closed (when the name and address are all the sub-elements of the person):
<rule context='person'>
  <assert test='count(name|address) = count(*)'>
    There is an extra element.</assert>
</rule>
In languages that support ``any'' element concept, since any well-formed XML fragment is allowed for the any element, the open model can be simulated in some sense. However, since this requires the ``any'' element be defined in the schema beforehand, it is less flexible than the explicit open model.


8. Conditional definition: Often, elements are allowed only in certain situations.


DTD: No XML Schema: No XDR: No
SOX: No Schematron: Yes DSD: Yes



For instance, the following Schematron schema states that, in HTML, the element input can appear only if it is inside the element form.
<rule context='E'>
  <report test='not(parent::form) and input'>
    Element input cannot appear.</report>
</rule>
DSD supports this feature using its boolean operators. The usage is similar to the case of the conditional definition for attributes.

2.5   Inheritance

As in object-oriented inheritance, inheritance is done by extending or restricting the base type. In this section, we divide the target of the inheritance into simple and complex types. When some languages support inheritance toward attribute and element instead, we treat them as the simple and complex type inheritance, respectively.


1. Simple type by extension: In this scheme, new simple types may be created by deriving from other simple types with more relaxed domain constraint. The set of legal values of the new type is a superset of that of the base type. No languages support this feature.


DTD: No XML Schema: No XDR: No
SOX: No Schematron: No DSD: No




2. Simple type by restriction: The set of legal values of the new type is a subset of that of the base type.


DTD: No XML Schema: Yes XDR: No
SOX: Yes Schematron: No DSD: No



In XML Schema, inheritance among simple types are allowed as shown in the following example, where a 9 digit US zipcode is created from the base type string:
<simpleType name='zipcode' base='string'>
  <pattern value='[0-9]{5}(-[0-9]{4})?'/>
</simpleType>
By constraining the domain values using the pattern expression, the legal values for the zipcode have been restricted from the string type. In SOX, new datatypes may be refined from built-in or derived types. For instance, the new datatype RGB allows only three values from the color type.
<datatype name='RGB'>
  <enumeration datatype='color'>
    <option>Red</option>
    <option>Green</option>
    <option>Blue</option>
  </enumeration>
</datatype>

3. Complex type by extension:


DTD: No XML Schema: Yes XDR: No
SOX: Yes Schematron: No DSD: No



XML Schema supports type inheritance using constructs ábaseñ and áderivedBy='extension'ñ. Newly added elements are always appended at the end. In SOX, áextends type='basetype'ñ is supported, where appending new elements and attributes are allowed. Given the person element defined elsewhere, the following example illustrates how the new element new-person inherits the content model of the person element and has an additional element address and attribute email.
<elementtype name='new-person'>
  <extends type='person'>
    <append>
      <element name='address' type='addr'/>
    </append>
    <attdef name='email' datatype='string'>
  </extends>
</elementtype>
In DSD, any definition can be redefined using the áRenewIDñ and áCurrIDRefñ constructs. However, once the new type is defined, the original type is no longer accessible. Therefore, this feature is for renewing rather than deriving.


4. Complex type by restriction:


DTD: No XML Schema: Yes XDR: No
SOX: No Schematron: No DSD: No



In XML Schema, it is possible to derive new types by restricting the content models of existing types. The values represented by the new type are a subset of the values represented by the base type. For instance, the following schema shows the newly defined element E whose type is ResItemType which is required to have at least one item sub-element as a new restriction.
<complexType name='ItemType'>
  <element name='item' minOccurs='0'>
</complexType>
<complexType name='ResItemType'
  base='ItemType' derivedBy='restriction'>
  <element name='item' minOccurs='1'>
</complexType>
<element name='E' type='ResItemType'>

2.6   Being unique or key

1. Uniqueness for attribute: All languages support this feature.


DTD: Yes XML Schema: Yes XDR: Yes
SOX: Yes Schematron: Yes DSD: Yes



DTD, XDR, SOX and DSD use ID type for an attribute to ensure uniqueness while XML Schema uses áuniqueñ where the scope and target object of the uniqueness are specified by áselectorñ and áfieldñ constructs, respectively. Since Schematron does not have an explicit construct equivalent to ID in DTD, uniqueness for an attribute must be simulated using pattern ``count()=1''.


2. Uniqueness for non-attribute: Schema languages like XML Schema, Schematron, or DSD specify uniqueness not only for attributes but also for arbitrary elements or even composite objects (attribute + element) in a portion of the document or the whole document.


DTD: No XML Schema: Yes XDR: Partial
SOX: No Schematron: Yes DSD: No



This feature can be easily expressed in XML Schema using the same construct áuniqueñ. For instance, the following schema ensures there exists a unique phone element under addr sub-elements of the person element.
<unique>
  <selector>person/addr</selector>
  <field>phone</field>
</unique>
In XDR, elements support the ID attribute type as if they are attributes albeit this is not implemented yet in Internet Explorer 5.
<ElementType name='phone' dt:type='ID'/>
However, XDR cannot support uniqueness of composite objects. In Schematron, the same constraint can be written as follows:
<rule context='person/addr'>
  <assert test='count(phone) = 1'>
    phone is not unique.</assert>
</rule>
3. Key for attribute: In databases, being a key requires being unique as well as not being null. A similar concept is defined in XML Schema.


DTD: No XML Schema: Yes XDR: No
SOX: No Schematron: Yes DSD: No



Using almost identical syntax as áuniqueñ, a construct ákeyñ can specify an attribute as a key in XML Schema. In Schematron, this feature can be simulated as follows:
<rule context='person'>
  <assert test='@ssn and count(@ssn) = 1'>
    Is ssn unique?</assert>
  <assert test='string-length(@ssn) > 0'>
    Is ssn not empty?</assert>
</rule>

4. Key for non-attribute: XML Schema allows specification of arbitrary elements or composite objects as key.


DTD: No XML Schema: Yes XDR: No
SOX: No Schematron: Yes DSD: No



For instance, the following schema in XML Schema defines the combination of an employee's department code (element) and employee's name (attribute) as a key.
<key name='ekey'>
  <selector>employee</selector>
  <field>dept/code</field><field>@name</field>
</key>
Schematron supports this feature similarly using patterns.


5. Foreign key for attribute: Foreign key states if 1) who is a referencing key and 2) who is being referenced by the referencing key.


DTD: Partial XML Schema: Yes XDR: Partial
SOX: Partial Schematron: Yes DSD: Yes



Like ID type, DTD, XDR, SOX and DSD use IDREF type for a referencing attribute. XML Schema uses ákeyrefñ. In addition to this, XML Schema and DSD support a method to specify whom the foreign key actually points to using constructs áreferñ and áPointsToñ, respectively. Furthermore, DSD even allows association of arbitrary boolean expressions with the áPointsToñ construct. Using this, for instance, one can specify ``an attribute A points to either attribute B in an element E1 or C in element E2'' in DSD. In Schematron, this feature can be expressed using patterns. For instance, the following schema states that dno attribute of employee element should reference the unique identifier of dept element.
<rule context = 'employee[@dno]'>
  <assert test='(name(id(@dno)) = 'dept')'>
    Error occurred.</assert>
</rule>

6. Foreign key for non-attribute:


DTD: No XML Schema: Yes XDR: No
SOX: No Schematron: No DSD: Yes



Similar to specifying uniqueness for non-attributes, XML Schema can specify foreign keys for arbitrary elements or composite objects using the same ákeyrefñ construct.
<keyref refer='ekey'>
  <selector>project</selector>
  <field>emp-dept</field><field>@ename</field>
</keyref>
Similarly, the following DSD example illustrates that an attribute book-ref is referencing an element book.
<AttributeDecl ID='book-ref' IDType='IDRef'>
  <PointsTo>
    <Context><Element Name='book'/></Centext>
  </PointsTo>
</AttributeDecl>

2.7   Miscellaneous Features


1. Dynamic constraint: In Schematron, one can selectively turn on and off the constraints using áphaseñ construct so that only part of the schema constraints can be dynamically evaluated at any given time.


DTD: No XML Schema: No XDR: No
SOX: No Schematron: Yes DSD: No




2. Version: Sometimes it is desirable to allow several different attribute or element definitions with the same name. That is, several versions of an attribute or element coexist.


DTD: No XML Schema: No XDR: No
SOX: No Schematron: No DSD: Yes



XML Schema has a construct áversionñ for schema definition, but the current specification does not define any further semantics for that; it is simply provided as a convenience. DSD utilizes both ``Name'' as well as ``ID'' attributes for element definition so that the attributes with same names are legal as long as their IDs are different. Furthermore, by using the áRenewIDñ and áCurrIDRefñ, any definition can be renewed, making a new version of the definition. For instance, the following schema illustrates the redefinition of the DSD constraint book-constraints:
<ConstraintDef ID='book-constraints'/>
<ConstraintDef RenewID='book-constraints'>
  <Constraint CurrIDRef='book-constraints'/>
  ... modification ...
</ConstraintDef>

3. Documentation: At minimum, all languages support commenting on schema fragments using a construct <-- comment -->. However, here we consider documentation features beyond commenting such as: 1) textual description to explain a schema fragment for human readers, 2) embedded documentation for application programs, or 3) error or hint messages to aid schema validation and debugging.


DTD: No XML Schema: Yes XDR: No
SOX: Yes Schematron: Yes DSD: Yes



XML Schema provides ádocumentationñ and áappinfoñ elements to support description for both human readers as well as application programs. SOX provides the áintroñ element to provide an introduction to the schema as a whole and áexplainñ element to provide a hook for including documentation within a schema fragment. However, there is no support for automatic debugging message or application programs. In Schematron, by using the assertion semantics provided by constructs áassertñ and áreportñ, detailed documentation for validating XML structures can be provided. DSD supports three keywords: áLabelñ, áBriefDocñ and áDocñ. Using these, it is straightforward to implement, for instance, a debugging system.


4. Embedded HTML: Due to HTML's popularity, it is often convenient to be able to embed HTML fragments inside XML documents.


DTD: No XML Schema: Yes XDR: No
SOX: Yes Schematron: Partial DSD: Yes



Using áanyñ element, XML Schema allows specification that any well-formed XML is permissible in a type's content model. Hence, well-formed HTML code can be easily embedded in XML document. SOX provides similar feature using áexplainñ element. Schematron allows a few HTML tags (e.g., <p>, <emph>), but not general ones. In DSD, one can use the documentation facility to embed HTML.


5. Self-describability: The following languages provide a meta schema (i.e., representing the schema specification using the schema itself being defined). The meta schema is useful in bootstrapping the implementation of the language.


DTD: No XML Schema: Partial XDR: No
SOX: No Schematron: Partial DSD: Yes



While XML Schema and Schematron provide meta schemas that capture only the syntactic requirements, DSD provides a meta schema that captures both the syntactic and semantic requirements.

3   Conclusion



Figure 1: XML schema languages classification.




Figure 2: Classification of the expressive power of the six languages.


Our comparative review of the features is summarized in Table 1.


From an ``ease of use'' point of view, DTD is arguably the easiest schema language to learn despite its use of proprietary syntax. Since the new additions to XDR and SOX are relatively manageable, we think the migration curve from DTD to these languages is not steep. Although the language specification of Schematron is very simple, it exhibits much power. However, this requires users to learn yet another language XPath. Due to the extensive set of features supported by XML Schema and DSD, we expect them to be more difficult to learn than others. Since DSD uses explicit operators for regular expressions (e.g., áRepeatñ, áOneOfñ), DSD schema tends to be more verbose than XML Schema or Schematron schema.


From a ``language'' point of view, the six reviewed XML schema languages can be roughly divided into two camps based on factors such as grammar-based vs. pattern-based, definition-oriented vs. validation-oriented, structure-oriented vs. constraints-oriented, etc. The classification is summarized in Figure 1. Based on our study, DTD, XML Schema, XDR and SOX belong to the grammar-based language group while Schematron belongs to the pattern-based language group. DSD stands in-between, supporting both features together. The grammar-based language group especially has an advantage in XML querying since knowing the structure and definition of the schema helps users write more optimized queries and detect errors in the queries more easily. On the other hand, the pattern-based language group is naturally superior with respect to the expressiveness of constraints in the application.


From a ``database'' point of view, no single language suffices the needs completely. The SQL DDL allows specification of not only a set of relations and attributes, but also information about the domain of values associated with each attribute, integrity constraints, indices for each relation, security, etc [21]. While XML Schema fulfills the support for a variety of built-in domain types, it could not express, for instance, an arbitrary SQL CHECK or ASSERT clause. Furthermore, although Schematron or DSD can express such integrity constraints, they have no support of physical indices specification for boosting performance. Since a substantial amount of web documents are generated from underlying databases by the user's request, it is important to be able to handle such data-centric features as SQL DDL do. We feel this is one of the areas where database researchers can contribute more.


From an ``expressive power'' point of view, the six languages can be organized into the following three classes as depicted in Figure 2.

One should keep in mind, however, that the philosophies by which each language has been designed are quite different; some try to define more semantics while others try to be more minimalistic. Therefore, the languages in a higher class should not be regarded as superior to the ones in a lower class.


In our study, we have found that the support of constraints in the schema language (e.g., Schematron, DSD) is a very attractive feature. However, at the same time, ignoring the schema definition aspect completely like Schematron raises some concern as a general purpose schema language. Although XML Schema identifies many commonly recurring schema constraints and incorporates them into the language specification, we still feel XML Schema is too rigid in that sense. It would be interesting to see if the support of constraints will be added to XML Schema in the future.


Acknowledgment


The authors wish to thank Rick Jelliffe (ASCC) for answering questions regarding
Schematron and Michael I. Schwartzbach (BRICS) for his helpful comments on DSD during the writing of this paper.

References

[1]
A. Bonifati, S. Ceri, ``Comparative Analysis of Five XML Query Languages'', ACM SIGMOD Record, 29(1), 2000.

[2]
R. Bourret, J. Cowan, I. Macherius, S. St. Laurent (ed.), ``Document Definition Markup Language (DDML) Specification, Version 1.0'', January 1999. (http://www.w3.org/TR/NOTE-ddml)

[3]
T. Bray, C. Frankston, A. Malhotra (ed.), ``Document Content Description for XML'', July 1998. (http://www.w3.org/TR/NOTE-dcd)

[4]
P. V. Biron, A. Malhotra (ed.) ``XML Schema Part 2: Datatypes'', W3C, April 2000. (http://www.w3.org/TR/xmlschema-2)

[5]
T. Bray, J. Paoli, C. M. Sperberg-McQueen (ed.), ``Extensible Markup Language (XML) 1.0'', W3C, February 1998. (http://www.w3.org/TR/REC-xml)

[6]
J. Clark (ed.) ``XML Transformations (XSLT) Version 1.0'', W3C, November 1999. (http://www.w3.org/TR/xslt)

[7]
J. Clark, S. DeRose (ed.) ``XML Path Language (XPath) Version 1.0'', W3C, November 1999. (http://www.w3.org/TR/xpath)

[8]
A. Davidson, M. Fuchs, M. Hedin, et al., ``Schema for Object-Oriented XML 2.0'', W3C, July 1999. (http://www.w3.org/TR/NOTE-SOX)

[9]
D. C. Fallside (ed.), ``XML Schema Part 0: Primer'', W3C, April 2000. (http://www.w3.org/TR/xmlschema-0)

[10]
C. Frankston, H. S. Thompson, ``XML-Data Reduced'', Internet Document, July 1998. (http://www.ltg.ed.ac.uk/~ht/XMLData-Reduced.htm)

[11]
R. Jelliffe, ``Schematron'', Internet Document, May 2000. (http://www.ascc.net/xml/resource/schematron/)

[12]
N. Klarlund, A. Moller, M. I. Schwatzbach, ``Document Structure Description 1.0'', Internet Document, 1999. (http://www.brics.dk/DSD/)

[13]
N. Klarlund, A. Moller, M. I. Schwatzbach, ``DSD: A Schema Language for XML'', Proc. 3rd ACM Workshop on Formal Methods in Software Practice, 2000.

[14]
S. St. Laurent, ``Describing Your Data: DTDs and XML Schemas'', XML.com, December 1999. (http://www.xml.com/pub/1999/12/dtd/)

[15]
A. Layman, E. Jung, et al., ``XML-Data'', W3C, January 1998. (http://www.w3.org/TR/1998/NOTE-XML-data)

[16]
M. Makoto, ``RELAX (REgular LAnguage description for XML), Internet Document, April 2000. (http://www.xml.gr.jp/relax/)

[17]
Microsoft, ``XML Schema Developer's Guide'', Internet Document, May 2000. (http://msdn.microsoft.com/xml/XMLGuide/schema-overview.asp)

[18]
N. Miloslav, ``Schematron Tutorial'', Internet Document, May 2000. (http://www.zvon.org/HTMLonly/SchematronTutorial/
General/contents.html
)

[19]
D. Raggett, ``Assertion Grammars'', Internet Document, May 1999. (http://www.w3.org/People/Raggett/dtdgen/Docs/)

[20]
A. Sahuguet ``Everything You Ever Wanted to Know About DTDs, But Were Afraid to Ask'', Proc. 3rd Int'l Workshop on the Web and Databases (WebDB), Dallas, TX, 2000

[21]
A. Silberschatz, H. F. Korth, S. Sudarshan, ``Database System Concepts (3rd Edition)'', McGraw-Hill Co., 1997.

[22]
H. S. Thompson, D. Beech, M. Maloney, N. Mendelsohn (ed.) ``XML Schema Part 1: Structures'', W3C, April 2000. (http://www.w3.org/TR/xmlschema-1)

Features DTD 1.0 XML Schema 1.0 XDR 1.0 SOX 2.0 Schematron 1.4 DSD 1.0
Schema
syntax in XML No Yes Yes Yes Yes Yes
namespace No Yes Yes Yes Yes No
include No Yes No Yes No Yes
import No Yes No Yes No No
Datatype
built-in type 10 37 33 17 0 0
user-defined type No Yes No Yes No Yes
domain constraint No Yes No Partial Yes Yes
explicit null No Yes No No No No
Attribute
default value Yes Yes Yes Yes No Yes
choice No No No No Yes Yes
optional vs. required Yes Yes Yes Yes Yes Yes
domain constraint Partial Yes Partial Partial Yes Yes
conditional definition No No No No Yes Yes
Element
default value No Partial No No No Yes
content model Yes Yes Yes Partial Yes Yes
ordered sequence Yes Yes Yes Yes Yes Yes
unordered sequence No Yes Yes No Yes Yes
choice Yes Yes Yes Yes Yes Yes
mim & max occurrence Partial Yes Yes Yes Yes Partial
open model No No Yes No Yes No
conditional definition No No No No Yes Yes
Inheritance
simple type by extension No No No No No No
simple type by restriction No Yes No Yes No No
complex type by extension No Yes No Yes No No
complex type by restriction No Yes No No No No
Being unique or key
uniqueness for attribute Yes Yes Yes Yes Yes Yes
uniqueness for non-attribute No Yes Partial No Yes No
key for attribute No Yes No No Yes No
key for non-attribute No Yes No No Yes No
foreign key for attribute Partial Yes Partial Partial Yes Yes
foreign key for non-attribute No Yes No No No Yes
Miscellaneous
dynamic constraint No No No No Yes No
version No No No No No Yes
documentation No Yes No Yes Yes Yes
embedded HTML No Yes No Yes Partial Yes
self-describability No Partial No No Partial Yes

Table 1: Summary of the feature comparisons.



1
The title and structure of this paper imitate those of [1] in the hope of being a sequel.
2
These languages are still evolving at the time of writing.

This document was translated from LATEX by HEVEA.