A Web Data Model Unifying XML and RDF

Harold Boley

Draft, 2001-09-18




This is a proposal for a unified Web data model that generalizes the data models of both XML and RDF. Such a unification can benefit applications requiring aspects of both data models, as developed for the Semantic Web. It is based on role-prefixed children in XML elements, also accomodating RDF properties. Role-prefixed children are combined with a sequence of ordinary positional children, leading to "Order-Labeled (OrdLab) Trees". RDF's URI attributes about and resource are transferred to XML to, respectively, describe and reference URI names, arriving at the most general OrdLab Graphs. Relative URIs are taken over from XML and used for XML's id; XML's idref is transcribed into an OrdLab role. The structure of OrdLab Trees is described by extended DTDs and XML Schemas. Reductions of the unified data model to XML (1.0) are discussed. The first use of the OrdLab data model, in an XML-reduced form, was for the transition from RuleML Version 0.7 to 0.8, now addressing the modeling needs of both the XML and RDF communities.

Sections

Introduction

The Semantic Web can be viewed as an attempt to bring more semantic markup into the original WWW, dominated by syntactic markup. Contrary to some popular opinions, XML without additional conventions is not (yet) a semantic markup language [Cov98]. However, XML tags can be used to specify the types of elements. They can also be used to specify the roles of subelements within an element. While such 'type tags' and 'role tags' are rarely distinguished, their orthogonal use in marking up an element should clarify its meaning. This will be supported in this article by developing 'role tags' into 'role prefixes', i.e. attribute-like prefixes for subelements similar to RDF 'properties' (the current RDF version is [LS99]; for a logic-oriented introduction to RDF see [Bol00]). The 'type tags' and 'role prefixes' of the ensuing XML-RDF unification can then be seen as a groundwork for more semantic markup.

The XML and RDF communities have developed W3C recommendations with different data models. XML is based on, possibly attributed, left-to-right ordered, node-labeled trees, reminiscent of parse (syntax!) trees, except that their hierarchical structure permits an overlay of 'horizontal' id/idref links. RDF is based on directed arc-labeled (unordered) graphs with two kinds of nodes, resources and literals, where the latter do not allow outgoing arcs. While originally the intended uses of XML and RDF seemed to be sufficiently distinct to justify two different data models, it later turned out, e.g. with the advent of the Semantic Web, that a unified data model would be advantageous.

The non-uniformity of the XML and RDF data models is reflected by the current XML serialization of RDF. This RDF/XML serialization, unlike the RDF data model, has been criticized as being hard to write and read. Previous attempts at an improved XML-RDF interplay have focussed on the development of a better XML serialization of RDF using XML elements as RDF arcs and implicit RDF nodes [BL99], and on the RDF annotation of existing XML documents via internal DTDs [Mel99].

The current proposal focusses on the XML and RDF data models, developing their unification in an algebraic/graph-theoretical fashion. Based on this, it also develops a new XML serialization of RDF and puts forward an extended XML that would directly support the unified XML and RDF data model. We develop our model in two halves, extending XML by 1. RDF-like properties ('roles') and 2. URI descriptions.

The first half can be viewed as an extension of XML by attribute-like child prefixes, called 'roles', whose values are normally tagged XML subelements. It has often been observed that the decision of whether to use children or attributes is not easy in XML 1.0, but one criterion for using children being the need for tagged markup (rather than string-like CDATA), which is forbidden in attribute values. Since a direct generalization of XML attribute values to tagged markup would lead to "tag nestings within (start) tags", this does not appear to be the desired bridge between children and attributes. A better generalization, which we will also motivate from the RDF perspective, is leaving "tag nestings between (start and end) tags", but optionally allowing to prefix embedded children by attribute-like "roles".

A Problem with XML

The XML and RDF data models have complementary strengths and weaknesses. The strengths can be combined, and the weaknesses avoided, via an initial unification encompassing both of these models.

The XML data model is strong in capturing positional collections of data since the children of an XML element are textually ordered. It is weak in capturing non-positional collections since these would suggest arc labels, absent in XML, which indicate the role each of the unordered components is playing in the collection.

For example, suppose we want to mark up power formulas as integer bases raised to integer exponents. In XML we can use a positional representation, where a powform consists of a base followed by an exponent; the base and exponent can each be marked up as an integer. Thus, the power formula 32 would be marked up as follows:

<powform>
  <integer>3</integer>
  <integer>2</integer>
</powform>
Graph-theoretically, we obtain a tree with left-to-right-ordered arcs and two kinds of nodes labeled by their types, here powform and integer, where oval nodes are RDF-like URI resources, here empty (since anonymous), and rectangular nodes are RDF-like literals or XML-like PCDATA, here 3 and 2 (we will arrange for typed literals in RDF):

The left-to-right ordering of arcs could also be made explicit by giving them labels 1, 2, ..., where the arc labels permit arbitrary (topology-preserving) permutations of those arcs, even in three dimensions (emphasized via the 3D arrow styles):

While powform's binary-positional convention "First child is base, second child is exponent" seems natural and easy to memorize, there may remain a rest of doubt about whether this markup could instead mean 23 according to a "First child is exponent, second child is base" convention. Analogous conventions for N-ary operators (N>2) need to disambiguate a combinatorially exploding number of possible interpretations. For instance, a 3-ary relation can be obtained from the power formula, 32, by extending it by the resulting rational value, 9. (Or, from power formula 3-2, rational value 1/9.) Now, even if we fix a "First base, then exponent" convention, its ternary-positional powequ markup could not only be done following the often-used "Last child is value" convention corresponding to the equation 32 = 9:

<powequ>
  <integer>3</integer>
  <integer>2</integer>
  <rational>9</rational>
</powequ>
Instead, this could also be done following the sometimes preferable "First child is value" convention corresponding to the equivalent equation 9 = 32:
<powequ>
  <rational>9</rational>
  <integer>3</integer>
  <integer>2</integer>
</powequ>
Without extra information the positions of the 'roles' (base, exponent, value) of the three powequ children cannot in general be uniquely determined from some such markup. Here, since both the base and the exponent are integers, the child 'types', integer vs. rational, do not discriminate between their roles; these types should neither be relied on for discriminating between the base or exponent role and the value role (while heuristic 'role inference' could use knowledge about powers such as "Negative integer exponent leads to reciprocal rational value", it would rely on the dubious assumption that the most specific types were used in all roles). Moreover, if there was a more specific uniform type natural in all three positions, type-discrimination capability between any of these roles would be lost entirely.

Of course, XML users can express nestings of several elements with positional children. For example, the equation (6/2)((1 * 2 * 4)(1/3)) = 9 could be expressed like this:

<powequ>
  <divform>
    <integer>6</integer>
    <integer>2</integer>
  </divform>
  <rootform>
    <integer>3</integer>
    <prodform>
      <integer>1</integer>
      <integer>2</integer>
      <integer>4</integer>
    </prodform>
  </rootform>
  <rational>9</rational>
</powequ>
Graph-theoretically, this is again a tree with left-to-right-ordered arcs and type-labeled nodes (the two integer-labeled, 2-marked PCDATA/literal nodes cannot be identified to a single node in XML or RDF, since node identity is considered independent of marks or labels):

The left-to-right ordering of arcs could again be made explicit by giving them labels 1, 2, ...:

Even more obvious than previously, the 'roles' of the three powequ children (base, exponent, value), the four grandchildren (nominator, denominator as well as degree, radicand), etc. cannot be uniquely determined from the above markup or trees.

In summary, role-as-type discrimination cannot be relied on: For recipients to correctly interpret these kinds of XML markup, there is a need for an explicit convention that specifies the role of each child position, which becomes less obvious with an increasing number and generation of children.

Role-Prefixed Children

In 'object-centered' modeling the way out is representing powers and other operators in a non-positional manner, making them 'objects' with explicitly indicated roles for their arguments. In XML there are many ways to emulate this, perhaps the most often used being an extra level of markup that uniquely distinguishes the argument roles. In our example we could use powform with base and exponent children as follows:

<powform>
  <base><integer>3</integer></base>
  <exponent><integer>2</integer></exponent>
</powform>
This should be considered equivalent to the following markup using the other possible powform-child permutation, but (while XSLT, XQuery, etc. can use them equivalently) this equivalence is not part of XML itself:
<powform>
  <exponent><integer>2</integer></exponent>
  <base><integer>3</integer></base>
</powform>
These markups perform the desired base/exponent distinction. They can also be naturally extended by the resulting value, as exemplified by one of six equivalent permutations:
<powequ>
  <base><integer>3</integer></base>
  <exponent><integer>2</integer></exponent>
  <value><rational>9</rational></value>
</powequ>
However, these markups mix what we regard as two kinds of tags, as often seen in XML 1.0 practice, namely the type-like (single- or multiple-child) tags powform, powequ, integer, and rational with the role-like (single-child) tags base, exponent, and value.

In RDF, the base/exponent distinction is done by corresponding properties base and exponent; these act as roles which are clearly separated from the types powform and integer:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:r="http://www.num.org/roles"
         xmlns:t="http://www.num.org/types">
  <t:powform about="http://www.num.org/comps/powform3sup2">
    <r:base><t:integer about="http://www.num.org/insts/3"/></r:base>
    <r:exponent><t:integer about="http://www.num.org/insts/2"/></r:exponent>
  </t:powform>
</rdf:RDF> 
This is equivalent to the following serialization using the other possible powform-property permutation:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:r="http://www.num.org/roles"
         xmlns:t="http://www.num.org/types">
  <t:powform about="http://www.num.org/comps/powform3sup2">
    <r:exponent><t:integer about="http://www.num.org/insts/2"/></r:exponent>
    <r:base><t:integer about="http://www.num.org/insts/3"/></r:base>
  </t:powform>
</rdf:RDF>
As triple sets and graphs, both serializations are actually indistinguishable.

These XML and RDF considerations lead to a unified model that can be regarded as an RDF-extended XML. In it, besides normal children, elements can have role-prefixed children of the form role==child, where role is quoted CDATA, like in XML's attribute values, and child is any well-formed markup. In our example of the power formula 32 we can then use powform with base- and exponent-prefixed children as follows:

<powform>
  "base"==<integer>3</integer>
  "exponent"==<integer>2</integer>
</powform>
Graph-theoretically, we obtain a tree with unordered arcs labeled by their roles, here base and exponent, and nodes as introduced earlier:

This is again equivalent to the following commutative markup using the other correctly prefixed powform-child permutation.

<powform>
  "exponent"==<integer>2</integer>
  "base"==<integer>3</integer>
</powform>
Graph-theoretically, nothing changes when arcs are permuted with their labels and target nodes:

This equivalence is again considered part of the extended XML much like the order of "="-attributes in start tags is already considered irrelevant in the non-extended XML.

More precisely, the following equation expresses this algebraic law of commutativity for our extended XML (the ellipses, ". . .", stand for arbitrary child contexts, which could be made explicit as in constructor algebras):

<element>. . .role1==child1. . .role2==child2. . .</element>
=
<element>. . .role2==child2. . .role1==child1. . .</element>

Note that such an abstraction is also implicit in RDF graphs and serializations, since both the triples within RDF models and the pairs within rdf:Description can be permuted without information loss.

These equivalent markups directly perform the desired base/exponent distinction, without an extra level of tag nesting. They can also be naturally extended by the resulting value, as exemplified by one of six algebraically equivalent permutations corresponding to the equation 32 = 9:

<powequ>
  "base"==<integer>3</integer>
  "exponent"==<integer>2</integer>
  "value"==<rational>9</rational>
</powequ>

Of course, extended XML users can also express nestings of several complete elements with role-prefixed children. For example, the equation (6/2)((1 * 2 * 4)(1/3)) = 9 can now be expressed more clearly like this (as for RDF properties, it is allowed to have multiple occurrences of the same role within an element, here factor within prodform):

<powequ>
  "base"==<divform>
            "nominator"==<integer>6</integer>
            "denominator"==<integer>2</integer>
          </divform>
  "exponent"==<rootform>
                "degree"==<integer>3</integer>
                "radicand"==<prodform>
                              "factor"==<integer>1</integer>
                              "factor"==<integer>2</integer>
                              "factor"==<integer>4</integer>
                            </prodform>
              </rootform>
  "value"==<rational>9</rational>
</powequ>
Graph-theoretically, this is again a tree with unordered arcs labeled by their roles, here base (branching further into nominator and denominator), exponent (branching further into degree and radicand, etc.), and value:

In addition, we permit multiple roles to share a single child markup. For example, for expressing 33 = 27, we can shorten

<powequ>
  "base"==<integer>3</integer>
  "exponent"==<integer>3</integer>
  "value"==<rational>27</rational>
</powequ>
to
<powequ>
  "base"=="exponent"==<integer>3</integer>
  "value"==<rational>27</rational>
</powequ>
with the integer 3 being shared by the base and the exponent. This is particularly important for avoiding two or more copies of one large, nested child markup; to give a still small nested example, the base = exponent could itself be a powform element (which here could again be shortened via sharing):
<powequ>
  "base"=="exponent"==<powform>
                        "base"==<integer>2</integer>
                        "exponent"==<integer>2</integer>
                      </powform>
  "value"==<rational>256</rational>
</powequ>

Positional Children

Now, suppose we want to mark up power sequences as rational numbers, rational pairs, triples, etc. In XML we can use a positional representation, where a powseq consists of one or more elements, each marked up as a rational. Thus, the binary power sequence 3, 9 would be marked up as follows:

<powseq>
  <rational>3</rational>
  <rational>9</rational>
</powseq>
Clearly, powseq's binary-positional ordering directly reflects the ordering of the original power sequence. For the extended ternary power sequence 3, 9, 27 the markup is extended as follows:
<powseq>
  <rational>3</rational>
  <rational>9</rational>
  <rational>27</rational>
</powseq>
The positional XML representation is exactly what is needed for power and other sequences; a non-positional representation would give us no added value for sequences.

Actually, in RDF, the positional representation must be simulated non-positionally by regarding the indexes 1, 2, ... of a mapping like 1 -> 31, 2 -> 32, ... as properties. In RDF these properties are written rdf:_1, rdf:_2, ... (usually generated, HTML-like, from rdf:li, rdf:li, ...). Within RDF's Seq(uence) container they act as superimposed position indicators which are substitutes for XML's built-in ordering:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:r="http://www.num.org/roles"
         xmlns:t="http://www.num.org/types">
  <t:powseq about="http://www.num.org/comps/powseq3supN">
    <r:numcont>
      <rdf:Seq>
        <rdf:_1><t:rational about="http://www.num.org/insts/3"/></rdf:_1>
        <rdf:_2><t:rational about="http://www.num.org/insts/9"/></rdf:_2>
      </rdf:Seq>
    </r:numcont>
  </t:powseq>
</rdf:RDF> 
For the extended ternary power sequence 3, 9, 27 the markup can then be extended as follows:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:r="http://www.num.org/roles"
         xmlns:t="http://www.num.org/types">
  <t:powseq about="http://www.num.org/comps/powseq3supN">
    <r:numcont>
      <rdf:Seq>
        <rdf:_1><t:rational about="http://www.num.org/insts/3"/></rdf:_1>
        <rdf:_2><t:rational about="http://www.num.org/insts/9"/></rdf:_2>
        <rdf:_3><t:rational about="http://www.num.org/insts/27"/></rdf:_3>
      </rdf:Seq>
    </r:numcont>
  </t:powseq>
</rdf:RDF> 

In our unified model normal, unprefixed children are used for such sequences, thus leaving the above examples of the non-extended XML unchanged. But these can also be viewed RDF-like as index-prefixed children as illustrated for the extended sequence:

<powseq>
  "kid1"==<rational>3</rational>
  "kid2"==<rational>9</rational>
  "kid3"==<rational>27</rational>
</powseq>
Here, the built-in roles kid1, kid2, ... (in 1-to-1 correspondence to our earlier labels 1, 2, ...) implement sequential child order, thus abstracting away from the textual order of children.

Order-Labeled Trees

RDF-like role-prefixed and XML's positional children can be easily combined, obtaining our basic RDF-XML integration of "Order-Labeled (OrdLab) Trees".

For example, suppose we want to combine the markup of a power sequence with the marked up formula denoting its highest power. For the original sequence this gives 3, 9 = 32, and we obtain the following markup:

<powseqform>
  <rational>3</rational>
  <rational>9</rational>
  "base"==<integer>3</integer>
  "exponent"==<integer>2</integer>
</powseqform>
Graph-theoretically, positional children become left-to-right-ordered arcs, here targeting the rationals 3 and 9, while role children become labels on unordered arcs, here targeting the integers 3 and 2:

For the extended sequence this gives 3, 9, 27 = 33, and we obtain the following markup:

<powseqform>
  <rational>3</rational>
  <rational>9</rational>
  <rational>27</rational>
  "base"==<integer>3</integer>
  "exponent"==<integer>3</integer>
</powseqform>
Again, the children prefixed by the roles base and exponent could be interchanged or, since they are equal here, shared without changing the meaning of powseqform. In addition, they can also be moved to precede the run of rational children, thus:
<powseqform>
  "base"==<integer>3</integer>
  "exponent"==<integer>3</integer>
  <rational>3</rational>
  <rational>9</rational>
  <rational>27</rational>
</powseqform>
However, to simplify interpretation, the run of positional children such as the three rationals must not be interspersed by role-prefixed children such as the base- and exponent-prefixed integers.

Thus, the above equivalences lead to indistinguishable graphs because the geometry of unordered labeled arcs is graph-theoretically irrelevant; the above non-interspersing principle leads to at most one tree-like left-to-right-ordered fan-out per node.

In general, our XML-RDF-unified data model consists of "Order-Labeled (OrdLab) Trees": The fan-out of each oval node is a sequence of 0 or more left-to-right-ordered arcs and a set of 0 or more labeled arcs. If there are no labeled arcs, the model corresponds to XML trees. If there are no ordered arcs, the model corresponds to RDF trees. Rectangular nodes, as leaves, have no fan-out. Our serialization syntax consists of markup linearizing these OrdLab Trees as shown via the examples. Non-empty oval nodes (RDF URI attributes) and non-tree RDF graphs will be introduced later.

In the RDF-like view with index-prefixed children the 3, 9, 27 = 33 markup looks thus:

<powseqform>
  "kid1"==<rational>3</rational>
  "kid2"==<rational>9</rational>
  "kid3"==<rational>27</rational>
  "base"==<integer>3</integer>
  "exponent"==<integer>3</integer>
</powseqform>
The interchanged markup looks as follows:
<powseqform>
  "base"==<integer>3</integer>
  "exponent"==<integer>3</integer>
  "kid1"==<rational>3</rational>
  "kid2"==<rational>9</rational>
  "kid3"==<rational>27</rational>
</powseqform>
Graph-theoretically, kidI-labeled arcs lead to homogeneous directed labeled graphs, permitting arbitrary arc permutations, to which we can reduce our combined data model. However, besides schema-definable user labels such as base and exponent (cf. the RDF model and RDF Schema), this would also require predefined built-in labels kidI, constituting "structural links" as already criticized in [Woo75]. We thus prefer the combined data model without kidI links.

To sum up, what we have achieved thus far is a "least general generalization" encompassing the XML and RDF data models. However, we have mostly ignored URIs and related aspect, which will be discussed next.

URIs and Physical Embedding

The second half of our model can be viewed as an extension of XML by URI descriptions. For describing and referring to metadata, RDF uses the URI attributes about and resource, respectively; we'll transfer these to XML. In RDF Descriptions the built-in attribute about allows to give a resource (URI) for which metadata are being specified. This can later be referred to with the built-in attribute resource. In our XML-RDF integration we adapt both of these attributes. Also, RDF's type attribute becomes the tag pair containing the about attribute, much like in the third basic RDF abbreviation (cf. [LS99]). Should the type attribute be missing in RDF, we use the generic tag pair <any>...</any> in XML. If there is more than one type in RDF, we use a new tag pair <intertag> ...</intertag> for their intersection in XML.

Besides for the node that is being described, such 'type tags' are used for all of its outgoing arrows, even when they are RDF-like literals (XML-like PCDATA). Thus, as a 'named' version of our earlier 'anonymous' XML-like markups, we can transfer an earlier RDF description of the resource http://www.num.org/comps/powform3sup2, with a described-node type powform and two outgoing-arrow node types integer going to 3 and 2, i.e.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:r="http://www.num.org/roles"
         xmlns:t="http://www.num.org/types">
  <t:powform about="http://www.num.org/comps/powform3sup2">
    <r:base><t:integer about="http://www.num.org/insts/3"/></r:base>
    <r:exponent><t:integer about="http://www.num.org/insts/2"/></r:exponent>
  </t:powform>
</rdf:RDF> 
into our extended XML as follows (without namespaces, here and below, as in the 'anonymous' versions):
<powform about="http://www.num.org/comps/powform3sup2">
  "base"==<integer>3</integer>
  "exponent"==<integer>2</integer>
</powform>
Graph-theoretically, we obtain a tree with URIs as oval nodes, here http://www.num.org/comps/powform3sup2, and all the rest as introduced earlier, i.e about URIs are (names of) source nodes, while the containing tags become type labels on these nodes:

Similarly, another earlier markup can be extended into a powseqform3sup2-describing version, transferring the RDF description

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:r="http://www.num.org/roles"
         xmlns:t="http://www.num.org/types">
  <t:powseqform about="http://www.num.org/comps/powseqform3sup2">
    <r:numcont>
      <rdf:Seq>
        <rdf:_1><t:rational about="http://www.num.org/insts/3"/></rdf:_1>
        <rdf:_2><t:rational about="http://www.num.org/insts/9"/></rdf:_2>
      </rdf:Seq>
    </r:numcont>
    <r:base><t:integer about="http://www.num.org/insts/3"/></r:base>
    <r:exponent><t:integer about="http://www.num.org/insts/2"/></r:exponent>
  </t:powseqform>
</rdf:RDF> 
into our extended XML as follows:
<powseqform about="http://www.num.org/comps/powseqform3sup2">
  <rational>3</rational>
  <rational>9</rational>
  "base"==<integer>3</integer>
  "exponent"==<integer>2</integer>
</powseqform>
Graph-theoretically, this is a tree describing the URI http://www.num.org/comps/powseqform3sup2:

This powseqform3sup2 description can now be partitioned by having it embed the above powform3sup2-describing resource. For this a new role, formula, is introduced:

<powseqform about="http://www.num.org/comps/powseqform3sup2">
  <rational>3</rational>
  <rational>9</rational>
  "formula"==<powform about="http://www.num.org/comps/powform3sup2">
               "base"==<integer>3</integer>
               "exponent"==<integer>2</integer>
             </powform>
</powseqform>

Graph-theoretically, this is a tree with a formula-labeled arc leading from the URI node http://www.num.org/comps/powseqform3sup2 to the embedded powform subtree:

The powform URI node http://www.num.org/comps/powform3sup2 is the one defined earlier. Since node identity is considered to be established by identical URIs, the above diagrams could thus be 'glued' together via the node http://www.num.org/comps/powform3sup2:

Because of the 'physical' embedding, we could also omit the URI attribute about http://www.num.org/comps/powform3sup2 here (making this node anonymous), if there was no reference to this URI:

<powseqform about="http://www.num.org/comps/powseqform3sup2">
  <rational>3</rational>
  <rational>9</rational>
  "formula"==<powform>
               "base"==<integer>3</integer>
               "exponent"==<integer>2</integer>
             </powform>
</powseqform>
The anonymous subtree would not be accessible from the outside:

URIs and Symbolic Linking

The powseqform3sup2 description can be further modularized by having it refer to the above powform3sup2-describing resource. For this the RDF description

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:r="http://www.num.org/roles"
         xmlns:t="http://www.num.org/types">
  <t:powseqform about="http://www.num.org/comps/powseqform3sup2">
    <r:numcont>
      <rdf:Seq>
        <rdf:_1><t:rational about="http://www.num.org/insts/3"/></rdf:_1>
        <rdf:_2><t:rational about="http://www.num.org/insts/9"/></rdf:_2>
      </rdf:Seq>
    </r:numcont>
    <r:formula><t:powform resource="http://www.num.org/comps/powform3sup2"/></r:formula>
  </t:powseqform>
</rdf:RDF> 
is transferred into our extended XML, where "formula" now prefixes an empty-element type tag for powform that contains a resource attribute linking to powform3sup2:
<powseqform about="http://www.num.org/comps/powseqform3sup2">
  <rational>3</rational>
  <rational>9</rational>
  "formula"==<powform resource="http://www.num.org/comps/powform3sup2"/>
</powseqform>

Graph-theoretically, this is a tree with a formula-labeled arc leading from the URI node http://www.num.org/comps/powseqform3sup2 to the URI node http://www.num.org/comps/powform3sup2, where an arc's bullet shaft indicates that the target node is reached via 'symbolic' (resource URI) linking, not by physical embedding:

The powform URI node http://www.num.org/comps/powform3sup2 is again the one defined earlier. While the RDF and extended XML descriptions represent such nodes utilizing separate copies, since they act at the same time as (resource) targets and (about) sources, graph-theoretically they are identical nodes: The above diagrams could thus again be 'glued' together via the node http://www.num.org/comps/powform3sup2. Because of the symbolic linking, we could, of course, not omit the URI attribute about http://www.num.org/comps/powform3sup2 here.

Similarly, in our extended XML, as in RDF, the node http://www.num.org/comps/powform3sup2 could have an arc pointing back to the powseqform node http://www.num.org/comps/powseqform3sup2. Two or more extended XML descriptions can thus together represent a directed (labeled) graph, an OrdLab Graph, not just a tree. In order to simplify such sets of RDF-like descriptions for several nodes, we allow a document to be a forest of XML-like elements, not just one tree.

Finally, XML's left-to-right ordered trees can be directly combined with RDF's URIs. Suppose, we have described powform3sup1 besides the previous powform3sup2:

<powform about="http://www.num.org/comps/powform3sup1">
  "base"==<integer>3</integer>
  "exponent"==<integer>1</integer>
</powform>
Now, we can define power sequences all (here, two) of whose elements are formulas 31, 32, ...:
<powallform about="http://www.num.org/comps/powallform3sup2">
  <powform resource="http://www.num.org/comps/powform3sup1"/>
  <powform resource="http://www.num.org/comps/powform3sup2"/>
</powallform>

Graph-theoretically, this is a left-to-right ordered tree with an unlabeled arc leading to the URI node http://www.num.org/comps/powform3sup1 to the left of an unlabeled arc leading to the URI node http://www.num.org/comps/powform3sup2, where an ordered arc's bullet shaft again indicates that the target node is reached via symbolic (resource URI) linking:

Altogether, we permit all combinations in an orthogonal system of arrows that are, in one dimension, unlabeled, left-to-right ordered (ordinary) vs. labeled, unordered (3D-style) and, in the other dimension, physically embedding (without bullet shaft) vs. symbolically linking (with bullet shaft). As in XML, if two physically embedded nodes or subtrees are equal, they are separate copies, where one can be edited without changing the other. As in RDF, if two symbolically linked nodes or subtrees are equal, they are identical (URI) objects, which can be graphically 'glued' together, so that editing one correspondingly changes the other. By the nature of physical embedding vs. symbolic linking, each node can have at most one incoming arrow without a bullet shaft vs. arbitrarily many incoming arrows with a bullet shaft.

Relative URIs and id/idref

Besides the absolute URIs previously taken over from RDF, we now also take over relative URIs from XML. Relative URIs starting with a "#" refer to the current document. We allow the attributes about and resource to, respectively, describe and reference relative URIs as well. For this we require that the relative-URI value of an about attribute must be unique within the current document, just as is the case for an id value in XML. (Elements with an about="#LocalName" attribute contain 'metadata' that happen to describe the URI 'content' of LocalName where they, themselves, are stored: like for the self-describing about="", 'metadata' and 'content' become indistinguishable here.) "#"-URIs and RDF's about/resource attributes can then take over the functionality of XML's id/idref attributes.

XML's built-in attributes id and idref can be used to break out of XML's element-tree structure. For example, the equation (6/2)((1 * 2 * 4)(1/3)) = 9 could be enriched by an id attribute identifying the divform base as div6by2 and by an idref attribute referring to that base from the rootform exponent:

<powequ>
  <divform id="div6by2">
    <integer>6</integer>
    <integer>2</integer>
  </divform>
  <rootform idref="div6by2">
    <integer>3</integer>
    <prodform>
      <integer>1</integer>
      <integer>2</integer>
      <integer>4</integer>
    </prodform>
  </rootform>
  <rational>9</rational>
</powequ>

Using the above conventions, we can transcribe this XML as follows into OrdLab XML by making idref a role:

<powequ>
  <divform about="div6by2">
    <integer>6</integer>
    <integer>2</integer>
  </divform>
  <rootform>
    "idref"==<divform resource="#div6by2"/>
    <integer>3</integer>
    <prodform>
      <integer>1</integer>
      <integer>2</integer>
      <integer>4</integer>
    </prodform>
  </rootform>
  <rational>9</rational>
</powequ>

Graph-theoretically, this becomes a directed (here, acyclic) graph with left-to-right-ordered arcs plus an idref-labeled unordered arc, where the anonymous rootform node links to the div6by2-named divform node:

Of course, idref is not a very precise role name, which becomes obvious when there are several idrefs on a single node. So, in the above example, another idref role could lead from the rootform exponent to the rational value. But it would remain unclear how, specifically, the rootform exponent refers to the two other nodes: idref-labeled arcs are equivalent to unlabeled, unordered arcs. We thus permit users to choose meaningful role names such as base and value instead of the 'dummy' name idref.

OrdLab DTDs and XML Schemas

As in the case of unextended XML, it should be possible to describe the structure of OrdLab Trees by extended DTDs and XML Schemas. For this we introduce a DTD and XML Schema metasyntax for roles and a DTD metasyntax for unordered groups (already existing in XML Schema).

The DTD metasyntax for roles generalizes XML's unprefixed element content to the role-prefixed role==element content.

The Schema metasyntax for roles introduces a corresponding xsd:role tag having a name attribute for the role and containing a normal xsd:element for the element.

The DTD metasyntax for unordered groups replaces the ","-separator for sequences by a ";"-separator for sets. This is needed here only for sets of role-prefixed elements but could also be used for sets of unprefixed elements.

The Schema metasyntax for unordered groups is the normal xsd:all with both minOccurs and maxOccurs (by default) set to "1".

In our example of the power formula 32 we can use a powform DTD with ";"-separated, base- and exponent-prefixed children as follows:

<!ELEMENT powform ("base"==integer; "exponent"==integer)>
<!ELEMENT integer (#PCDATA)>
This is equivalent to a DTD with interchanged base- and exponent-prefixed integers; so this could have been the first line:
<!ELEMENT powform ("exponent"==integer; "base"==integer)>

The relevant part of an equivalent powform Schema is the following:

<xsd:element name="powform">
  <xsd:complexType>
    <xsd:all>
      <xsd:role name="base">
        <xsd:element name="integer" type="xsd:string"/>
      </xsd:role>
      <xsd:role name="exponent">
        <xsd:element name="integer" type="xsd:string"/>
      </xsd:role>
    </xsd:all>
  </xsd:complexType>
</xsd:element>
Again, this is equivalent to a Schema with interchanged base-role and exponent-role-embedded integer elements.

Both DTDs and both Schemas describe, e.g.,

<powform>
  "base"==<integer>3</integer>
  "exponent"==<integer>2</integer>
</powform>
and the permuted
<powform>
  "exponent"==<integer>2</integer>
  "base"==<integer>3</integer>
</powform>
as well as (since PCDATA and xsd:string, unlike xsd:integer, need not conform to an integer type)
<powform>
  "base"==<integer>three</integer>
  "exponent"==<integer>two</integer>
</powform>

A DTD describing markup having the form of our nested powform equation (6/2)((1 * 2 * 4)(1/3)) = 9 would look like this (multiple occurrences of role-element pairs, here three factor-integer pairs within prodform, are allowed, where the element, here integer, occurrences normally expand to different instances):

<!ELEMENT powequ   ("base"==divform; "exponent"==rootform; "value"==rational)>
<!ELEMENT divform  ("nominator"==integer; "denominator"==integer)>
<!ELEMENT rootform ("degree"==integer; "radicand"==prodform)>
<!ELEMENT prodform ("factor"==integer; "factor"==integer; "factor"==integer)>
<!ELEMENT integer  (#PCDATA)>
<!ELEMENT rational (#PCDATA)>

The Schema is analogous (but longer).

DTDs for OrdLab Trees combining role-prefixed and positional children combine the ";"-separator and the ","-separator. The separator between a set of role-prefixed elements and a sequence of positional elements is ";" since the sequence is regarded as a member of the set ("," has binding precedence over ";").

In a corresponding manner, Schems for such OrdLab Trees combine the xsd:all and xsd:sequence tags.

For example, a DTD for the sequence 3, 9 = 32 is the following:

<!ELEMENT powseqform (rational, rational; "base"==integer; "exponent"==integer)>
<!ELEMENT integer  (#PCDATA)>
<!ELEMENT rational (#PCDATA)>
This is equivalent to, e.g., a DTD in which the rational sequence comes after the role-prefixed integer set; so this could have been the first line:
<!ELEMENT powseqform ("base"==integer; "exponent"==integer; rational, rational)>
The relevant part of an equivalent Schema is the following:
<xsd:element name="powseqform">
  <xsd:complexType>
    <xsd:all>
      <xsd:sequence>
        <xsd:element name="rational" type="xsd:string"/>
        <xsd:element name="rational" type="xsd:string"/>
      </xsd:sequence>
      <xsd:role name="base">
        <xsd:element name="integer" type="xsd:string"/>
      </xsd:role>
      <xsd:role name="exponent">
        <xsd:element name="integer" type="xsd:string"/>
      </xsd:role>
    </xsd:all>
  </xsd:complexType>
</xsd:element>
Again, this is equivalent to, e.g., a Schema in which the rational sequence comes after the integer set.

Both DTDs and both Schemas describe, e.g.,

<powseqform>
  <rational>3</rational>
  <rational>9</rational>
  "base"==<integer>3</integer>
  "exponent"==<integer>2</integer>
</powseqform>
and the permuted
<powseqform>
  "base"==<integer>3</integer>
  "exponent"==<integer>3</integer>
  <rational>3</rational>
  <rational>9</rational>
</powseqform>

Reductions to XML

The extended XML introduced in the previous sections can be reduced to XML 1.0 in various ways. We will discuss here three principal reduction possibilities, reducing the new notion of roles as follows:

  1. Roles as/in superimposed elements: The role name (or URI) is specified by the tag or by an attribute of a new element superimposed above a child.
  2. Roles as/in preceding elements: The role name (or URI) is specified by the tag or by an attribute of a new element preceding a child.
  3. Roles in value children themselves: The role name (or URI) is specified by an attribute within the existing child element.

The URI attributes about and resource taken from RDF are neutral w.r.t. these XML-reduction possibilities and, being XML attributes already, pose no problem in the possible reductions.

In the following, we only expand on possibility (1) Roles as/in superimposed elements. Possibility (2) is analogous. Possibility (3) may be more cute but less intuitive: One would have to "look inside a child's start tag" to see what role it was playing for its parent.

"_"-Role Version: A rudimentary variant of this reduction possibility was already implicit in our derivation of the role notion. A role-prefixed child of the form role==child can use the role, prefixed by a "_" to distinguish it from the child's type, as a superimposed tag pair: <_role> child </_role>. For example,

"base"==<integer>3</integer>
becomes
<_base><integer>3</integer></_base>

Meta-role Version: In a more generic version, a role-prefixed child of the form role==child can use the role in a superimposed role-tag pair with an href attribute: <role href="role"> child </role>. For example,

"base"==<integer>3</integer>
becomes
<role href="base"><integer>3</integer></role>

Unlike the "_"-convention, the meta-role version generically keeps the same tag name, role, and uses the role name as the value of the href attribute, for all role-child pairs. Like our XML extension, this directly allows the role name to be a complete URL, where the quoted URL string should be ultimately replaced by a URL datatype (cf. W3C report [RFC2396] and XML Schema). The "_"-role version would require namespaces as tag prefixes here, as also used in RDF. Unlike the "_"-role version, the meta-role version moreover allows child sharing, as illustrated by contrasting

<powequ>
  <_base><integer>3</integer></_base>
  <_exponent><integer>3</integer></_exponent>
  <_value><rational>27</rational></_value>
</powequ>
with
<powequ>
  <role href="base exponent"><integer>3</integer></role>
  <role href="value"><rational>27</rational></role>
</powequ>
However, for the different roles played by a child, a space-separated (CDATA-)string, "base exponent", becomes necessary as the value of the href attribute. More seriously, the meta-role version cannot be defined by a (context-free) DTD: only the role context of an href attribute's value (here, base exponent vs. value) can tell what content type (here, integer vs. rational) is allowed.

The "_"-role version of the superimposition reduction was thus also proposed for RuleML. For more details on this XML reduction possibility, for an RDF reduction via Seq containers, for ASCII diagrams of OrdLab Trees (where our 3D-style arrows become "*"-line arrows), and for completely different examples (business rules), readers are referred to the RuleML DTD, Version 0.8.

OrdLab DTDs and XML Schemas can be reduced, for all possibilities discussed, to normal DTDs and XML Schemas along with the OrdLab XML itself.

Conclusions

In this article we introduced native roles for XML, native sequences for RDF, and, more generally, an integration of the XML-RDF data models through OrdLab Graphs. The fact that this XML extension is not easily reduced to XML 1.0 supports our view that it should be directly incorporated into a future version of XML.

Role names could not only be enclosed by double quotes, as shown in this article, but also by single quotes, as for XML's attribute values. Role names can thus be CDATA such as full URIs, without requiring (but still allowing) the namespace prefixes of RDF properties. XML parsers could be extended to recognize text of the form '"CDATA"==<' or "'CDATA'==<", possibly with whitespace around "==" (XML's mixed content would then no longer permit "==", now the role-child separator, to precede a "<", the beginning of a tag).

Alternatively, it would also be possible to omit quotes altogether, making role names normal XML Names. This would enable another serialization for OrdLab XML, somewhat related to possibility (3) of the above XML reductions: Instead of "RoleName==<TagName ...>" we could write "<RoleName==TagName ...>". Thus, the role-child pair

base==<integer>3</integer>
would become
<base==integer>3</integer>
This would preserve XML's 'top-level' syntax, but would require two "=="-separated parts within start-tag Names.

Of course, our choice of "==" as the role-child separator, 'generalizing' the "=" of XML attributes, could be easily revised. The choice of such a separator symbol is constrained by what kind of mixed content or start-tag name should be reserved for indicating a role-child construct.

The current OrdLab XML uses roles as prefixes 'only' for elements, not for PCDATA/xsd:string or other XML data. In RuleML's above-discussed superimposition XML reduction of OrdLab XML we have been tempted to use such role-prefixed PCDATA somewhat like this, which would not correspond to well-formed OrdLab XML:

<powform>
  <_base>three</_base>
  <_exponent>two</_exponent>
</powform>
However, role-prefixed PCDATA could always be re-represented as well-formed OrdLab XML using an explicit intermediate pcdata element (and often a more specific one), here giving us this version:
<powform>
  "base"==<pcdata>three</pcdata>
  "exponent"==<pcdata>two</pcdata>
</powform>

References

[BL99] Tim Berners-Lee. A Strawman Unstriped Syntax for RDF in XML. Web Page, May 1999.

[Bol00] Harold Boley. Relationships Between Logic Programming and RDF. Proc. 1st Pacific Rim International Workshop on Intelligent Information Agents (PRIIA 2000), University of Melbourne, Australia, 2000; LNAI volume to be published.

[Cov98] Robin Cover. XML and Semantic Transparency. The XML Cover Pages, Nov. 1998.

[LS99] Ora Lassila and Ralph R. Swick. Resource Description Framework (RDF) Model and Syntax Specification. W3C Recommendation, REC-rdf-syntax-19990222, Feb. 1999.

[Mel99] Sergey Melnik. Bridging the Gap between RDF and XML. Web Page, Dec. 1999.

[Woo75] William A. Woods. What's in a Link: Foundations for Semantic Networks. In: Daniel C. Bobrow and A. M. Collins. Representation and Understanding: Studies in Cognitive Science, Academic Press, pp. 35-82, 1975.





"Practice what you preach": XML source of this homepage at xmlrdf.xml (xmlrdf.xml.txt);
transformed to HTML via the adaptation of Michael Sintek's SliML XSLT stylesheet at homepage.xsl (View | Page Source)

Powered by Cocoon