XML Output in Omegahat, R S4/Splus5.

If we want to use XML as one mechanism to exchange data between applications and processes, we will have to be able to both parse it and also generate it. Packages for parsing XML in R and S are described elsewhere. Here we discuss how we might generate XML output and the associated tools needed to this generically.

Basic Mechanism - object output.

We define a method toXML(). This is a generic function. The class-specific versions of this are responsible for returning a string containing the appropriate XML output.

XML Streams

A more challenging, and potentially flexible, facility is to have an object that acts as an XML filter. It takes objects as input at different times in the session and generates XML output appropriate for the object. Depending on how it is created, it then passes this new output to one or more listeners so that it can be rendered, stored, transmitted, etc. In this way, we may have the output appear in a browser as the object is displayed in the session. The browser may mark up the XML in interesting ways to assist the user in, for example, displaying the session in outline mode, connecting variables as input to and output from different commands, clicking on an icon to activate the associated graphics device, etc.

   filter$write(read.table())
   filter$write(1:10)
   filter$write(factor(rbinom(100)))

Reading DTDs in R/S.

A filter as described above needs to create well-formed and valid XML. To do this, it must have some knowledege of a DTD to use. There two possible ways to do this. One is to create functions and data structures that have a particular DTD encoded in them. The alternative is to have a general mechanism for reading DTDs and interpreting them. The former requires work to be done for each DTD and also causes potential problems regarding synchronization between the external description of the DTD and the local datastructures. Thus the second is preferred. This allows us to write some very general facilities which operate on arbitrary DTDs and validate content by reading the DTD description itself. This is simarl to the style used in the emacs mode - PSGML.

We can access the information within a DTD locally using the parseDTD() function and the argument of the same name to xmlTreeParse(). The DTD elements returned by both are identical, so we describe the value returned by parseDTD(). Before this, we give a very brief overview of what is in a DTD and what we can expect to see in the user-level objects

`parseDTD()`

This function takes the name of a file which is expected to contain a Document Type Definition (DTD). This file is parsed and the resulting tables of element and entity defintions are converted to lists of user-level objects. The return value is a list containing two sub-lists, one for the elements and one for the entities. (In the case of a the DTD being returned via the function xmlTreeParse, both the internal and external DTD tables are returned. Each of these is as described here.)

Entities

There are two types of entities - internal and external/system entities. The former are used as simple text substitution or macro facilities. They allow one to define a segment of text to be used in a document or elsewhere in the DTD (such as attribute lists) that are used in several places. Rather than repeating the text and having to modify multiple instances of it should it need to be changed, one uses entities to parameterize the segment. Internal entities are defined something like

   <!ENTITY % foo "my text to be repeated">

Internal entities of this form are converted to user-level objects of class XMLEntity. Each of these has 3 fields. These are the name which is the identifier used to refer to the entity. The value field is the expansion of the macro. The orig field is the unexpanded value which means that if the value contains references to other entities, these will not be expanded. For example, the entries in the DTD

<!ENTITY % bar "for R and S">
<!ENTITY testEnt "test entity &bar;">

produces the XMLEntity object

$name
[1] "testEnt"

$value
[1] "test entity for R and S"

$original
[1] "test entity %bar;"

attr(,"class")
[1] "XMLEntity"

The names of the entities list uses the names of each the entities.

External Entities

External entities are similar to regular internal entities but refer to text expansions that reside outside of this file. The location may be another file or a URL, etc. These are returned with a class XMLExternalEntity. This has the same fields as the class XMLEntity but the interpretation of the value field is left to the user-level software. One can use scan(), url.scan, and other functions for reading the value of the remote content.

Elements

While the entities usually appear at the start of the DTD and are important for building flexible, useful DTDs and documents, the most important aspect of a DTD is the collection of elements that define the structure of a document that "obeys" the DTD and how the different pieces (nodes) of the document fit together. These are element definitions and each specifies firstly, the list of attributes that can be used within that element and their types, and secondly what other elements can be nested within this one and in what order. We will not try to explain the structure of a DTD in this document. See W3.org for resources explaining the structure at various different levels.

A basicelement definition has the following components

  <!ELEMENT name content>

The name is the text used to introduce it in an XML document as in

  <name>   </name>
  <name />

The content is the most complicated aspect of an element, but it is relatively simple to understand in most cases. It is used to indicate what are the possible combination of elements that can be nested within this element. It allows the author of the DTD to specify an ordering of the sub-elements as well limited control over the number of such elements one can use in any position. The three basic structures used in the content definition are

another element,
a set of elements of which one can be used, and
an ordered sequence of elements and composite structures,

Each of these three can be qualified by an occurance qualifier which controls the number of such types to expect in this position.

by default, just one is expected.
(content) + means that at least one is expected, but there can be any number of structures matching this content description after the first one.
(content)* means that there are 0 or more expected.
(content)? means zero or one.

The following example illustrates all of the basic features

 <!ELEMENT entry3 ( (variables | (tmp, x)), (record)* , (a*, b,c,d, (e|f)) , (foo)+ ) >

Here we define an element named entry3. This has 4 basic types that can be nested within in, and in a specific order. First, we must have a variables element or the pair tmp followed by x. There should be exactly one of either of these entries. This is followed optionally by any number of record element instances. After this, there must be a sequence of element instances a, b, c, d and either of e or f. And finally, we can have one or more foo entries, but at least one.

The attributes an element supports are listed separately via a ATTLIST element

 <!ATTLIST element-name
       attributeId type  default
         ...
 >

The structure returned from parsing and converting a DTD to a user-level object is quite simple. It is a list of length 2, one for the entities and the other for the elements within the DTD. If the DTD object comes from a document, it separates the entities and elements defined locally or internally in the document and those in the external DTD if there is one. This results in a list of length 2 which contains the internal and external DTDs. Each of these is then a list of length 2 with the entities and elements.

The entities element in a DTD is a named list. The names are the identifiers for the entities. Each entry in this list is an object of class XMLEntity or XMLExternalEntity. In either case, each has 3 fields. name, content and original. The name is the identifier of the entity. The value is the text used to substitute in place of the entity reference. The original field is for use when reproducing or analyzing the DTD. If the value contains references to other entities, this field reflects that and is the unexpanded or literal version of the entity definition as it appears in the DTD document.

The elements list is also a named list, with the names being those of the elements. Each entry in the list is an object of class XMLElementDef. These contain 4 fields:

name

the name of the element.

type

this will almost always be 1 indicating an ELEMENT_NODE. An explanatory string is used as the name for this integer enumeration value.

contents

This is an object defining the restriction on the sub-elements that can be nested within this element. This is of class XMLElementContent and has 3 fields:

type: named integer value (with name providing a description of the meaning) indicating what type of content. The usual ones are PCData, Sequence, Element, Or, and so on.
ocur: named enumerated value indicating how many instances of this content are expected and admissable. These are Once, Zero or One, Mult and One or More.
elements: A list of XMLContent objects that describe the feasible sub-elements within this element being defined. These are usually specializations of the class XMLContent: XMLOrContent, XMLElementContent, XMLSequenceContent. These have the same structure, just different meaning and semantics.

attributes

a named list of XMLAttributeDef objects, with the names being those of the attributes being defined for this element.

The result of converting the definition of entry3 above is given below. It is an object of class XMLSequenceContent. Hence, its type field is a named integer with value 3 and name Sequence. Since the entire content has no qualifier, the ocur field is Once.

Now we look at the sub-elements, accessible from the elements field. This is a list of length 4, one fore each term in the sequence. The classes of the objects may help to explain its structure.

sapply(d$elements$entry3$content$elements,class)
[1] "XMLOrContent"       "XMLElementContent"  "XMLSequenceContent"
[4] "XMLElementContent"

Let's look at the third entry, the XMLSequenceContent object.

r <- d$elements$entry3$content$elements[[3]]

Again, this is a sequence. Its sub-entries are of diffrent content classes.

sapply(r$elements, class)
[1] "XMLElementContent" "XMLElementContent" "XMLElementContent"
[4] "XMLElementContent" "XMLOrContent"

The first 4 are reasonably obvious. These identify single elements and are the primitive content types.

> r$elements[[1]]
$type
Element 
      2 

$ocur
Mult
   2 

$elements
[1] "a"

attr(,"class")
[1] "XMLElementContent"

We see that the expected type is a and that there can be zero or more of these.

The more interesting entry is the last one. Its primitive display is given below.

$type
Or 
 4 

$ocur
Once 
   1 

$elements
$elements[[1]]
$type
Element 
      2 

$ocur
Once 
   1 

$elements
[1] "e"

attr(,"class")
[1] "XMLElementContent"


$type
Element 
      2 

$ocur
Once 
   1 

$elements
[1] "f"

attr(,"class")
[1] "XMLElementContent"


attr(,"class")
[1] "XMLOrContent"

We see that it is of type Or and that we expect exactly one instance of it. It is interpreted by expecting any of the content structures described in its elements list. Each of these is a simple XMLElementContent object and so is a "primitive".

Back to the Filter

Armed with contents of a DTD, generating XML output via a filter can now be validated easily. Suppose the following command is issued via the filter. (These will most likely be done indirectly via higher-level commands.)

  filter$output("variable", c(unit="mpg"), value)

Then, the filter will check its current state, specifically the last open/unfinished element, and examine its content specification. If the previous command was something like

  filter$open("variables", numRecords=nrow(data))

then the filter will extract the list of possible entries for this tag.

     dtd$entries[["variables"]]$contents$elements

Then it determines whether the element variable can be added. In the case of a dataset, this is a simple lookup. The only acceptable value is a variable element.

> d$elements$variables$contents
$type
Element 
      2 

$ocur
Mult 
   3 

$elements
[1] "variable"

attr(,"class")
[1] "XMLElementContent"

Duncan Temple Lang, duncan@wald.ucdavis.edu

Last modified: Mon Sep 30 10:46:17 EDT 2002