If we want to use XML as one mechanism to exchange data between
applications and processes, we will have to be able to both parse it
and also generate it. Packages for parsing XML in R and S are
Here we discuss how we might generate XML output and the associated
tools needed to this generically.
Basic Mechanism - object output.
We define a method
toXML(). This is a
generic function. The class-specific versions of this are responsible
for returning a string containing the appropriate XML output.
A more challenging, and potentially flexible, facility is to have an
object that acts as an XML filter. It takes objects as input at
different times in the session and generates XML output appropriate
for the object. Depending on how it is created, it then passes this
new output to one or more listeners so that it can be rendered,
stored, transmitted, etc. In this way, we may have the output appear
in a browser as the object is displayed in the session. The browser
may mark up the XML in interesting ways to assist the user in, for
example, displaying the session in outline mode, connecting variables
as input to and output from different commands, clicking on an icon to
activate the associated graphics device, etc.
A filter as described above needs to create well-formed and valid
XML. To do this, it must have some knowledege of a DTD to use. There
two possible ways to do this. One is to create functions and data
structures that have a particular DTD encoded in them. The
alternative is to have a general mechanism for reading DTDs and
interpreting them. The former requires work to be done for each DTD
and also causes potential problems regarding synchronization between
the external description of the DTD and the local datastructures.
Thus the second is preferred. This allows us to write some very
general facilities which operate on arbitrary DTDs and validate
content by reading the DTD description itself.
This is simarl to the style used in the emacs mode -
We can access the information within a DTD locally using the
parseDTD() function and the argument of the same name to
xmlTreeParse(). The DTD elements returned by both are
identical, so we describe the value returned by
Before this, we give a very brief overview of what is in a DTD
and what we can expect to see in the user-level objects
This function takes the name of a file which is expected to contain a
Document Type Definition (DTD). This file is parsed and the resulting
tables of element and entity defintions are converted to lists of
user-level objects. The return value is a list containing two
sub-lists, one for the elements and one for the entities.
(In the case of a the DTD being returned via the function
xmlTreeParse, both the internal and external DTD
tables are returned. Each of these is as described here.)
There are two types of entities - internal and external/system
entities. The former are used as simple text substitution or macro
facilities. They allow one to define a segment of text to be used in
a document or elsewhere in the DTD (such as attribute lists) that are
used in several places. Rather than repeating the text and having to
modify multiple instances of it should it need to be changed, one uses
entities to parameterize the segment.
Internal entities are defined something like
<!ENTITY % foo "my text to be repeated">
Internal entities of this form are converted to user-level objects of
XMLEntity. Each of these has 3 fields. These are
name which is the identifier used to refer to the
value field is the expansion of the macro.
orig field is the unexpanded value which means that
if the value contains references to other entities, these will not be
For example, the entries in the DTD
<!ENTITY % bar "for R and S">
<!ENTITY testEnt "test entity &bar;">
 "test entity for R and S"
 "test entity %bar;"
The names of the entities list uses the names of each the entities.
External entities are similar to regular internal entities but refer
to text expansions that reside outside of this file. The location may
be another file or a URL, etc. These are returned with a class
XMLExternalEntity. This has the same fields as the class
XMLEntity but the interpretation of the
value field is left to the user-level software.
One can use
url.scan, and other
functions for reading the value of the remote content.
While the entities usually appear at the start of the DTD and are
important for building flexible, useful DTDs and documents, the most
important aspect of a DTD is the collection of elements that define
the structure of a document that "obeys" the DTD and how the different
pieces (nodes) of the document fit together. These are element
definitions and each specifies firstly, the list of attributes that
can be used within that element and their types, and secondly what
other elements can be nested within this one and in what order. We
will not try to explain the structure of a DTD in this document. See
W3.org for resources explaining
the structure at various different levels.
A basicelement definition has the following components
<!ELEMENT name content>
The name is the text used to introduce it in an XML document as in
The content is the most complicated aspect of an element, but it is
relatively simple to understand in most cases.
It is used to indicate what are the possible combination of elements
that can be nested within this element. It allows the author of the
DTD to specify an ordering of the sub-elements as well limited control
over the number of such elements one can use in any position.
The three basic structures used in the content definition
Each of these three can be qualified by an occurance qualifier which
controls the number of such types to expect in this position.
- another element,
- a set of elements of which one can be used, and
- an ordered sequence of elements and composite structures,
The following example illustrates all of the basic features
- by default, just one is expected.
(content) + means that at least one is expected, but there can
be any number of structures matching this content description
after the first one.
(content)* means that there are 0 or more
(content)? means zero or one.
<!ELEMENT entry3 ( (variables | (tmp, x)), (record)* , (a*, b,c,d, (e|f)) , (foo)+ ) >
Here we define an element named
entry3. This has 4 basic
types that can be nested within in, and in a specific order. First,
we must have a
variables element or the pair
tmp followed by
x. There should be exactly
one of either of these entries.
This is followed optionally by any number of
After this, there must be a sequence of
and either of
And finally, we can have one or more
foo entries, but at
The attributes an element supports are listed separately
attributeId type default
The structure returned from parsing and converting a DTD to a
user-level object is quite simple. It is a list of length 2, one for
the entities and the other for the elements within the DTD. If the
DTD object comes from a document, it separates the entities and
elements defined locally or internally in the document and those in
the external DTD if there is one. This results in a list of length 2
which contains the internal and external DTDs. Each of these is then a
list of length 2 with the entities and elements.
The entities element in a DTD is a named list. The names are the
identifiers for the entities.
Each entry in this list is an object of class
In either case, each has 3 fields.
The name is the identifier of the entity.
The value is the text used to substitute in place of the entity
original field is for use when reproducing
or analyzing the DTD. If the value contains references to other
entities, this field reflects that and is the unexpanded or literal
version of the entity definition as it appears in the DTD document.
The elements list is also a named list, with the names being those of
the elements. Each entry in the list is an object of class
These contain 4 fields:
The result of converting the definition of
- the name of the element.
- this will almost always be
1 indicating an
ELEMENT_NODE. An explanatory string is used as the name for this
integer enumeration value.
- This is an object defining the restriction on the sub-elements
that can be nested within this element.
This is of class
XMLElementContent and has 3
- named integer value (with name providing a description of
the meaning) indicating what type of content. The usual
Or, and so on.
- named enumerated value indicating how many instances
of this content are expected and admissable.
Zero or One,
One or More.
- A list of
that describe the feasible sub-elements within this
element being defined.
These are usually specializations of the class
These have the same structure, just different meaning and semantics.
- a named list of
XMLAttributeDef objects, with the
names being those of the attributes being defined for this
entry3 above is given below. It is an
object of class
type field is a named
integer with value
3 and name
Since the entire content has no qualifier, the
Now we look at the sub-elements, accessible from the
This is a list of length 4, one fore each term in the sequence.
The classes of the objects may help to explain its structure.
 "XMLOrContent" "XMLElementContent" "XMLSequenceContent"
Let's look at the third entry, the
r <- d$elements$entry3$content$elements[]
Again, this is a sequence. Its sub-entries are of diffrent content
 "XMLElementContent" "XMLElementContent" "XMLElementContent"
 "XMLElementContent" "XMLOrContent"
The first 4 are reasonably obvious. These identify single elements
and are the primitive content types.
We see that the expected type is
a and that there can be
zero or more of these.
The more interesting entry is the last one.
Its primitive display is given below.
We see that it is of type
Or and that we expect exactly
one instance of it. It is interpreted by expecting any of the content
structures described in its
elements list. Each of these
is a simple
XMLElementContent object and so is a
Back to the Filter
Armed with contents of a DTD, generating XML output via a filter can
now be validated easily. Suppose the following command is
issued via the filter. (These will most likely be done indirectly via
filter$output("variable", c(unit="mpg"), value)
Then, the filter will check its current state, specifically the
last open/unfinished element, and examine its content specification.
If the previous command was something like
then the filter will extract the list of possible entries for this
Then it determines whether the element
can be added.
In the case of a dataset, this is a simple lookup.
The only acceptable value is a
Duncan Temple Lang,
Last modified: Mon Sep 30 10:46:17 EDT 2002